Friday, April 29, 2011

Parallelizing and cross-validating feature selection in R

This is an example piece of code for the Overfitting competition at kaggle.com. This method has an AUC score of ~.91, which is currently good enough for about 38th place on the leaderboard. If you read the completion forums closely, you will find code that is good enough to tie for 25th place, as well as hints as to how to break into the top 10.

However, I like this script because it does 2 tricky things well, without over fitting:
1. It selects features, despite the curse of dimensionality (250 observations, 200 features)
2. It fits a linear model, using the elastic net.

In future posts, I will walk you through how this code works, but for now, download the data and give it a shot!

Friday, April 22, 2011

Intro

This blog will show you how to build tools to survive in the modern world. I will focus on statistics and machine learning, because that's where my strengths lie, but sometime we may find ourselves veering far off course.

My primary interest lies in using computers to solve problems, and I will spend the majority of my time discussing practical, rather than theoretical issues.

If you wish to play the home game, start by installing R, and hang on to your seat...

Sociable