The Kaggle Don't Overfit competition is over, and I took 11th place! Additionally, I tied with tks for contributing the most to the forum, so thanks to everyone who voted for me! I voted for tks, and I'm very happy to share the prize with him, as most of my code is based off of his work.

The top finishers in this competition did a writeup of their methods in the forums, and they are definitely worth a read. In my last post, I promised a writeup of a method that would beat the benchmark by a signifigant margin, so here that is. This variable selection technique is based on tks' code .

To start, we set things up in the same way as my previous posts: Load the data, split train vs. test, etc.

To improve over our previous attempt, which fits a glmnet model on all 200 variables, we want to select a subset of those variables that performs better than the entire thing. This whole process must be cross-validated to avoid overfitting, but fortunately the caret package in R has a great function called rfe that handles most of the gruntwork. The process is described in great detail here, but it can be thought of as 2 nested loops: the outer loop resamples your dataset, either using cross-validation or bootstrap sampling, and then feeds the resampled data to the inner loop, which fits a model, calculates variable importances based on that model, and then eliminates the least important variables and re-calculates the model's fit. The results of the inner loop are collected by the outer loop, which then stores a re-sampled metric of each variable's importance, as well as the number of variables that produced the best fit. If it sounds complicated, read the document at my previous link for more detailed information.


This section of code initializes an object to control the RFE function. We use the 'caretFuncs' object as our framework, and the 'twoClassSummary' function as our summary function, as we are doing a 2-class problem and want to use AUC to evaluate our predictive accuracy. Then we use a custom function to rank the variables from a fitted glmnet object, based on tks' method. We've decided to rank the variables by their coefficients, and consider variables with larger coefficients more important. I'm still not 100% sure why this worked for this competition, but it definitely gives better results than glmnet on its own. Finally, we create our RFE control object, where we specify that we want to use 25 repeats of bootstrap sampling for our outer loop.

Next, we have to setup the training parameters and multicore parameters, all of which are explained in my previous posts. In both loops of my RFE function I am using 25 repeats of bootstrap sampling, which you could turn up to 50 or 100 for higher accuracy at the cost of longer runtimes.


Now we get to actually run the RFE function and recursively eliminate features!
The structure of the RFE function is very similar to caret's train function. In fact, the inner loop of RFE is fitting a model using the train function, so we need to pass 2 sets of parameters to RFE: one for RFE and one for train. I have indented the parameters for the train function twice so you can see what's going on. After running RFE, we can access the variables it selected using RFE$optVariables, which we use to construct a formula we will use to fit the final model. We can also plot the RFE object for a graphical representation of how well the various subsets of variables performed.

Lastly, we fit our final model using caret's train function. This step is a little unnecessary, because RFE also fit a final model on the full dataset, but I included it anyways to try out a longer sequence of lambdas (lambda is a parameter for the glmnet model).

Because we fit our model on Target_Practice, we can score the model ourselves using colAUC. This model gets .91 on Target_Practice, and also scored about .91 on Target_Leaderboard and Target_Evaluate. Unfortunately, .91 was only good for 80th place on the leaderboard, as Ockham released the set of variables used for the leaderboard dataset, and I was unable to fit a good model using this information. Many other competitors were able to use this information to their advantage, but this yielded no edge on the final Target_Evaluate, which is what really mattered.

Overall, I'm very happy with the result of this competition. If anyone has any code they'd like to share that scored higher than .96 on the leaderboard, I'd be happy to hear from you!
6

View comments

  1. Hello, Zach. I've just read you article - very interesting. But I have some questions. What is the best way to contact you?

    ReplyDelete
  2. Hi Dmitri. Feel free to post your questions in the comments, or you can email me at zach.mayer(at)gmail. I'm glad you enjoyed the post!

    ReplyDelete
  3. I sent a mail to you. Have you get it?

    ReplyDelete
  4. Yes, I got your email. I'll send a reply after I've had some time to compose it. The sort answer is that I used the algorithm I did because, after an extended period of trial and error, it seemed to work. The recursive feature elimination I used has no theoretical basis that I know of, but happened to work on this dataset so I used it.

    ReplyDelete
  5. Dear Zach,
    this is very impressive coding. Learned a lot from going through it line by line. Thank you.
    I am trying to apply it to a multinomial problem (Target has 6 levels) as glmnet is able to handle these problems, howeverI need to change twoClassSummary but unfortunately it is not working, any hints?

    Best,

    Simon

    ReplyDelete
    Replies
    1. You'll need to write a multiClassSummary -- caret does not include on by default. One idea would be to calculate AUC for each class and report the average.

      Delete
  1. https://www.ai-insight-solutions.com/blog/



    0

    Add a comment

  2. My package caretEnsemble, for making ensembles of caret models, is now on CRAN.

    Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker)
    2

    View comments

  3. Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them.  If you know how to insert gitub gists on a dynamic blogger template, please let me known.

    In the meantime, here are instructions with links to the code:
    First of all, use homebrew to compile openblas.  It's easy!  Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary)

    To use openblas with R, follow these instructions:
    https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-openblas-sh

    To use veclib with R, follow these intructions:
    https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-veclib-sh

    OLD POST:

    Inspired by this post, I decided to try using OpenBLAS for R on my mac.  However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework.

    If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:


    However, as noted in r-sig-mac, these instructions do not work for R 3.0.  You have to directly link to the accelerate framework's version of vecLib:


    Finally, test your new blas using this script:


    On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup.  If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results."

    So far, I have not encountered any issues using vecLib, but it's only been a few hours :-).

    UPDATE: you can also install OpenBLAS on a mac:

    If you do this, make sure to change the directories to point to the correct location on your system  (e.g. change /users/zach/source to whatever directory you clone the git repo into).  On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup.
    11

    View comments

  4. Here's a quick demo of how to fit a binary classification model with caretEnsemble.  Please note that I haven't spent as much time debugging caretEnsemble for classification models, so there's probably more bugs than my last post.  Also note that multi class models are not yet supported.






    Right now, this code fails for me if I try a model like a nnet or an SVM for stacking, so there's clearly bugs to fix.

    The greedy model relies 100% on the gbm, which makes sense as the gbm has an AUC of 1 on the training set.  The linear model uses all of the models, and achieves an AUC of .5.  This is a little weird, as the gbm, rf, SVN, and knn all achieve an AUC of close to 1.0 on the training set, and I would have expected the linear model to focus on these predictions. I'm not sure if this is a bug, or a failure of my stacking model.
    20

    View comments


  5. I've written a new R package called caretEnsemble for creating ensembles of caret models in R.  It currently works well for regression models, and I've written some preliminary support for binary classification models.


    At this point, I've got 2 different algorithms for combining models:

    1. Greedy stepwise ensembles (returns a weight for each model)
    2. Stacks of caret models

    (You can also manually specify weights for a greedy ensemble)

    The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package here on github.  The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.

    All the models in the ensemble must use the same training/test folds.  Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here's a brief script demonstrating how to use the package:


    Please feel free to submit any comments here or on github.  I'd also be happy to include any patches you feel like submitting.  In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.
    39

    View comments

  6. The caret package for R now supports time series cross-validation!  (Look for version 5.15-052 in the news file).  You can use the createTimeSlices function to do time-series cross-validation with a fixed window, as well as a growing window.  This function generates a list of indexes for the training set, as well as a list of indexes for the test set, which you can then pass to the trainControl object.


    Caret does not currently support univariate time series models (like arima, auto.arima and ets), but perhaps that functionality is coming in the future?  I'd also love to see someone write a timeSeriesSummary function for caret that calculates error at each horizon in the test set and a createTimeResamples function, perhaps using the Maximum Entropy Bootstrap.

    Here's a quick demo of how you might use this new functionality:
    6

    View comments

  7. The caret package for R provides a variety of error metrics for regression models and 2-class classification models, but only calculates Accuracy and Kappa for multi-class models.  Therefore, I wrote the following function to allow caret:::train to calculate a wide variety of error metrics for multi-class problems:

    This function was prompted by a question on cross-validated, asking what the optimal value of k is for a knn model fit to the iris dataset.  I wanted to look at statistics besides accuracy and kappa, so I wrote a wrapper function for caret:::confusionMatrix and auc and logLoss from the Metric packages.  Use the following code to fit a knn model to the iris dataset, aggregate all of the metrics, and save a plot for each metric to a pdf file:


    This demonstrates that, depending on what metric you use, you will end up with a different model.  For example, Accuracy seems to peak around 17:

    While AUC and logLoss seem to peak around 6:


    You can also increase the number of cross-validation repeats, or use a different method of re-sampling, such as bootstrap re-sampling.
    1

    View comments

  8. I finally got around to publishing my time series cross-validation package to github, and I plan to push it out to CRAN  shortly.

    You can clone the repo using github for mac, for windows, or linux, and then run the following script to check it out:
    This script downloads monthly data for S&P 500 (adjusted for splits and dividends), and, for each month form 1995 to the present, fits a naive model, an auto.arima() model, and an ets() model to the past 5 year's worth of data and uses those models to predict S&P 500 prices for the next 12 months (note that the progress bar doesn't update if you register a parallel backend.  I can't figure out how to fix this bug):


    The naive model outperforms the arima and exponential smoothing models, both of which take into account seasonal patterns, trends, and mean-reversion!  Furthermore, we're not just using any arima/exponential smoothing model: at each step we're selecting the best model, based on the last 5 years worth of data.  (The ets model slightly outperforms the naive model at the 3 month horizon, but not the 2 month or 4 month horizons).

    Forecasting equities prices is hard!
    2

    View comments

  9. UPDATE: a better parallel algorithm will be included in a future version of DEoptim, so I've removed my package from CRAN.  You can still use the code from this post, but keep Josh's comments in mind.


    Last night I was working on a difficult optimization problems, using the wonderful DEoptim package for R. Unfortunately, the optimization was taking a long time, so I thought I'd speed it up using a foreach loop, which resulted in the following function:


    Here's what's going on: I divide the bounds for each parameter into n segments, and use a foreach loop to run DEoptim on each segment, collect the results of the loop, and then return the optimization results for the segment with the lowest value of the objective function.  Additionally, I defined a "parDEoptim" class to make it easier to combine the results during the foreach loop.  All of the work is still being done by the DEoptim algorithm.  All I've done is split up the problem into several chunks.

    Here is an example, straight out of the DEoptim documentation:


    In theory, on a 20-core machine, this should run a bit faster than the serial example.  Note that you may need to set itermax for the parallel run at a higher value than (itermax for the serial run)/(number of segments), as you want to make sure the algorithm can find the minimum of each segment.  Also note that, in this example, there are 20 segments on the interval c(-10,-10) to c(10,10), which means that 2 of the segments have boundaries at c(1,1), which is the global minimum of the function.  The DEoptim algorithm has no trouble finding a solution at the boundary of the parameter space, which is why it's so easy to parallelize.

    Rumor has it that the next version of DEoptim will include foreach parallelization, but if you can't wait until then, I rolled up the above function into an R package and posted it to CRAN.  Let me know what you think!
    1

    View comments

  10. This is a quick post on the importance of benchmarking time-series forecasts.  First we need to reload the functions from my last few posts on times-series cross-validation.  (I copied the relevant code at the bottom of this post so you don't have to find it).

    Next, we need to load data for the S&P 500.  To simplify things, and allow us to explore seasonality effects, I'm going to load monthly data, back to 1980.




    The object "Data" has monthly closing prices for the S&P 500 back until 1980.  Next, we cross validate 3 time series forecasting models:  auto.arima, from the forecast package,  a mean forecast, that returns the mean value over the last year, and a naive forecast, which assumes the next value of the series will be equal to the present value.  These last 2 forecasts serve as benchmarks, to help determine if auto.arima would be useful for forecasting the S&P 500.  Also note that I'm using BIC as a criteria for selecting arima models, and I have trace on so you can see the results of the model selection process.


    After the 3 models finish cross-validating, it is useful to plot their forecast errors at different horizons.  As you can see, auto.arima performs much better than the mean model, but is constantly worse than the naive model.  This illustrates the importance of benchmarking forecasts.  If you can't constantly beat a naive forecast, there's no reason to waste processing power on a useless model.



    Finally, here is all the code in one place.  Note that you can parallelize the cv.ts function by loading your favorite foreach backend.

    0

    Add a comment

BlogRoll
BlogRoll
Popular Posts
Popular Posts
  • Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them.  If you know how to insert gitu...
  • My package caretEnsemble, for making ensembles of caret models, is now on CRAN . Check it out, and let me know what you think! (Submit bug...
  • Here's a quick demo of how to fit a binary classification model with caretEnsemble.  Please note that I haven't spent as much time d...
  • I've written a new R package called  caretEnsemble  for creating ensembles of  caret models  in R.  It currently works well for regress...
  • Note: This post is NOT financial advice!  This is just a fun way to explore some of the capabilities R has for importing and manipulating da...
Blog Archive
About Me
About Me
Loading
Zachary Mayer. Dynamic Views theme. Powered by Blogger. Report Abuse.