Practical tools for predictive modeling, data science, machine learning and web scraping
Wednesday, March 13, 2013
New package for ensembling R models
I've written a new R package called caretEnsemble for creating ensembles of caret models in R. It currently works well for regression models, and I've written some preliminary support for binary classification models.
At this point, I've got 2 different algorithms for combining models:
1. Greedy stepwise ensembles (returns a weight for each model)
2. Stacks of caret models
(You can also manually specify weights for a greedy ensemble)
The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package here on github. The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.
All the models in the ensemble must use the same training/test folds. Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here's a brief script demonstrating how to use the package:
Please feel free to submit any comments here or on github. I'd also be happy to include any patches you feel like submitting. In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.
Subscribe to:
Post Comments (Atom)
Great news...
ReplyDeleteLooks like Max Kuhn started to work on this too (package FuseBox).
Instead of building from scratch it would be nice to have a unified interace.
Can you contact him to continue what's started here : https://r-forge.r-project.org/R/?group_id=1036 ?
source here : https://r-forge.r-project.org/scm/viewvc.php/?root=fusebox
Hi dickoa,
DeleteI hope you find me code to be useful! It looks like fusebox is primarily intended for weighted averages of classification models. I wanted to include regression models, and additionally allow non-linear methods of creating ensembles.
If Max would like me to a function for greedy weight selection to FuseBox, I'd be happy to, but I'm not sure I want to take over maintenance of the entire package.
Thank you Zach...your package is very useful.
DeleteI think that in a way or another...Kaggle will have an impact on the R ecosystem and what you are doing is excellent.
I'll probably fork your repos on github to see how things evolve and hope that at the end, we will have a robust package for ensemble modeling.
Cheers
I'm happy to hear that! I totally agree, Kaggle will make a significant impact on the R ecosystem.
DeleteAs soon as you feel like you've made improvements (or significant bug fixes) I'd be happy to add you as a collaborator to caretEnsemble.
Wow, I've been looking for a good way of ensembling models in R and just as I couldn't put it off any longer your project shows up; thanks a lot!
ReplyDeleteI'm particularly interested in calibrating a blend of randomforests & gradient boosted tree models, any suggestions on the best way of proceeding? Cheers
I'm glad to hear that! Are you working on a classification problem, or a regression problem? Note that classification might be a little bit buggier with this code.
DeleteI would probably train a few GBMs with different settings for interaction.depth and shrinkage, and a few RFs with different settings for .mtry. Make sure you use the save CV folds for each models.
Then try a greedy ensemble (which finds a weighted average) and a caretEnsemble, perhaps using ridge regression or a gam. Test both ensemble (and your original models) on a test set, and see which performs the best.
You could also take this one step further and do nested cross-validation. If you do that, it'd be awesome if you could share your code.
This comment has been removed by the author.
ReplyDeleteHi. I am doing some hydrologic prediction and I hoped that ensembles help me to improve it, but although I try a bunch of combinations, I always get approximately the same results as with SVM alone. These problems have, e.g., 5000 rows and 40 variables and it is prediction - basically regression. Do you think that it is worth to continue with effort to try it with ensembles, or maybe ensembles are for bigger problems than that mine and in such smaller problems cannot be fully revealed their strength? I like ideas of ensembleCaret, I would like to try it – I hope I will use it correctly.
ReplyDeleteHi,
DeleteI hope you get a chance to try my package and give me your feedback! Unfortunately, I haven't had much time to write a new-user guide, so you're sort of on your own to figure it out. If you follow the example I posted above, with your own data, you should be good to go.
How have you been creating your ensembles? It's possible that the SVM is far superior to the other models, so they don't add much new information to the ensemble. You could try fitting a bunch of different SVMs with different parameters combinations and different kernels (linear, poly, radial), and make an SVM-only ensemble. That could improve your results.
-Zach
Great package! I've been searched for stacking package, that's it!!!
ReplyDeleteI'd like to try your package for classification problem. Would you like to show me classification examples?
This comment has been removed by the author.
DeleteI wrote up a quick classification example here:
Deletehttp://moderntoolmaking.blogspot.com/2013/03/caretensemble-classification-example.html
Please note that I currently only support binary classification models... multi-class models are a long way off. Also note that I haven't spent nearly enough time debugging my ensembler and stacker for classification models, so there's a lot more possible errors in my classification code. Feel free to submit bug reports and patches.
Thanks for your quick response :)
Deleteno problem! Please send me your feedback on classification models. I suspect there's more bugs than in the regression model mode.
DeleteHello Zach
ReplyDeleteThanks for this package.
Is there a reason for your selection of models in this example?
If not, could you suggest your most used or most successful models for regression from the selection within the caret package. I know that it's probably problem dependent but there must be a best trial ordering of models (i.e. best X to try in the ensemble), if averaged over a large set of diverse regression problems. Maybe your experience with caret tells you this.
Hi Daniel,
DeleteThere's no real reason to my models in the example, except I tried to use a pretty wide variety (some tree based, some regression based, a neural network, a support vector machine.) It's very hard to tell ahead of time what models will perform best on a given problem. I had a dataset I was working on last night where a logistic regression far outperformed a random forest!
I would say at the least start with a linear regression and a random forest. Then add something completely different, like a support vector machine or a neural network. I've also had a lot of luck with gbm's, but they tend to have similar performance to random forests and are harder to tune.
Sorry I can't be more specific, but you really just need to try a bunch of models before you can get a feel for which ones work on your dataset. Again, try to use models that are very different from each other. E.g. random forest, blackboost, and gbm are all tree based, so pick one of the three. glm and glmnet are different types of linear regression, so pick one of the two. Etc.
-Zach
Thanks Zach, that helps.
DeleteContinuing this, my training set is quite large and trying multiple combinations of models is prohibitive time wise. So, I'm thinking about trying an approach where I subset the training data (e.g. use 1/4 of it), train the diverse models you suggest on that, get a CV error for each, and cull models that perform badly on the smaller training set before I up scale to the full set. Do you see any problems with this approach?
If I'm not mistaken, when using ensembles, the predictive ability of the individual models are important but so is the difference between model predictions (say, measured by pearson's between hold out sets). So it might be preferable to include outlier models even if they are less predictive. Is that right? I wouldn't know how to trade off uniqueness/predictive abilities when culling models here though.
With your greedy and stack methods, can including additional models ever worsen performance or will they just be weighted down to insignificance?
That seems like a good approach to me. I think it makes sense to try to find a diverse set of GOOD models. E.g. spend some time tuning your neural network and your random forest, and see if you can get a good performance from both. Then ensemble those 2 models.
DeleteAdditional models should never make the greedy method worse, as it starts with the best model and only adds models that improve the ensemble from there (so you might end up with a single model in the ensemble). Additional models in the caret stacker can cause it to over-fit.
Hi Zach,
ReplyDeleteThe caretEnsemble package looks nice. For stacking, you might want to also check out the SuperLearner R package:
CRAN: http://cran.r-project.org/web/packages/SuperLearner/index.html
GitHub: https://github.com/ecpolley/SuperLearner
Thanks for the suggestion! caretEnsemble was definitively inspired by the medley and SuperLearner packages, but I wanted to use caret models as my base and I wanted to allow greedy or caret-based ensembles. I use caret for almost all of my modeling, so it made sense to me to use it for my ensembles as well.
DeleteHi Zach,
ReplyDeleteThis package looks very convenient. I've also read the paper and implemented several mentioned ensemble algorithms. It seems that your code is a little bit different from that. The weighting procedure in the paper is achieved by select one model multiple times. And yours seems not including the bagging and initialization variation of the ensemble model, to avoid overfitting.
I think these are useful as well. Maybe you would like to add them to your package later.
Hi,
DeleteThanks for the feedback. My greedyEnsemble function WILL select one model multiple times, but it currently does not include bagging or the initial variation. I'm working on some code to implement this efficiently, but haven't had time to wrap it up yet. If you'd like to submit a patched version of the function, I'd be happy to include it.
Also, caretEnsemble does not include bagging, and I don't have any idea how to efficiently bag caret models. If you have any suggestions there, I'd be happy to hear them.
Thanks,
Zach
Hi Zach!
ReplyDeleteCan one use another metric such as RMLSE? if yes, How should one do?
RMLSE is easy: just take the log of your target variable, and then use RMSE =D. For arbitrary error metrics, use caretEnsemble, and specify the summaryFunction in trainControl.
DeleteOK! something like this:
DeletemyControl <- trainControl(method='cv', number=folds, repeats=repeats, returnResamp='none',
returnData=FALSE, savePredictions=TRUE,
verboseIter=TRUE, allowParallel=TRUE,
summaryFunction=RMSE(log(X[,1]),log(Y)),
index=createMultiFolds(Y[train], k=folds, times=repeats))
Where X[,1] is cmedv
and Y =BostonHousing2$cmedv
Is my formulation correct if we take your example?
That works. If you want to use greedyEnsemble, you could also do something like this:
DeleteY[train] <- log(Y[train])
Then any model you fit that minimizes RMSE will actually be minimizing RMSLE. (Don't forget to exp() your final predictions).
I am playing with your code and I would like to ask you one detail, although it is more caret question than caretEnsemble. It is this: when there is in demo code for some model preprocessing it will do scaling on training data you have some center and scale for every variable. What I want to ask: when then is models and ensembles running on test data is it scaling the same variables in testing set with the same parameters of center and scale? Is in caret some mechanism that it remembers these values? I think that outside caret I must think about this while coding.
ReplyDeleteWhen you pass a pre-processing argument to caret's "train" function, it remembers all the pre-processing parameters, including the center and scale parameters.
DeleteThese parameters get applied to new data before it is scored by the model. This way, caret can correctly incorporate the entire modeling pipeline (imputing NAs, centering, scaling, transforming, and even PCA). This allows you to include the pre-processing in your cross-validation, and also apply the correct transformations to new data.
Thank you. By the way in some previous post I expressed my doubt if by some ensemble modelling I get better results than with properly trained SVM. I was wrong - I get quite significantly better results with caretEnsemble model. One question - do you think that selection of weights of individual models could be done also by some heuristic methodology - genetic algorithms or something like this? (I want to try so called hermony search, which is here: https://sites.google.com/site/fesangharyweb/downloads ).
ReplyDeleteYou can use any method you want to find the weights. Note however that a genetic algorithmic would risk overfitting, which greedy ensembling seems to avoid. Also, I'm not sure I would call a genetic algorithm a heuristic-- it's probably a lot more work than a simple, greedy ensemble!
DeleteThis comment has been removed by the author.
DeleteI am trying to understand what are training data to greedy (greedOptRMSE function). I was thinking that in columns of input matrix to this function are outputs of ensemble members Y' computed with best parameters of that models trained with X (X, Y are training data). So model is trained with best parameters and with X,Y data and it runs once with X as inputs and Y' is output of computation and column in matrix for greedy. But I think that I was wrong, maybe works like this, I would be happy if you let me know:
ReplyDelete1. By caret model tuning, best parameters are obtained in repeated crossvalidation
2. Then again (in case of 10-fold cross validation) 9 folds are used for training with this best parameters and 1 fold is computed as test (this is repeated 10x)
3. Computed values from this testing in cross validation with best parameters are used as input matrix for greedy (search of weights)
4. Because I used repeated cross-validation (5x) this is repeated 5 times
5. So to greedy as input serve this de facto testing data (of training set in crossvalidation)
I would be happy if you let me know if I am correct.
Thanks,
Milan