Friday, July 22, 2011

Parallel random forests using foreach

There's been some discussion on the kaggle forums and on a few blogs about various ways to parallelize random forests, so I thought I'd add my thoughts on the issue.

Here's my version of the 'parRF' function, which is based on the elegant version in the foreach vignette:

This function works very simply: you pass it a vector of mtry values, and it fits a random forest using each of those values and returns the combined result. You can all pass any additional parameters you want (like ntree) to the randomForest function.

I think this functions provides 2 improvements over previous implementations. #1 is you can use any parallel backend you want. doRedis is my current favorite, as it's cross-platform and fault-tolerant and let's me commandeer idle laptops around the house/office when a random forest is taking too long to fit. #2 is the argument .inorder=FALSE in the foreach function, which provides a small performance improvement as it lets R combine the random forests as they finish, rather than forcing R to combine them in the order they start.

Lets say you want a random forest with 5000 trees. The default value for ntree is 500, so we use rep(4,10) as the argument for the function.

Maybe we're unsure of the optimal mtry value, and want combine 2 ensembles of 2500 trees. Then we use the argument c(rep(3,5),rep(4,5)). This gives us 2500 trees with mtry=3 and 2500 with mtry=4. I like to think of this as a sort of meta-ensemble of decision trees, but I've yet to see it improve my predictive accuracy.

At the very least, this can help with those damn 'out of memory' errors I've been getting on my laptop when fitting random forests to large datasets.

13 comments:

  1. Hi Zach - Thanks for the snippet. One question though. Are the individual forests randomized among clusters ? For example, if the individual clusters start with same random seed, won't the trees be simply duplicated ?

    ReplyDelete
  2. Hi Karmic,

    That's a really good question, and since I'm not quite sure how R generates random numbers, I can't really answer it. I suspect that each cluster will end up with a different random seed, as the computations will start at slightly different times on different machines. However, there could be problems with the random seeds are correlated, and I'm not sure how to control for this.

    Your best bet is probably to use the doRNG package, to make sure each node of your cluster gets an independent stream of random numbers to feed to the randomForest cluster.
    http://cran.r-project.org/web/packages/doRNG/doRNG.pdf

    -Zach

    ReplyDelete
  3. Hi Zach

    Looks like doRNG needs doMC, which isn't available on Windows :( I will need to look for something else.

    Karmic.Menace

    ReplyDelete
    Replies
    1. I am currently running doRNG on windows 7. Update R and update your packages and try again.

      Delete
  4. Hi Zach,

    my question is actually about caret and RF, since i found a few of your replies online stating you are using caret.
    i see the parRF and rf model in caret's train(). what i fail to be able to do is use the final model to predict anything.

    e.g.:
    predict(model$finalModel, newdata=test)

    this would result in "variables in the training data missing in newdata". that seems rather clear (it seems to split a factor variable into 3 variables in the final model) but i have no idea if caret actually provides a method to transform my test set somehow or am i missing something else completely?

    thank you

    ReplyDelete
    Replies
    1. Hi Tom,

      I've found 2 things make working with randomForests much easier:
      1. Convert all your factors to dummies using model.matrix PRIOR to running caret.
      2. Make sure your newdata has EXACTLY the same column names and order as your original dataset.

      I'd first check that all(names(newdata)==names(data)). If that's true, then try converting your factors into dummies using model.matrix.

      Delete
    2. Hi Zach,

      1) will look into it. so far i have simply done as.numeric().
      2) they are exactly the same. all(names(newdata)==names(data)) returns TRUE. if i use randomForest(0 directly it works.
      here is the code I am talking about. https://www.refheap.com/paste/055e5e1d23e1769cddb66e6e3
      any ideas?

      Delete
    3. Hi Tom,

      If I'm going to help you debug your code, you're going to need to need to give me a fully reproducible example.

      I suspect the issue is the "" level in "UsageBand" which does not parse into a valid R column name. You could try replacing this level, but I don't know for sure.

      Also try converting your entire data.frame using model.matrix, and running train with the non-formula interface: train(X, Y), where X is your model matrix, and Y is your target.

      -Zach

      Delete
    4. Zach,

      the "" level was not the problem, anyways, i gave up on it and went with the model.matrix. Thank you very much for your help and offer to debug my code, highly appreciated.

      Thomas

      Delete
  5. Hello Zach, all,
    Thanks, this is extremely interesting to me, if I can get it to work.

    Please can you tell me if I am doing something obviously wrong? I installed doRedis first and then copied and pasted the above code;

    library(doRedis)
    > multiRF <- function(x,...) {
    + foreach(i=x,.combine=combine,.packages='randomForest',
    + .export=c('X','Y'),.inorder=FALSE) %dopar% {
    + randomForest(X,Y,mtry=i,...)
    + }
    + }
    > multiRF(c(rep(3,10),rep(4,10),rep(5,10)),ntree=500)
    randomForest 4.6-7
    Type rfNews() to see new features/changes/bug fixes.

    Attaching package: ‘randomForest’

    The following object(s) are masked from ‘package:Biobase’:

    combine

    The following object(s) are masked from ‘package:BiocGenerics’:

    combine

    Error in { : task 1 failed - "object 'X' not found"
    In addition: Warning message:
    executing %dopar% sequentially: no parallel backend registered

    Please let me know the problem is?

    Thank you.

    J.

    ReplyDelete
    Replies
    1. It looks like you never created and X and Y object. You need data to fit the model on! Try this:

      X <- iris[,-5]
      Y <- iris[,5]

      Also, you need to register a parallel back end. Just loading the library won't make it run in parallel.

      Delete
  6. Dear Zach,
    Thanks for the advice. This self contained version should be ok?

    library(doParallel)
    library(foreach)
    library(randomForest)

    data(iris)

    X <- iris[,-5]
    Y <- iris[,5]

    multiRF <- function(x,...) {
    foreach(i=x,.combine=randomForest::combine,.packages='randomForest',.export=c('X','Y'),.inorder=FALSE) %dopar% {
    randomForest(X,Y,mtry=i,...)
    }
    }

    registerDoParallel(cores=1)
    system.time(output<-multiRF(c(rep(3,10),rep(4,10),rep(5,10)),ntree=500))
    # user system elapsed
    # 5.40 0.22 6.48

    registerDoParallel(cores=8)
    system.time(output<-multiRF(c(rep(3,10),rep(4,10),rep(5,10)),ntree=500))
    # user system elapsed
    # 4.29 0.22 5.45

    I would be interested why the export on the X,Y variables, could they not be fed in like the ntree variable? I am learning about randomForest in respect to predicting disease stages using gene expression data; so was interested in your posts when I stumbled on them.

    J.

    ReplyDelete
    Replies
    1. Hi John,

      That's a good point. Here's an improved version of the multiRF function. I've made 2 updates:

      1. X and Y are now arguments for the function.
      2. I changed from %doparallel% to %dorng%. This makes the random forests reproducible, and also makes sure the random numbers used on each core won't be correlated.

      https://gist.github.com/zachmayer/5554373

      -Zach

      Delete

Sociable