Forest and Boosted Tree Prediction Speed on Hadoop
- By default, rxPredict launches one MR job per tree to minimize memory usage
- For smallish data sets, call rxPredict inside rxExec or set scheduleOnce=TRUE (in 7.3) to reduce the scheduling overhead
- For larger data sets, set scheduleOnce=1 to do prediction in parallel using a single MR job (available in 7.3; internally, uses rxDataStep to call predict.randomForest; requires the randomForest package )