Forest and Tree Modeling Accuracy
Tune rxDForest parameters (speed trade-off) (*: OSR and RRE defaults)
– Increase nTree, e.g. to 20 or more (OSR=500, RRE=10)*
– Increase maxDepth, e.g. to 20 or more (OSR=N/A, RRE=10)*
– Decrease minSplit, e.g. to 2 (OSR=5, RRE=sqrt(N))*
– Increase mTry, e.g. to 40 or more (OSR/RRE=sqrt(p) or p/3)*
– Increase maxNumBins, e.g. to 1e5 or 1e6
– Accuracy of 81.4% with the KDD dataset using the following with a further increase to 82.3% when ntree=200:
ntree=20, mtry=40, minSplit=2, maxDepth=20, maxNumBins=1e6
– Adjust MR memory limits if needed since data must fit within memory on each node.
Tune rxDForest parameters (speed trade-off) (*: OSR and RRE defaults)
– Increase nTree, e.g. to 20 or more (OSR=500, RRE=10)*
– Increase maxDepth, e.g. to 20 or more (OSR=N/A, RRE=10)*
– Decrease minSplit, e.g. to 2 (OSR=5, RRE=sqrt(N))*
– Increase mTry, e.g. to 40 or more (OSR/RRE=sqrt(p) or p/3)*
– Increase maxNumBins, e.g. to 1e5 or 1e6
– Accuracy of 81.4% with the KDD dataset using the following with a further increase to 82.3% when ntree=200:
ntree=20, mtry=40, minSplit=2, maxDepth=20, maxNumBins=1e6
- Alternatively, run the open source randomForest routine across the Hadoop cluster using rxExec
– Adjust MR memory limits if needed since data must fit within memory on each node.