General Hadoop Performance Considerations
MapReduce Jobs and Tasks
MapReduce Jobs and Tasks
- Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
- Each MapReduce Job consists of one or more Map tasks
- Map tasks can execute in parallel
- Set RxHadoopMR( … consoleOutput=TRUE … ) to track job progress
- Random Forest with rxExec (small to medium data)
- #jobs = 1
- #tasks = nTrees (default is 10)
- Random Forest (large data, e.g. 100 GB+)
- #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
- #tasks = #inputSplits
- Logistic Regression, GLM, k-Means
- #jobs = #iterations (typically 4 - 15 iterations)
- #tasks = #inputSplits
- Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
- #jobs = 1-2
- #tasks = #inputSplits