Notice: This website is an unofficial Microsoft Knowledge Base (hereinafter KB) archive and is intended to provide a reliable access to deleted content from Microsoft KB. All KB articles are owned by Microsoft Corporation. Read full disclaimer for more details.

General Hadoop Performance Considerations

View products that this article applies to.

General Hadoop Performance Considerations

MapReduce Jobs and Tasks

Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
Each MapReduce Job consists of one or more Map tasks
Map tasks can execute in parallel
Set RxHadoopMR( … consoleOutput=TRUE … ) to track job progress

MapReduce Job and Task Scaling

Random Forest with rxExec (small to medium data)
- #jobs = 1
- #tasks = nTrees (default is 10)
- Random Forest (large data, e.g. 100 GB+)
  - #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
  - #tasks = #inputSplits
- Logistic Regression, GLM, k-Means
  - #jobs = #iterations (typically 4 - 15 iterations)
  - #tasks = #inputSplits
- Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
  - #jobs = 1-2
  - #tasks = #inputSplits

↑ Back to the top

Applies to:

Revolution Analytics

↑ Back to the top

Keywords: kb

↑ Back to the top

Article Info

Article ID	:	3104164
Revision	:	1
Created on	:	1/7/2017
Published on	:	11/1/2015
Exists online	:	False
Views	:	179

Microsoft KB Archive Search

General Hadoop Performance Considerations

Applies to: