Notice: This website is an unofficial Microsoft Knowledge Base (hereinafter KB) archive and is intended to provide a reliable access to deleted content from Microsoft KB. All KB articles are owned by Microsoft Corporation. Read full disclaimer for more details.

General Hadoop Performance Considerations


View products that this article applies to.

General Hadoop Performance Considerations

MapReduce Jobs and Tasks
  • Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
  • Each MapReduce Job consists of one or more Map tasks
  • Map tasks can execute in parallel
  • Set RxHadoopMR( …  consoleOutput=TRUE … ) to track job progress
 MapReduce Job and Task Scaling
  • Random Forest with rxExec (small to medium data)
    • #jobs = 1
    • #tasks = nTrees (default is 10)
    • Random Forest (large data, e.g. 100 GB+)
      • #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
      • #tasks = #inputSplits
    • Logistic Regression, GLM, k-Means
      • #jobs = #iterations (typically 4 - 15 iterations)
      • #tasks = #inputSplits
    • Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
      • #jobs = 1-2
      • #tasks = #inputSplits

↑ Back to the top


Keywords: kb

↑ Back to the top

Article Info
Article ID : 3104164
Revision : 1
Created on : 1/7/2017
Published on : 11/1/2015
Exists online : False
Views : 58