Hadoop Composite XDF Block Size tuning suggestions

View products that this article applies to.

Hadoop Composite XDF Block Size

MapReduce splits each input text file into one or more input splits which by default are the HDFS block size, e.g. 128 MB

Each input split is converted from uncompressed, unparsed text to a compressed and parsed output binary xdfd file in the “data” subdirectory of the output directory – header info for the set of xdfd’s is in a single xdfm metadata file in the “metadata” directory
For efficiency in subsequent analyses, each output xdfd file should roughly match the HDFS block size
To compensate for XDF compression you’ll therefore usually need to increase the output xdfd file size by increasing the input split size using this parameter to RxHadoopMR():
- hadoopSwitches="-Dmapred.min.split.size=1000000000"
- For more recent Hadoop installations using YARN, the parameter is mapreduce.input.fileinputformat.split.minsize
Increasing the input split size further may reduce the number of composite XDF files and hence the number of parallelized map tasks in subsequent analyses. This may be useful if the number of available map slots or containers is small relative to the number of splits. Conversely, when many map slots or containers are available, smaller input splits and more xdfd’s may result in faster completion.
Example

Importing an input CSV of 670 MB in the Hortonworks Sandbox using the default input split size (32MB) created 670/32=21 xdfd’s with an rxSummary performance of 185”. Increasing input split size to 150MB created 5 xdfd’s each about 32MB with an rxSummary performance of 68”.

rxSetComputeContext(RxHadoopMR(hadoopSwitches =

"-Dmapreduce.input.fileinputformat.split.minsize=150000000"))

rxImport(myCSV, myCXdf, overwrite=TRUE)

rxSetComputeContext(RxHadoopMR()) # set it back when done

↑ Back to the top

Applies to:

Revolution Analytics

↑ Back to the top

Keywords: kb

↑ Back to the top

Article Info

Article ID	:	3104166
Revision	:	1
Created on	:	1/7/2017
Published on	:	11/1/2015
Exists online	:	False
Views	:	164

Microsoft KB Archive Search

Hadoop Composite XDF Block Size tuning suggestions

Applies to: