Hadoop Composite XDF Block Size
MapReduce splits each input text file into one or more input splits which by default are the HDFS block size, e.g. 128 MB
rxSetComputeContext(RxHadoopMR(hadoopSwitches =
"-Dmapreduce.input.fileinputformat.split.minsize=150000000"))
rxImport(myCSV, myCXdf, overwrite=TRUE)
rxSetComputeContext(RxHadoopMR()) # set it back when done
MapReduce splits each input text file into one or more input splits which by default are the HDFS block size, e.g. 128 MB
- Each input split is converted from uncompressed, unparsed text to a compressed and parsed output binary xdfd file in the “data” subdirectory of the output directory – header info for the set of xdfd’s is in a single xdfm metadata file in the “metadata” directory
- For efficiency in subsequent analyses, each output xdfd file should roughly match the HDFS block size
- To compensate for XDF compression you’ll therefore usually need to increase the output xdfd file size by increasing the input split size using this parameter to RxHadoopMR():
- hadoopSwitches="-Dmapred.min.split.size=1000000000"
- For more recent Hadoop installations using YARN, the parameter is mapreduce.input.fileinputformat.split.minsize
- Increasing the input split size further may reduce the number of composite XDF files and hence the number of parallelized map tasks in subsequent analyses. This may be useful if the number of available map slots or containers is small relative to the number of splits. Conversely, when many map slots or containers are available, smaller input splits and more xdfd’s may result in faster completion.
- Example
rxSetComputeContext(RxHadoopMR(hadoopSwitches =
"-Dmapreduce.input.fileinputformat.split.minsize=150000000"))
rxImport(myCSV, myCXdf, overwrite=TRUE)
rxSetComputeContext(RxHadoopMR()) # set it back when done