You can use the same RevoScaleR functions to process huge data sets stored on disk as you do to analyze in-memory data frames. This is because RevoScaleR functions use 'chunking' algorithms. Basically, chunking algorithms follow this process:
- Initialization: intermediate results needed for computation of final statistics are initialized
- Read data: read a chunk (set of observations of variables) of data
- Transform data: perform transformations and row selections for the chunk of data as needed; write out data if only performing import or data step
- Process data: compute intermediate results for the chunk of data
- Update results: combine the results from the chunk of data with those of previous chunks
- Repeat steps (2) - (5) (perhaps in parallel) until all data has been processed
- Process results: when results from all the chunks have been completed, do final computations and return results