The RevoScaleR function rxSummary() comes in very handy for this kind of thing. By default it will report group means and standard deviations as well as cell counts and number of valid observations, if you specify a formula that includes interaction terms. In the below examples we use the shipped dataset 'CensusWorkers.xdf'.
Example:
--------
testDataDir <- file.path(rxGetOption("sampleDataDir"))
rxSummary(incwage ~ sex:state, data = file.path(testDataDir, "CensusWorkers.xdf"))
Number of valid observations: 351121
Number of missing observations: 0
Name Mean StdDev Min Max ValidObs
incwage:sex:state 35333.84 40444.54 0 354000 351121
Statistics by category (6 categories):
Category sex state Means StdDev
incwage for sex=Male, state=Connecticut Male Connecticut 55002.00 67742.93
incwage for sex=Female, state=Connecticut Female Connecticut 32605.03 34426.27
incwage for sex=Male, state=Indiana Male Indiana 38325.33 36160.07
incwage for sex=Female, state=Indiana Female Indiana 23117.71 20371.68
incwage for sex=Male, state=Washington Male Washington 41001.14 43962.65
incwage for sex=Female, state=Washington Female Washington 25765.03 25191.75
Min Max ValidObs MissingObs
0 354000 41751 0
0 354000 37666 0
0 314000 74221 0
0 314000 63702 0
0 336000 73372 0
0 336000 60409 0
You can use the 'rowSelection' argument to further subset your data based on specific values of some of the variables. Another useful function is the 'F()' function, which tells RevoScaleR to treat a specific variable as a factor:
Example:
--------
### Get information on variable types in XDF file
rxGetInfo(file.path(testDataDir,"CensusWorkers.xdf"), getVarInfo=TRUE)
rxSummary(incwage ~ F(age):sex, data = file.path(testDataDir,"CensusWorkers.xdf"), rowSelection = (state == "Indiana"))
Example:
--------
testDataDir <- file.path(rxGetOption("sampleDataDir"))
rxSummary(incwage ~ sex:state, data = file.path(testDataDir, "CensusWorkers.xdf"))
Number of valid observations: 351121
Number of missing observations: 0
Name Mean StdDev Min Max ValidObs
incwage:sex:state 35333.84 40444.54 0 354000 351121
Statistics by category (6 categories):
Category sex state Means StdDev
incwage for sex=Male, state=Connecticut Male Connecticut 55002.00 67742.93
incwage for sex=Female, state=Connecticut Female Connecticut 32605.03 34426.27
incwage for sex=Male, state=Indiana Male Indiana 38325.33 36160.07
incwage for sex=Female, state=Indiana Female Indiana 23117.71 20371.68
incwage for sex=Male, state=Washington Male Washington 41001.14 43962.65
incwage for sex=Female, state=Washington Female Washington 25765.03 25191.75
Min Max ValidObs MissingObs
0 354000 41751 0
0 354000 37666 0
0 314000 74221 0
0 314000 63702 0
0 336000 73372 0
0 336000 60409 0
You can use the 'rowSelection' argument to further subset your data based on specific values of some of the variables. Another useful function is the 'F()' function, which tells RevoScaleR to treat a specific variable as a factor:
Example:
--------
### Get information on variable types in XDF file
rxGetInfo(file.path(testDataDir,"CensusWorkers.xdf"), getVarInfo=TRUE)
rxSummary(incwage ~ F(age):sex, data = file.path(testDataDir,"CensusWorkers.xdf"), rowSelection = (state == "Indiana"))