‘vtreat’ is a package that prepares arbitrary data frames into clean data frames that are ready for analysis (usually supervised learning). A clean data frame:
To effect this encoding ‘vtreat’ replaces original variables or columns with new derived variables. In this note we will use variables and columns as interchangeable concepts. This note describes the current family of ‘vtreat’ derived variable types.
‘vtreat’ usage splits into three main cases:
In all cases vtreat variable names are built by appending a notation onto the original user supplied column name. In all cases the easiest way to examine the derived variables is to look at the scoreFrame
component of the returned treatment plan.
We will outline each of these situations below:
An example categorical variable treatment is demonstrated below:
library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE),
stringsAsFactors = FALSE)
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "desigining treatments Mon May 2 21:02:03 2016"
## [1] "design var x Mon May 2 21:02:03 2016"
## [1] "design var z Mon May 2 21:02:03 2016"
## [1] "scoring treatments Mon May 2 21:02:03 2016"
## [1] "have treatment plan Mon May 2 21:02:03 2016"
## [1] "rescoring complex variables Mon May 2 21:02:03 2016"
## [1] "done rescoring complex variables Mon May 2 21:02:03 2016"
print(treatmentsC$scoreFrame[,c('origName','varName','code','varMoves','sig')])
## origName varName code varMoves sig
## 1 x x_lev_NA lev TRUE 0.20766228
## 2 x x_lev_x.a lev TRUE 0.40972582
## 3 x x_lev_x.b lev TRUE 1.00000000
## 4 x x_catP catP TRUE 0.25493078
## 5 x x_catB catB TRUE 0.05044486
## 6 z z_clean clean TRUE 0.14299775
## 7 z z_isBAD isBAD TRUE 0.20766228
For each user supplied variable or column (in this case x
and z
) ‘vtreat’ proposes derived or treated variables (in this case x_lev_x.a
, x_lev_x.b
, x_catP
, x_catB
, z_clean
, and z_isBAD
). The mapping from original variable name to derived variable name is given by comparing the columns origName
and varName
. One can map facts about the new variables back to the original variables as follows:
# Build a map from vtreat names back to reasonable display names
vmap <- as.list(treatmentsC$scoreFrame$origName)
names(vmap) <- treatmentsC$scoreFrame$varName
print(vmap['x_catB'])
## $x_catB
## [1] "x"
# Map significances back to original variables
aggregate(sig~origName,data=treatmentsC$scoreFrame,FUN=min)
## origName sig
## 1 x 0.05044486
## 2 z 0.14299775
Essentially a derived variable name is built by concatenating an original variable name and a treatment type (also recorded in the code
column for convenience). The codes give the different ‘vtreat’ variable types (or really meanings, as all derived variables are numeric).
For categorical targets the possible variable types are as follows:
x_lev_x.a
is 1 when the original x
variable had a value of “a”. These indicators are essentially variables representing explicit encoding of levels as contrasts. In some cases a special level code is used to represent pooled rare values.x_catB = log(P[y|x]/P[y])
. This encoding is especially useful for categorical variables that have a large number of levels, but be aware it can obscure degrees of freedom if not used properly.An example numeric variable treatment is demonstrated below:
library(vtreat)
dTrainN <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)),
stringsAsFactors = FALSE)
treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "desigining treatments Mon May 2 21:02:03 2016"
## [1] "design var x Mon May 2 21:02:03 2016"
## [1] "design var z Mon May 2 21:02:03 2016"
## [1] "scoring treatments Mon May 2 21:02:03 2016"
## [1] "have treatment plan Mon May 2 21:02:03 2016"
## [1] "rescoring complex variables Mon May 2 21:02:03 2016"
## [1] "done rescoring complex variables Mon May 2 21:02:03 2016"
print(treatmentsN$scoreFrame[,c('origName','varName','code','varMoves','sig')])
## origName varName code varMoves sig
## 1 x x_lev_NA lev TRUE 0.3739010
## 2 x x_lev_x.a lev TRUE 0.5185185
## 3 x x_lev_x.b lev TRUE 1.0000000
## 4 x x_catP catP TRUE 0.3739010
## 5 x x_catN catN TRUE 0.1933777
## 6 x x_catD catD TRUE 0.3739010
## 7 z z_clean clean TRUE 0.2562868
## 8 z z_isBAD isBAD TRUE 0.3739010
The treatment of numeric targets is similar to that of categorical targets. In the numeric case the possible derived variable types are:
x_lev_x.a
is 1 when the original x
variable had a value of “a”. These indicators are essentially variables representing explicit encoding of levels as contrasts. In some cases a special level code is used to represent pooled rare values.x_catN = E[y|x] - E[y]
. This encoding is especially useful for categorical variables that have a large number of levels, but be aware it can obscure degrees of freedom if not used properly.Note: for categorical targets we don’t need cat\_D
variables as this information is already encoded in cat\_B
variables.
An example “no target” variable treatment is demonstrated below:
library(vtreat)
dTrainZ <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),
stringsAsFactors = FALSE)
treatmentsZ <- designTreatmentsZ(dTrainZ,colnames(dTrainZ))
## [1] "desigining treatments Mon May 2 21:02:03 2016"
## [1] "design var x Mon May 2 21:02:03 2016"
## [1] "design var z Mon May 2 21:02:03 2016"
## [1] "scoring treatments Mon May 2 21:02:03 2016"
## [1] "have treatment plan Mon May 2 21:02:03 2016"
print(treatmentsZ$scoreFrame[,c('origName','varName','code','varMoves')])
## origName varName code varMoves
## 1 x x_catP catP TRUE
## 2 z z_clean clean TRUE
## 3 z z_isBAD isBAD TRUE
Note: because there is no user supplied target the scoreFrame
significance columns are not meaningful, and are populated only for regularity of code interface. Beyond that the no-target treatments are similar to the earlier treatments. Possible derived variable types in this case are:
x_lev_x.a
is 1 when the original x
variable had a value of “a”. These indicators are essentially variables representing explicit encoding of levels as contrasts. In some cases a special level code is used to represent pooled rare values.Variables that “do not move” (don’t take on at least two values during treatment design) or don’t achieve at least a minimal significance are suppressed. The catB
/catN
variables are essentially single variable models and are very useful for re-encoding categorical variables that take on a very large number of values (such as zip-codes).
The intended use of ‘vtreat’ is as follows:
‘vtreat’ attempts to compute “out of sample” significances for each variable effect ( the sig
column in scoreFrame
) through cross-validation techniques.
‘vtreat’ is primarily intended to be “y-aware” processing. Of particular interest is using vtreat::prepare()
with scale=TRUE
which tries to put most columns in ‘y-effect’ units. This can be an important pre-processing step before attempting dimension reduction (such as principal components methods).
The vtreat user should pick which sorts of variables they are want and also filter on estimated significance. Doing this looks like the following:
dTrainN <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)),
stringsAsFactors = FALSE)
treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "desigining treatments Mon May 2 21:02:03 2016"
## [1] "design var x Mon May 2 21:02:03 2016"
## [1] "design var z Mon May 2 21:02:03 2016"
## [1] "scoring treatments Mon May 2 21:02:03 2016"
## [1] "have treatment plan Mon May 2 21:02:03 2016"
## [1] "rescoring complex variables Mon May 2 21:02:03 2016"
## [1] "done rescoring complex variables Mon May 2 21:02:03 2016"
print(treatmentsN$scoreFrame[,c('origName','varName','code','varMoves','sig')])
## origName varName code varMoves sig
## 1 x x_lev_NA lev TRUE 0.3739010
## 2 x x_lev_x.a lev TRUE 0.5185185
## 3 x x_lev_x.b lev TRUE 1.0000000
## 4 x x_catP catP TRUE 0.3739010
## 5 x x_catN catN TRUE 0.1933777
## 6 x x_catD catD TRUE 0.3739010
## 7 z z_clean clean TRUE 0.2562868
## 8 z z_isBAD isBAD TRUE 0.3739010
pruneSig <- 1.0 # don't filter on significance for this tiny example
vScoreFrame <- treatmentsN$scoreFrame
varsToUse <- vScoreFrame$varName[(vScoreFrame$sig<=pruneSig) &
vScoreFrame$code %in% c('lev','catN','clean','isBad')]
print(varsToUse)
## [1] "x_lev_NA" "x_lev_x.a" "x_lev_x.b" "x_catN" "z_clean"
origVarNames <- sort(unique(vScoreFrame$origName[vScoreFrame$varName %in% varsToUse]))
print(origVarNames)
## [1] "x" "z"
We strongly suggest using the “y aware” variables coded as ‘lev’, ‘catN’, ‘catB’, ‘clean’, and ‘isBad’. The non y aware variables (‘catP’ and ‘catD’) can be useful (possibly as interactions or guards on the coresponding ‘canN’ and ‘catB’ variables) but also incode distributional facts about the data that may or may not be appropriate depending on your problem domain.
When displaying variables to end users we suggest using the original names and the min significance seen on any derived variable:
origVarNames <- sort(unique(vScoreFrame$origName[vScoreFrame$varName %in% varsToUse]))
print(origVarNames)
## [1] "x" "z"
origVarSigs <- vScoreFrame[vScoreFrame$varName %in% varsToUse,]
aggregate(sig~origName,data=origVarSigs,FUN=min)
## origName sig
## 1 x 0.1933777
## 2 z 0.2562868