‘vtreat’ is a package that prepares arbitrary data frames into clean data frames that are ready for analysis. A clean data frame:
To achieve this a number of techniques are used. Principally:
For more details see: the ‘vtreat’ article and update.
The use pattern is:
designTreatmentsC()
or designTreatmentsN()
to design a treatment planprepare()
to apply the plan to data frames.The main feature of ‘vtreat’ is that all data preparation is “y-aware”: it uses the relations of effective variables to the dependent or outcome variable to encode the effective variables.
The structure returned from designTreatmentsN()
or designTreatmentsC()
includes a list of “treatments”: objects that encapsulate the transformation process from the original variables to the new “clean” variables.
In addition to the treatment objects designTreatmentsC()
and designTreatmentsN()
also return a data frame named scoreFrame
which contains columns:
varName
: name of new variableorigName
: name of original variable that the variable was derived from (can repeat).code
: what time of treatment was performed to create the derived variable (useful for filtering).varMoves
: logical TRUE if the variable varied during training; only variables that move will be in the treated frame.psig
: linear significnace of regerssing derived variable against a 0/1 indicator target.csig
: for categorical outcomes: significance of observed variable catPRSquared value under an in-sample permutation test.sig
: csig
for categorical outcomes, psig
otherwise.needsSplit
: is the variable a sub model and require out of sample scoring.In all cases we have two undesirable upward biases on the scores:
PRESSRsquared
or logistic regression for catPRSquared
). So in each of these cases we would like the regression itself to only be evaluated on held-out data. The PRESS statistic does just that (fast 1-way cross validation). The catPseudoRSquared
performs explicit 1-way cross validation for small data sets and hold-out scoring for larger data sets.‘vtreat’ uses a number of cross-training and jackknife style procedures to try to mitigate these effects. The suggested best practice (if you have enough data) is to split your data randomly into at least the following disjoint data sets:
designTreatmentsC()
or designTreatmentsN()
step and not used again for training or test.prepare()
) for training your model.prepare()
) for estimating your model’s out of training performance.Taking the extra step to perform the designTreatmentsC()
or designTreatmentsN()
on data disjoint from training makes the training data more exchangeable with test and avoids the issue that ‘vtreat’ may be hiding a large number of degrees of freedom in variables it derives from large categoricals.
Some trivial execution examples (not demonstrating any cal/train/test split) are given below. Variables that do not move during hold-out testing are considered “not to move.”
library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
head(dTrainC)
## x z y
## 1 a 1 FALSE
## 2 a 2 FALSE
## 3 a 3 TRUE
## 4 b 4 FALSE
## 5 b NA TRUE
## 6 <NA> 6 TRUE
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestC)
## x z
## 1 a 10
## 2 b 20
## 3 c 30
## 4 <NA> NA
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "desigining treatments Mon May 2 21:01:59 2016"
## [1] "design var x Mon May 2 21:01:59 2016"
## [1] "design var z Mon May 2 21:01:59 2016"
## [1] "scoring treatments Mon May 2 21:01:59 2016"
## [1] "have treatment plan Mon May 2 21:01:59 2016"
## [1] "rescoring complex variables Mon May 2 21:01:59 2016"
## [1] "done rescoring complex variables Mon May 2 21:01:59 2016"
print(treatmentsC)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"
##
## $treatments[[2]]
## [1] "vtreat 'Prevalence Code'('x'(integer,factor)->character->'x_catP')"
##
## $treatments[[3]]
## [1] "vtreat 'Bayesian Impact Code'('x'(integer,factor)->character->'x_catB')"
##
## $treatments[[4]]
## [1] "vtreat 'Scalable pass through'('z'(double,numeric)->numeric->'z_clean')"
##
## $treatments[[5]]
## [1] "vtreat 'is.bad'('z'(double,numeric)->numeric->'z_isBAD')"
##
##
## $scoreFrame
## varName varMoves psig sig csig needsSplit origName
## 1 x_lev_NA TRUE 0.3739010 0.20766228 0.20766228 FALSE x
## 2 x_lev_x.a TRUE 0.5185185 0.40972582 0.40972582 FALSE x
## 3 x_lev_x.b TRUE 1.0000000 1.00000000 1.00000000 FALSE x
## 4 x_catP TRUE 0.3739010 0.25493078 0.25493078 TRUE x
## 5 x_catB TRUE 0.1155611 0.05044486 0.05044486 TRUE x
## 6 z_clean TRUE 0.2562868 0.14299775 0.14299775 FALSE z
## 7 z_isBAD TRUE 0.3739010 0.20766228 0.20766228 FALSE z
## code
## 1 lev
## 2 lev
## 3 lev
## 4 catP
## 5 catB
## 6 clean
## 7 isBAD
##
## $outcomename
## [1] "y"
##
## $outcomeTarget
## [1] TRUE
##
## $outcomeType
## [1] "Binary"
##
## $vtreatVersion
## [1] '0.5.25'
##
## attr(,"class")
## [1] "treatmentplan"
print(treatmentsC$treatments[[1]])
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"
Here we demonstrate the optional scaling feature of prepare()
, which scales and centers all significant variables to mean 0, and slope 1 with respect to y: In other words, it rescales the variables to “y-units”. This is useful for downstream principal components analysis. Note: variables perfectly uncorrelated with y necessarily have slope 0 and can’t be “scaled” to slope 1, however for the same reason these variables will be insignificant and can be pruned by pruneSig.
scale=FALSE
by default.
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
head(dTrainCTreated)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB z_clean z_isBAD
## 1 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 -0.38648649 -0.1
## 2 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 -0.21081081 -0.1
## 3 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 -0.03513514 -0.1
## 4 -0.1 0.1666667 -9.614813e-17 0.1 0.05164882 0.14054054 -0.1
## 5 -0.1 0.1666667 -9.614813e-17 0.1 0.05164882 0.00000000 0.5
## 6 0.5 0.1666667 4.807407e-17 0.4 0.46186845 0.49189189 -0.1
## y
## 1 FALSE
## 2 FALSE
## 3 TRUE
## 4 FALSE
## 5 TRUE
## 6 TRUE
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB
## -6.938894e-18 -2.775558e-17 4.108149e-33 1.850372e-17 -2.312965e-18
## z_clean z_isBAD
## 9.251859e-18 -6.938894e-18
# all slopes should be 1 for variables with dTrainCTreated$scoreFrame$sig<1
sapply(varsC,function(c) { lm(paste('y',c,sep='~'),
data=dTrainCTreated)$coefficients[[2]]})
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB z_clean z_isBAD
## 1 1 0 1 1 1 1
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE)
head(dTestCTreated)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB z_clean z_isBAD
## 1 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 0.4918919 -0.1
## 2 -0.1 0.1666667 -9.614813e-17 0.1 0.05164882 0.4918919 -0.1
## 3 -0.1 0.1666667 4.807407e-17 0.7 0.05164882 0.4918919 -0.1
## 4 0.5 0.1666667 4.807407e-17 0.4 0.46186845 0.0000000 0.5
# numeric example
dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA),
z=c(1,2,3,4,5,NA,7),y=c(0,0,0,1,0,1,1))
head(dTrainN)
## x z y
## 1 a 1 0
## 2 a 2 0
## 3 a 3 0
## 4 a 4 1
## 5 b 5 0
## 6 b NA 1
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestN)
## x z
## 1 a 10
## 2 b 20
## 3 c 30
## 4 <NA> NA
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "desigining treatments Mon May 2 21:01:59 2016"
## [1] "design var x Mon May 2 21:01:59 2016"
## [1] "design var z Mon May 2 21:01:59 2016"
## [1] "scoring treatments Mon May 2 21:01:59 2016"
## [1] "have treatment plan Mon May 2 21:01:59 2016"
## [1] "rescoring complex variables Mon May 2 21:01:59 2016"
## [1] "done rescoring complex variables Mon May 2 21:01:59 2016"
print(treatmentsN)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"
##
## $treatments[[2]]
## [1] "vtreat 'Prevalence Code'('x'(integer,factor)->character->'x_catP')"
##
## $treatments[[3]]
## [1] "vtreat 'Scalable Impact Code'('x'(integer,factor)->character->'x_catN')"
##
## $treatments[[4]]
## [1] "vtreat 'Deviation Fact'('x'(integer,factor)->character->'x_catD')"
##
## $treatments[[5]]
## [1] "vtreat 'Scalable pass through'('z'(double,numeric)->numeric->'z_clean')"
##
## $treatments[[6]]
## [1] "vtreat 'is.bad'('z'(double,numeric)->numeric->'z_isBAD')"
##
##
## $scoreFrame
## varName varMoves psig sig needsSplit origName code
## 1 x_lev_NA TRUE 0.2855909 0.2855909 FALSE x lev
## 2 x_lev_x.a TRUE 0.3524132 0.3524132 FALSE x lev
## 3 x_lev_x.b TRUE 0.8456711 0.8456711 FALSE x lev
## 4 x_catP TRUE 0.2721791 0.2721791 TRUE x catP
## 5 x_catN TRUE 0.3548219 0.3548219 TRUE x catN
## 6 x_catD TRUE 0.4835362 0.4835362 TRUE x catD
## 7 z_clean TRUE 0.1724763 0.1724763 FALSE z clean
## 8 z_isBAD TRUE 0.2855909 0.2855909 FALSE z isBAD
##
## $outcomename
## [1] "y"
##
## $outcomeType
## [1] "Numeric"
##
## $vtreatVersion
## [1] '0.5.25'
##
## attr(,"class")
## [1] "treatmentplan"
dTrainNTreated <- prepare(treatmentsN,dTrainN,
pruneSig=c(),scale=TRUE)
head(dTrainNTreated)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN x_catD
## 1 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714
## 2 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714
## 3 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714
## 4 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714
## 5 -0.0952381 0.2380952 0.07142857 0.2 0.07142857 0.2380952
## 6 -0.0952381 0.2380952 0.07142857 0.2 0.07142857 0.2380952
## z_clean z_isBAD y
## 1 -0.41904762 -0.0952381 0
## 2 -0.26190476 -0.0952381 0
## 3 -0.10476190 -0.0952381 0
## 4 0.05238095 -0.0952381 1
## 5 0.20952381 -0.0952381 0
## 6 0.00000000 0.5714286 1
varsN <- setdiff(colnames(dTrainNTreated),'y')
# all input variables should be mean 0
sapply(dTrainNTreated[,varsN,drop=FALSE],mean)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN
## -3.965082e-18 0.000000e+00 -2.974054e-18 7.930164e-17 -3.965082e-18
## x_catD z_clean z_isBAD
## -9.515810e-17 4.757324e-17 -3.967986e-18
# all slopes should be 1 for variables with treatmentsN$scoreFrame$sig<1
sapply(varsN,function(c) { lm(paste('y',c,sep='~'),
data=dTrainNTreated)$coefficients[[2]]})
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN x_catD z_clean
## 1 1 1 1 1 1 1
## z_isBAD
## 1
# prepared frame
dTestNTreated <- prepare(treatmentsN,dTestN,
pruneSig=c())
head(dTestNTreated)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN x_catD z_clean
## 1 0 1 0 0.5714286 -0.17857143 0.5000000 7.000000
## 2 0 0 1 0.2857143 0.07142857 0.7071068 7.000000
## 3 0 0 0 0.0000000 0.00000000 0.7071068 7.000000
## 4 1 0 0 0.1428571 0.57142857 0.7071068 3.666667
## z_isBAD
## 1 0
## 2 0
## 3 0
## 4 1
# scaled prepared frame
dTestNTreatedS <- prepare(treatmentsN,dTestN,
pruneSig=c(),scale=TRUE)
head(dTestNTreatedS)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN x_catD
## 1 -0.0952381 -0.1785714 -0.02857143 -0.2 -1.785714e-01 -0.1785714
## 2 -0.0952381 0.2380952 0.07142857 0.2 7.142857e-02 0.2380952
## 3 -0.0952381 0.2380952 -0.02857143 0.6 -1.586033e-17 0.2380952
## 4 0.5714286 0.2380952 -0.02857143 0.4 5.714286e-01 0.2380952
## z_clean z_isBAD
## 1 0.5238095 -0.0952381
## 2 0.5238095 -0.0952381
## 3 0.5238095 -0.0952381
## 4 0.0000000 0.5714286
Related work: