vtreat::prepare(scale=TRUE)
is a variation of vtreat::prepare()
intended to prepare data frames so all the derived input or independent (x) variables are fully in outcome or dependent variable (y) units (in the sense of a regression; categorical/logical y’s are treated as 0/1 indicators) and mean-zero.
This is the appropriate preparation before a geometry/metric sensitive modeling step such as principal components analysis or clustering (such as k-means clustering).
Normally (with vtreat::prepare(scale=FALSE)
) vtreat passes through a number of variables with minimal alteration (cleaned numerics), builds 0/1 indicator variables for various conditions (categorical levels, presence of NAs, and so on), and builds some “in y-units” variables (catN, catB) that are in fact sub-models. With vtreat::prepare(scale=TRUE)
all of these numeric variables are then re-processed to have mean zero, and slope 1 (when possible) when numerically regressed against the y-variable.
This is easiest to illustrate with a concrete example.
library('vtreat')
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE,
verbose=FALSE)
dTrainCTreatedUnscaled <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=FALSE)
dTrainCTreatedScaled <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
The standard vtreat treated frame converts the original data from this:
print(dTrainC)
## x y
## 1 a FALSE
## 2 a FALSE
## 3 a TRUE
## 4 b FALSE
## 5 b TRUE
## 6 <NA> TRUE
into this:
print(dTrainCTreatedUnscaled)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB y
## 1 0 1 0 0.5000000 -0.4052985 FALSE
## 2 0 1 0 0.5000000 -0.4052985 FALSE
## 3 0 1 0 0.5000000 -0.4052985 TRUE
## 4 0 0 1 0.3333333 0.0000000 FALSE
## 5 0 0 1 0.3333333 0.0000000 TRUE
## 6 1 0 0 0.1666667 0.6926476 TRUE
This is the “standard way” to run vtreat – with the exception that for this example we set pruneSig
to NULL
to suppress variable pruning, instead of setting it to a value in the interval (0,1)
. The principle is: vtreats inflicts the minimal possible alterations on the data, leaving as much as possible to the downstream machine learning code. This does turn out to already be a lot of alteration. Mostly vtreat is taking only steps that are unsafe to leave for later: re-encoding of large categoricals, re-coding of aberrant values, and bulk pruning of variables.
However some procedures, in particular principal components analysis or geometric clustering, assume all of the columns have been fully transformed. The usual assumption (“more honored in the breach than the observance”) is that the columns are centered (mean zero) and scaled. The non y-aware meaning of “scaled” is unit variance. However, vtreat is designed to emphasize y-aware processing and we feel the y-aware sense of scaling should be: unit slope when regressed against y. If you want standard scaling you can use the standard frame produced by vtreat and scale it yourself. If you want vtreat style y-aware scaling you (which we strongly think is the right thing to do) you can use vtreat::prepare(scale=TRUE)
which produces a frame that looks like the following:
print(dTrainCTreatedScaled)
## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB y
## 1 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 FALSE
## 2 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 FALSE
## 3 -0.1 -0.1666667 4.807407e-17 -0.2 -0.18838870 TRUE
## 4 -0.1 0.1666667 -9.614813e-17 0.1 0.05164882 FALSE
## 5 -0.1 0.1666667 -9.614813e-17 0.1 0.05164882 TRUE
## 6 0.5 0.1666667 4.807407e-17 0.4 0.46186845 TRUE
First we can check the claims. Are the variables mean-zero and slope 1 when regressed against y?
slopeFrame <- data.frame(varName = treatmentsC$scoreFrame$varName,
stringsAsFactors = FALSE)
slopeFrame$mean <-
vapply(dTrainCTreatedScaled[, slopeFrame$varName, drop = FALSE], mean,
numeric(1))
slopeFrame$slope <- vapply(slopeFrame$varName,
function(c) {
lm(paste('y', c, sep = '~'),
data = dTrainCTreatedScaled)$coefficients[[2]]
},
numeric(1))
slopeFrame$sig <- vapply(slopeFrame$varName,
function(c) {
treatmentsC$scoreFrame[treatmentsC$scoreFrame$varName == c, 'sig']
},
numeric(1))
slopeFrame$badSlope <-
ifelse(is.na(slopeFrame$slope), TRUE, abs(slopeFrame$slope - 1) > 1.e-8)
print(slopeFrame)
## varName mean slope sig badSlope
## 1 x_lev_NA -6.938894e-18 1 0.20766228 FALSE
## 2 x_lev_x.a -2.775558e-17 1 0.40972582 FALSE
## 3 x_lev_x.b 4.108149e-33 0 1.00000000 TRUE
## 4 x_catP 1.850372e-17 1 0.25493078 FALSE
## 5 x_catB -2.312965e-18 1 0.05044486 FALSE
The above claims are true with the exception of the derived variable x_lev_x.b
. This is because the outcome variable y
has identical distribution when the original variable x==‘b’
and when x!=‘b’
(on half the time in both cases). This means y
is perfectly independent of x==‘b’
and the regression slope must be zero (thus, cannot be 1). vtreat now treats this as needing to scale by a multiplicative factor of zero. Note also that the significance level associated with x_lev_x.b
is large, making this variable easy to prune. The varMoves
and significance facts in treatmentsC$scoreFrame
are about the unscaled frame (where x_lev_x.b
does in fact move).
Previous versions of vtreat (0.5.22 and earlier) would copy variables that could not be sensibly scaled into the treated frame unaltered. This was considered the “most faithful” thing to do. However we now feel that this practice was not safe for many downstream procedures, such as principal components analysis and geometric clustering.