SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit
, predict
) to train models faster.
In addition to building machine learning models, there are handy functionalities to do feature engineering
This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.
You can install latest cran version using (recommended):
install.packages("superml")
You can install the developmemt version directly from github using:
devtools::install_github("saraswatmks/superml")
This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.
We’ll quickly prepare the data set to be ready to served for model training.
load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")
library(data.table)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(superml)
#> Loading required package: R6
library(kableExtra)
library(Metrics)
#>
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#>
#> precision, recall
kable(head(reg_train, 10)) %>%
scroll_box(width = "100%", height = "300px")
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
7 | 20 | RL | 75 | 10084 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | WD | Normal | 307000 |
8 | 60 | RL | NA | 10382 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NA | NA | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
9 | 50 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2008 | WD | Abnorml | 129900 |
10 | 190 | RL | 50 | 7420 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2008 | WD | Normal | 118000 |
split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])
xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]
# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]
for(c in cat_cols){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')
xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]
# fill missing value with -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1
KNN Regression
knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
rmse(actual = xtest$SalePrice, predicted=labels)
#> [1] 5517.411
SVM Regression
#predicts probabilities - must specify mc_type ("OvA_hinge", "AvA_hinge")
svm <- SVMTrainer$new(type="ls")
svm$fit(xtrain, 'SalePrice')
#> Removing invalid columns. The names should not start with anumber: 1stFlrSF,2ndFlrSF,3SsnPorch
pred <- svm$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 34516.13
Simple Regresison
lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -348560 -14786 -618 12638 217495
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.284e+06 1.547e+06 -1.477 0.140123
#> MSSubClass -8.554e+01 6.572e+01 -1.301 0.193421
#> MSZoning 2.818e+02 1.411e+03 0.200 0.841784
#> LotFrontage -2.791e+01 3.368e+01 -0.829 0.407503
#> LotArea 2.761e-01 1.509e-01 1.830 0.067623 .
#> Street -5.565e+04 2.055e+04 -2.708 0.006892 **
#> LotShape -1.492e+03 2.002e+03 -0.745 0.456236
#> LandContour 2.793e+03 2.143e+03 1.304 0.192665
#> Utilities -6.427e+04 3.343e+04 -1.923 0.054806 .
#> LotConfig 1.502e+03 1.063e+03 1.413 0.158130
#> LandSlope 5.832e+03 5.571e+03 1.047 0.295470
#> Neighborhood -2.862e+02 2.020e+02 -1.416 0.156957
#> Condition1 -2.634e+03 8.796e+02 -2.994 0.002824 **
#> Condition2 -2.429e+03 4.164e+03 -0.583 0.559909
#> BldgType -2.294e+03 2.764e+03 -0.830 0.406867
#> HouseStyle 2.063e+02 9.941e+02 0.208 0.835639
#> OverallQual 1.456e+04 1.378e+03 10.564 < 2e-16 ***
#> OverallCond 6.761e+03 1.201e+03 5.630 2.38e-08 ***
#> YearBuilt 5.409e+02 7.887e+01 6.858 1.26e-11 ***
#> YearRemodAdd 8.737e+01 7.926e+01 1.102 0.270629
#> RoofStyle 6.421e+03 1.949e+03 3.294 0.001023 **
#> RoofMatl -3.760e+04 3.973e+03 -9.464 < 2e-16 ***
#> Exterior1st -1.720e+03 6.438e+02 -2.672 0.007668 **
#> Exterior2nd 1.869e+03 6.179e+02 3.024 0.002562 **
#> MasVnrType 4.639e+03 1.612e+03 2.879 0.004082 **
#> MasVnrArea 2.500e+01 7.432e+00 3.364 0.000798 ***
#> ExterQual 1.575e+03 2.342e+03 0.673 0.501381
#> ExterCond 1.249e+03 2.355e+03 0.530 0.596122
#> Foundation -2.028e+03 1.744e+03 -1.163 0.245210
#> BsmtQual 6.699e+03 1.530e+03 4.379 1.33e-05 ***
#> BsmtCond -3.194e+03 1.849e+03 -1.728 0.084363 .
#> BsmtExposure 1.133e+03 8.694e+02 1.304 0.192677
#> BsmtFinType1 -1.473e+03 7.050e+02 -2.090 0.036911 *
#> BsmtFinSF1 1.929e+01 5.932e+00 3.251 0.001190 **
#> BsmtFinType2 1.788e+02 1.373e+03 0.130 0.896396
#> BsmtFinSF2 1.196e+01 1.145e+01 1.044 0.296809
#> BsmtUnfSF 7.807e+00 5.589e+00 1.397 0.162741
#> Heating -7.083e+02 3.262e+03 -0.217 0.828160
#> HeatingQC -1.448e+03 1.390e+03 -1.042 0.297795
#> CentralAir 7.169e+03 5.572e+03 1.287 0.198490
#> Electrical 2.958e+03 1.978e+03 1.496 0.135019
#> `1stFlrSF` 5.727e+01 7.244e+00 7.906 7.36e-15 ***
#> `2ndFlrSF` 5.095e+01 6.095e+00 8.360 2.22e-16 ***
#> LowQualFinSF 3.968e+00 2.178e+01 0.182 0.855466
#> BsmtFullBath 8.272e+03 2.946e+03 2.808 0.005083 **
#> BsmtHalfBath -2.294e+03 4.444e+03 -0.516 0.605775
#> FullBath 8.174e+03 3.218e+03 2.540 0.011231 *
#> HalfBath -6.042e+02 2.993e+03 -0.202 0.840077
#> BedroomAbvGr -7.474e+03 1.921e+03 -3.891 0.000107 ***
#> KitchenAbvGr -2.916e+04 5.940e+03 -4.909 1.08e-06 ***
#> KitchenQual 9.395e+03 1.784e+03 5.267 1.71e-07 ***
#> TotRmsAbvGrd 4.822e+03 1.383e+03 3.487 0.000511 ***
#> Functional -4.269e+03 1.419e+03 -3.007 0.002704 **
#> Fireplaces -4.591e+03 2.727e+03 -1.684 0.092581 .
#> FireplaceQu 3.733e+03 1.390e+03 2.686 0.007352 **
#> GarageType 1.801e+03 1.308e+03 1.376 0.169025
#> GarageYrBlt -1.002e+00 4.676e+00 -0.214 0.830379
#> GarageFinish 1.446e+03 1.470e+03 0.983 0.325648
#> GarageCars 1.411e+04 3.395e+03 4.155 3.54e-05 ***
#> GarageArea -6.969e+00 1.137e+01 -0.613 0.540048
#> GarageQual 2.293e+03 2.931e+03 0.782 0.434192
#> GarageCond -7.950e+02 2.957e+03 -0.269 0.788099
#> PavedDrive -1.471e+03 3.072e+03 -0.479 0.632228
#> WoodDeckSF 3.517e+01 8.970e+00 3.921 9.44e-05 ***
#> OpenPorchSF 1.330e+01 1.682e+01 0.791 0.429295
#> EnclosedPorch 1.675e+01 1.782e+01 0.940 0.347490
#> `3SsnPorch` 1.839e+01 3.125e+01 0.588 0.556421
#> ScreenPorch 8.479e+01 1.971e+01 4.302 1.87e-05 ***
#> PoolArea 1.273e+01 2.728e+01 0.467 0.640909
#> Fence -1.434e+03 1.355e+03 -1.058 0.290293
#> MiscVal -5.677e-01 1.815e+00 -0.313 0.754468
#> MoSold 2.421e+02 3.779e+02 0.640 0.522020
#> YrSold 4.812e+02 7.699e+02 0.625 0.532145
#> SaleType 2.692e+03 1.213e+03 2.220 0.026647 *
#> SaleCondition 6.574e+02 1.317e+03 0.499 0.617920
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 961763883)
#>
#> Null deviance: 6.7644e+12 on 1023 degrees of freedom
#> Residual deviance: 9.1271e+11 on 949 degrees of freedom
#> AIC: 24161
#>
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 42085.83
Lasso Regression
lf <- LMTrainer$new(family = "gaussian", alpha=1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39251.53
Ridge Regression
lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39048.85
Logistic Regression with CV
lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
#> Computation done.
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39868.51
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> OverallQual 853779674918
#> GarageCars 540979092620
#> 1stFlrSF 524075273991
#> GarageArea 470339867542
#> YearBuilt 389244470841
#> FullBath 316880539159
#> BsmtFinSF1 302814894689
#> GarageYrBlt 299711687944
#> TotRmsAbvGrd 222563877576
#> 2ndFlrSF 195898795460
#> ExterQual 191977400578
#> LotArea 187452249182
#> YearRemodAdd 174720881654
#> KitchenQual 153588077672
#> Fireplaces 139497407084
#> FireplaceQu 135849726989
#> BsmtQual 130356448141
#> MasVnrArea 116484015593
#> Foundation 111896573727
#> LotFrontage 102110064822
#> OpenPorchSF 95971655823
#> BsmtFinType1 91272652787
#> BsmtUnfSF 72496983186
#> WoodDeckSF 60744463856
#> BedroomAbvGr 47606931429
#> HeatingQC 46648901825
#> GarageType 46294317752
#> RoofStyle 46232079711
#> Neighborhood 45190484978
#> Exterior2nd 37110890894
#> HalfBath 36835880001
#> MoSold 36165618613
#> MSSubClass 35471695037
#> OverallCond 34034719437
#> HouseStyle 32011515205
#> GarageFinish 28439661504
#> Exterior1st 27414958735
#> BsmtFullBath 23095348746
#> YrSold 23032533124
#> SaleCondition 21796081615
#> LotShape 20844040940
#> BsmtExposure 20493537059
#> PoolArea 18153017479
#> MasVnrType 16526542389
#> LotConfig 16440419662
#> MSZoning 15354466600
#> ScreenPorch 14894425924
#> CentralAir 14083855484
#> SaleType 13252524029
#> LandContour 12787679270
#> BldgType 11920439520
#> EnclosedPorch 11833958367
#> GarageCond 11496596053
#> BsmtCond 10602907496
#> Condition1 9880847832
#> KitchenAbvGr 9506013503
#> GarageQual 9468060341
#> Fence 9028409858
#> LandSlope 8876637237
#> ExterCond 8784784300
#> BsmtFinType2 6324964125
#> PavedDrive 5700637874
#> BsmtFinSF2 5460122633
#> Functional 4918683792
#> RoofMatl 2946224767
#> Electrical 2909961298
#> MiscVal 2600616431
#> 3SsnPorch 1714230264
#> BsmtHalfBath 1462471611
#> LowQualFinSF 1332849288
#> Heating 1219512331
#> Street 883151232
#> Condition2 525051591
#> Utilities 21047446
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 32191.64
Xgboost
xgb <- XGBTrainer$new(objective = "reg:linear"
, n_estimators = 500
, eval_metric = "rmse"
, maximize = F
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:180448.046875 val-rmse:175184.421875
#> [51] train-rmse:8857.614258 val-rmse:32693.087891
#> [101] train-rmse:4940.895020 val-rmse:32328.214844
#> [151] train-rmse:3089.641357 val-rmse:32272.939453
#> [201] train-rmse:1999.938965 val-rmse:32301.414062
#> [251] train-rmse:1348.789185 val-rmse:32299.994141
#> [301] train-rmse:918.838562 val-rmse:32285.113281
#> [351] train-rmse:624.947327 val-rmse:32284.828125
#> [401] train-rmse:413.633789 val-rmse:32282.865234
#> [451] train-rmse:297.058624 val-rmse:32280.492188
#> [500] train-rmse:208.659836 val-rmse:32278.873047
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 32278.87
Grid Search
xgb <- XGBTrainer$new(objective="reg:linear")
gst <-GridSearchCV$new(trainer = xgb,
parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:142583.875000
#> [10] train-rmse:14989.334961
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:144377.453125
#> [10] train-rmse:15865.090820
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143024.562500
#> [10] train-rmse:16595.208984
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:142583.875000
#> [50] train-rmse:3297.738525
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:144377.453125
#> [50] train-rmse:4253.146973
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143024.562500
#> [50] train-rmse:4584.068848
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143632.109375
#> [10] train-rmse:30646.046875
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:145269.359375
#> [10] train-rmse:30542.128906
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143839.015625
#> [10] train-rmse:27930.759766
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143632.109375
#> [50] train-rmse:17434.611328
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:145269.359375
#> [50] train-rmse:16815.578125
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143839.015625
#> [50] train-rmse:15829.471680
gst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0
#>
#> $accuracy_sd
#> [1] 0
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Random Search
rf <- RFTrainer$new()
rst <-RandomSearchCV$new(trainer = rf,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter=3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#>
#> $max_depth
#> [1] 2
#>
#> $accuracy_avg
#> [1] 0.0127107
#>
#> $accuracy_sd
#> [1] 0.006822147
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.
Data Preparation
# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")
kable(head(cla_train, 10)) %>%
scroll_box(width = "100%", height = "300px")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | C |
# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]
# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
# drop these features
to_drop <- c('PassengerId','Ticket','Name')
xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]
Now, our data is ready to be served for model training. Let’s do it.
KNN Classification
knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
auc(actual = xtest$Survived, predicted=labels)
#> [1] 0.6776491
Naive Bayes Classification
nb <- NBTrainer$new()
nb$fit(xtrain, 'Survived')
pred <- nb$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)
#> [1] 0.710828
SVM Classification
#predicts probabilities - must specify mc_type ("OvA_hinge", "AvA_hinge")
svm <- SVMTrainer$new(predict.prob = T, type="bc", mc_type="OvA_hinge")
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred[,2])
#> [1] 0.784916
#predicts labels
svm <- SVMTrainer$new(predict.prob = F, type="bc")
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)
#> [1] 0.7381008
Logistic Regression
lf <- LMTrainer$new(family="binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0647 -0.5139 -0.3550 0.5659 2.5979
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.882835 0.667638 2.820 0.00480 **
#> Pclass -0.991285 0.198153 -5.003 5.66e-07 ***
#> Sex 3.014839 0.250533 12.034 < 2e-16 ***
#> Age -0.050270 0.010402 -4.833 1.35e-06 ***
#> SibSp -0.376242 0.132598 -2.837 0.00455 **
#> Parch -0.137521 0.146524 -0.939 0.34796
#> Fare 0.001671 0.002794 0.598 0.54981
#> Cabin 0.017868 0.005923 3.017 0.00256 **
#> Embarked 0.076637 0.148818 0.515 0.60657
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 823.56 on 623 degrees of freedom
#> Residual deviance: 495.21 on 615 degrees of freedom
#> AIC: 513.21
#>
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.7930805
Lasso Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7981181
Ridge Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7937464
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 83.179331
#> Fare 49.530245
#> Age 44.026986
#> Cabin 27.806777
#> Pclass 22.210427
#> SibSp 13.742906
#> Parch 9.837351
#> Embarked 6.777296
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739
Xgboost
xgb <- XGBTrainer$new(objective = "binary:logistic"
, n_estimators = 500
, eval_metric = "auc"
, maximize = T
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-auc:0.910654 val-auc:0.815229
#> [51] train-auc:0.977882 val-auc:0.803243
#> [101] train-auc:0.990142 val-auc:0.808280
#> [151] train-auc:0.994508 val-auc:0.807354
#> [201] train-auc:0.996520 val-auc:0.809352
#> [251] train-auc:0.997454 val-auc:0.809496
#> [301] train-auc:0.998147 val-auc:0.808309
#> [351] train-auc:0.998554 val-auc:0.808136
#> [401] train-auc:0.998796 val-auc:0.809120
#> [451] train-auc:0.999060 val-auc:0.809699
#> [500] train-auc:0.999104 val-auc:0.809699
pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8096989
Grid Search
xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.129808
#> [10] train-error:0.098558
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.105769
#> [10] train-error:0.088942
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.098558
#> [10] train-error:0.069712
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.129808
#> [50] train-error:0.038462
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.105769
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.098558
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.177885
#> [10] train-error:0.153846
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.201923
#> [10] train-error:0.137019
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.182692
#> [10] train-error:0.115385
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.177885
#> [50] train-error:0.110577
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.201923
#> [50] train-error:0.096154
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.182692
#> [50] train-error:0.081731
gst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0
#>
#> $accuracy_sd
#> [1] 0
#>
#> $auc_avg
#> [1] 0.883034
#>
#> $auc_sd
#> [1] 0.0242347
Random Search
rf <- RFTrainer$new()
rst <-RandomSearchCV$new(trainer = rf,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0.849359
#>
#> $accuracy_sd
#> [1] 0.0264787
#>
#> $auc_avg
#> [1] 0.8279856
#>
#> $auc_sd
#> [1] 0.02242134
Let’s create some new feature based on target variable using target encoding and test a model.
# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$test[[2]]]
# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 85.213569
#> Fare 51.676287
#> Age 47.071256
#> Cabin 28.804936
#> Pclass 22.431287
#> SibSp 13.735815
#> Parch 9.643044
#> feat_01 4.449812
#> Embarked 4.385365
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739