Variable Selection Methods

Introduction

Variable selection refers to the process of choosing the most relevant variables to include in a regression model. They help to improve model performance and avoid over fitting.

Before we explore stepwise selection methods, let us take a quick look at all/best subset regression. As they evaluate every possible variable combination, these methods are computationally intensive and may crash your system if used with a large set of variables. We have included them in the package purely for educational purpose.

All Possible Regression

All subset regression tests all possible subsets of the set of potential independent variables. If there are K potential independent variables (besides the constant), then there are \(2^{k}\) distinct subsets of them to be tested. For example, if you have 10 candidate independent variables, the number of subsets to be tested is \(2^{10}\), which is 1024, and if you have 20 candidate variables, the number is \(2^{20}\), which is more than one million.

model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_step_all_possible(model)
##    Index N      Predictors  R-Square Adj. R-Square Mallow's Cp
## 3      1 1              wt 0.7528328     0.7445939  0.70869536
## 1      2 1            disp 0.7183433     0.7089548  0.67512054
## 2      3 1              hp 0.6024373     0.5891853  0.50969578
## 4      4 1            qsec 0.1752963     0.1478062  0.07541973
## 8      5 2           hp wt 0.8267855     0.8148396  0.78108710
## 10     6 2         wt qsec 0.8264161     0.8144448  0.77856272
## 6      7 2         disp wt 0.7809306     0.7658223  0.72532105
## 5      8 2         disp hp 0.7482402     0.7308774  0.69454380
## 7      9 2       disp qsec 0.7215598     0.7023571  0.66395284
## 9     10 2         hp qsec 0.6368769     0.6118339  0.52014395
## 14    11 3      hp wt qsec 0.8347678     0.8170643  0.78199548
## 11    12 3      disp hp wt 0.8268361     0.8082829  0.76789526
## 13    13 3    disp wt qsec 0.8264170     0.8078189  0.76988533
## 12    14 3    disp hp qsec 0.7541953     0.7278591  0.68301440
## 15    15 4 disp hp wt qsec 0.8351443     0.8107212  0.77102968

Best Subset Regression

Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow’s Cp or AIC.

model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_step_best_subset(model)
##    Best Subsets Regression    
## ------------------------------
## Model Index    Predictors
## ------------------------------
##      1         wt              
##      2         hp wt           
##      3         hp wt qsec      
##      4         disp hp wt qsec 
## ------------------------------
## 
##                                                    Subsets Regression Summary                                                    
## ---------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                           
## Model    R-Square    R-Square    R-Square     C(p)        AIC        SBIC        SBC         MSEP       FPE       HSP       APC  
## ---------------------------------------------------------------------------------------------------------------------------------
##   1        0.7528      0.7446      0.7087    12.4809    166.0294    74.2916    170.4266    296.9167    9.8572    0.3199    0.2801 
##   2        0.8268      0.8148      0.7811     2.3690    156.6523    66.5755    162.5153    215.5104    7.3563    0.2402    0.2091 
##   3        0.8348      0.8171       0.782     3.0617    157.1426    67.7238    164.4713    213.1929    7.4756    0.2461    0.2124 
##   4        0.8351      0.8107       0.771     5.0000    159.0696    70.0408    167.8640    220.8882    7.9497    0.2644    0.2259 
## ---------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Stepwise Selection

Stepwise regression is a method of fitting regression models that involves the iterative selection of independent variables to use in a model. It can be achieved through forward selection, backward elimination, or a combination of both methods. The forward selection approach starts with no variables and adds each new variable incrementally, testing for statistical significance, while the backward elimination method begins with a full model and then removes the least statistically significant variables one at a time.

Model

We will use the below model throughout this article except in the case of hierarchical selection. You can learn more about the data here.

model <- lm(y ~ ., data = surgical)
summary(model)
## 
## Call:
## lm(formula = y ~ ., data = surgical)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.36 -132.75  -10.00   89.48  790.12 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1148.823    242.328  -4.741 2.17e-05 ***
## bcs            62.390     24.470   2.550 0.014258 *  
## pindex          8.973      1.874   4.788 1.86e-05 ***
## enzyme_test     9.888      1.742   5.677 9.39e-07 ***
## liver_test     50.413     44.959   1.121 0.268109    
## age            -0.951      2.649  -0.359 0.721231    
## gender         15.874     58.475   0.271 0.787269    
## alc_mod         7.713     64.956   0.119 0.906007    
## alc_heavy     320.697     85.070   3.770 0.000474 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201.4 on 45 degrees of freedom
## Multiple R-squared:  0.7818, Adjusted R-squared:  0.7431 
## F-statistic: 20.16 on 8 and 45 DF,  p-value: 1.607e-12

Model specification

Irrespective of the stepwise method being used, we have to specify the full model i.e. all the variabels/predictors under consideration as olsrr extracts the candidate variables for selection/elimination from the model specified.

Forward selection
# stepwise forward regression
ols_step_forward_p(model)
## 
## 
##                               Stepwise Summary                              
## --------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2 
## --------------------------------------------------------------------------
##  0      Base Model     802.606    806.584    646.794    0.00000    0.00000 
##  1      liver_test     771.875    777.842    616.009    0.45454    0.44405 
##  2      alc_heavy      761.439    769.395    605.506    0.56674    0.54975 
##  3      enzyme_test    750.509    760.454    595.297    0.65900    0.63854 
##  4      pindex         735.715    747.649    582.943    0.75015    0.72975 
##  5      bcs            730.620    744.543    579.638    0.78091    0.75808 
## --------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.884       RMSE                 184.276 
## R-Squared                 0.781       MSE                38202.426 
## Adj. R-Squared            0.758       Coef. Var             27.839 
## Pred R-Squared            0.700       AIC                  730.620 
## MAE                     137.656       SBC                  744.543 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6535804.090         5    1307160.818    34.217    0.0000 
## Residual      1833716.447        48      38202.426                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------
## (Intercept)    -1178.330       208.682                 -5.647    0.000    -1597.914    -758.746 
##  liver_test       58.064        40.144        0.156     1.446    0.155      -22.652     138.779 
##   alc_heavy      317.848        71.634        0.314     4.437    0.000      173.818     461.878 
## enzyme_test        9.748         1.656        0.521     5.887    0.000        6.419      13.077 
##      pindex        8.924         1.808        0.380     4.935    0.000        5.288      12.559 
##         bcs       59.864        23.060        0.241     2.596    0.012       13.498     106.230 
## ------------------------------------------------------------------------------------------------
Backward elimination
# stepwise backward regression
ols_step_backward_p(model)
## 
## 
##                              Stepwise Summary                              
## -------------------------------------------------------------------------
## Step    Variable        AIC        SBC       SBIC        R2       Adj. R2 
## -------------------------------------------------------------------------
##  0      Full Model    736.390    756.280    586.665    0.78184    0.74305 
##  1      alc_mod       734.407    752.308    584.276    0.78177    0.74856 
##  2      gender        732.494    748.406    581.938    0.78142    0.75351 
##  3      age           730.620    744.543    579.638    0.78091    0.75808 
## -------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.884       RMSE                 184.276 
## R-Squared                 0.781       MSE                38202.426 
## Adj. R-Squared            0.758       Coef. Var             27.839 
## Pred R-Squared            0.700       AIC                  730.620 
## MAE                     137.656       SBC                  744.543 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6535804.090         5    1307160.818    34.217    0.0000 
## Residual      1833716.447        48      38202.426                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------
## (Intercept)    -1178.330       208.682                 -5.647    0.000    -1597.914    -758.746 
##         bcs       59.864        23.060        0.241     2.596    0.012       13.498     106.230 
##      pindex        8.924         1.808        0.380     4.935    0.000        5.288      12.559 
## enzyme_test        9.748         1.656        0.521     5.887    0.000        6.419      13.077 
##  liver_test       58.064        40.144        0.156     1.446    0.155      -22.652     138.779 
##   alc_heavy      317.848        71.634        0.314     4.437    0.000      173.818     461.878 
## ------------------------------------------------------------------------------------------------

Criteria

The criteria for selecting variables may be one of the following:

Include/exclude variables

We can force variables to be included or excluded from the model at all stages of variable selection. The variables may be specified either by name or position in the model specified.

By name
ols_step_forward_p(model, include = c("age", "alc_mod"))
## 
## 
##                               Stepwise Summary                               
## ---------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2  
## ---------------------------------------------------------------------------
##  0      Base Model     804.340    812.295    645.675    0.04110     0.00350 
##  1      age            803.834    809.801    646.572    0.01420    -0.00476 
##  2      alc_mod        804.340    812.295    645.675    0.04110     0.00350 
##  3      liver_test     772.922    782.867    615.246    0.48357     0.45258 
##  4      enzyme_test    763.665    775.599    606.382    0.58074     0.54652 
##  5      alc_heavy      754.332    768.255    598.224    0.66012     0.62471 
##  6      pindex         739.680    755.592    587.108    0.75031     0.71843 
## ---------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.866       RMSE                 196.724 
## R-Squared                 0.750       MSE                44464.323 
## Adj. R-Squared            0.718       Coef. Var             30.034 
## Pred R-Squared            0.649       AIC                  739.680 
## MAE                     146.418       SBC                  755.592 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6279697.346         6    1046616.224    23.538    0.0000 
## Residual      2089823.191        47      44464.323                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                       
## -----------------------------------------------------------------------------------------------
##       model        Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## -----------------------------------------------------------------------------------------------
## (Intercept)    -814.092       213.222                 -3.818    0.000    -1243.041    -385.144 
##         age       0.458         2.706        0.013     0.169    0.866       -4.985       5.902 
##     alc_mod       1.088        67.941        0.001     0.016    0.987     -135.591     137.768 
##  liver_test     126.675        33.832        0.341     3.744    0.000       58.613     194.737 
## enzyme_test       7.523         1.543        0.402     4.874    0.000        4.418      10.628 
##   alc_heavy     361.751        87.140        0.357     4.151    0.000      186.448     537.053 
##      pindex       7.862         1.908        0.334     4.120    0.000        4.023      11.700 
## -----------------------------------------------------------------------------------------------
By index
ols_step_forward_p(model, include = c(5, 7))
## 
## 
##                               Stepwise Summary                               
## ---------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2  
## ---------------------------------------------------------------------------
##  0      Base Model     804.340    812.295    645.675    0.04110     0.00350 
##  1      age            803.834    809.801    646.572    0.01420    -0.00476 
##  2      alc_mod        804.340    812.295    645.675    0.04110     0.00350 
##  3      liver_test     772.922    782.867    615.246    0.48357     0.45258 
##  4      enzyme_test    763.665    775.599    606.382    0.58074     0.54652 
##  5      alc_heavy      754.332    768.255    598.224    0.66012     0.62471 
##  6      pindex         739.680    755.592    587.108    0.75031     0.71843 
## ---------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.866       RMSE                 196.724 
## R-Squared                 0.750       MSE                44464.323 
## Adj. R-Squared            0.718       Coef. Var             30.034 
## Pred R-Squared            0.649       AIC                  739.680 
## MAE                     146.418       SBC                  755.592 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6279697.346         6    1046616.224    23.538    0.0000 
## Residual      2089823.191        47      44464.323                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                       
## -----------------------------------------------------------------------------------------------
##       model        Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## -----------------------------------------------------------------------------------------------
## (Intercept)    -814.092       213.222                 -3.818    0.000    -1243.041    -385.144 
##         age       0.458         2.706        0.013     0.169    0.866       -4.985       5.902 
##     alc_mod       1.088        67.941        0.001     0.016    0.987     -135.591     137.768 
##  liver_test     126.675        33.832        0.341     3.744    0.000       58.613     194.737 
## enzyme_test       7.523         1.543        0.402     4.874    0.000        4.418      10.628 
##   alc_heavy     361.751        87.140        0.357     4.151    0.000      186.448     537.053 
##      pindex       7.862         1.908        0.334     4.120    0.000        4.023      11.700 
## -----------------------------------------------------------------------------------------------

Standardized output

All stepwise selection methods display standard output which includes:

# adjusted r-square 
ols_step_forward_adj_r2(model)
## 
## 
##                               Stepwise Summary                              
## --------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2 
## --------------------------------------------------------------------------
##  0      Base Model     802.606    806.584    646.794    0.00000    0.00000 
##  1      liver_test     771.875    777.842    616.009    0.45454    0.44405 
##  2      alc_heavy      761.439    769.395    605.506    0.56674    0.54975 
##  3      enzyme_test    750.509    760.454    595.297    0.65900    0.63854 
##  4      pindex         735.715    747.649    582.943    0.75015    0.72975 
##  5      bcs            730.620    744.543    579.638    0.78091    0.75808 
## --------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.884       RMSE                 184.276 
## R-Squared                 0.781       MSE                38202.426 
## Adj. R-Squared            0.758       Coef. Var             27.839 
## Pred R-Squared            0.700       AIC                  730.620 
## MAE                     137.656       SBC                  744.543 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6535804.090         5    1307160.818    34.217    0.0000 
## Residual      1833716.447        48      38202.426                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------
## (Intercept)    -1178.330       208.682                 -5.647    0.000    -1597.914    -758.746 
##  liver_test       58.064        40.144        0.156     1.446    0.155      -22.652     138.779 
##   alc_heavy      317.848        71.634        0.314     4.437    0.000      173.818     461.878 
## enzyme_test        9.748         1.656        0.521     5.887    0.000        6.419      13.077 
##      pindex        8.924         1.808        0.380     4.935    0.000        5.288      12.559 
##         bcs       59.864        23.060        0.241     2.596    0.012       13.498     106.230 
## ------------------------------------------------------------------------------------------------

Visualization

Use the plot() method to visualize variable selection. It will display how the variable selection criteria changes at each step of the selection process along with the variable selected.

# adjusted r-square 
k <- ols_step_forward_adj_r2(model)
plot(k)

Verbose output

To view the detailed regression output at each stage of variable selection/elimination, set details to TRUE. It will display the following information at each step:

# adjusted r-square 
ols_step_forward_adj_r2(model, details = TRUE)
## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1. bcs 
## 2. pindex 
## 3. enzyme_test 
## 4. liver_test 
## 5. age 
## 6. gender 
## 7. alc_mod 
## 8. alc_heavy 
## 
## 
## Step     => 0 
## Model    => y ~ 1 
## Adj. R2  => 0 
## 
## Initiating stepwise selection... 
## 
##                        Table: Adding New Variables                        
## -------------------------------------------------------------------------
## Predictor      DF      AIC        SBC       SBIC        R2       Adj. R2  
## -------------------------------------------------------------------------
## liver_test      1    771.875    777.842    616.009    0.45454     0.44405 
## enzyme_test     1    782.629    788.596    626.220    0.33435     0.32154 
## pindex          1    794.100    800.067    637.196    0.17680     0.16097 
## alc_heavy       1    794.301    800.268    637.389    0.17373     0.15784 
## bcs             1    797.697    803.664    640.655    0.12010     0.10318 
## alc_mod         1    802.828    808.795    645.601    0.03239     0.01378 
## gender          1    802.956    808.923    645.725    0.03009     0.01143 
## age             1    803.834    809.801    646.572    0.01420    -0.00476 
## -------------------------------------------------------------------------
## 
## Step     => 1 
## Added    => liver_test 
## Model    => y ~ liver_test 
## Adj. R2  => 0.44405 
## 
##                       Table: Adding New Variables                        
## ------------------------------------------------------------------------
## Predictor      DF      AIC        SBC       SBIC        R2       Adj. R2 
## ------------------------------------------------------------------------
## alc_heavy       1    761.439    769.395    605.506    0.56674    0.54975 
## enzyme_test     1    762.077    770.033    606.090    0.56159    0.54440 
## pindex          1    770.387    778.343    613.737    0.48866    0.46861 
## alc_mod         1    771.141    779.097    614.435    0.48147    0.46113 
## gender          1    773.802    781.758    616.901    0.45528    0.43391 
## age             1    773.831    781.787    616.928    0.45498    0.43361 
## bcs             1    773.867    781.823    616.961    0.45462    0.43323 
## ------------------------------------------------------------------------
## 
## Step     => 2 
## Added    => alc_heavy 
## Model    => y ~ liver_test + alc_heavy 
## Adj. R2  => 0.54975 
## 
##                       Table: Adding New Variables                        
## ------------------------------------------------------------------------
## Predictor      DF      AIC        SBC       SBIC        R2       Adj. R2 
## ------------------------------------------------------------------------
## enzyme_test     1    750.509    760.454    595.297    0.65900    0.63854 
## pindex          1    756.125    766.070    600.225    0.62163    0.59892 
## bcs             1    763.063    773.008    606.379    0.56975    0.54394 
## age             1    763.110    773.055    606.421    0.56938    0.54354 
## alc_mod         1    763.428    773.373    606.704    0.56683    0.54084 
## gender          1    763.433    773.378    606.709    0.56679    0.54080 
## ------------------------------------------------------------------------
## 
## Step     => 3 
## Added    => enzyme_test 
## Model    => y ~ liver_test + alc_heavy + enzyme_test 
## Adj. R2  => 0.63854 
## 
##                      Table: Adding New Variables                       
## ----------------------------------------------------------------------
## Predictor    DF      AIC        SBC       SBIC        R2       Adj. R2 
## ----------------------------------------------------------------------
## pindex        1    735.715    747.649    582.943    0.75015    0.72975 
## bcs           1    750.782    762.716    595.377    0.66973    0.64277 
## alc_mod       1    752.403    764.337    596.743    0.65967    0.63189 
## age           1    752.416    764.350    596.755    0.65959    0.63180 
## gender        1    752.509    764.443    596.833    0.65900    0.63116 
## ----------------------------------------------------------------------
## 
## Step     => 4 
## Added    => pindex 
## Model    => y ~ liver_test + alc_heavy + enzyme_test + pindex 
## Adj. R2  => 0.72975 
## 
##                      Table: Adding New Variables                       
## ----------------------------------------------------------------------
## Predictor    DF      AIC        SBC       SBIC        R2       Adj. R2 
## ----------------------------------------------------------------------
## bcs           1    730.620    744.543    579.638    0.78091    0.75808 
## age           1    737.680    751.603    585.012    0.75030    0.72429 
## gender        1    737.712    751.635    585.036    0.75016    0.72413 
## alc_mod       1    737.713    751.636    585.037    0.75015    0.72413 
## ----------------------------------------------------------------------
## 
## Step     => 5 
## Added    => bcs 
## Model    => y ~ liver_test + alc_heavy + enzyme_test + pindex + bcs 
## Adj. R2  => 0.75808 
## 
##                      Table: Adding New Variables                       
## ----------------------------------------------------------------------
## Predictor    DF      AIC        SBC       SBIC        R2       Adj. R2 
## ----------------------------------------------------------------------
## age           1    732.494    748.406    581.938    0.78142    0.75351 
## gender        1    732.551    748.463    581.978    0.78119    0.75325 
## alc_mod       1    732.614    748.526    582.023    0.78093    0.75297 
## ----------------------------------------------------------------------
## 
## 
## No more variables to be added.
## 
## Variables Selected: 
## 
## => liver_test 
## => alc_heavy 
## => enzyme_test 
## => pindex 
## => bcs
## 
## 
##                               Stepwise Summary                              
## --------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2 
## --------------------------------------------------------------------------
##  0      Base Model     802.606    806.584    646.794    0.00000    0.00000 
##  1      liver_test     771.875    777.842    616.009    0.45454    0.44405 
##  2      alc_heavy      761.439    769.395    605.506    0.56674    0.54975 
##  3      enzyme_test    750.509    760.454    595.297    0.65900    0.63854 
##  4      pindex         735.715    747.649    582.943    0.75015    0.72975 
##  5      bcs            730.620    744.543    579.638    0.78091    0.75808 
## --------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.884       RMSE                 184.276 
## R-Squared                 0.781       MSE                38202.426 
## Adj. R-Squared            0.758       Coef. Var             27.839 
## Pred R-Squared            0.700       AIC                  730.620 
## MAE                     137.656       SBC                  744.543 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6535804.090         5    1307160.818    34.217    0.0000 
## Residual      1833716.447        48      38202.426                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------
## (Intercept)    -1178.330       208.682                 -5.647    0.000    -1597.914    -758.746 
##  liver_test       58.064        40.144        0.156     1.446    0.155      -22.652     138.779 
##   alc_heavy      317.848        71.634        0.314     4.437    0.000      173.818     461.878 
## enzyme_test        9.748         1.656        0.521     5.887    0.000        6.419      13.077 
##      pindex        8.924         1.808        0.380     4.935    0.000        5.288      12.559 
##         bcs       59.864        23.060        0.241     2.596    0.012       13.498     106.230 
## ------------------------------------------------------------------------------------------------

Progress

To view the progress in the variable selection procedure, set progress to TRUE. It will display the variable being selected/eliminated at each step until there are no more candidate variables left.

# adjusted r-square 
ols_step_forward_adj_r2(model, progress = TRUE)
## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1. bcs 
## 2. pindex 
## 3. enzyme_test 
## 4. liver_test 
## 5. age 
## 6. gender 
## 7. alc_mod 
## 8. alc_heavy 
## 
## 
## Variables Entered: 
## 
## => liver_test 
## => alc_heavy 
## => enzyme_test 
## => pindex 
## => bcs 
## 
## No more variables to be added.
## 
## 
##                               Stepwise Summary                              
## --------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2 
## --------------------------------------------------------------------------
##  0      Base Model     802.606    806.584    646.794    0.00000    0.00000 
##  1      liver_test     771.875    777.842    616.009    0.45454    0.44405 
##  2      alc_heavy      761.439    769.395    605.506    0.56674    0.54975 
##  3      enzyme_test    750.509    760.454    595.297    0.65900    0.63854 
##  4      pindex         735.715    747.649    582.943    0.75015    0.72975 
##  5      bcs            730.620    744.543    579.638    0.78091    0.75808 
## --------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.884       RMSE                 184.276 
## R-Squared                 0.781       MSE                38202.426 
## Adj. R-Squared            0.758       Coef. Var             27.839 
## Pred R-Squared            0.700       AIC                  730.620 
## MAE                     137.656       SBC                  744.543 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6535804.090         5    1307160.818    34.217    0.0000 
## Residual      1833716.447        48      38202.426                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------
## (Intercept)    -1178.330       208.682                 -5.647    0.000    -1597.914    -758.746 
##  liver_test       58.064        40.144        0.156     1.446    0.155      -22.652     138.779 
##   alc_heavy      317.848        71.634        0.314     4.437    0.000      173.818     461.878 
## enzyme_test        9.748         1.656        0.521     5.887    0.000        6.419      13.077 
##      pindex        8.924         1.808        0.380     4.935    0.000        5.288      12.559 
##         bcs       59.864        23.060        0.241     2.596    0.012       13.498     106.230 
## ------------------------------------------------------------------------------------------------

Hierarchical selection

When using p values as the criterion for selecting/eliminating variables, we can enable hierarchical selection. In this method, the search for the most significant variable is restricted to the next available variable. In the below example, as liver_test does not meet the threshold for selection, none of the variables after liver_test are considered for further selection i.e. the stepwise selection ends as soon as it comes across a variable that does not meet the selection threshold. You can learn more about hierachichal selection here.

# hierarchical selection
m <- lm(y ~ bcs + alc_heavy + pindex + enzyme_test + liver_test + age + gender + alc_mod, data = surgical)
ols_step_forward_p(m, 0.1, hierarchical = TRUE)
## 
## 
##                               Stepwise Summary                              
## --------------------------------------------------------------------------
## Step    Variable         AIC        SBC       SBIC        R2       Adj. R2 
## --------------------------------------------------------------------------
##  0      Base Model     802.606    806.584    646.794    0.00000    0.00000 
##  1      bcs            797.697    803.664    640.655    0.12010    0.10318 
##  2      alc_heavy      791.701    799.657    633.668    0.24119    0.21144 
##  3      pindex         778.574    788.519    620.390    0.42659    0.39218 
##  4      enzyme_test    730.924    742.858    579.087    0.77136    0.75269 
## --------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                         0.878       RMSE                 188.249 
## R-Squared                 0.771       MSE                39053.801 
## Adj. R-Squared            0.753       Coef. Var             28.147 
## Pred R-Squared            0.695       AIC                  730.924 
## MAE                     140.619       SBC                  742.858 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                    Sum of                                              
##                   Squares        DF    Mean Square      F         Sig. 
## -----------------------------------------------------------------------
## Regression    6455884.265         4    1613971.066    41.327    0.0000 
## Residual      1913636.272        49      39053.801                     
## Total         8369520.537        53                                    
## -----------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------
## (Intercept)    -1334.424       180.589                 -7.389    0.000    -1697.332    -971.516 
##         bcs       81.439        17.781        0.329     4.580    0.000       45.706     117.171 
##   alc_heavy      312.777        72.341        0.309     4.324    0.000      167.402     458.152 
##      pindex       10.131         1.622        0.431     6.246    0.000        6.871      13.390 
## enzyme_test       11.243         1.308        0.601     8.596    0.000        8.614      13.871 
## ------------------------------------------------------------------------------------------------