ale function handling of various datatypes for x

Chitu Okoli

January 9, 2024

This vignette demonstrates how ale works for various datatypes of input (x) values. You should first read the introductory vignette that explains general functionality of the package; this vignette is a demonstration of specific functionality.

We begin by loading the necessary libraries.

library(ale)
library(dplyr)

var_cars: modified mtcars dataset (Motor Trend Car Road Tests)

For this demonstration, we use a modified version of the built-in mtcars dataset so that it has binary (logical), multinomial (factor, that is, non-ordered categories), ordinal (ordered factor), discrete interval (integer), and continuous interval (numeric or double) values. This modified version, called var_cars, will let us test all the different basic variations of x variables. For the factor, it adds the country of the car manufacturer.

The data is a tibble with 32 observations on 12 variables:

Variable Format Description
mpg double Miles/(US) gallon
cyl integer Number of cylinders
disp double Displacement (cu.in.)
hp double Gross horsepower
drat double Rear axle ratio
wt double Weight (1000 lbs)
qsec double 1/4 mile time
vs logical Engine (0 = V-shaped, 1 = straight)
am logical Transmission (0 = automatic, 1 = manual)
gear ordered Number of forward gears
carb integer Number of carburetors
country factor Country of car manufacturer
print(var_cars)
#> # A tibble: 32 × 12
#>      mpg   cyl  disp    hp  drat    wt  qsec vs    am    gear   carb country
#>    <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <ord> <int> <fct>  
#>  1  21       6  160    110  3.9   2.62  16.5 FALSE TRUE  four      4 Japan  
#>  2  21       6  160    110  3.9   2.88  17.0 FALSE TRUE  four      4 Japan  
#>  3  22.8     4  108     93  3.85  2.32  18.6 TRUE  TRUE  four      1 Japan  
#>  4  21.4     6  258    110  3.08  3.22  19.4 TRUE  FALSE three     1 USA    
#>  5  18.7     8  360    175  3.15  3.44  17.0 FALSE FALSE three     2 USA    
#>  6  18.1     6  225    105  2.76  3.46  20.2 TRUE  FALSE three     1 USA    
#>  7  14.3     8  360    245  3.21  3.57  15.8 FALSE FALSE three     4 USA    
#>  8  24.4     4  147.    62  3.69  3.19  20   TRUE  FALSE four      2 Germany
#>  9  22.8     4  141.    95  3.92  3.15  22.9 TRUE  FALSE four      2 Germany
#> 10  19.2     6  168.   123  3.92  3.44  18.3 TRUE  FALSE four      4 Germany
#> # ℹ 22 more rows
summary(var_cars)
#>       mpg             cyl             disp             hp       
#>  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
#>  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
#>  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
#>  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
#>  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
#>  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
#>       drat             wt             qsec           vs         
#>  Min.   :2.760   Min.   :1.513   Min.   :14.50   Mode :logical  
#>  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   FALSE:18       
#>  Median :3.695   Median :3.325   Median :17.71   TRUE :14       
#>  Mean   :3.597   Mean   :3.217   Mean   :17.85                  
#>  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90                  
#>  Max.   :4.930   Max.   :5.424   Max.   :22.90                  
#>      am             gear         carb          country  
#>  Mode :logical   three:15   Min.   :1.000   Germany: 8  
#>  FALSE:19        four :12   1st Qu.:2.000   Italy  : 4  
#>  TRUE :13        five : 5   Median :2.000   Japan  : 6  
#>                             Mean   :2.812   Sweden : 1  
#>                             3rd Qu.:4.000   UK     : 1  
#>                             Max.   :8.000   USA    :12

Modelling with ALE and GAM

With GAM, only numeric variables can be smoothed, not binary or categorical ones. However, smoothing does not always help improve the model since some variables are not related to the outcome and some that are related actually do have a simple linear relationship. To keep this demonstration simple, we have done some earlier analysis (not shown here) that determines where smoothing is worthwhile on the modified var_cars dataset, so only some of the numeric variables are smoothed. Our goal here is not to demonstrate the best modelling procedure but rather to demonstrate the flexibility of the ale package.

cm <- mgcv::gam(mpg ~ cyl + disp + hp + drat + wt + s(qsec) +
                  vs + am + gear + carb + country,
                data = var_cars)
summary(cm)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> mpg ~ cyl + disp + hp + drat + wt + s(qsec) + vs + am + gear + 
#>     carb + country
#> 
#> Parametric coefficients:
#>               Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   -7.84775   12.47080  -0.629  0.54628   
#> cyl            1.66078    1.09449   1.517  0.16671   
#> disp           0.06627    0.01861   3.561  0.00710 **
#> hp            -0.01241    0.02502  -0.496  0.63305   
#> drat           4.54975    1.48971   3.054  0.01526 * 
#> wt            -5.03736    1.53979  -3.271  0.01095 * 
#> vsTRUE        12.45630    3.62342   3.438  0.00852 **
#> amTRUE         8.77813    2.67611   3.280  0.01080 * 
#> gear.L         0.53111    3.03337   0.175  0.86525   
#> gear.Q         0.57129    1.18201   0.483  0.64150   
#> carb          -0.34479    0.78600  -0.439  0.67223   
#> countryItaly  -0.08633    2.22316  -0.039  0.96995   
#> countryJapan  -3.31948    2.22723  -1.490  0.17353   
#> countrySweden -3.83437    2.74934  -1.395  0.19973   
#> countryUK     -7.24222    3.81985  -1.896  0.09365 . 
#> countryUSA    -7.69317    2.37998  -3.232  0.01162 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>           edf Ref.df     F p-value  
#> s(qsec) 7.797  8.641 5.975  0.0101 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.955   Deviance explained = 98.8%
#> GCV = 6.4263  Scale est. = 1.6474    n = 32

Before starting, we recommend that you enable progress bars to see how long procedures will take. Simply run the following code at the beginning of your R session:

# Run this in an R console; it will not work directly within an R Markdown or Quarto block
progressr::handlers(global = TRUE)
progressr::handlers('cli')

If you forget to do that, the {ale} package will do it automatically for you with a notification message.

Now we generate ALE data from the var_cars GAM model and plot it.

cars_ale <- ale(
  var_cars, cm,
  parallel = 2  # CRAN limit (delete this line on your own computer)
)

# Print all plots
cars_ale$plots |> 
  patchwork::wrap_plots(ncol = 2)

We can see that ale has no trouble modelling any of the datatypes in our sample (logical, factor, ordered, integer, or double). It plots line charts for the numeric predictors and column charts for everything else.

The numeric predictors have rug plots that indicate in which ranges of the x (predictor) and y (mpg) values data actually exists in the dataset. This helps us to not over-interpret regions where data is sparse. Since column charts are on a discrete scale, they cannot rug plots. Instead, the percentage of data represented by each column is displayed.

We can also generate and plot the ALE data for all two-way interactions.

cars_ale_ixn <- ale_ixn(
  var_cars, cm,
  parallel = 2  # CRAN limit (delete this line on your own computer)
)

# Print plots
cars_ale_ixn$plots |>
  # extract list of x1 ALE outputs
  purrr::walk(\(.x1) {  
    # plot all x2 plots in each .x1 element
    patchwork::wrap_plots(.x1, ncol = 2) |> 
      print()
  })

There are no interactions in this dataset. (To see what ALE interaction plots look like in the presence of interactions, see the {ALEPlot} comparison vignette, which explains the interaction plots in more detail.)

Finally, as explained in the vignette on modelling with small datasets, a more appropriate modelling workflow would require bootstrapping the entire model, not just the ALE data. So, let’s do that now.

mb <- model_bootstrap(
  var_cars, 
  cm,
  boot_it = 10,  # 100 by default but reduced here for a faster demonstration
  parallel = 2,  # CRAN limit (delete this line on your own computer)
  seed = 2  # workaround to avoid random error on such a small dataset
)

mb$ale$plots |> 
  patchwork::wrap_plots(ncol = 2)

(By default, model_bootstrap() creates 100 bootstrap samples but, so that this illustration runs faster, we demonstrate it here with only 10 iterations.)

With such a small dataset, the bootstrap confidence interval always overlap with the middle band, indicating that this dataset cannot support any claims that any of its variables has a meaningful effect on fuel efficiency (mpg). Considering that the average bootstrapped ALE values suggest various intriguing patterns, the problem is no doubt that the dataset is too small–if more data were collected and analyzed, some of the patterns would probably be confirmed.