David Rügamer 2021-04-13

This vignette describes the generic use of the `selfmade`

software to produce valid post-selection inference, or more specifically *selective inference*, for linear mixed and additive models after any type of variable selection mechanism, which can be repeated in a bootstrap-like manner.

Rügamer, D., & Greven, S. (2018). Selective inference after likelihood-or test-based model selection in linear models. Statistics & Probability Letters, 140, 7-12.

Rügamer, D., & Greven, S. (2020). Inference for (L_2)-Boosting. Statistics and Computing, 30(2), 279-289.

- The framework assumes that covariates in all models are fixed.
- It must be possible to fit the final model with the
`gamm4`

function of the eponymous R package. Support for models from`mgcv`

will be added in the future. - It must be possible to define a
**deterministic**function of a vector (y \in \mathbb{R}^n) (referred to as`selection_function`

in the following) determining the selection result for which the practioner seeks valid inference statements. In other words, the user has to define a function similar to the function`selection_function`

defined below, which is deterministic in the sense that for the same input`y`

the output should also be exactly the same. - It must be possible to define a function, which checks the congruency of the result of the
`selection_function`

and the original selection given when performing model selection on the original data (y). This is usually trivial and just a wrapper for the`selection_function`

.

```
selection_function <- function(y)
{
# based on any input y of the same dimension as the original response
# a model is selected and mapped to an integer value
# ....
best_model_index <- get_best_model(list_of_models)
return(best_model_index)
}
```

Note that the `selection_function`

should return the original result when called with the original data vector (y).

- Run the experiment with the
`original_response`

- Save the model selection result (e.g., as integer indicating the selected model) as well as the final model
`final_model`

(after refitting the model with`gamm4`

if a different package has been used for model selection) - Define the model selection function (
`selection_function`

) - Define the wrapper (
`check_congruency`

) function returning a logical value whether the result of any model call of`selection_function`

is equivalent to the original model selection result - Run the
`mocasin`

-function providing selective inference as follows:

```
res <- mocasin(mod = final_model,
checkFun = check_congruency,
this_y = original_response
# further options
)
```

- Example 1 demonstrates the package’s ability to reproduce classical inference if no model selection was done.
- Example 2 demonstrates the use of the package for model selection with only
`gamm4`

models - Example 3 demonstrates the package’s ability to calculate valid inference regardless of the type of model selection and packages involved (as long as the prerequisites are met)

```
library(selfmade)
library(gamm4)
set.seed(0)
dat <- gamSim(1,n=500,scale=2) ## simulate 4 term additive truth
dat$y <- 3 + dat$x0^2 + rnorm(n=500)
br <- gamm4(y~ s(x0) + s(x1), data = dat)
summary(br$gam) ## summary of gam
# do not use any selection
# - hence it's not necessary to define selection_function
# and the checl_congruency always returns TRUE
checkFun <- function(yb) TRUE
# calculate selective inference, which, in this case,
# except for an approximation error, should be equivalent
# to the unconditional inference
res <- mocasin(br, this_y = dat$y,
checkFun = checkFun,
nrlocs = c(0.7,1),
nrSamples = 1000, trace = FALSE)
# we get very similar results using
do.call("rbind", res$selinf)
```

```
library(selfmade)
library(lme4)
library(lmerTest)
# use the ham data and use scaled information liking
# as response
ham$Informed.liking <- scale(ham$Informed.liking)
# We first define a function to fit a model based on response
# This function is usually not required but can be
# specified in order to define the selection_function
# as a function of the model instead of as a function
# of the response vector
modFun <- function(y)
{
ham$y <- y
lmer(y ~ Gender + Information * Product + (1 | Consumer) +
(1 | Product), data=ham)
}
# define the selection_function:
# here this is done as function of a model
# which, in combination with modFun, can then
# be used as function
selFun <- function(mod) step(mod, reduce.fixed = FALSE)
# define a function which extracts the results
# of the selection procedure
extractSelFun <- function(this_mod){
this_mod <- attr(this_mod, "model")
if(class(this_mod)=="lm")
return(attr(this_mod$coefficients, "names")) else
return(c(names(fixef(this_mod)),
names(getME(this_mod, "theta"))))
}
# Now we run the initial model selection on the
# orginal data, which is a
# backward elimination of non-significant effects:
(step_result <- selFun(modFun(ham$Informed.liking)))
attr(step_result, "model")
# Elimination tables for random- and fixed-effect terms:
(sel <- extractSelFun(step_result))
# Now we can define the function checking the congruency
# with the original selection
checkFun <- function(yb){
this_mod <- modFun(yb)
setequal( extractSelFun(selFun(this_mod)), sel )
}
# and compute valid p-values conditional on the selection
# (this takes some time and will produce a lot of warnings)
res <- mocasin(attr(step_result, "model"), this_y = ham$Informed.liking,
checkFun = checkFun, which = 1:4, nrSamples = 50, trace = FALSE)
print(res)
```

```
# Run an AIC comparison between different additive models
# fitted with mgcv::gam
library(selfmade)
library(mgcv)
library(gamm4)
# create data and models
set.seed(2)
# use enough noise to get a diverse model selection
dat <- gamSim(1,n=400,dist="normal",scale=10)
b0123 <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
b123 <- gam(y~s(x1)+s(x2)+s(x3),data=dat)
b013 <- gam(y~s(x0)+s(x1)+s(x3),data=dat)
# seems that the second model seems to be the most
# 'appropriate' one
which.min(AIC(b0123, b123, b013)$AIC)
# and refit the model with gamm4
b123_gamm4 <- gamm4(y~s(x0)+s(x1)+s(x3),data=dat)
# define selection_function
selection_function <- function(y)
{
dat$y <- y
list_of_models <- list(
gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat),
gam(y~s(x1)+s(x2)+s(x3),data=dat),
gam(y~s(x0)+s(x1)+s(x3),data=dat)
)
# return an integer value which model is best
return(
which.min(sapply(list_of_models, AIC))
)
}
# define the congruency function
checkFun <- function(y) selection_function(y)==2
# compute inference
res <- mocasin(mod = b123_gamm4,
checkFun = checkFun,
nrlocs = 3, # test one position of the spline
nrSamples = 10)
print(res)
```