The R package penalizedclr provides an implementation of the penalized logistic regression model that can be used in the analysis of matched case-control studies. The implementation allows to apply different penalties to different blocks of covariates, and is therefore particularly useful in the presence of multi-source heterogenous data, such as multiple layers of omics measurements. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection to allow for variable selection in the considered regression model.
You can install the released version of penalizedclr from CRAN with:
install.packages("penalizedclr")
And the development version from GitHub with:
library(devtools)
install_github("veradjordjilovic/penalizedclr")
Load the package with:
library(penalizedclr)
Assume that we have \(K\) independent matched case-control pairs \((Y_{ki}, X_{ki})\), where \(Y_{ki}\), \(k=1,\ldots,K;\) \(i=1,2,\) is a binary variable indicating case control status (1 if case, 0 if control) and \(X_{ki}\) is the associated \(p\)-dimensional vector of covariates. The conditional logistic regression models the probability of being a case given that the observation belongs to the \(k\)-th pair as: \[ {\mathrm {logit}}\left[P(Y_{ki}=1 \mid S=k)\right] = \beta_{0k} + \beta^{T}X_{ki}, \quad k\in\left\{1,\ldots, K\right\}, i\in\left\{1,2\right\} \] where \(S\) is the matched pair id, \(\beta_{0k}\) is the pair specific intercept and \(\beta=(\beta_1, \ldots,\beta_p)^T\) is a \(p\)-dimensional vector of fixed effects.
In the present package we:
estimate \(\beta\) in the high dimensional setting in which the number of covariates \(p\) is much higher than the number of pairs \(K\). We consider a penalized conditional logistic regression, which adds a penalty to the conditional log likelihood. Motivated by current medical applications considering clinical and molecular data, we allow \(X_{ki}\) to be a merge of heteregenous data sources;
perform stability selection to identify important variables, that is, variables for which \(\beta_j\neq 0\), \(j\in\left\{1,\ldots,p\right\}\).
In this section we provide examples of how to fit a penalized conditional regression model with source-specific penalty parameters and how to perform variable selection with penalizedclr.
Initial settings and libraries to be loaded:
set.seed(123)
require(tidyverse)
We generate a simple multi-source data set, with two groups of covariates containing 80 and 20 variables, respectively. As specified above, each case is matched to a control, and the probability of being a case in each stratum (case-control pair) is obtained from the linear predictor. An intercept is generated independently for each stratum.
# two groups of predictors
<- c(80, 20)
p
# percentage and number of non-null variables
<- c(0.2, 0.8)
p_nz <- round(p*p_nz, 0)
m_nz
# number of different strata (case-control pairs)
<- 125
K
# number of cases and controls in each stratum (not necessarily 1:1 matching, other designs are also allowed)
<- 1
n_cases <- 1
n_ctrl
# generating covariates
= cbind(matrix(rnorm(p[1] * K * (n_cases + n_ctrl), 0, 1), ncol = p[1]),
X matrix(rnorm(p[2] * K * (n_cases + n_ctrl), 0, 2), ncol = p[2]))
# coefficients
<- as.matrix(c(rnorm(m_nz[1], 0, 0.8),
beta rep(0, p[1] - m_nz[1]),
rnorm(m_nz[2], 0.1, 0.4),
rep(0, p[2] - m_nz[2])), ncol = 1)
<- rep(rnorm(K, 0, 2), each= n_cases+n_ctrl)
beta_stratum
# stratum membership
<- rep(1:K, each= n_cases+n_ctrl)
stratum
# linear predictor
<- beta_stratum + X %*% beta
lin_pred
<- exp(lin_pred) / (1 + exp(lin_pred))
prob_case
# generate the response
<- rep(0, length(stratum))
Y
<- as_tibble(data.frame(stratum = stratum,
data_sim probability = prob_case,
obs_id = 1 : length(stratum)))
<- data_sim %>%
data_sim_cases group_by(stratum)%>%
sample_n(n_cases, weight = probability)
$obs_id] <- 1 Y[data_sim_cases
The function penalized.clr fits a penalized conditional logistic regression model with different penalties for different blocks of covariates. The L1 penalty parameter lambda
(a numeric vector of the length equal to the number of blocks) can be specified by the user or computed internally. In the latter case, cross-validation is performed internally for each data layer separately. It is also possible to apply elastic net penalty through parameter alpha
.
This code illustrates how to fit penalized.clr with penalty parameters specified by the user. Here Y is the response vector, X is the multi-source matrix of covariates, stratum is the vector of ids of the matching pairs, and p is the vector of block dimensions. It has the same dimension as the vector of penalty parameters lambda. Penalty parameters are thus specified for each data source, i.e., a block of covariates. It is possible to standardize the variables by setting standardize = TRUE (FALSE by default).
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit1 lambda = c(6,7), p = p, standardize = TRUE)
The function returns a list with elements:
penalized - Regression coefficients for the penalized covariates.
unpenalized - Regression coefficients for the unpenalized covariates.
converged - Whether the fitting process was judged to be converged.
lambda - The tuning parameter for L1 used.
alpha - The elastic net mixing parameter used.
For instance, fit1$penalized
is a numeric vector containing the regression coefficients for the penalized covariates. We can compare estimated coefficients with the true coefficients (the ones that generated our data), by constructing a 2 x 2
contingency table cross-tabulating true non-zero and estimated non-zero coefficients:
<- (beta != 0) * 1 #index of nonzero coefficients
nonzero_index table(fitted = (fit1$penalized != 0) * 1, nonzero_index)
#> nonzero_index
#> fitted 0 1
#> 0 51 17
#> 1 17 15
When the user does not specify the penalty parameters, they are computed internally for each data source separately based on cross-validation.
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit2 p = p,
standardize = TRUE)
The selected penalty coefficients are:
$lambda
fit2#> [1] 4.958049 5.641397
The package uses fast cross-validation implemented in the R package clogitL1 (see the function find.default.lambda described below). We recommend to inspect manually the obtained lambda
parameters. It is also recommended to investigate different ratios of data-source specific penalties (see, for instance, Boulesteix et al. 2017 and examples of penalty grids therein). Note that cross validation depends on random data splits, and different runs will return different values of lambda
.
This code shows what happens when no information is given about the blocks of predictors (p
). In this case, all covariates are considered to form a single block and are penalized equally.
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit3 standardize = TRUE)
The selected penalty coefficient is:
$lambda
fit3#> [1] 4.62712
Somethimes, a subset of covariates should be excluded from penalization. This can be achieved by specifying the unpenalized
argument. In what follows, we penalize the first block of covariates, and leave the remaining block unpenalized.
<- X[, 1:p[1]]
X1 <- X[, (p[1]+1):(p[1]+p[2])]
X2 <- penalized.clr(response = Y, penalized = X1, unpenalized = X2,
fit4 stratum = stratum, p = p[1],
standardize = TRUE)
This can be particularly useful when performing variable selection on omics variables (penalized) while adjusting for clinical covariates that should not be penalized.
The package penalizedclr is not limited to L1, i.e. lasso, penalty. While L1 penalty is more suited for variable selection, in the presence of highly correlated covariates, it can be useful to add small amount of L2 penalty. The two can be combined in an elastic net framework by specifying the mixing parameter alpha
that can assume values between 0 and 1. Default alpha=1
gives lasso.
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit2 p = p,
standardize = TRUE, alpha = 0.6)
The function stable.clr performs stability selection (Meinshausen and Bühlmann 2010 and Shah and Samworth 2013) for variable selection in penalized conditional logistic regression. For details on stability selection, we refer to the original publications. Briefly, a set of L1 penalties is considered, and for each considered value of the penalty, \(2B\) random subsamples of \(\lfloor K/2 \rfloor\) matched pairs are taken and a penalized model is estimated. For each covariate and penalty value, selection probability is estimated as the proportion of estimated models in which the associated coefficient estimate is different from zero. Finally, the overall selection probability of a variable is obtained by taking the maximum selection probability over all considered penalty values.
Parameter \(B\) is set to 100 by default, but can be changed by the user. Note that this choice will have an impact on the computation time, and higher values of \(B\) will lead to a slower computation.
The function returns a list containing the selection probabilities for each covariate, i.e. the proportion of estimated models in which the associated coefficient estimate is different from zero. The user can then set a threshold for selection probability (values ranging from 0.55 to 0.9 are recommended) and obtain the set of selected covariates.
The following code performs stability selection when all covariates are considered as a single block (a single data source) and a sequence of L1 penalties contains only 2 values.
<- stable.clr(response = Y,
stable1 penalized = X, stratum = stratum,
lambda.seq = c(10,20))
The function returns a list with two elements:
Pilambda a numeric vector giving estimates of selection probabilities for each penalized covariate.
lambda.seq a sequence of L1 penalties used.
To inspect the results, we can, for instance, print the covariates with selection probability higher than 0.6:
which(stable1$P>0.6)
#> [1] 82 86 87 90
It is possible to obtain the sequence of L1 penalty parameters via the function find.default.lambda. It relies on the cv.clogitL1
function of the clogitL1
package to perform cross-validation to determine a suitable lambda
sequence. Note that it considers each data source separately and returns a sequence of four values per source (the optimal penalty along with three additional values, see package documentation for details). The number of folds is set to 10 by default but can also be specified by the user.
<- find.default.lambda(response = Y,
lambda.seq stratum = stratum,
penalized = X,
alpha=1,
nfolds = 10)
lambda.seq#> [1] 5.300950 6.988013 9.211995 12.143774
<- stable.clr(response = Y,
stable2 penalized = X, stratum = stratum,
lambda.seq = lambda.seq)
Covariates with selection probability higher than 0.6:
which(stable2$P>0.6)
#> [1] 1 14 58 82 86 87 90 93
If we do not specify the penalty parameters as in case 2, stable.clr will compute them by default by using find.default.lambda with default options. We only need to run:
<- stable.clr(response = Y,
stable3 penalized = X,
stratum = stratum)
The selected sequence of lambda
is:
$lambda.seq
stable3#> [1] 6.521593 8.023312 9.870830 12.143774
Covariates with selection probability higher than 0.6:
which(stable3$P>0.6)
#> [1] 1 82 86 87 90 93
This code implements stability selection while taking into account the block structure of covariates (\(p\)). To achieve this, we use the function stable.clr.g which extends stable.clr to allow for blocks (groups) of covariates. In this case, the function takes as an argument lambda.list
, a list of sequences of L1 penalties, one for each block. Note that the computation time depends on the length of the block-specific sequences: for each combination of penalties, \(2B\) subsamples are taken and used to estimate a penalized conditional logistic regression model.
<- find.default.lambda(response = Y,
lambda.list penalized = X, stratum = stratum,
p = p)
lambda.list#> [[1]]
#> [1] 10.67917 13.13825 16.16358 19.88555
#>
#> [[2]]
#> [1] 9.211995 12.143774 16.008612 21.103461
<- stable.clr.g(response = Y,
stable.g1 penalized = X,
stratum = stratum,
p = p,
lambda.list = lambda.list)
which(stable.g1$P>0.6)
#> [1] 82 86 87 90 93
Note that if \(p\) is not specified, the package will automatically run stable.clr and apply equal penalties for all covariates.
Boulesteix, A. L., De Bin, R., Jiang, X., & Fuchs, M. (2017). IPF-LASSO: Integrative-penalized regression with penalty factors for prediction based on multi-omics data. Computational and mathematical methods in medicine, 2017.
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.
Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80.