Introduction to rags2ridges

rags2ridges is an R-package for fast and proper L2-penalized estimation of precision (and covariance) matrices also called ridge estimation. Its L2-penalty features the ability to shrink towards a target matrix, allowing for incorporation of prior knowledge. Likewise, it also features a fused L2 ridge penalty allows for simultaneous estimation of multiple matrices. The package also contains additional functions for post-processing the L2-penalized estimates — useful for feature selection and when doing graphical modelling. The fused ridge estimation is useful when dealing with grouped data as when doing meta or integrative analysis.

This vignette provides a light introduction on how to get started with regular ridge estimation of precision matrices and further steps.

Getting started

Package installation

The README details how to install the rags2ridges package. When installed, the package is loaded as seen below where we also define a function for adding pretty names to a matrix.

library(rags2ridges)

Small theoretical primer and package usage

The sample variance-covariance matrix, or simply covariance matrix, is well-known and ubiquitous. It is given by

\[ S = \frac{1}{n - 1}XX^T \]

where \(X\) is the \(n \times p\) data matrix that is zero-centered with each \(p\)-dimensional observations in the rows. I.e. each row of \(X\) is an observation and each column is feature. Often high-dimensional data is organised this way (or transposed).

That \(X\) is zero-centered simply means that the column means has been subtracted the columns. The very similar estimate \(S = \frac{1}{n}XX^T\) without Bessel’s correction is the maximum likelihood estimate in a multivariate normal model with mean \(0\) and covariance \(\Sigma\). The likelihood function in this case is given by

\[ \ell(\Omega; S) = \ln|\Omega| - \text{tr}(S\Omega) \]

where \(\Omega = \Sigma^{-1}\) is the so-called precision matrix (also sometimes called the concentration matrix). It is precisely this \(\Omega\) for which we seek an estimate we will denote \(P\). Indeed, one can naturally try to use the inverse of \(S\) for this:

\[ P = S^{-1} \]

Let’s try.

The createS() function can easily simulate covariance matrices. But we go a more verbose route for illustration:

p <- 6
n <- 20
X <- createS(n = n, p = p, dataset = TRUE)
head(X, n = 4) # Show 4 first of the n rows

##            A      B       C      D     E       F
## [1,]  0.1929 -0.166  0.9021  0.125 0.187 -0.5412
## [2,]  0.7405  0.889  0.0172 -0.635 0.537 -0.0827
## [3,] -1.2461 -0.683 -1.3470  0.693 0.648 -1.7891
## [4,]  0.0166 -0.851  1.9025 -1.222 1.392  1.0383

Here the columns corresponds to features A, B, C, and so on.

When can then arrive a the MLE using covML() which centers X (subtracting the column means) and then computes the estimate:

S <- covML(X)
print(S)

##        A       B       C       D       E       F
## A  1.512  0.2049  0.1829  0.4241 -0.1348  0.3695
## B  0.205  1.0297  0.0048 -0.4313 -0.2056 -0.0964
## C  0.183  0.0048  0.9096 -0.0915 -0.2594  0.0195
## D  0.424 -0.4313 -0.0915  0.9431 -0.0461  0.2925
## E -0.135 -0.2056 -0.2594 -0.0461  0.7429  0.2179
## F  0.370 -0.0964  0.0195  0.2925  0.2179  0.8464

Using cov2cor() the well-known correlation matrix could be obtained.

By default, createS() simulates zero-mean i.i.d. normal variables (corresponding to \(\Sigma=\Omega=I\) being the identity matrix), but it has plenty of possibilities for more intricate covariance structures. The S matrix could have been obtained directly had we omitted the dataset argument, leaving it to be the default FALSE. The rmvnormal() function is utilized by createS() to generate the normal sample.

We can obtain the precision estimate P using solve() to invert S:

P <- solve(S)
print(P)

##          A      B      C      D        E      F
## A  0.99181 -0.473 -0.250 -0.601  0.00514 -0.275
## B -0.47262  1.612  0.365  1.056  0.59331 -0.136
## C -0.25002  0.365  1.386  0.514  0.63669 -0.223
## D -0.60098  1.056  0.514  2.050  0.63716 -0.502
## E  0.00514  0.593  0.637  0.637  1.97175 -0.677
## F -0.27466 -0.136 -0.223 -0.502 -0.67718  1.639

That’s it! Everything goes well here only because \(n < p\). However, when \(p\) is close to \(n\), the estimate become unstable and varies wildly and when \(p\) exceeds \(n\) one can no longer invert \(S\) and this strategy fails:

p <- 25
S2 <- createS(n = n, p = p)  # Direct to S
P2 <- solve(S2)

## Error in solve.default(S2): system is computationally singular: reciprocal condition number = 3.93946e-19

Note that this is now a \(25 \times 25\) precision matrix we are trying to estimate. Datasets where \(p > n\) are starting to be common, so what now?

To solve the problem, rags2ridges adds a so-called ridge penalty to the likelihood above — this method is also called \(L_2\) shrinkage and works by “shrinking” the eigenvalues of \(S\) in a particular manner to combat that they “explode” when \(p \geq n\).

The core problem that rags2ridges solves is that

\[ \ell(\Omega; S) = \ln|\Omega| - \text{tr}(S\Omega) - \frac{\lambda}{2}|| \Omega - T||^2_2 \] where \(\lambda > 0\) is the ridge penalty parameter, \(T\) is a \(p \times p\) known target matrix (which we will get back to) and \(||\cdot||_2\) is the \(L_2\)-norm. The maximizing solution here is surprisingly on closed form, but it is rather complicated¹. Assume for now the target matrix is an all zero matrix and thus out of the equation.

The core function of rags2ridges is ridgeP which computes this estimate in a fast manner.

P2 <- ridgeP(S2, lambda = 1.17)
print(P2[1:7, 1:7]) # Showing only the 7 first cols and rows

##         A         B         C       D       E       F       G
## A  4.1380  0.074299 -0.077388  0.2333  0.1517 -0.0611 -0.0441
## B  0.0743  3.852059  0.000171  0.0495 -0.3440  0.1268  0.4402
## C -0.0774  0.000171  4.141499 -0.1013 -0.1807  0.0651 -0.0723
## D  0.2333  0.049464 -0.101335  3.5453  0.0713 -0.0169  0.0986
## E  0.1517 -0.344025 -0.180677  0.0713  3.8917  0.2976  0.1880
## F -0.0611  0.126784  0.065059 -0.0169  0.2976  4.0244 -0.1250
## G -0.0441  0.440229 -0.072276  0.0986  0.1880 -0.1250  3.6523

And voilà, we have our estimate. We will now discuss the penalty parameters and target matrix and how to choose them.

The penalty parameter

The penalty parameter \(\lambda\) (lambda) shrinks the values of \(P\) such toward 0 (when \(T = 0\)) — i.e. very larges values of \(\lambda\) makes \(P\) “small” and more stable whereas smaller values of \(\lambda\) makes the \(P\) tend toward the (possibly non-existent) \(S^{-1}\). So what lambda should you choose? One strategy for choosing \(\lambda\) is selecting it to be stable yet precise (a bias-variance trade-off). Automatic k-fold cross-validation can be done with optPenalty.kCVauto()is well suited for this:

Y <- createS(n, p, dataset = TRUE)
opt <- optPenalty.kCVauto(Y, lambdaMin = 0.001, lambdaMax = 100)
str(opt)

## List of 2
##  $ optLambda: num 0.76
##  $ optPrec  : 'ridgeP' num [1:25, 1:25] 2.9202 -0.2749 -0.0744 -0.0204 -0.1199 ...
##   ..- attr(*, "lambda")= num 0.76
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:25] "A" "B" "C" "D" ...
##   .. ..$ : chr [1:25] "A" "B" "C" "D" ...

As seen, the function returns a list with the optimal penalty parameter and corresponding ridge precision estimate. By default, the the functions performs leave-one-out cross validation. See ?optPenalty.kCVauto` for more information.

The target matrix

The target matrix \(T\) is a matrix the same size as \(P\) which the estimate is “shrunken” toward — i.e. for large values of \(\lambda\) the estimate goes toward \(T\). The choice of the target is another subject. While one might first think that the all-zeros \(T = [0]\) would be a default it is intuitively not a good target. This is because we’d like an estimate that is positive definite (the matrix-equivalent to at positive number) and the null-matrix is not positive definite.

If one has a very good prior estimate or some other information this might used to construct the target. E.g. the function kegg.target() utilizes the Kyoto Encyclopedia of Genes and Genomes (KEGG) database of gene and gene-networks together with pilot data to construct a target.

In the absence of such knowledge, the default could be a data-driven diagonal matrix. The function default.target() offers some different approaches to selecting this. A good choice here is often the diagonal matrix times the reciprocal mean of the eigenvalues of the sample covariance as entries. See ?default.target for more choices.

Gaussian graphical modeling and post processing

What is so interesting with the precision matrix anyway? I’m always interested in correlations and thus the correlation matrix.

As you may know, correlation does not imply causation. Nor does covariance imply causation. However, precision matrix provides stronger hints at causation. A relatively simple transformation of \(P\) maps it to partial correlations—much like how the sample covariance \(S\) easily maps to the correlation matrix. More precisely, the \(ij\)th partial correlation is given by

\[ \rho_{ij|\text{all others}} = \frac{- p_{ij}}{\sqrt{p_{ii}p_{jj}}} \]

where \(p_{ij}\) is the \(ij\)th entry of \(P\).

Partial correlations measure the linear association between two random variables whilst removing the effect of other random variables; in this case, it is all the remaining variables. This somewhat absolves the issue in “regular” correlations where are often correlated but only indirectly; either by sort of ‘transitivity of correlations’ (which does not hold generally and is not² so³ simple⁴) or by common underlying variables.

OK, but what is graphical about the graphical ridge estimate?

In a multivariate normal model, \(p_{ij} = p_{ji} = 0\) if and only if \(X_i\) and \(X_j\) are conditionally independent when condition on all other variables. I.e. \(X_i\) and \(X_j\) are conditionally independent given all \(X_k\) where \(k \neq i\) and \(k \neq j\) if and when the \(ij\)th and \({ji}\)th elements of \(P\) are zero. In real world applications, this means that \(P\) is often relatively sparse (lots of zeros). This also points to the close relationship between \(P\) and the partial correlations.

The non-zero entries of the a symmetric PD matrix can them be interpreted the edges of a graph where nodes correspond to the variables.

Graphical ridge estimation? Why not graphical Lasso?

The graphical lasso (gLasso) is the L1-equivalent to graphical ridge. A nice feature of the L1 penalty automatically induces sparsity and thus also select the edges in the underlying graph. The L2 penalty of rags2ridges relies on an extra step that selects the edges after \(P\) is estimated. While some may argue this as a drawback (typically due to a lack of perceived simplicity), it is often beneficial to separate the “variable selection” and estimation.

First, a separate post-hoc selection step allows for greater flexibility.

Secondly, when co-linearity is present the L1 penalty is “unstable” in the selection between the items. I.e. if 2 covariances are co-linear only one of them will typically be selected in a unpredictable way whereas the L2 will put equal weight on both and “average” their effect. Ultimately, this means that the L2 estimate is typically more stable than the L1.

At last point to mention here is also that the true underlying graph might not always be very sparse (or sparse at all).

How do I select the edges then?

The sparsify() functions lets you select the non-zero entries of P corresponding to edges. It supports a handful different approaches ranging from simple thresholding to false discovery rate based selection.

After edge select GGMnetworkStats() can be utilized to get summary statistics of the resulting graph topology.