gif: Graphical Independence Filtering for Learning Large-Scale Sparse Graphical Models

Shiyun Lin, linshy27@mail2.sysu.edu.cn

May 15, 2020

Introduction

One of the fundamental problems in data mining and statistical analysis is to detect the relationships among a set of variables. To this end, researchers apply undirected graphical models in work, which combine graph theory and probability theory to create networks that model complex probabilistic relationships. By estimating the underlying graphical model, one can capture the direct dependence between variables. In the last few decades, undirected graphical models have attracted numerous attention in various areas such as genetics, neuroscience, finance and social science.

When the data is multivariate Gaussian distributed, detecting the graphical model is equivalent to estimating the inverse covariance matrix. gif package provides efficient solutions for this problem. The core functions in gif package are hgt and sgt.

These functions based on graphical independence filtering have several advantages:

Method \(p = 1000\) \(p = 4000\) \(p = 10000\)
hgt 0.395s 6.668s 46.993s
sgt 0.225s 3.099s 21.454s
QUIC 1.580s 117.041s 1945.648s
fastclime 62.704s *** ***

Particularly, hgt provides a solution for best subset selection problem in Gaussian graphical models and sgt offers closed-form solution equivalent to graphical lasso when the graph structure is acyclic.

Installation

CRAN version

To install the gif R package from CRAN, just run:

install.packages("gif")

Github version

To install the development version from Github, run:

library(devtools)
install_github("Mamba413/gif/R-package", build_vignettes = TRUE)

Windows user will need to install Rtools first.

Usage

Take a synthetic dataset as a simple example to illustrate how to use hgt and sgt to estimate the underlying graphical model.

Simulated Data

Using the function ggm.generator, we extract 200 samples from the graphical model with \(p = 100\) and whose graph structure is the so-called AR(1). A sketch of the example could be seen in the following picture.

set.seed(1)
n <- 200
p <- 100
Omega <- diag(1, p, p)
for(i in 1:(p - 1)) {
  Omega[i, i + 1] <- 0.5
  Omega[i + 1, i] <- 0.5
}
x <- ggm.generator(n, Omega)

hgt

For Hard Graphical Thresholding algorithm, users could choose to estimate the underlying model with given model size or given active entries.

When the model size in given, the program would return a \(p \times p\) matrix with number of non-zero off-diagonal entries equal to the given model size and a \(K \times 2\) matrix marks down the corresponding active entries.

non_zero_num <- sum(Omega != 0) - p
res <- hgt(x, size = non_zero_num / 2)
Omega_hat <- as.matrix(res[["Omega"]])
head(Omega_hat[, 1:6])
#           [,1]      [,2]      [,3]      [,4]      [,5]       [,6]
# [1,] 0.5353892 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
# [2,] 0.0000000 0.2576383 0.0000000 0.0000000 0.0000000 0.00000000
# [3,] 0.0000000 0.0000000 0.1609424 0.0000000 0.0000000 0.00000000
# [4,] 0.0000000 0.0000000 0.0000000 0.1340856 0.0000000 0.00000000
# [5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.1082484 0.00000000
# [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.09268247
active.entry <- res[["active.entry"]]
head(active.entry)
#      row col
# [1,]   9  10
# [2,]  10  11
# [3,]  11  12
# [4,]  13  14
# [5,]  14  15
# [6,]  15  16

When the active entries are given directly, the model fitting procedure is the so-called covariance selection and the program would return a \(p \times p\) matrix whose non-zero off-diagonal entries correspond to the given active entries.

non_zero_index <- which(as.matrix(Omega) != 0, arr.ind = TRUE)
active.entry <- non_zero_index[which(non_zero_index[,1] < non_zero_index[,2]),]
res <- hgt(x, active.entry = active.entry)

sgt

For Soft Graphpical Thresholding algorithm, users could choose to estimate the underlying model with given model size or given regularization parameter \(\lambda\). In the return, we not only provide the parameter matrix and the corresponding active entries mentioned above, but also a boolean flag indicating whether the estimated graph structure is acyclic or not, since the solution would be equivalent to graphical lasso if the graph is acyclic.

The usage is similar to hgt when model size is given, and when regularization parameter is given, users could proceed as follows.

res <- sgt(x, lambda = 0.01)
res[["is.acyclic"]]
# [1] FALSE

License

GPL (>= 2)

Reference