Overview

This document is part of the “DrBats” project whose goal is to implement exploratory statistical analysis on large sets of data with uncertainty. The idea is to visualize the results of the analysis in a way that explicitely illustrates the uncertainty in the data.

The “DrBats” project applies a Bayesian Latent Factor Model.

This project involves the following persons, listed in alphabetical order :

Bayesian Latent Factor Model

Notations

We want to visualize \(\mathbf{Y}\) in a lower dimension. The \(\mathbf{\beta}\) are the coordinates of the observations in the lower dimension, \(\mathbf{W}\) is the transition matrix, and \(\epsilon\) represents the difference between the low-rank represenation and the actual data.

Model

Without loss of generality, we assume that \(\mathbf{Y}\) is centered. The Latent Factor Model is therefore:

\(\mathbf{Y_i}^T = \mathbf{W}\beta_i + \epsilon_i\)

For identifiability reasons, we will estimate a \(P \times D\) positive lower triangular (PLT) matrix, instead of the full matrix \(\mathbf{W}\). We can do a matrix decomposition of \(\mathbf{W}\) as an orthogonal matrix and an upper-triangular matrix, \(\mathbf{W} = \mathbf{Q} \mathbf{R}\). As such, we estimate the PLT matrix \(\mathbf{R}^T = \left(r_{j, k}\right)_{j = 1:P, k = 1:D}\) with the rotation matrix \(\mathbf{Q}\) known (for instance, estimated from the classical PCA):

\[ \mathbf{R}^T = \begin{pmatrix} r_{1, 1} & 0 & \cdots & 0\\ r_{2, 1} & r_{2, 2} & \ddots & 0 \\ r_{3, 1} & r_{3, 2} & \ddots & \vdots \\ \vdots & & \ddots & r_{P, D} \end{pmatrix} \]

We assume for now that all rows of \(\mathbf{Y}\) have the same variance, that is \(\sigma^2 \mathbf{1}_P\). The full Bayesian model is:

\begin{eqnarray}\label{model} \mathbf{Y_i}^T|\mathbf{R}^T, \beta_i, \sigma^2 &\overset{i.i.d.}{\sim}& \mathcal{N}_P(\mathbf{R}^T\beta_i, \sigma^2 \mathbf{1}_P) \end{eqnarray} \begin{eqnarray*} \beta_i & \overset{i.i.d.}{\sim}& \mathcal{N}_D(0, \mathbf{1}_D) \\ r_{j, k} & \overset{i.i.d.}{\sim}& \mathcal{N}(0, \tau^2) \\ \sigma^2, \tau^2 &\sim& IG(0.001, 0.001) \end{eqnarray*}

We assume that the non-null entries of the PLT matrix are independent, centered and normal, with same variance \(\tau^2\).

To integrate information about uncertainty for interval-valued data, we can put an informative prior on the variance of each individual \(\mathbf{Y_i}\). This could be in the form of a weight matrix, for instance \(\sigma^2 \Phi_i\) where \(\Phi_i\) is either fixed by the user, or estimated.

Finally, contrary to classical Principal Component Analysis, in this model the factor loadings \(\beta_i\) for each individual are random variables. This allows for uncertainty of projection, resulting in non-elliptical confidence regions around the estimated factor loadings.

Data Simulation

We can simulate data using \(\mathbf{Y_i}^T = \mathbf{R}^T\beta_i + \epsilon_i\) with the function drbats.simul(). We choose a matrix \(\mathbf{R}\), and then build \(\mathbf{Y_i}\) by simulating \(\beta_i\) and \(\epsilon_i\) for each individual. To obtain the full matrix \(\mathbf{Y}\), we stack the rows \(\mathbf{Y_i}\).

The matrix \(\mathbf{R}\) built in this package is, as previously stated, the result of the matrix decomposition of a full low-rank matrix \(\mathbf{W}\). To choose an \(\mathbf{R}\) that would resemble something out of an agronomic dataset, we build \(\mathbf{W}\) to be the low-rank matrix of an extreme case of data found in agronomy: often times observations are signals that trend over time, with peaks at certain moments over the observation period. In addition, the number of observations can be small when these peaks occur.

To build \(\mathbf{W}\) and subsequently, \(\mathbf{R}\), we first simulate bi-modal signals observed unevenly over time.

suppressPackageStartupMessages(require(DrBats))
set.seed = 45
toydata <- drbats.simul(N = 5,
                        P = 150,
                        t.range = c(0, 1000),
                        b.range = c(0.2, 0.4),
                        c.range = c(0.6, 0.8),
                        b.sd = 5,
                        c.sd = 5,
                        y.range = c(-5, 5),
                        sigma2 = 0.2,
                        breaks = 8,
                        data.type = 'sparse.tend')
matplot(t(toydata$t), t(toydata$X), type = 'l', lty = 1, lwd = 1,
        xlab = 'Time', ylab = ' ')
points(t(toydata$t), t(toydata$X), pch = '.')

For details please refer to the simul_and_project.pdf vignette.

We set the dimensionality of the low-rank matrix as the number of axes retained after Principal Component Analysis:

barplot(toydata$proj.pca$lambda.perc, ylim = c(0, 1),
        col = mycol[1:length(toydata$proj.pca$lambda.perc)])

## [1] "Number of retained axes: 2"

If you want to use the PCA rotation to anchor the latent factors, the function wlu() does an LU matrix decomposition of the matrix of latent factors.

Model Inference

Now we fit a Stan model for the simulated data with a No-U-Turn Sampler using the modelFit() function, and the rstan package.

See the modelFit.pdf vignette for details.

fit <- modelFit(model = "igPLT", prog = "stan", Xhisto = toydata$Y.simul$Y,
                nchains = 4, nthin = 50, niter = 10000, D = toydata$wlu$D)

The main.modelFit() function outputs an object called fit, containing the posterior estimates for the parameters of the model.

Evaluation

For evaluation, we can convert the object to an mcmc.list to apply the diagnostic tests in the coda package manual. Our package also works on mcmc.lists for coherence.

If post-processing is needed, our package provides a function to repair reflection problems. Refer to the modelEval.pdf vignette for details.

We can plot the histogram of the posterior density of the data:

post <- postdens(codafit.clean, Y = toydata$Y.simul$Y, D = toydata$wlu$D, chain = 1)
hist(post, main = "Histogram of the posterior density", xlab = "Density")

It’s possible to visualize the projection of the observations onto the lower dimensional space with the function \(visbeta()\). We can project onto the latent factors of our choice, here we chose the first and second (we didn’t have a choice actually since there are only two latent factors in the toy example). The uncertainty envelope at \(95 \%\) is also plotted if we choose \(quant = c(0.05, 0.95)\).

beta.res <- visbeta(codafit.clean, toydata$Y.simul$Y, toydata$wlu$D, chain = 1, axes = c(1, 2), quant = c(0.05, 0.95))

ggplot2::ggplot() +
  ggplot2::geom_path(data = beta.res$contour.df, ggplot2::aes(x = x, y = y, colour = ind)) +
  ggplot2::geom_point(data = beta.res$mean.df, ggplot2::aes(x = x, y = y, colour = ind)) +
  ggplot2::ggtitle("Convex hull of HMC estimates of the scores")

W.res <- visW(codafit.clean, toydata$Y.simul$Y, toydata$wlu$D, chain = 1, factors = c(1, 2))
W.res
## $res.W
##     Lower.est Estimation Upper.est Factor
## 1   202.47211   325.0718  832.0374      1
## 2   310.74754  1038.2422 2332.0514      1
## 3   197.37565   614.9283 1316.8594      1
## 4    42.13060   549.0696 1279.2705      1
## 5    24.53771   666.8108 2340.6475      1
## 6    55.22115   927.8369 2424.4718      1
## 7    35.36126   581.7619 1338.2913      1
## 8     0.00000     0.0000    0.0000      2
## 9   586.89446   815.4527 1081.9533      2
## 10  207.19608   282.9008  418.2881      2
## 11  370.17494   512.6679  718.9451      2
## 12  547.32480   749.1254 1100.3490      2
## 13 1722.95012  2431.3896 3510.9093      2
## 14  845.69432  1168.7538 1649.5560      2
## 
## $Inertia
## [1] 0.08413625 0.91586375

References

G. Weinrott, B. Fontez, N. Hilgert & S. Holmes, “Modèle Bayésien à facteurs latents pour l’analyse de données fonctionnelles”, Actes des JdS 2016.