Dealing with label switching: relabelling in Bayesian mixture models by pivotal units

Leonardo Egidi

2023-02-22

In this vignette we explore the fit of Gaussian mixture models and the relabelling pivotal method proposed in Egidi et al. (2018b) through the pivmet package. First of all, we load the package:

library(pivmet)
library(bayesmix)
library(bayesplot)

The pivmet package provides a simple framework to (i) fit univariate and multivariate mixture models according to a Bayesian flavour and detect the pivotal units, via the piv_MCMC function; (ii) perform the relabelling step via the piv_rel function.

There are two main functions used for this task.

The function piv_MCMC:

\[c_{ip} = \frac{n_{ip}}{H}=\frac{1}{H} \sum_{h=1}^{H}|[z_i]_h=[z_p]_h|,\] where \(|\cdot|\) denotes the event indicator and \(n_{ip}\) is the number of times the units \(i, \ p\) belong to the same group along the sampling.

These three methods are applied by the internal function piv_sel. Alternatively, when \(k <5\), the user can set piv.criterion="MUS" (Egidi et al. 2018a) which performs a sequential search of identity submatrices within the matrix \(C\) via the internal function MUS.

The function piv_rel:

\[\begin{align*} [\mu_{j}]_h=&[\mu_{z_{i_{j}}}]_h \\ [z_{i}]_h =j & \mbox{ for } i : [z_i]_h=[z_{i_{j}}]_h,\\ \end{align*}\] where \(\boldsymbol{\mu}=(\mu_1,\mu_2,\ldots,\mu_k)\) is the vector of the means parameters and \(\boldsymbol{z}=(z_1,z_2,\ldots,z_N)\) an i.i.d. vector of latent variables taking values in \(\{1,2,\ldots,k \}\) and denoting the group membership of each statistical unit.

piv_rel takes as input the MCMC output from piv_MCMC and returns the relabelled chains and the corresponding posterior estimates.

Example: bivariate Gaussian data

Suppose now that \(\boldsymbol{y}_i \in \mathbb{R}^2\) and assume that:

\[\boldsymbol{y}_i \sim \sum_{j=1}^{k}\eta_{j}\mathcal{N}_{2}(\boldsymbol{\mu}_{j}, \boldsymbol{\Sigma})\] where \(\boldsymbol{\mu}_j\) is the mean vector for group \(j\), \(\boldsymbol{\Sigma}\) is a positive-definite covariance matrix and the mixture weight \(\eta_j= P(z_i=j)\) as for the one-dimensional case. We may generate Gaussian mixture data through the function piv_sim, specifying the sample size \(N\), the desired number of groups \(k\) and the \(k \times 2\) matrix for the \(k\) mean vectors. The argument W handles the weights for a nested mixture, in which the \(j\)-th component is in turn modelled as a two-component mixture, with covariance matrices \(\boldsymbol{\Sigma}_{p1}, \boldsymbol{\Sigma}_{p2}\), respectively.

set.seed(500)
N  <- 200
k  <- 3
D <- 2
nMC <- 2000
M1 <- c(-10,8)
M2 <- c(10,.1)
M3 <- c(30,8)
# matrix of input means
Mu <- rbind(M1,M2,M3)
# covariance matrices for the two subgroups
Sigma.p1 <- diag(D)
Sigma.p2 <- (10^2)*diag(D)
# subgroups' weights
W   <- c(0.2,0.8)
# simulate data
sim <- piv_sim(N = N, k = k, Mu = Mu,
 Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W)

The function piv_MCMC requires only three mandatory arguments: the data object y, the number of components k and the number of MCMC iterations, nMC. By default, it performs Gibbs sampling using the runjags package. If software="rjags", for bivariate data the priors are specified as:

\[\begin{align} \boldsymbol{\mu}_j \sim & \mathcal{N}_2(\boldsymbol{\mu}_0, S_2)\\ \Sigma^{-1} \sim & \mbox{Wishart}(S_3, 3)\\ \eta \sim & \mbox{Dirichlet}(\boldsymbol{\alpha}), \end{align}\]

where \(\boldsymbol{\alpha}\) is a \(k\)-dimensional vector and \(S_2\) and \(S_3\) are positive definite matrices. By default, \(\boldsymbol{\mu}_0=\boldsymbol{0}\), \(\boldsymbol{\alpha}=(1,\ldots,1)\) and \(S_2\) and \(S_3\) are diagonal matrices, with diagonal elements equal to 1e+05. Different values can be specified for the hyperparameters \(\boldsymbol{\mu}_0, S_2, S_3\) and \(\boldsymbol{\alpha}\): priors =list(mu_0 = c(1,1), S2 = ..., S3 = ..., alpha = ...)}, with the constraint for \(S2\) and \(S3\) to be positive definite, and \(\boldsymbol{\alpha}\) a vector of dimension \(k\) with nonnegative elements.

If software="rstan", the function performs Hamiltonian Monte Carlo (HMC) sampling. In this case the priors are specified as:

\[\begin{align} \boldsymbol{\mu}_j \sim & \mathcal{N}_2(\boldsymbol{\mu}_0, LD^*L^{T})\\ L \sim & \mbox{LKJ}(\eta)\\ D_{1,2} \sim & \mbox{HalfCauchy}(0, \sigma_d). \end{align}\]

The covariance matrix is expressed in terms of the LDL decomposition as \(LD^*L^{T}\), a variant of the classical Cholesky decomposition, where \(L\) is a \(2 \times 2\) lower unit triangular matrix and \(D^*\) is a \(2 \times 2\) diagonal matrix. The Cholesky correlation factor \(L\) is assigned a LKJ prior with \(\epsilon\) degrees of freedom, which, combined with priors on the standard deviations of each component, induces a prior on the covariance matrix; as \(\epsilon \rightarrow \infty\) the magnitude of correlations between components decreases, whereas \(\epsilon=1\) leads to a uniform prior distribution for \(L\). By default, the hyperparameters are \(\boldsymbol{\mu}_0=\boldsymbol{0}\), \(\sigma_d=2.5, \epsilon=1\). Different values can be chosen with the argument: priors=list(mu_0=c(1,2), sigma_d = 4, epsilon = 2).

We fit the model using rjags with 2000 MCMC iterations and default priors:

res <- piv_MCMC(y = sim$y, k= k, nMC =nMC, 
                piv.criterion = "maxsumdiff")
#> Compiling rjags model...
#> Calling the simulation using the rjags method...
#> Note: the model did not require adaptation
#> Burning in the model for 1000 iterations...
#> Running the model for 2000 iterations...
#> Simulation complete
#> Note: Summary statistics were not produced as there are >50 monitored
#> variables
#> [To override this behaviour see ?add.summary and ?runjags.options]
#> FALSEFinished running the simulation
#> Calculating summary statistics...
#> Calculating the Gelman-Rubin statistic for 13 variables....
#> Note: Unable to calculate the multivariate psrf
#> 
#> JAGS model summary statistics from 8000 samples (chains = 4; adapt+burnin = 2000):
#>                                                                               
#>             Lower95     Median   Upper95       Mean        SD Mode       MCerr
#> mu[1,1]     -14.929    -11.022    30.206    -4.8522    12.318   --     0.55816
#> mu[2,1]     -12.979     8.3939    11.677     3.8098     8.827   --     0.39286
#> mu[3,1]      10.383     31.196    34.631     29.862     5.289   --     0.18869
#> mu[1,2]     -2.2595     6.3831    10.315     5.0314     4.037   --    0.099145
#> mu[2,2]     -1.6616      1.373    8.4781     2.3568    2.9636   --    0.096032
#> mu[3,2]       0.871     5.4391      9.44     5.3455    2.0762   --    0.053396
#> tau[1,1]  0.0083046   0.018274  0.024732   0.017846 0.0039067   --  0.00017228
#> tau[2,1] -0.0029414 0.00072118 0.0043499 0.00084049 0.0020083   -- 0.000069349
#> tau[1,2] -0.0029414 0.00072118 0.0043499 0.00084049 0.0020083   -- 0.000069349
#> tau[2,2]  0.0083925   0.011006  0.014041   0.011262 0.0019103   -- 0.000043734
#> eta[1]      0.18232    0.33145   0.50197    0.32907  0.075866   --   0.0028267
#> eta[2]       0.2599    0.40914   0.53553    0.40818  0.070245   --   0.0023241
#> eta[3]      0.17094    0.25408   0.39454    0.26275  0.055515   --   0.0013749
#>                                      
#>          MC%ofSD SSeff   AC.10   psrf
#> mu[1,1]      4.5   487 0.55553 1.5041
#> mu[2,1]      4.5   505  0.5613 1.9658
#> mu[3,1]      3.6   786 0.41516 1.4078
#> mu[1,2]      2.5  1658 0.42937 1.5711
#> mu[2,2]      3.2   952 0.44329 1.7288
#> mu[3,2]      2.6  1512  0.2307 1.0752
#> tau[1,1]     4.4   514 0.33367 1.0298
#> tau[2,1]     3.5   839 0.22096 1.0213
#> tau[1,2]     3.5   839 0.22096 1.0213
#> tau[2,2]     2.3  1908 0.31946 1.0414
#> eta[1]       3.7   720 0.42281 1.2206
#> eta[2]       3.3   914 0.38269 1.1382
#> eta[3]       2.5  1630 0.23669 1.1617
#> 
#> Total time taken: 12.9 seconds

Once we obtain posterior estimates, label switching is likely to occurr. For such a reason, we need to relabel our chains as explained above. In order to relabel the chains, the function piv_rel can be used, which only needs the mcmc = res argument. Relabelled outputs can be displayed through the function piv_plot, with different options for the argument type:

The optional argument par takes four possible alternative choices: mean, sd, weight and all for the means, standard deviations, weights or all the three mentioned parameters, respectively. By default, par="all".

rel <- piv_rel(mcmc=res)
piv_plot(y = sim$y, mcmc = res, rel_est = rel, par = "mean", type = "chains")
#> Description: traceplot of the raw MCMC chains and the relabelled chains for the means parameters (coordinate 1 and 2). Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sample is not able to distinguish between the components.
piv_plot(y = sim$y, mcmc = res, rel_est = rel, type = "hist")
#> Description: 3d histogram of the data along with the posterior estimates of the relabelled means (triangle points)

Example: fishery data

The Fishery dataset has been previously used by Titterington, Smith, and Makov (1985) and Papastamoulis (2016) and consists of 256 snapper length measurements. It is contained in the bayesmix package (Grün 2011). We may display the histogram of the data, along with an estimated kernel density.

data(fish)
y <- fish[,1]
hist(y, breaks=40, prob = TRUE, cex.lab=1.6,
             main ="Fishery data", cex.main =1.7,
             col="navajowhite1", border="navajowhite1")
 lines(density(y), lty=1, lwd=3, col="blue")

We assume a mixture model with \(k=5\) groups:

\[\begin{equation} y_i \sim \sum_{j=1}^k \eta_j \mathcal{N}(\mu_j, \sigma^2_j), \ \ i=1,\ldots,n, \label{eq:fishery} \end{equation}\] where \(\mu_j, \sigma_j\) are the mean and the standard deviation of group \(j\), respectively. Moreover, assume that \(z_1,\ldots,z_n\) is an unobserved latent sequence of i.i.d. random variables following the multinomial distribution with weights \(\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{k})\), such that:

\[P(z_i=j)=\eta_j,\] where \(\eta_j\) is the mixture weight assigned to the group \(j\).

We fit our model by simulating \(H=15000\) samples from the posterior distribution of \((\boldsymbol{z}, \boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\eta})\). In the univariate case, if the argument software="rjags" is selected (the default option), Gibbs sampling is performed by the package bayesmix, and the priors are:

\[\begin{eqnarray} \mu_j \sim & \mathcal{N}(\mu_0, 1/B_0)\\ \sigma_j \sim & \mbox{invGamma}(\nu_0/2, \nu_0S_0/2)\\ \eta \sim & \mbox{Dirichlet}(\boldsymbol{\alpha})\\ S_0 \sim & \mbox{Gamma}(g_0/2, g_0G_0/2), \end{eqnarray}\]

with default values: \(B_0=0.1\), \(\nu_0 =20\), \(g_0 = 10^{-16}\), \(G_0 = 10^{-16}\), \(\boldsymbol{\alpha}=(1,1,\ldots,1)\). The users may specify their own hyperparameters with the priors arguments, declaring a names list such as: priors = list(mu_0=2, alpha = rep(2, k), ...).

If software="rstan" is selected, the priors are:

\[\begin{eqnarray} \mu_j & \sim \mathcal{N}(\mu_0, 1/B0inv)\\ \sigma_j & \sim \mbox{Lognormal}(\mu_{\sigma}, \tau_{\sigma})\\ \eta_j & \sim \mbox{Uniform}(0,1), \end{eqnarray}\]

where the vector of the weights \(\boldsymbol{\eta}=(\eta_1,\ldots,\eta_k)\) is a \(k\)-simplex. Default hyperparameters values are: \(\mu_0=0, B0inv=0.1, \mu_{\sigma}=0, \tau_{\sigma}=2\). Here also, the users may choose their own hyperparameters values in the following way: priors = list(mu_sigma = 0, tau_sigma = 1,...).

We fit the model using the rjags method, and we set the burnin period to 7500.

k <- 5
nMC <- 15000
res <- piv_MCMC(y = y, k = k, nMC = nMC, 
                burn = 0.5*nMC, software = "rjags")
#> Compiling model graph
#>    Declaring variables
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 256
#>    Unobserved stochastic nodes: 268
#>    Total graph size: 1050
#> 
#> Initializing model
#> 
#> 
#> Call:
#> JAGSrun(y = y, model = mod.mist.univ, control = control)
#> 
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 7501 
#> End = 22500 
#> Thinning interval = 1 
#> 
#>  Empirical mean, standard deviation and 95% CI for eta 
#>          Mean      SD     2.5%  97.5%
#> eta[1] 0.2030 0.20230 0.009156 0.5672
#> eta[2] 0.2006 0.13765 0.009236 0.5198
#> eta[3] 0.1674 0.12353 0.018239 0.5051
#> eta[4] 0.1217 0.05231 0.071350 0.1988
#> eta[5] 0.3073 0.22254 0.009916 0.5752
#> 
#>  Empirical mean, standard deviation and 95% CI for mu 
#>        Mean     SD  2.5%  97.5%
#> mu[1] 8.386 2.7806 3.377 12.341
#> mu[2] 8.413 2.1134 5.178 12.347
#> mu[3] 8.369 1.5851 5.201 10.862
#> mu[4] 3.472 0.6172 3.126  5.368
#> mu[5] 7.358 2.7863 5.069 12.292
#> 
#>  Empirical mean, standard deviation and 95% CI for sigma2 
#>             Mean     SD   2.5%  97.5%
#> sigma2[1] 0.4473 0.2510 0.1932 1.1808
#> sigma2[2] 0.4387 0.2108 0.2009 1.0234
#> sigma2[3] 0.4526 0.2415 0.2021 1.1223
#> sigma2[4] 0.2606 0.1128 0.1280 0.5479
#> sigma2[5] 0.4132 0.2442 0.2035 1.1182

First of all, we may access the true number of iterations by tiping:

res$true.iter
#> [1] 7421

We may have a glimpse if label switching ocurred or not by looking at the traceplot for the mean parameters, \(\mu_j\). To do this, we apply the function piv_rel to relabel the chains and obtain useful inferences; the only argument for this function is the MCMC result just obtained with piv_MCMC. The function piv_plot displays some graphical tools, both traceplots (argument type="chains") and histograms along with the final relabelled means (argument type="hist"). For both plot ttpes, the function returns a printed message explaining how to interpret the results.

rel <- piv_rel(mcmc=res)
piv_plot(y=y, res, rel, par = "mean", type="chains")

#> Description: traceplot of the raw MCMC chains and the relabelled chains for the means parameters. Each colored chain corresponds to one of the k distinct parameters of the mixture model. Overlapping chains may reveal that the MCMC sample is not able to distinguish between the components.
piv_plot(y=y, res, rel, type="hist")

#> Description: histograms of the data along with the estimated posterior means (red points) from raw MCMC and relabelling algorithm. The blue line is the estimated density curve.

The first plot displays the traceplots for the parameters \(\boldsymbol{\mu}\). From the left plot showing the raw outputs as given by the Gibbs sampling, we note that label switching clearly occurred. Our algorithm seems able to reorder the mean \(\mu_j\) and the weights \(\eta_j\), for \(j=1,\ldots,k\). Of course, a MCMC sampler which does not switch the labels would ideal, but nearly impossible to program. However, we could assess how two diferent sampler perform, by repeating the analysis above by selecting software="rstan" in the piv_MCMC function.

# stan code not evaluated here
res2 <- piv_MCMC(y = y, k = k, nMC = 3000, 
                 software = "rstan")
rel2 <- piv_rel(res2)
piv_plot(y=y, res2, rel2, par = "mean", type="chains")

With the rstan option, we can use the bayesplot functions on the argument:

# stan code not evaluated here
posterior <- as.array(res2$stanfit)
mcmc_intervals(posterior, regex_pars = c("mu"))

Regardless of the software that we chose, we may extract the JAGS/Stan model by typing:

cat(res$model)
#> var 
#>  b0,
#>      B0inv,
#>      nu0Half,
#>      g0Half,
#>      g0G0Half,
#>      k,
#>      N,
#>      eta[5], 
#>  mu[5], 
#>  tau[5], 
#>  nu0S0Half,
#>      S0,
#>      e[5], 
#>  y[256], 
#>  S[256]; 
#> 
#> model    {
#>  for (i in 1:N) {
#>      y[i] ~ dnorm(mu[S[i]],tau[S[i]]);
#>      S[i] ~ dcat(eta[]);
#>  }
#>  for (j in 1:k) {
#>      mu[j] ~ dnorm(b0,B0inv);
#>      tau[j] ~ dgamma(nu0Half,nu0S0Half);
#>  }
#>  S0 ~ dgamma(g0Half,g0G0Half);
#>  nu0S0Half <- nu0Half * S0;
#> 
#>  eta[] ~ ddirch(e[]);
#> }

In order to estimate the number of clusters in the data, we can now fit sparse finite mixtures as proposed by Frühwirth-Schnatter and Malsiner-Walli (2019) by assuming:

\[\boldsymbol{\eta} \sim \text{Dirichlet}(e_0),\]

where the smaller \(e_0\) and the the smaller the number of clusters a-posteriori. The function allows for a Gamma prior on \(e_0\) with hyperparameters \(a_e, b_e\) that may be chosen by the users.

res3 <- piv_MCMC(y = y, k = k, nMC = nMC, sparsity = TRUE,
                 priors = list(alpha = rep(0.001, k))) # sparse on eta
#> Compiling model graph
#>    Declaring variables
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 256
#>    Unobserved stochastic nodes: 268
#>    Total graph size: 1050
#> 
#> Initializing model
#> 
#> 
#> Call:
#> JAGSrun(y = y, model = mod.mist.univ, control = control)
#> 
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 7501 
#> End = 22500 
#> Thinning interval = 1 
#> 
#>  Empirical mean, standard deviation and 95% CI for eta 
#>             Mean        SD    2.5%     97.5%
#> eta[1] 3.674e-05 0.0008714 0.00000 1.086e-12
#> eta[2] 7.515e-01 0.1190594 0.48700 9.288e-01
#> eta[3] 3.925e-04 0.0058635 0.00000 3.997e-11
#> eta[4] 2.480e-01 0.1190932 0.07107 5.130e-01
#> eta[5] 2.549e-05 0.0007312 0.00000 2.014e-14
#> 
#>  Empirical mean, standard deviation and 95% CI for mu 
#>        Mean     SD    2.5%  97.5%
#> mu[1] 5.677 3.1578 -0.4744 11.816
#> mu[2] 5.425 0.2327  4.9587  5.846
#> mu[3] 5.613 3.1466 -0.5620 11.721
#> mu[4] 8.394 0.8106  6.9791 10.047
#> mu[5] 5.633 3.1208 -0.4693 11.684
#> 
#>  Empirical mean, standard deviation and 95% CI for sigma2 
#>            Mean     SD   2.5% 97.5%
#> sigma2[1] 2.438 1.1416 0.9608 5.312
#> sigma2[2] 1.814 0.3876 1.1119 2.623
#> sigma2[3] 2.436 1.1367 0.9649 5.276
#> sigma2[4] 3.002 0.8899 1.5451 4.944
#> sigma2[5] 2.440 1.1286 0.9741 5.291
#> [1] "MCMC has not never been able to identify the required number of groups and the process has been interrupted"
barplot(table(res3$nclusters), xlab= expression(K["+"]),
        col = "blue", border = "red", main = expression(paste("p(",K["+"], "|y)")),
        cex.main=3, yaxt ="n", cex.axis=2.4, cex.names=2.4,
        cex.lab=2)

References

Egidi, Leonardo, Roberta Pappadà, Francesco Pauli, and Nicola Torelli. 2018a. “Maxima Units Search (MUS) Algorithm: Methodology and Applications.” In Studies in Theoretical and Applied Statistics, edited by C. Perna, M. Pratesi, and A. Ruiz-Gazen, 71–81. Springer.
———. 2018b. “Relabelling in Bayesian Mixture Models by Pivotal Units.” Statistics and Computing 28 (4): 957–69.
Frühwirth-Schnatter, Sylvia. 2001. “Markov Chain Monte Carlo Estimation of Classical and Dynamic Switching and Mixture Models.” Journal of the American Statistical Association 96 (453): 194–209.
Frühwirth-Schnatter, Sylvia, and Gertraud Malsiner-Walli. 2019. “From Here to Infinity: Sparse Finite Versus Dirichlet Process Mixtures in Model-Based Clustering.” Advances in Data Analysis and Classification 13 (1): 33–64.
Grün, B. 2011. bayesmix: Bayesian Mixture Models with JAGS. R Package Version 0.7-2.” https://CRAN.R-project.org/package=bayesmix.
Papastamoulis, Panagiotis. 2016. label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs.” Journal of Statistical Software, Code Snippets 69 (1): 1–24. https://doi.org/10.18637/jss.v069.c01.
Titterington, D Michael, Adrian FM Smith, and Udi E Makov. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley.