Abstract

A vector of \(k\) positive coordinates lies in the \(k\)-dimensional simplex when the sum of all coordinates is constrained to equal 1. Sampling distributions efficiently on the simplex can be difficult because of this constraint. This paper introduces a transformed logit-scale proposal for Monte Carlo Markov chain (MCMC) that naturally adjusts step size based on the position in the simplex. This enables efficient sampling on the simplex even when the simplex is high dimensional and/or includes coordinates of differing orders of magnitude. Implementation of this method is shown with the SALTSampler R package and comparisons are made to other simpler sampling schemes to illustrate this methods’ improvement in performence. A simulation of a typical calibration problem is included to demonstrate the use of the R package.

Note: This document is a code-heavy preview of a paper to be submitted on the SALTSampler package. Many of the sample sizes for the MCMC chains used in the paper have been reduced since the primary purpose of this vignette is efficient illustration of the package code while in the paper we are more focused on demonstrating robust results.

1. Introduction

A \(k\)-dimensional simplex is a space in \(\mathbb{R}^{k}\) where any point, \((x_{1},...,x_{k})\), is constrained such that \(\sum_{i=1}^{k}x_{i} = 1\) where \(x_{k} > 0\). Geometrically, this corresponds to a \(k-1\) dimensional subspace of \(\mathbb{R}^{k}\). Sampling on the simplex is common in statistics. For example, the multinomial model works in the simplex space, since the parameters of a multinomial are constrained to sum to 1. In this paper, we present the Self-Adjusting Logit Transformation (SALT) proposal to enable sampling distributions on a simplex and a corresponding SALTSampler R package to facilitate its implementation.

Monte Carlo Markov Chain (MCMC) sampling is a fundamental tool for simulating the parameters for some posterior distribution, \(p(\theta|y)\). We let \(y\) represent observed data and \(\theta=\left(\theta_{1}, ..., \theta_{k}\right)\) represent the parameter(s) of interest. New values for \(\theta\) are proposed and accepted in such a way that they converge to a sample from \(p(\theta|y)\). Specifically, for Metropolis-Hasting sampling, the parameters are set to some initial value, \(\theta^{0}\). Then, for each step, \(t = 1,2,...\) a proposal, \(\theta'\), is drawn from some proposal distribution, \(q(\theta'|\theta)\). The proposal is accepted with probability \(min(1, r)\), where \[ \begin{equation} r = \frac{p\left(\theta'|y\right)}{p(\theta|y)}\cdot \frac{q(\theta|\theta')}{q(\theta'|\theta)}. \tag{1} \end{equation} \] We refer to the factor \(p(\theta'|y)/p(\theta|y)\) as the ratio of the posterior densities and the factor \(q(\theta|\theta')/q(\theta'|\theta)\) as the ratio of the transition densities.

In particular, we show how to move one coordinate in \(\theta\) and then correspondingly adjust the other coordinates such that the proposal stays on the simplex. In addition to our package, several other software packages allow for sampling on a simplex in specific cases. In the Stan software designed to facilitate fully Bayesian modeling, all constrained variables are transformed to unconstrained variables to simplify Hamiltonian Monte Carlo (HMC). The software uses an element-wise transformation \[ \begin{equation} y_{k} = logit(z_{k}) - log \left( \frac{1}{K- k} \right) \tag{2} \end{equation} \] that is understood as a stick-breaking process. \(K\) refers to the number of pieces into which the stick is divided and \(z_{k}\) refers to the ratio of the length of the kth piece of the stick relative to the length of the \(1 - \sum \limits_{k' = 1}^{k-1}x_{k'}\) remaining pieces of the stick at the time the kth piece is broken (Stan Development Team 2014). A derivation of this same transformation using hyperspherical coordinates is provided by (Betancourt 2010). This approach works well, but is only available within the framework of Stan. In R, the hitandrun package uses linear constraints to sample a range of convex shapes including the simplex; however, its functionality is limited to drawing uniform samples (van Valkenhoef and Tervonen 2015). The functionality in Stan and in the hitandrun package are both useful. However, neither software offers the generalized simplex sampler available directly in R that the SALTSampler package provides.

The remainder of the paper focuses on the derivation, motivation, and usage of the Self-Adjusting Logit Transform (SALT) proposal to sample on a simplex. In Section 2, we define the SALT proposal and in Sections 3 and 4 we discuss details of its implementation. Section 5 compares how our approach performs relative to other simpler options for sampling on a simplex. In general, simple proposals lead to ineffective sampling, difficulties in high dimensions, and artifacts such as thinning while the SALT proposal method remains effective even in difficult cases such as when the mass of the distribution is heavily concentrated on one coordinate or the simplex is high-dimensional. Finally, in Section 7, we use the SALTSampler package to calibrate a functional model with simulated data to illustrate how to use the package to easily solve sampling problems that would othewise be difficult.

2. Self-Adjusting Logit Transform (SALT) proposal

2.1. Defining the proposal

We begin by defining a proposal for use with MCMC on a simplex that results in more effective sampling and can be broadly applied. In particular, we derive the process for updating a single element \(\theta_{i}\) and correspondingly adjusting the remaining \(\theta_{j}, i \neq j\). This process can be applied succesively for each \(\theta_{i}\) on every iteration. We begin by defining the logit transformation and its inverse \[ \begin{equation} logit(p) = log\left(\frac{p}{1-p}\right) \tag{3} \end{equation} \]

\[ \begin{equation} ilogit(x) = \frac{exp(x)}{1 + exp(x)}. \tag{4} \end{equation} \]

Using these definitions, a proposal for \(\theta_{i}'\) is given by

\[ \begin{equation} \theta_{i}' = \text{ilogit}[\text{logit}(\theta_{i}) + h_{i}Z] \tag{5} \end{equation} \]

where \(Z \sim N(0,1)\) and \(h_{i} \in \mathbb{R}^{+}\). To preserve the constraint that \(\sum_{i=1}^{k} \theta_{i} = 1\), all other coordinates of the proposal, \(\theta'\), must be correspondingly rescaled. However, care is taken to ensure that these remaining coordinates retain their relative proportions to each other. Reserve one element of \(\theta'\), \(\theta_{\ell}'\), \(\ell \neq i\), and define the others, \(\theta_{j}'\), \(j \neq i,\ell\), as

\[ \begin{equation} \theta_{j}' = (1 - \theta_{i}')\left( \frac{\theta_{j}}{1 - \theta_{i}} + U_{j} \right), \text{ } U_{j} \sim Unif (-\epsilon, \epsilon) \tag{6} \end{equation} \]

for some \(\epsilon > 0\) with \(U_{j}\) independent over \(j\). The factor \(\theta_{j}/(1 - \theta_{i})\) represents the proportion of the mass that \(\theta_{j}\) had relative to all other coordinates except \(\theta_{i}\). The factor \((1 - \theta_{i}')\) rescales each proportion relative to the remaining mass after \(\theta_{i}'\) has been determined. Adding the random noise term, \(U_{j}\), in Equation 6 expands the space on which \(\theta_{j}, j \neq i, \ell\), is defined. This reduces mathematical complexity when we later determine transition densities, since \(\theta\) has a well-defined density in the \(k-1\) dimensional space of the k-simplex, which is not the case if we omit the \(U_{j}\). Finally, we set

\[ \begin{equation} \theta_{\ell}' = (1 - \theta_{i}') \left( \frac{\theta_{\ell}}{1 - \theta_{i}} - \sum_{j \neq i,\ell}U_{j} \right) = 1 - \sum_{j \neq \ell }\theta_{j}'. \tag{7} \end{equation} \]

This transition creates a candidate, \(\theta'\), that sums to 1 where all coordinates are within the bounds \(-\epsilon \leq \theta ' \leq 1 + \epsilon\). This bound does allow coordinates to be greater than 1 or less than 0 when \(\epsilon \neq 0\), meaning a proposal could be outside the simplex. However, we will later show that \(\epsilon\) can be taken arbitrarily close to 0, thus, guaranteeing that all proposals are on the simplex.

2.2. Determining the acceptance probability

Expressing \(q(\theta'|\theta)\) requires a density that accounts for the transitions of each coordinate in the \(k-1\) dimensional space. Fortunately, this density can be easily factored,

\[ \begin{equation} q(\theta'|\theta) = q(\theta_{i}'|\theta_{i})\prod_{j \neq i,\ell} q(\theta_{j}'|\theta_{i}', \theta). \tag{8} \end{equation} \]

Considering each component seperately, we begin by deriving the density \(q(\theta_{i}'|\theta_{i})\). From Equation 5, we obtain

\[ \begin{align} logit(\theta_{i}') \sim N(logit(\theta_{i}), h_{i}^{2}). \tag{9} \end{align} \]

Transforming from the logit scale to the natural scale, we obtain

\[ \begin{align} \label{TransDensityComp1b} q(\theta_{i}'|\theta) &= \frac{1}{h_{i}}\phi\left(\frac{logit(\theta_{i}') - logit(\theta_{i})}{h_{i}}\right)|\frac{d }{d \theta_{i}'} logit(\theta_{i}')| \tag{10} \\ &= \frac{1}{h_{i}}\phi\left(\frac{logit(\theta_{i}') - logit(\theta_{i})}{h_{i}}\right)\frac{1}{\theta_{i}' (1 - \theta_{i})} \end{align} \]

where \(\phi ( \cdot )\) is the standard normal density function. Turning now to the transitions, \(q(\theta_{j}'|\theta_{i}', \theta), j \neq i,\ell\), we note that \([\theta_{j}'|\theta_{i}', \theta]\) is uniformly distributed with

\[ \begin{align} q(\theta_{j}'|\theta_{i}', \theta) =& \frac{1}{2 \epsilon (1 - \theta_{i}')} \tag{11} \end{align} \]

for \(\theta_{j}'\) consistent with Equation 6. This implies

\[ \begin{align} \prod_{j \neq i, \ell} q(\theta_{j}'|\theta_{i}', \theta) =& \prod_{j \neq i,\ell} \frac{1}{2 \epsilon (1 - \theta_{i}')} = \left[\frac{1}{ 2\epsilon (1 - \theta_{i}')}\right]^{k-2}. \tag{12} \end{align} \]

So, the overall transition density is

\[ \begin{align} q \left( \theta'|\theta \right) &= q \left( \theta_{i}'|\theta_{i} \right)\prod_{j \neq i, \ell} q(\theta_{j}'|\theta_{i}', \theta) \tag{13} \\ &= \frac{1}{h_{i}}\phi \left(\frac{logit(\theta_{i}') - logit(\theta_{i})}{h_{i}} \right) \frac{1}{\theta_{i}'(1 - \theta_{i}')} \left[\frac{1}{ 2\epsilon (1 - \theta_{i}')}\right]^{k-2}\\ \nonumber &= \frac{1}{h_{i}}\phi \left(\frac{logit(\theta_{i}') - logit(\theta_{i})}{h_{i}} \right)\frac{1}{\theta_{i}'}\left[\frac{1}{1 - \theta_{i}'}\right]^{k-1}\left[\frac{1}{ 2\epsilon}\right]^{k-2}. \end{align} \]

Noting that the Normal distribution is symmetric, the transition ratio simplifies to

\[ \begin{align} \frac{q(\theta|\theta')}{q(\theta'|\theta)} = \frac{\theta_{i}'}{\theta_{i}} \left[ \frac{(1 - \theta_{i}')}{(1 - \theta_{i})} \right]^{k-1}. \tag{14} \end{align} \] This means the acceptance ratio, \[ \begin{equation} r = \frac{p(\theta'|y)}{p(\theta|y)}\cdot \frac{q(\theta|\theta')}{q(\theta'|\theta)} = \frac{p(\theta'|y)}{p(\theta|y)}\cdot \frac{\theta_{i}'}{\theta_{i}} \left[ \frac{(1 - \theta_{i}')}{(1 - \theta_{i}^{t})} \right]^{k-1} \tag{15} \end{equation} \] is independent of \(\epsilon\). So, we can now take \(\epsilon\) arbitrarily close to 0 and ensure that all proposals are on the simplex.

In summary, to update \(\theta_{i}\), we use Equation 5 to propose a value for \(\theta_{i}'\). This value is accepted with probability \(min(1, r)\) where the acceptance ratio is as given in Equation 15. If accepted, \(\theta_{i}'\) is set to the new proposed value and all \(\theta_{j}, j \neq i\), are set to their new values as given in Equations 6 and 7. Otherwise, \(\theta' = \theta\). New values are proposed and accepted or rejected for each \(i\) on every iteration until convergence.

3. Use of logits

Our approach to moving on the simplex represents all coordinates of \(\theta\) on the logit scale. This choice is made to preserve numerical accuracy in the extreme case when one coordinate in \(\theta\) is very near 1. Specifically, in double precision arithmetic, \(1 \pm m\) is numerically equivalent to 1 when \(m \sim 5 \times 10^{-16}\). The exact value for \(m\), referred to as “machine epsilon”, may change based on the program, but can usually be found easily. For example, in R, \(m\) is the value of .Machine\$double.eps (R Core Team 2015). Because of these limits in accuracy, if any coordinate reaches a value where \(|1 - \theta_{i}| < m\), all other coordinates will be forced to have value 0 in order to satisfy the constraint \(\sum_{i=1}^{k}\theta_{i} = 1.\) So, any information about the relative proportion of the other \(\theta\) values is lost, since these coordinates are all identically zero. Using a logit scale ameliorates this issue by mapping the range \((0,1)\) monotonically onto \((-\infty, \infty)\). In particular, the relative precision on \(1 - \theta_{i}\) for \(\theta_{i}\) near 1 is vastly improved, which, consequently, improves the relative precision of all other elements as well.

Additionally, working on the logit scale causes the step size of the proposal to change appropriately based on the current value of \(\theta_{i}\). Large steps are proposed when \(\theta_{i}\) is in the middle of the simplex and small steps are proposed when \(\theta_{i}\) is near 0 or 1. We refer to this property as “self-adjustment”. It has considerable value, since taking small steps near the boundary of the simplex ensures that new proposals continue to be accepted, while taking large steps near the center of the simplex ensures that the entire posterior is explored. To show this concretely, we estimate the variance of the proposal, \(\theta_{i}'\), in Equation 5 as a function of \(\theta_{i}\) using the delta method. Applying a Taylor series expansion, we can aproximate \(\theta_{i}'\) around \(Z= 0\) \[ \begin{align} \theta_{i}' & \approx ilogit(logit(\theta_{i})) + h_{i}Z \left[ \frac{d}{dZ} ilogit(logit(\theta_{i}) + h_{i}Z) \right]_{Z = 0} \tag{16}\\ &= \theta_{i} + h_{i}Z\left[\frac{ilogit(logit(\theta_{i}) + h_{i}Z)}{1 + exp(logit(\theta_{i}) + h_{i}Z)} \right]_{Z=0}\\ &= \theta_{i} + h_{i}Z \theta_{i} (1 - \theta_{i}). \end{align} \] This gives \(var(\theta_{i}') = [h_{i}\theta_{i}(1 - \theta_{i})]^{2}\), which is maximized when \(\theta_{i} = 0.5\) and decreases monotically to 0 when \(\theta_{i}\) either decreases to 0 or increases to 1. This means that proposals made from the center of the simplex have larger variance than proposals near the edge of the simplex, where \(\theta_{i}\) is approximately 0 or 1. Since the variance of a proposal is a reflection of the step size, this illustrates that the proposal suitably adjusts based on the current value of \(\theta_{i}\).

4. Computational issues

As discussed in Section 3, transforming the problem to the logit scale preserves relative precision. However, to ensure that this precision is not lost in intermediate steps of the analysis, care must be taken when moving from the logit scale to the natural scale. To do this, we make use of two valuable R functions: log1p and expm1. “log1p(x) computes \(log(1+x)\) accurately for \(|x| << 1\)” and “expm1(x) computes \(exp(x) - 1\) accurately for \(|x| << 1\)(R Core Team 2015).

As an example of one of the many cases where this care is needed, we discuss rescaling the remaining elements of \(\theta\) after a new value of \(\theta_{i}'\) is proposed. Specifically, new proposal values are drawn for \(\theta_{i}'\) on the logit scale as defined in Equation 5. Then, the remaining \(\theta\) coordinates are updated by a scaling factor of \((1 - \theta_{i}')/(1 - \theta_{i}) = (1 - \theta_{i}')/(\sum_{j \neq i} \theta_{j})\). From the \(logit(\theta_{i}')\), we cannot directly determine this scaling factor. Instead, we find \(log(1 - \theta_{i}')\) from the \(logit(\theta_{i}')\) and then express the scaling factor on the log scale as the \(log(1 - \theta_{i}') - log(\sum_{j \neq i} \theta_{j})\). Finding \(log(1 - \theta_{i}')\) from \(logit(\theta_{i}')\) for small \(\theta_{i}'\) reduces precision if care is not taken. We demonstrate this in the general case where we aim to calculate \(log(1-p)\) from \(x = logit(p)\) for small \(p\). Mathematically, a simple approach would be to solve \(p = e^{x}/(1 - e^{x})\) and then take the \(log(1-p)\). However, applying this approach directly gives an imprecise result for \(log(1-p)\) when \(p\) is small. Letting \(p = 1 \times 10^{-17}\), we show this calculation in R

 p <- 1*10^-17
 x <- log(p/(1-p))
 log(1 - exp(x)/(1 - exp(x)))
## [1] 0

Since this calculation gives 0 numerically, we must write \(log(1-p)\) using the log1p function instead of the log function to obtain greater precision. For \(x < 0\), we can write \[ \begin{align} log(1-p) = -log(\frac{1}{1-p}) \tag{17} \\ = -log(1 + \frac{p}{1-p})\\ =-log(1 + e^{x}) \end{align} \] which can be evaluated with expression -log1p(exp(x)). For \(x>0\), we can write \[ \begin{align} log(1-p) &= log(p) - log(p) + log(1-p) \tag{18} \\ &= -log(\frac{1}{p}) - log(\frac{p}{1-p})\\ &= -log(1 + \frac{1-p}{p}) - x \\ &= -log(1 + e^{-x}) - x \end{align} \] which can be evaluated with expression -log1p(exp(-x)) - x. Applying this approach to the original example, we obtain

p <- 1*10^-17
x <- log(p/(1-p))
-log1p(exp(x))
## [1] -1e-17

which is much more precise than our initial result of zero. Thus, using this technique when rescaling the remaining coordinates of \(\theta\) after a new value of \(logit(\theta_{i}')\) is proposed gives precise values for each \(\theta_{j}', j \neq i\), rather than setting all these elements to 0. By applying this level of care both in this example and throughout our analysis, we are able to ensure precise results from the MCMC, even in cases where some elements of \(\theta\) are very small.

5. Limitations of simple proposals

The method introduced in the previous sections appears quite complicated, which suggests that a simpler alternative might exist. Unfortunately this is not the case. In this section, we demonstrate how seemingly natural proposals are inadequate for sampling on a simplex in non-trivial cases.

5.1. Succesive Dirichlet proposal

Because proposals for a simplex must meet the constraint \(\sum_{i=1}^{k}\theta_{i}' = 1\), it seems reasonable to use a Dirichlet distribution proposal for \(\theta'\), which provides a candidate with elements that sum to 1. MCMC is guaranteed to converge to the true posterior distribution, but this does not mean convergence occurs quickly. Using this naive proposal leads to a chain that does not fully explore the posterior in a reasonable length of time. Specifically, let

\[ \begin{equation} q(\theta'|\theta) \sim Dirichlet(\theta, s) \tag{19} \end{equation} \] where the Dirichlet distribution is parameterized with probability density function \[ \begin{equation} \frac{1}{\beta(s\theta)}\prod_{i=1}^{k}(\theta_{i}')^{s\theta_{i} - 1} \tag{20} \end{equation} \]

with \(\theta = (\theta_{1},...,\theta_{k})\) and \(s\) a concentration parameter. Since many programs, including R, do not have built-in functionality for sampling from a Dirichlet distribution, samples can be found using gamma distributions. First sample, \(y_{i} \sim Gamma(\theta_{i}s, 1)\) and then renormalize, \(\theta'_{i} = (1/\sum_{i=1}^{k}y_{i})(y_{1},...,y_{k})\) (Gelman et al. 2014). Alternatively, samples can be found for each coordinate sequentially using beta distributions. This approach can be more numerically stable. Beginning with \(\theta_{1}'\), draw \(\theta_{1}' \sim Beta(\theta_{1}s, \sum_{i=2}^{k}\theta_{i}s)\). Then for each \(\theta_{j}', j = 2,...,k-1\), sample \(\phi_{j} \sim Beta(\theta_{j}s, \sum_{j+1}^{k}\theta_{j}s)\) and let \(\theta_{j}' = (1 - \sum_{j=1}^{k-1}\theta_{i}'s)\phi_{j}\). Finally, let \(\theta_{k}' = 1 - \sum_{i=1}^{k-1}\theta_{i}'\) (Gelman et al. 2014).

Though these Dirichlet proposals defined in Equation 19 are theoretically valid, they perform poorly in practice. When a chain reaches a point where one or more coordinates of \(\theta\) are near 0, the corresponding proposal distribution is narrow. This results in only small steps being proposed, meaning that all coordinates sampled for a long period of time may remain in a particular corner or edge of the simplex. Other areas of the posterior are then not explored, resulting in poor mixing and slow convergence of the sampler. Numerical issues also occur in sampling the Dirichlet distribution when any of the coordinates are numerically close to 0. We illustrate this poor sampling in the following set of code where we generate the acceptance rates of MCMC chains grouped by their minimum coordinate value. To ensure that the chain does not become fixed in a position on the edge of the simplex due to numerical issues, proposals which contain coordinates with values less than 0.01 were automatically rejected in this analysis.

First, we define a Dirichlet target distribution function which returns the ratio of the log-likelihood of the posterior distribution for the proposal, \(\theta\) = ycand, to the log-likelihood of the posterior for the current value, \(\theta\) = ycurrent. For numerical accuracy, we use the Logit and LogPq functions from the SALTSampler Package.

library(SALTSampler)
## Loading required package: lattice
TargetDir <- function(ycand, ycurrent, a = NULL, dat = NULL) {
  ycand <- Logit(ycand)
  ycurrent <- Logit(ycurrent)
  out <- sum((a - 1)*(LogPq(ycand)$logp - LogPq(ycurrent)$logp))
  return(out)
}

Next, we define a proposal function for the succesive Dirichlet proposal:

PropStepDir <- function(y, s, guard) {
  p <- length(y)
  ynew <- rep(NA, p)
  #Sample first y value 
  ynew[1] <- rbeta(n = 1, shape1 = y[1]*s, shape2 = sum(y[2:p]*s))
  #Sample next p - 2  y values
  for (i in 2:(p-1)) {
    ynew[i] <- rbeta(n = 1, shape1 = y[i]*s, 
                     shape2 = sum(y[(i + 1):p]*s))*(1 - sum(ynew[1:(i - 1)]))
  }
  #Sample last y value
  ynew[p] <- 1 - sum(ynew[1:(p - 1)])
  #Calculate detailed balance term
  if (any(ynew < guard)) {
    dbt <- NA #Outside guard, always reject
  } else {
    dbt <- lgamma(sum(ynew*s)) - sum(lgamma(ynew*s)) + sum((ynew*s - 1)*log(y))
           - lgamma(sum(y*s))
    + sum(lgamma(y*s)) - sum((y*s - 1)*log(ynew)) 
  }
  #Return new logit-scaled point and corresponding dbt
  attr(ynew, 'dbt') <- dbt
  return(ynew)
}

Then, we write a function to conduct the sampling:

RunMhDir <- function(center, B, concentration, s, type, dat = NULL, guard) {
  #Redefining parameters, setting initial time, and making empty vectors and matrices
  zz = proc.time();
  p <- length(center)
  Y <- array(0, c(B, p)) 
  moveCount <- 0
  center <- center/sum(center)
  a <- concentration*center 
  ycurrent <- center
  #Run sampler
  for (i in 1:B) { 
    #Propose step
    ycand <- PropStepDir(ycurrent, s, guard)
    #Decide to accept or reject
    if (any(ycand < guard)) {
      move <- FALSE #Outside guard, always reject
    } else {
      move <- (log(runif(1)) < attr(ycand, 'dbt') + TargetDir(ycand, ycurrent, a, dat))
    }
    if (!is.na(move) & move) {
      ycurrent <- ycand
      moveCount <- moveCount + 1
    }
    #Store y for this iteration
    Y[i, ] <- ycurrent
  }
  #Timing
  runTime <- proc.time() - zz
  #Return results
  return(list(Y = Y, runTime = runTime, moveCount = moveCount, p = p, 
              center = center, B = B, concentration = concentration, s = s, 
              type = type, dat = NULL, a = a, moveCount = moveCount))
}

Finally, we run the sampler:

succDir <- RunMhDir(center = c(1/3, 1/3, 1/3), B = 5e3, concentration = 3, 
                    s = 1.2, type = 'dirichlet', dat = NULL, guard = 0.01)

To evaluate the results, we make a table that displays the proportion of times a proposal, \(\theta'\), is selected sorted by the value of the minimum coordinate of the current simplex point \(\theta\). \(s\) has been selected to achieve roughly optimal acceptance rates from samples where the minimum coordinate of \(\theta\) is greater than 0.1.

tab1 <- matrix(nrow = 6, ncol = 3)
tab1[, 1] <- c("0.01-0.02", "0.02-0.03", "0.03-0.05", "0.05-0.1", ">0.1", "Overall")
colnames(tab1) <- c("Minimum Coordinate", "Number of Samples", "Acceptance Rate")
minVal <- apply(succDir$Y[1:succDir$B - 1, ], 1, min)
g1 <- which(minVal < .02)
tab1[1, 2] <- length(g1)
tab1[1, 3] <- round(sum(succDir$Y[g1, ] != succDir$Y[g1 + 1, ])/length(g1), 5)
g2 <- which(minVal < .03 & minVal >= 0.02)
tab1[2, 2] <- length(g2)
tab1[2, 3] <- round(sum(succDir$Y[g2, ] != succDir$Y[g2 + 1, ])/length(g2), 5)
g3 <- which(minVal < .05 & minVal >= 0.03)
tab1[3, 2] <- length(g3)
tab1[3, 3] <- round(sum(succDir$Y[g3, ] != succDir$Y[g3 + 1, ])/length(g3), 5)
g4 <- which(minVal < .1 & minVal >= .05)
tab1[4, 2] <- length(g4)
tab1[4, 3] <- round(sum(succDir$Y[g4, ] != succDir$Y[g4 + 1, ])/length(g4), 5)
g5 <- which( minVal >= 0.1)
tab1[5, 2] <- length(g5)
tab1[5, 3] <- round(sum(succDir$Y[g5, ] != succDir$Y[g5 + 1, ])/length(g5), 5)
tab1[6, 2] <- succDir$B - 1
tab1[6, 3] <- round(succDir$moveCount/(succDir$B - 1), 5)
tab1
##      Minimum Coordinate Number of Samples Acceptance Rate
## [1,] "0.01-0.02"        "686"             "0.11808"      
## [2,] "0.02-0.03"        "282"             "0.18085"      
## [3,] "0.03-0.05"        "914"             "0.18381"      
## [4,] "0.05-0.1"         "1155"            "0.24675"      
## [5,] ">0.1"             "1962"            "0.33945"      
## [6,] "Overall"          "4999"            "0.08342"

As this table illustrates, this proposal results in low acceptance rates for samples with lower minimum coordinates of \(\theta\), resulting in the chain primarily sampling in these regions. Since the acceptance rates are dependent on \(s\), changing \(s\) will only change which values of \(\theta\) cause low acceptance, not eliminate them. So, we have shown that an efficient sampler can not be obtained with this approach.

5.2. Restricted proposal

Another natural choice for the proposal would be to make updates successively for each element \(\theta_{i}\) on each iteration of the algorithm. However, when \(\theta = \left(\theta_{1},..., \theta_{k}\right)\) is restricted to the simplex, updating one element per iteration is impossible, because when any new \(\theta_{i}\) is selected the constraint \(\sum_{i=1}^{k}\theta_{i} = 1\) is violated if the other \(\theta_{j}, i \neq j\), remain constant. This means that any given \(\theta_{k}\) cannot be moved independently of any entry in the vector \(\theta\), so, proposals that treat individual coordinates seperately are unusable.

To avoid this issue, a seemingly good option is to update all coordinates on every iteration and then determine whether to accept or reject the entire set of coordinates. In this framework, one coordinate can be defined as a function of the other coordinates. Often, though not always, this results in a proposal that satisfies the constraints of a simplex and so has positive probability of acceptance. If the proposal does not fall on the simplex, it has a log-likelihood value of \(-\infty\) and so is always rejected. Thus, all sampled coordinates accepted in the chain are on the simplex. As a precise example of this approach, consider for each \(\theta_{i}', i = 1,..., k-1\), drawing \(\theta_{i}' \sim N(\theta_{i}, h)\) where \(h\) is a selected step size. Then let \(\theta_{k}' = 1 - \sum_{i=1}^{k-1} \theta_{i}'\). To improve mixing, which \(\theta_{i}'\) is defined as \(\theta_{k}'\) should be selected randomly at each iteration. Some proposals may have \(\theta_{i} \leq 0\) or \(\theta_{i} \geq 1\), but these will always be rejected.

Though mathematically sound, this approach has some noteworthy disadvantages. This sampler is increasingly ineffective as the dimension of the simplex increases. Each individual coordinate has positive probability of being less than or equal to 0 or greater than or equal to 1. So, as the dimension increases, so does the probability of drawing at least one coordinate that violates the bounds of a simplex. Consequently, in higher dimensions, very small steps must be taken to gain an adequate acceptance rate. However, taking such small steps gives highly correlated samples and so the chain fails to explore the space well. This means that the chain never performs well. With small steps, the chain mixes poorly and with large steps, the chain rejects the majority of proposals. To illustrate this effect, we draw 5,000 samples from a twenty-dimensional

Dirichlet(\(\theta\), 20) distribution using this proposal with different step sizes where \(\theta = (1/20, ..., 1/20)\).

We can use the same Dirichlet target function as we used with the succesive Dirichlet proposal, TargetDir, so we now only need to define the proposal function

PropStepRP <- function(y, s) {
  p <- length(y)
  ynew <- rep(NA, p)
  #Set one coordinate aside at random
  hold <- sample(c(1:p), 1)
  use <- 1:p
  use <- use[use != hold]
  #Sample other p - 1 coordinates
  for (i in sample(use)) {
    ynew[i] <- rnorm(1, y[i], s)
  }
  #Calculate coordinate p from other coordinates
  ynew[hold] <- 1 - sum(ynew[use])
  #Calculate detailed balance term
  if (sum(ynew) != 1 || any(ynew <= 0) ) {
    dbt <- 0 #Outside simplex, always reject
  } else {
    dbt <- sum(dnorm(y, ynew, s, log = TRUE) - dnorm(ynew, y, s, log = TRUE))
  }
  #Return new logit-scaled point and corresponding dbt
  attr(ynew, 'dbt') <- dbt
  return(ynew)
}

and a sampler function:

RunMhRP <- function(center, B, concentration, s, type, dat = NULL){
  #Redefining parameters, setting initial time, and making empty vectors and matrices
  zz = proc.time();
  p <- length(center)
  S <- array(0, c(B, p)) 
  moveCount <- 0
  center <- center/sum(center)
  a <- concentration*center 
  ycurrent <- center
  #Run sampler
  for (i in 1:B) { 
    #Propose new y
    ycand <- PropStepRP(ycurrent, s)
    #Decide to accept or reject
    if(sum(ycand) != 1 || any(ycand <= 0)){
      S[i, ] <- ycurrent #reject if outside simplex
    } else {
      move <- (log(runif(1)) < attr(ycand, 'dbt') + TargetDir(ycand, ycurrent, a, dat))
      if (!is.na(move) & move){
        ycurrent <- ycand
        moveCount <- moveCount + 1
      } 
      #Store y for this iteration
      S[i, ] <- ycurrent
    }
  }
  #Timing
  runTime <- proc.time()-zz
  #Return results
  return(list(S = S, runTime = runTime, moveCount = moveCount, p = p, 
              center = center, B = B, concentration = concentration, s = s,
              type = type, dat = NULL, a = a))
}

Then, we can sample values and make a table showing the step sizes with their corresponding acceptance rates and average effective sample size across the twenty dimensions.

library(coda) #for effective sample size function
#Vector of possible s values
sVec <- c(0.0005, 0.001, 0.005, 0.01, 0.05, 0.1)
#Make empty table for results
tab2 <- matrix(nrow = 6, ncol = 3)
colnames(tab2) <- c("Step Size(s)", "Acceptance Rate", "Mean Effective Sample Size")
tab2[, 1] <- sVec
#Run sampler for each s value and record acceptance rate and mean effective sample size
for(i in 1:6){
  s <- rep(sVec[i], 20)
  highDim <- RunMhRP(center = rep(1/20, 20), B = 5e3, concentration = 20, s = s,
                    type = 'dirichlet', dat = NULL)
  tab2[i, 2] <- round(highDim$moveCount/highDim$B, 5)
  tab2[i, 3] <- round(mean(effectiveSize(highDim$S)), 5)
}
tab2
##      Step Size(s) Acceptance Rate Mean Effective Sample Size
## [1,]        5e-04          0.9696                    4.29610
## [2,]        1e-03          0.9056                    5.33494
## [3,]        5e-03          0.4046                   15.85932
## [4,]        1e-02          0.1638                   20.29800
## [5,]        5e-02          0.0004                    3.47266
## [6,]        1e-01          0.0000                    0.00000

As expected, the acceptance rates are high for small step sizes, but the effective sample sizes are also small, because of the high autocorrelation. For the larger step sizes, the acceptance rates are low, which also gives small effective sample sizes. This illustrates that no efficient sampler can be found using this approach in high dimensions, either too many proposals will be rejected or the autocorrelation among samples will be too high.

Finally, samples obtained with this approach are often difficult to interpret. We now sample and plot a chain of 10,000 samples from a Dirichlet(\(\theta\), 3) distribution where \(\theta = (1/3, ..., 1/3)\).

#Run chain
thinning <- RunMhRP(center = c(1/3, 1/3, 1/3), B = 1e4, concentration = 3,
                  s = rep(.3, 3), type = 'dirichlet', dat = NULL)
#Plot
TriPlot(thinning) #Use SALTSampler plotting function
mtext(sprintf("h = (2, 2, 2) Proposal, Acceptance Rate: %s", 
              round(thinning$moveCount/thinning$B, 3)[1], 
              round(thinning$moveCount/thinning$B, 3)[2],
              round(thinning$moveCount/thinning$B, 3)[3]), side = 1, line = 0, cex = 0.8)

The sample mean of the chain matches the theoretical mean and theoretical mode of the true posterior distribution; however, the observed samples do not look as one would expect. The corners of the simplex appear thinly sampled while the center of the simplex appears densely sampled. In reality, near the corner of the simplex proposals are often not on the simplex and so are always rejected. This lowers the acceptance rates in these areas, so more points are repeated which gives the appearance of thin sampling. Though not technically wrong, this effect can be misleading and so ideally should be avoided.

6. Utility of the SALT proposal

Each of the problems outlined in Section 5 are solved by using the SALT proposal. Beginning with the simple task of sampling a uniform Dirichlet distribution in three dimensions, we again plot 10,000 samples from Dirichlet(\(\theta\), 3) distribution projected into two dimensions where \(\theta = (1/3, 1/3, 1/3)\). This can be done in just a few lines of code with the functions in the SALTSampler R package.

#Run chain
noThinning <- RunMh(center = c(1/3, 1/3, 1/3), B = 1e4, concentration = 3, h = c(2, 2, 2), 
                    type = 'dirichlet')
#Plotting
TriPlot(noThinning) 
mtext(sprintf("h = (2, 2, 2) Proposal, Acceptance Rates: (%s, %s, %s)",
              round(noThinning$moveCount/noThinning$B, 3)[1], 
              round(noThinning$moveCount/noThinning$B, 3)[2], 
              round(noThinning$moveCount/noThinning$B, 3)[3]), side = 1, line = 0, cex = 0.8)

The results for the SALT proposal shows a uniform sampling across the space. So, unlike for the simpler proposals, visualizing the posterior is straightforward. There is no need to account for apparent changes in sampling density due to changing acceptance rates.

Further, the SALT proposal performs well even for high-dimensional simplexes. Its design ensures that all proposals are on the simplex, so an increasing number of dimensions does not reduce the acceptance rate. This ensures that sampling is still efficient for high-dimensional simplexes. To illustrate, we return to drawing 5,000 samples from a uniform 20-dimensional Dirichlet distribution, but now use the SALT proposal. For example, with step size 2.4, we obtain reasonable mean acceptance rates and effective sample sizes:

#Run chain
highDim <- RunMh(center = rep(1/20, 20), B = 5e3, concentration = 20, h = rep(2.4, 20),
                 type = 'dirichlet')
#Mean acceptance rate
mean(highDim$moveCount/highDim$B)
## [1] 0.48688
#Mean effective sample size
mean(effectiveSize(highDim$Y)) 
## [1] 908.6697

So, the chain both accepts a good number of proposals and adequately explores the full posterior space.

Additionally, when the values of coordinates differ by orders of magnitude, our proposed method remains accurate. As discussed in Section 3, working on the logit scale allows for precision to be maintained even when coordinates differ greatly. For example, sampling a Dirichlet(\(\theta\), \(1 \times 10^{6}\)) distribution where \(\theta = (0.0001, 0.01, 0.9899)\) with step size, \(h = (0.20 ,0.02, 0.02)\), we obtain the following posterior means and acceptance rates:

#Run chain
ordMag <- RunMh(center = c(.0001, .01, .9899), B = 5e3, concentration = 1e6, 
                h = c(.2, .02, .02), type = 'dirichlet')
#Acceptance rates
ordMag$moveCount/ordMag$B
## [1] 0.5050 0.5014 0.4928
#Posterior means
apply(ordMag$S, 2, mean) 
## [1] 9.994319e-05 9.997400e-03 9.899027e-01

These values are consistent with the true posterior means, highlighting the good performance of the sampler even for this more challenging posterior. Step sizes were found relatively easily via trial and error bearing in mind that larger steps are more appropriate for larger coordinates and vice versa.

7. Calibration example

As an illustration of the utility of this proposal and the SALTSampler package, we present the following simulated example. In calibration problems, the underlying relationship between a set of inputs and outputs is known and the outputs are observed. The goal is to deduce what inputs could have generated the observed outputs. In many real models, it will be known that the inputs are constrained to a simplex. For example, when the inputs represent the composition of different substances in a material, the sum of the proportion of each type of material must be 1.

To simulate this common problem, we generate a noisy set of responses, \(z = g(y) + \epsilon\) given a relationship, \(g(y) = (1000y_{1}^{4}*y_{2}^{3})/\sqrt{20 + y_{3}}\), where \(g(y)\) is the expected response given inputs \(y\). For this example, we let the unknown \(y = (1/3, 1/3, 1/3)\) and \(\epsilon \sim N(0, 2^{2})\). Using R, we define a function for \(g(y)\) and then draw 1000 samples from a normal distribution centered around \(g(y = (1/3, 1/3, 1/3))\) with variance \(2^{2}\). We allow inputs on the logit scale in the calibration function to enable using this function for sampling as well.

#Function to calibrate
CalibFn <- function(y, logit = FALSE) {
  if (logit == TRUE) {
    y <- exp(LogPq(y)$logp)
  }
  out <- 1e3*y[1]^3*y[2]^3/sqrt(20 + y[3])
  return(out)
}
#Generate samples
z <- rnorm(1000, CalibFn(c(1/3, 1/3, 1/3), 2))

We now find the distribution of \(y\) that could have generated these simulated \(z\) values. Taking a Bayesian approach, we set a uniform prior for \(y\) on the simplex and obtain the following posterior distribution for \(f(y|z)\) \[ \begin{align} f(y|z) &\propto f(z|y)f(y)\\ \tag{21} & = N(z|g(y), 2^{2}) Dirichlet(y| a = (1, 1, 1), s = 1). \end{align} \] This is a unique posterior distribution to this problem, so we write the log likelihood for the posterior as a target function,

Target <- function(ycand, ycurrent, a, dat, pars = NULL) {
  out <- sum(dnorm(dat, CalibFn(ycand, logit = TRUE), 2, log = TRUE)) - 
         sum(dnorm(dat, CalibFn(ycurrent, logit = TRUE), 2, log = TRUE)) + 
         sum((a - 1)*(LogPq(ycand)$logp - LogPq(ycurrent)$logp))
  return(out)
} 

Finally to sample the possible inputs we use the RunMh function to execute the Metropolis Hasting algorithm, selecting type = 'user' to indicate that we have defined a non-standard posterior.

inputDist <- RunMh(center = c(1/3, 1/3, 1/3), B = 3e4, concentration = 3, 
                   h = c(0.2, 0.2, 0.2), type = 'user', dat = z)

We plot the resulting distribution projected into a two-dimensional space using the TriPlot function and obtain the acceptance rate for each coordinate as well as traceplots on the logit scale and true scale with the Diagnostics function.

TriPlot(inputDist)
mtext("Projected Samples of Y", side = 3, cex = 1.25, font = 2)

par(mfrow = c(1,2))
Diagnostics(inputDist)

## $acceptRate
## [1] 0.6886333 0.6940000 0.6476000

8. Conclusion

Conducting MCMC on a simplex can pose challenges, especially when the simplex is high dimensional and/or has coordinates that differ by orders of magnitude. However, with care in selecting the proposal distribution and attention to the numerical constraints of a given programming language, these challenges are not insurmountable. Using this approach and the corresponding SALTSampler R package will simplify this process, allowing MCMC sampling on a simplex to be done with relative ease.

9. Acknowledgements

The authors would like to thank Todd Graves for helpful discussions.

References

Betancourt, MJ. 2010. “Cruising the Simplex: Hamiltonian Monte Carlo and the Dirichlet Distribution.” In AIP Conf. Proc., 1443:157. arXiv: 1010.3436.

Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2014. Bayesian Data Analysis. Vol. 2. Taylor & Francis.

R Core Team. 2015. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org.

Stan Development Team. 2014. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0. http://mc-stan.org/.

van Valkenhoef, Gert, and Tommi Tervonen. 2015. Hitandrun: “Hit and Run” and “Shake and Bake” for Sampling Uniformly from Convex Shapes. http://CRAN.R-project.org/package=hitandrun.