Zero-Inflated and Two-Part Mixed Effects Models

Dimitris Rizopoulos

2018-07-03

Mixed Models with an Extra Zero Part

Function mixed_model() of GLMMadaptive can also be used to fit zero-inflated and two-part mixed effects models. For both types of models, a suitable family object needs to be specified as outlined in the vignette Custom Models, and also arguments zi_fixed and zi_random of mixed_model() come into play. In these arguments, the user can specify the fixed and random effects formulas of the logistic regression for the zero-part of the distribution of the outcome. We should note that the user has the option to leave zi_random set to NULL, in which case the for the zero-part we have a logistic regression with only fixed effect and no random effects.

In addition, in the specification of the family object and in order to better facilitate the internal computations, the user may specify the function score_eta_zi_fun that calculates the derivative of the log probability density function or the log probability mass function with respect to eta_zi that denotes the linear predictor of the logistic regression for the zero-part. Here we provide three examples to illustrate how such models can be fitted.

Zero-Inflated Poisson Mixed Effects Model

We start our illustrations by showing how we can fit a zero-inflated Poisson mixed effects model. The specification of the required family object is already available in the package as the object returned by zi.poisson(). Currently, only the log link is allowed. First, we simulate longitudinal data from a zero-inflated negative binomial distribution:

set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time

# we constuct a data frame with the design: 
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
                 time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
                 sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))

# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ 1, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)

betas <- c(1.5, 0.05, 0.05, -0.03) # fixed effects coefficients non-zero part
shape <- 2
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.4 # variance of random intercepts zero part

# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 2, drop = FALSE]))
# we simulate negative binomial longitudinal data
DF$y <- rnbinom(n * K, size = shape, mu = exp(eta_y))
# we set the extra zeros
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0

A zero-inflated Poisson mixed model with only fixed effects in the zero part is fitted with the following call to mixed_model():

fm1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
                   family = zi.poisson(), zi_fixed = ~ sex)

fm1
#> 
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF, 
#>     family = zi.poisson(), zi_fixed = ~sex)
#> 
#> 
#> Model:
#>  family: zero-inflated poisson
#>  link: log 
#> 
#> Random effects covariance matrix:
#>                StdDev
#> (Intercept) 0.8272898
#> 
#> Fixed effects:
#>    (Intercept)      sexfemale           time sexfemale:time 
#>   1.5275644687   0.0173647260  -0.0040365383   0.0004326841 
#> 
#> Zero-part coefficients:
#> (Intercept)   sexfemale 
#>   -1.210597    0.498611 
#> 
#> log-Lik: -2223.937

As noted above, only the log link is currently available for the non-zero part and the logit link for the zero-part. Hence, the estimated fixed effects for the two parts are interpreted accordingly. We extend fm1 by allowing also for random intercept in the zero-part. We should note that by default the random intercept from the non-zero part is correlated with the random intercept from the zero-part:

fm2 <- update(fm1, zi_random = ~ 1 | id)

fm2
#> 
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF, 
#>     family = zi.poisson(), zi_fixed = ~sex, zi_random = ~1 | 
#>         id)
#> 
#> 
#> Model:
#>  family: zero-inflated poisson
#>  link: log 
#> 
#> Random effects covariance matrix:
#>                 StdDev    Corr
#> (Intercept)     0.7391        
#> zi_(Intercept)  0.7157 -0.6334
#> 
#> Fixed effects:
#>    (Intercept)      sexfemale           time sexfemale:time 
#>   1.5902225072  -0.0117290754  -0.0029500093   0.0001046371 
#> 
#> Zero-part coefficients:
#> (Intercept)   sexfemale 
#>  -1.1201182   0.4416912 
#> 
#> log-Lik: -2214.306

We test if we need the extra random effect using a likelihood ratio test:

anova(fm1, fm2)
#> 
#>         AIC     BIC  log.Lik   LRT df p.value
#> fm1 4461.87 4480.11 -2223.94                 
#> fm2 4446.61 4446.61 -2214.31 19.26  2   1e-04

The results suggest that the extra random effect improves the fit of the model.

Zero-Inflated Negative Binomial Mixed Effects Model

We continue with the same data, but we now take into account the potential over-dispersion in the data using a zero-inflated negative binomial model. To fit this mixed model we use almost identical syntax to what just did above - the only difference that now we specify as family the zi.negative.binomial() object:

gm1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
                   family = zi.negative.binomial(), zi_fixed = ~ sex)

gm1
#> 
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF, 
#>     family = zi.negative.binomial(), zi_fixed = ~sex)
#> 
#> 
#> Model:
#>  family: zero-inflated negative binomial
#>  link: log 
#> 
#> Random effects covariance matrix:
#>                StdDev
#> (Intercept) 0.8291406
#> 
#> Fixed effects:
#>    (Intercept)      sexfemale           time sexfemale:time 
#>    1.471442764    0.027018363    0.003096235   -0.006574982 
#> 
#> Zero-part coefficients:
#> (Intercept)   sexfemale 
#>  -1.5551763   0.5913118 
#> 
#> dispersion parameter:
#>  2.227321 
#> 
#> log-Lik: -1958.746

Similarly to fm1 in gm1 we specified only fixed effects for the logistic regression for the zero part. We now compare this model with the zero-inflated Poisson model that allowed for a random intercept in the zero-part. The comparison can be done with the anova() method; because the two models are not nested we set test = FALSE in the call to anova(), i.e.:

anova(gm1, fm2, test = FALSE)
#> 
#>         AIC     BIC  log.Lik df
#> gm1 3933.49 3954.33 -1958.75   
#> fm2 4446.61 4446.61 -2214.31  1

We observe that accounting for the over-dispersion seems to better improve the fit than including the random intercepts term in the zero-part.

Two-Part Mixed Effects Model for Semi-Continuous Data

To further illustrate the flexibility provided by GLMMadaptive in allowing users to specify their own family objects with a specific log density function, we consider the setting of multivariate semi-continuous data. That is, continuous data with excess zeros. In the literature the class of two-part / hurdle mixed models has been proposed to analyze such data. These models specify a logistic regression for the dichotomous indicator that the outcome is zero or not, and a standard linear mixed model for the logarithmic transformation of the non-zero responses.

We start again by simulating some longitudinal data from this model:

set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time

# we constuct a data frame with the design: 
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
                 time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
                 sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))

# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)

betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients non-zero part
sigma <- 0.5 # standard deviation error terms non-zero part
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.1 # variance of random slopes non-zero part
D33 <- 0.4 # variance of random intercepts zero part

# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)), rnorm(n, sd = sqrt(D33)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1:2, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 3, drop = FALSE]))
# we simulate log-normal longitudinal data
DF$y <- exp(rnorm(n * K, mean = eta_y, sd = sigma))
# we set the zeros from the logistic regression
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0

We define the family object: The minimal requirement is to specify the log_dens component and the link function; however, as also explained in the Custom Models vignette, the internal calculations will be faster and more stable if the user also specifies the score vector for the linear predictor of the non-zero part (function score_eta_fun()), the derivative of the log density with respect to phis (function score_phis_fun()), and because we have a model with a zero-part, also the derivative of the log density with respect to the linear predictor of the zero part (function score_eta_zi_fun()). Finally, for being able to simulate from the model using the simulate() method, the function simulate() within the family object can also be specified. Hence, the family object for the two-part model is defined as:

two_part.lognormal <- function () {
    stats <- make.link("identity")
    log_dens <- function (y, eta, mu_fun, phis, eta_zi) {
        sigma <- exp(phis)
        # binary indicator for y > 0
        ind <- y > 0
        # non-zero part
        eta <- as.matrix(eta)
        eta_zi <- as.matrix(eta_zi)
        out <- eta
        out[ind, ] <- plogis(eta_zi[ind, ], lower.tail = FALSE, log.p = TRUE) + 
            dnorm(x = log(y[ind]), mean = eta[ind, ], sd = sigma, log = TRUE)
        # zero part
        out[!ind, ] <- plogis(eta_zi[!ind, ], log.p = TRUE)
        attr(out, "mu_y") <- eta
        out
    }
    score_eta_fun <- function (y, mu, phis, eta_zi) {
        sigma <- exp(phis)
        # binary indicator for y > 0
        ind <- y > 0
        # non-zero part
        eta <- as.matrix(mu)
        out <- eta
        out[!ind, ] <- 0
        out[ind, ] <- (log(y[ind]) - eta[ind, ]) / sigma^2
        out
    }
    score_eta_zi_fun <- function (y, mu, phis, eta_zi) {
        ind <- y > 0
        probs <- plogis(as.matrix(eta_zi))
        out <- 1 - probs
        out[ind, ] <- - probs[ind, ]
        out
    }
    score_phis_fun <- function (y, mu, phis, eta_zi) {
        sigma <- exp(phis)
        # binary indicator for y > 0
        ind <- y > 0
        # non-zero part
        eta <- as.matrix(mu)
        out <- eta
        out[!ind, ] <- 0
        out[ind, ] <- - 1 + (log(y[ind]) - eta[ind, ])^2 / sigma^2
        out
    }
    simulate <- function (n, mu, phis, eta_zi) {
        y <- rnorm(n = n, mean = mu, sd = exp(phis))
        y[as.logical(rbinom(n, 1, plogis(eta_zi)))] <- 0
        y
    }
    structure(list(family = "two-part log-normal", link = stats$name, 
                   linkfun = stats$linkfun, linkinv = stats$linkinv, log_dens = log_dens,
                   score_eta_fun = score_eta_fun, score_eta_zi_fun = score_eta_zi_fun,
                   score_phis_fun = score_phis_fun, simulate = simulate),
              class = "family")
}

Then to fit the model we provide the user-defined family object in the family argument of mixed_model() specifying also that we have one dispersion parameter in the family (i.e., n_phis = 1), and that in the zero-part we only include fixed effects:

km1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF, 
                  family = two_part.lognormal(), n_phis = 1,
                  zi_fixed = ~ sex)

km1
#> 
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF, 
#>     family = two_part.lognormal(), zi_fixed = ~sex, n_phis = 1)
#> 
#> 
#> Model:
#>  family: two-part log-normal
#>  link: identity 
#> 
#> Random effects covariance matrix:
#>                StdDev
#> (Intercept) 0.9615228
#> 
#> Fixed effects:
#>    (Intercept)      sexfemale           time sexfemale:time 
#>    -2.08102787    -0.24015083     0.22302346    -0.05696434 
#> 
#> Zero-part coefficients:
#> (Intercept)   sexfemale 
#>  -1.6034499   0.6713555 
#> 
#> phi parameters:
#>  -0.3352479 
#> 
#> log-Lik: -1214.208

The estimated standard deviation for the error terms is exp(phis) = 0.7 We extend the model by allowing for a random intercept in the zero-part, but using the || symbol in the random argument we specify the covariance matrix of the random effects is diagonal; hence, that the two random intercepts terms are uncorrelated:

km2 <- update(km1, random = ~ 1 || id, zi_random = ~ 1 | id)

km2
#> 
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 || id, data = DF, 
#>     family = two_part.lognormal(), zi_fixed = ~sex, zi_random = ~1 | 
#>         id, n_phis = 1)
#> 
#> 
#> Model:
#>  family: two-part log-normal
#>  link: identity 
#> 
#> Random effects covariance matrix:
#>                StdDev
#> (Intercept)    0.9619
#> zi_(Intercept) 0.6043
#> 
#> Fixed effects:
#>    (Intercept)      sexfemale           time sexfemale:time 
#>    -2.08026578    -0.24025852     0.22301074    -0.05698636 
#> 
#> Zero-part coefficients:
#> (Intercept)   sexfemale 
#>  -1.7160273   0.7096789 
#> 
#> phi parameters:
#>  -0.3350702 
#> 
#> log-Lik: -1210.313

Finally, we show how the simulate() method can be used to perform a replication / posterior predictive check. In particular, in the following code we compare the empirical distribution function estimated in the observed data with estimates of the empirical distribution function obtained from simulated/replicated data from the model. In the call to simulate() below we also specify to account for the variability in the maximum likelihood estimates by setting acount_MLEs_var = TRUE:

par(mar = c(2.5, 2.5, 0, 0), mgp = c(1.1, 0.5, 0), cex.axis = 0.7, cex.lab = 0.8)
y <- DF$y
y[y > 0] <- log(y[y > 0])
x_vals <- seq(min(y), max(y), length.out = 500)
out <- simulate(km2, nsim = 30, acount_MLEs_var = TRUE)
rep_y <- apply(out, 2, function (x, x_vals) ecdf(x)(x_vals), x_vals = x_vals)
matplot(x_vals, rep_y, type = "l", lty = 1, col = "lightgrey", 
        xlab = "Response Variable", ylab = "Empirical CDF")
lines(x_vals, ecdf(y)(x_vals))
legend("bottomright", c("replicated data", "observed data"), lty = 1, 
       col = c("lightgrey", "black"), bty = "n", cex = 0.8)