Function mixed_model()
of GLMMadaptive can also be used to fit zero-inflated and two-part mixed effects models. For both types of models, a suitable family
object needs to be specified as outlined in the vignette Custom Models, and also arguments zi_fixed
and zi_random
of mixed_model()
come into play. In these arguments, the user can specify the fixed and random effects formulas
of the logistic regression for the zero-part of the distribution of the outcome. We should note that the user has the option to leave zi_random
set to NULL
, in which case the for the zero-part we have a logistic regression with only fixed effect and no random effects.
In addition, in the specification of the family
object and in order to better facilitate the internal computations, the user may specify the function score_eta_zi_fun
that calculates the derivative of the log probability density function or the log probability mass function with respect to eta_zi
that denotes the linear predictor of the logistic regression for the zero-part. Here we provide three examples to illustrate how such models can be fitted.
We start our illustrations by showing how we can fit a zero-inflated Poisson mixed effects model. The specification of the required family
object is already available in the package as the object returned by zi.poisson()
. Currently, only the log link is allowed. First, we simulate longitudinal data from a zero-inflated negative binomial distribution:
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ 1, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(1.5, 0.05, 0.05, -0.03) # fixed effects coefficients non-zero part
shape <- 2
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 2, drop = FALSE]))
# we simulate negative binomial longitudinal data
DF$y <- rnbinom(n * K, size = shape, mu = exp(eta_y))
# we set the extra zeros
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
A zero-inflated Poisson mixed model with only fixed effects in the zero part is fitted with the following call to mixed_model()
:
fm1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = zi.poisson(), zi_fixed = ~ sex)
fm1
#>
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF,
#> family = zi.poisson(), zi_fixed = ~sex)
#>
#>
#> Model:
#> family: zero-inflated poisson
#> link: log
#>
#> Random effects covariance matrix:
#> StdDev
#> (Intercept) 0.8272898
#>
#> Fixed effects:
#> (Intercept) sexfemale time sexfemale:time
#> 1.5275644687 0.0173647260 -0.0040365383 0.0004326841
#>
#> Zero-part coefficients:
#> (Intercept) sexfemale
#> -1.210597 0.498611
#>
#> log-Lik: -2223.937
As noted above, only the log link is currently available for the non-zero part and the logit link for the zero-part. Hence, the estimated fixed effects for the two parts are interpreted accordingly. We extend fm1
by allowing also for random intercept in the zero-part. We should note that by default the random intercept from the non-zero part is correlated with the random intercept from the zero-part:
fm2 <- update(fm1, zi_random = ~ 1 | id)
fm2
#>
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF,
#> family = zi.poisson(), zi_fixed = ~sex, zi_random = ~1 |
#> id)
#>
#>
#> Model:
#> family: zero-inflated poisson
#> link: log
#>
#> Random effects covariance matrix:
#> StdDev Corr
#> (Intercept) 0.7391
#> zi_(Intercept) 0.7157 -0.6334
#>
#> Fixed effects:
#> (Intercept) sexfemale time sexfemale:time
#> 1.5902225072 -0.0117290754 -0.0029500093 0.0001046371
#>
#> Zero-part coefficients:
#> (Intercept) sexfemale
#> -1.1201182 0.4416912
#>
#> log-Lik: -2214.306
We test if we need the extra random effect using a likelihood ratio test:
anova(fm1, fm2)
#>
#> AIC BIC log.Lik LRT df p.value
#> fm1 4461.87 4480.11 -2223.94
#> fm2 4446.61 4446.61 -2214.31 19.26 2 1e-04
The results suggest that the extra random effect improves the fit of the model.
We continue with the same data, but we now take into account the potential over-dispersion in the data using a zero-inflated negative binomial model. To fit this mixed model we use almost identical syntax to what just did above - the only difference that now we specify as family the zi.negative.binomial()
object:
gm1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = zi.negative.binomial(), zi_fixed = ~ sex)
gm1
#>
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF,
#> family = zi.negative.binomial(), zi_fixed = ~sex)
#>
#>
#> Model:
#> family: zero-inflated negative binomial
#> link: log
#>
#> Random effects covariance matrix:
#> StdDev
#> (Intercept) 0.8291406
#>
#> Fixed effects:
#> (Intercept) sexfemale time sexfemale:time
#> 1.471442764 0.027018363 0.003096235 -0.006574982
#>
#> Zero-part coefficients:
#> (Intercept) sexfemale
#> -1.5551763 0.5913118
#>
#> dispersion parameter:
#> 2.227321
#>
#> log-Lik: -1958.746
Similarly to fm1
in gm1
we specified only fixed effects for the logistic regression for the zero part. We now compare this model with the zero-inflated Poisson model that allowed for a random intercept in the zero-part. The comparison can be done with the anova()
method; because the two models are not nested we set test = FALSE
in the call to anova()
, i.e.:
anova(gm1, fm2, test = FALSE)
#>
#> AIC BIC log.Lik df
#> gm1 3933.49 3954.33 -1958.75
#> fm2 4446.61 4446.61 -2214.31 1
We observe that accounting for the over-dispersion seems to better improve the fit than including the random intercepts term in the zero-part.
To further illustrate the flexibility provided by GLMMadaptive in allowing users to specify their own family objects with a specific log density function, we consider the setting of multivariate semi-continuous data. That is, continuous data with excess zeros. In the literature the class of two-part / hurdle mixed models has been proposed to analyze such data. These models specify a logistic regression for the dichotomous indicator that the outcome is zero or not, and a standard linear mixed model for the logarithmic transformation of the non-zero responses.
We start again by simulating some longitudinal data from this model:
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients non-zero part
sigma <- 0.5 # standard deviation error terms non-zero part
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.1 # variance of random slopes non-zero part
D33 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)), rnorm(n, sd = sqrt(D33)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1:2, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 3, drop = FALSE]))
# we simulate log-normal longitudinal data
DF$y <- exp(rnorm(n * K, mean = eta_y, sd = sigma))
# we set the zeros from the logistic regression
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
We define the family
object: The minimal requirement is to specify the log_dens
component and the link function; however, as also explained in the Custom Models vignette, the internal calculations will be faster and more stable if the user also specifies the score vector for the linear predictor of the non-zero part (function score_eta_fun()
), the derivative of the log density with respect to phis
(function score_phis_fun()
), and because we have a model with a zero-part, also the derivative of the log density with respect to the linear predictor of the zero part (function score_eta_zi_fun()
). Finally, for being able to simulate from the model using the simulate()
method, the function simulate()
within the family object can also be specified. Hence, the family object for the two-part model is defined as:
two_part.lognormal <- function () {
stats <- make.link("identity")
log_dens <- function (y, eta, mu_fun, phis, eta_zi) {
sigma <- exp(phis)
# binary indicator for y > 0
ind <- y > 0
# non-zero part
eta <- as.matrix(eta)
eta_zi <- as.matrix(eta_zi)
out <- eta
out[ind, ] <- plogis(eta_zi[ind, ], lower.tail = FALSE, log.p = TRUE) +
dnorm(x = log(y[ind]), mean = eta[ind, ], sd = sigma, log = TRUE)
# zero part
out[!ind, ] <- plogis(eta_zi[!ind, ], log.p = TRUE)
attr(out, "mu_y") <- eta
out
}
score_eta_fun <- function (y, mu, phis, eta_zi) {
sigma <- exp(phis)
# binary indicator for y > 0
ind <- y > 0
# non-zero part
eta <- as.matrix(mu)
out <- eta
out[!ind, ] <- 0
out[ind, ] <- (log(y[ind]) - eta[ind, ]) / sigma^2
out
}
score_eta_zi_fun <- function (y, mu, phis, eta_zi) {
ind <- y > 0
probs <- plogis(as.matrix(eta_zi))
out <- 1 - probs
out[ind, ] <- - probs[ind, ]
out
}
score_phis_fun <- function (y, mu, phis, eta_zi) {
sigma <- exp(phis)
# binary indicator for y > 0
ind <- y > 0
# non-zero part
eta <- as.matrix(mu)
out <- eta
out[!ind, ] <- 0
out[ind, ] <- - 1 + (log(y[ind]) - eta[ind, ])^2 / sigma^2
out
}
simulate <- function (n, mu, phis, eta_zi) {
y <- rnorm(n = n, mean = mu, sd = exp(phis))
y[as.logical(rbinom(n, 1, plogis(eta_zi)))] <- 0
y
}
structure(list(family = "two-part log-normal", link = stats$name,
linkfun = stats$linkfun, linkinv = stats$linkinv, log_dens = log_dens,
score_eta_fun = score_eta_fun, score_eta_zi_fun = score_eta_zi_fun,
score_phis_fun = score_phis_fun, simulate = simulate),
class = "family")
}
Then to fit the model we provide the user-defined family object in the family
argument of mixed_model()
specifying also that we have one dispersion parameter in the family (i.e., n_phis = 1
), and that in the zero-part we only include fixed effects:
km1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = two_part.lognormal(), n_phis = 1,
zi_fixed = ~ sex)
km1
#>
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 | id, data = DF,
#> family = two_part.lognormal(), zi_fixed = ~sex, n_phis = 1)
#>
#>
#> Model:
#> family: two-part log-normal
#> link: identity
#>
#> Random effects covariance matrix:
#> StdDev
#> (Intercept) 0.9615228
#>
#> Fixed effects:
#> (Intercept) sexfemale time sexfemale:time
#> -2.08102787 -0.24015083 0.22302346 -0.05696434
#>
#> Zero-part coefficients:
#> (Intercept) sexfemale
#> -1.6034499 0.6713555
#>
#> phi parameters:
#> -0.3352479
#>
#> log-Lik: -1214.208
The estimated standard deviation for the error terms is exp(phis) =
0.7 We extend the model by allowing for a random intercept in the zero-part, but using the ||
symbol in the random
argument we specify the covariance matrix of the random effects is diagonal; hence, that the two random intercepts terms are uncorrelated:
km2 <- update(km1, random = ~ 1 || id, zi_random = ~ 1 | id)
km2
#>
#> Call:
#> mixed_model(fixed = y ~ sex * time, random = ~1 || id, data = DF,
#> family = two_part.lognormal(), zi_fixed = ~sex, zi_random = ~1 |
#> id, n_phis = 1)
#>
#>
#> Model:
#> family: two-part log-normal
#> link: identity
#>
#> Random effects covariance matrix:
#> StdDev
#> (Intercept) 0.9619
#> zi_(Intercept) 0.6043
#>
#> Fixed effects:
#> (Intercept) sexfemale time sexfemale:time
#> -2.08026578 -0.24025852 0.22301074 -0.05698636
#>
#> Zero-part coefficients:
#> (Intercept) sexfemale
#> -1.7160273 0.7096789
#>
#> phi parameters:
#> -0.3350702
#>
#> log-Lik: -1210.313
Finally, we show how the simulate()
method can be used to perform a replication / posterior predictive check. In particular, in the following code we compare the empirical distribution function estimated in the observed data with estimates of the empirical distribution function obtained from simulated/replicated data from the model. In the call to simulate()
below we also specify to account for the variability in the maximum likelihood estimates by setting acount_MLEs_var = TRUE
:
par(mar = c(2.5, 2.5, 0, 0), mgp = c(1.1, 0.5, 0), cex.axis = 0.7, cex.lab = 0.8)
y <- DF$y
y[y > 0] <- log(y[y > 0])
x_vals <- seq(min(y), max(y), length.out = 500)
out <- simulate(km2, nsim = 30, acount_MLEs_var = TRUE)
rep_y <- apply(out, 2, function (x, x_vals) ecdf(x)(x_vals), x_vals = x_vals)
matplot(x_vals, rep_y, type = "l", lty = 1, col = "lightgrey",
xlab = "Response Variable", ylab = "Empirical CDF")
lines(x_vals, ecdf(y)(x_vals))
legend("bottomright", c("replicated data", "observed data"), lty = 1,
col = c("lightgrey", "black"), bty = "n", cex = 0.8)