Model Specification

Nicole Erler

2018-08-14

In this vignette we use the NHANES data for examples in cross-sectional data and the dataset simLong for examples in longitudinal data. For more info on these datasets, check out the vignette Visualizing Incomplete Data, in which the distributions of variables and missing values in both sets is explored.

Note:
In many of the examples we use n.adapt = 0 (and n.iter = 0, which is the default) in order to prevent the MCMC sampling and, hence, reduce computational time. mess = FALSE is used to suppress messages that are not of interest in this vignette.

Analysis model type

JointAI has three main functions:

Specification of these functions is similar to the specification of the complete data versions lm(), glm() and lme() (from package nlme).

lm_imp() and glm_imp() take arguments formula and data, whereas lme_imp() requires the specification of a fixed effects and a random effects formula. Specification of the fixed effects formula is discussed in section Model formula, specification of the random effects is discussed in section Multi level structure & longitudinal covariates

Additionally, glm_imp() requires the specification of the model family and link function.

Model formula

The arguments formula (in lm_imp() and glm_imp()) and fixed (in lme_imp()) take a two-sided formula object, where ~ separates the response (outcome / dependent variable) from the linear predictor, in which covariates (independent variables) are separated by +. An intercept is added automatically.

Interactions

Interactions between variables can be introduced using : or *, which adds the interaction term AND the main effects, i.e.,

is equivalent to

Interaction with multiple variables

Interactions between multiple variables can be specified using parentheses:

The function parameters() returns the parameters that are specified to be followed (even for models where no MCMC sampling was performed, i.e., when n.iter = 0 and n.adapt = 0).

Non-linear functional forms

In practice, associations between outcome and covariates do not always meet the standard assumption that all covariate effects are linear. Often, assuming a logarithmic, quadratic or other non-linear effect is more appropriate.

Non-linear associations can be specified in the model formula using either functions such as log() (the natural logarithm), sqrt() (the square root) or exp() (the exponential function), or by specifying a function of a variable using I(), for example, I(x^2) would be the quadratic term of a variable x.

For completely observed covariates, JointAI can handle any common type of function implemented in R, including splines, e.g., using ns() or bs() from the package splines (which is automatically installed with R).

Since functions involving variables that have missing values need to be re-calculated in each iteration of the MCMC sampling, currently, only functions that are available in JAGS can be used for incomplete variables. Those functions include:

The list of functions implemented in JAGS can be found in the JAGS user manual.

Some examples:1

What happens inside JointAI?

When a function of a complete or incomplete variable is used in the model formula, the main effect of that variable is automatically added as auxiliary variable (more on auxiliary variables in section Auxiliary variables), and only the main effects are used as predictors in the imputation models.

In mod3b, for example, the spline of age is used as predictor for SBP, but in the imputation model for bili, age enters with a linear effect.

The function list_impmodels prints a list of the imputation models used in a JointAI object. Since, at the moment, we are only interested in the predictor variables, we suppress printing of information on prior distributions, regression coefficients and other parameters by setting priors, regcoev and otherpars to FALSE.


When a function of a variable is specified as auxiliary variable, this function is used in the imputation models. For example, in mod3e, waist circumference (WC) is not part of the model for SBP, and I(WC^2) is used in the linear predictor of the imputation model for bili:

Incomplete variables are always imputed on their original scale, i.e.,

  • in mod3b the variable bili is imputed and the quadratic and cubic versions calculated from the imputed values.
  • Likewise, creat and albu in mod3c are imputed separately, and I(creat/albu^2) calculated from the imputed (and observed) values.

Functions with restricted support

When a function has restricted support, e.g., log(x) is only defined for x > 0, the distribution used to impute that variable needs to comply with these restrictions. This can either be achieved by truncating the distribution, using the argument trunc, or by selecting an imputation method that meets the restrictions. For more information on imputation methods, see the section Imputation model types.

Example:
When using a log() transformation for the covariate bili, we can either use the default imputation method norm (a normal distribution) and truncate it by specifying trunc = list(bili = c(lower, upper)) (where the lower and upper limits are the smallest and largest allowed values) or choose an imputation method (using the argument meth; more details see the section on Imputation model types) that only imputes positive values such as a log-normal distribution (lognorm) or a Gamma distribution (gamma):

Truncation always requires to specify both limits. Since -Inf and Inf are not accepted, a value outside the range of the variable (here: 1e10) can be selected for the second boundary of the truncation interval.

Functions that are not available in R

It is possible to use functions that have different names in R and JAGS, or that do exist in JAGS, but not in R, by defining a new function in R that has the name of the function in JAGS.

Example:
In JAGS the inverse logit transformation is defined in the function ilogit. In R, there is no function ilogit, but the inverse logit is available as the distribution function of the logistic distribution plogis.

Nested functions

It is also possible to nest a function in another function.

Example:2

The complementary log log transformation is restricted to values larger than 0 and smaller than 1. In order to use this function on a variable that exceeds this range, as is the case for creat, a second transformation might be used, for instance the inverse logit from the previous example.

In JAGS, the complementary log log transformation is implemented as cloglog, but since this function does not exist in (base) R, we first need to define it:

Multi level structure & longitudinal covariates

In multi-level models, additional to the fixed effects structure specified by the argument fixed a random effects structure needs to be provided via the argument random.

Random effects

random takes a one-sided formula starting with a ~. Variables for which a random effect should be included are usually separated by a +, and the grouping variable is separated by |. A random intercept is added automatically and only needs to be specified in a random intercept only model.

A few examples:

It is possible to use splines in the random effects structure, e.g.:

Currently, multiple levels of nesting are not yet available.

Longitudinal covariates

Longitudinal (“time-varying”) covariates can be used in the model, however, they can not yet be imputed.

Imputation model types

JointAI automatically selects an imputation model for each of the incomplete variables, based on the class of the variable and the number of levels. The automatically selected types are:

name model variable type
norm linear regression continuous variables
logit logistic regression factors with two levels
multilogit multinomial logit model unordered factors with >2 levels
cumlogit cumulative logit model ordered factors with >2 levels

The imputation models that are chosen by default may not necessarily be appropriate for the data at hand, especially for continuous variables, which often do not comply with the assumptions of (conditional) normality.

Therefore, alternative imputation methods are available for continuous covariates:

name model variable type
lognorm normal regression of the log-transformed variable right-skewed variables >0
gamma Gamma regression (with log-link) right-skewed variables >0
beta beta regression (with logit-link) continuous variables with values in [0, 1]

lognorm assumes a normal distribution for the natural logarithm of the variable, but the variable enters the linear predictor of the analysis model (and possibly other imputation models) on its original scale.

Specification of imputation model types

In models mod4b and mod4c we have already seen two examples in which the imputation model type was changed using the argument meth. When the model has many incomplete predictor variables, but not all need to be changed, it can be useful to perform a set-up run of the model without any iterations, extract the vector of imputation models that were automatically selected, and apply changes to that vector:

Alternatively, the function get_imp_meth() can be called directly. get_imp_meth has arguments

Order of the sequence of imputation models

By default, the imputation models are ordered by number of missing values (decreasing), and each model has the (cross-sectional) complete covariates and all incomplete variables that appear earlier in the sequence in its linear predictor:

By re-ordering the elements in meth, the order of the sequence of imputation models will be changed.

Auxiliary variables

Auxiliary variables are variables that are not part of the analysis model, but should be considered as predictor variables in the imputation models because they can inform the imputation of on unobserved values.

Good auxiliary variables are

In lm_imp(), glm_imp() and lme_imp(), auxiliary variables can be specified with the argument auxvars, which is a vector containing the names of the auxiliary variables.

Example:
We might consider the variables educ and smoke as predictors for the imputation of occup:

mod9a <- lm_imp(SBP ~ gender + age + occup, auxvars = c('educ', 'smoke'),
                data = NHANES, n.iter = 100, progress.bar = 'none', mess = FALSE)

The regression coefficients for educ and smoke in the analysis model are fixed to zero so that these two variables do not contribute to the model, and are omitted from the model summary:

summary(mod9a)
#> 
#>  Linear model fitted with JointAI 
#> 
#> Call:
#> lm_imp(formula = SBP ~ gender + age + occup, data = NHANES, n.iter = 100, 
#>     progress.bar = "none", mess = FALSE, auxvars = c("educ", 
#>         "smoke"))
#> 
#> Posterior summary:
#>                           Mean      SD    2.5%    97.5% tail-prob.
#> (Intercept)           105.3069 3.35782 98.6359 112.1304    0.00000
#> genderfemale           -5.6170 2.06830 -9.4011  -1.1724    0.01333
#> age                     0.3852 0.07564  0.2155   0.5252    0.00000
#> occuplooking for work   3.7841 6.70179 -8.7130  16.0765    0.60000
#> occupnot working       -0.8844 2.72687 -6.1402   4.5735    0.76667
#> 
#> Posterior summary of residual std. deviation:
#>            Mean     SD  2.5% 97.5%
#> sigma_SBP 14.47 0.7648 12.97 16.02
#> 
#> 
#> MCMC settings:
#> Iterations = 101:200
#> Sample size per chain = 100 
#> Thinning interval = 1 
#> Number of chains = 3 
#> 
#> Number of observations: 186

They are, however, used as predictors in the imputation for occup and imputed themselves (if they have missing values):

list_impmodels(mod9a, priors = FALSE, regcoef = FALSE, otherpars = FALSE, refcat = FALSE)
#> Cumulative logit imputation model for 'smoke'
#> * Predictor variables: 
#>   genderfemale, age, educhigh
#> 
#> Multinomial logit imputation model for 'occup'
#> * Predictor variables: 
#>   (Intercept), genderfemale, age, educhigh, smokeformer, smokecurrent

Functions of variables as auxiliary variables

As shown above in mod3e, it is possible to specify functions of auxiliary variables. In that case, the auxiliary variable is not considered as linear effect but as specified by the function:

Reference values for categorical covariates

In JointAI dummy coding is used when categorical variables enter a linear predictor, i.e., a binary variables is created for each category, except the reference category. These binary variables have value one when that category was observed and zero otherwise.

By default, the first category of a categorical variable (ordered or unordered) is used as reference, however, this may not always allow the desired interpretation of the regression coefficients. Moreover, when categories are unbalanced, setting the largest group as reference may result in better mixing of the MCMC chains.

Therefore, JointAI allows specification of the reference category separately for each variable, via the argument refcats.

Setting reference categories for all variables

To specify the choice of reference category globally for all variables in the model, refcats can be set as

For example:

Setting reference categories for individual variables

Alternatively, refcats takes a named vector, in which the reference category for each variable can be specified either by its number or its name, or one of the three global types: “first”, “last” or “largest”. For variables for which no reference category is specified in the list the default is used.

To help to specify the reference category, the function set_refcat() can be used. It prints the names of the categorical variables that are selected by

or all categorical variabels in the data if only data is provided, and asks a number of questions which the user needs to reply to by input of a number.

When option 4 is chosen, a question for each categorical variable is asked, for example:

After specification of the reference categories for all categorical variables, the determined specification for the argument refcats is printed:

set_refcat() also returns a named vector that can be passed to the argument refcats:


  1. Note: these examples are chosen to demonstrate functionality and may not fit the data.

  2. Again, this is just a demonstration of the possibilities in JointAI, but nesting transformations will most often result in coefficients that that do not have meaningful interpretation in practice.