Clark, J.S. 2016. Why species tell us more about traits than traits tell us about species, Ecology, 97, 1979-1993.
Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data. Ecological Monographs, in press.
files are found here
Because it accommodates different data types gjam can be used to model ecological traits by either of two approaches (Clark 2016). One approach uses community weighted mean/mode (CWMM) trait values for a plot \(i\) as a response vector \(\mathbf{u}_{i}\), where each trait has a corresponding data type designation in typeNames
. I discuss this approach first. I then summarize the second approach, predictive trait modeling.
There are \(n\) observations of \(M\) traits to be explained by \(Q - 1\) predictors in design matrix \(\mathbf{X}\). The Trait Response Model (TRM) in Clark (2016) is
\[\mathbf{w}_{i} \sim MVN(\mathbf{u}_i,\Omega)\]
\[\mathbf{u}_i = \mathbf{A}'\mathbf{x}_{i}\]
where \(\mathbf{u}_{i}\) is a length-\(M\) vector of CWMM values, corresponding to \(\mathbf{w}_{i}\) on the latent scale, \(\mathbf{A}\) is the \(Q \times M\) matrix of coefficients, and \(\Omega\) is the \(M \times M\) residual covariance (Fig. 1). After describing the setup and model fitting I show how gjam summarizes the estimates and predictions.
Figure 1. Trait response model showing the sizes of matrices for a sample containing n observations, M traits, and Q predictors.
Data contained in forestTraits
include predictors in xdata
, a character vector of data types in traitTypes
, and treesDeZero
, which contains tree biomass in de-zeroed format. Here the data are loaded, re-zeroed with gjamReZero
:
library(gjam)
library(repmis)
source_data("https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True")
xdata <- forestTraits$xdata # n X Q
types <- forestTraits$traitTypes # 12 trait types
sbyt <- forestTraits$specByTrait # S X 12
pbys <- gjamReZero(forestTraits$treesDeZero) # n X S
pbys <- gjamTrimY(pbys,5)$y # at least 5 plots
head(sbyt)
The matrix pbys
holds biomass values for species, rounded off to reduce storage. The first six columns of sbyt
are centered and standardized. The three ordinal classes are integer values, but do not represent an absolute scale (see below). The three groups of categorical variables in data.frame sbyt
have different numbers of levels shown here:
table(sbyt$leaf) # four levels
table(sbyt$xylem) # diffuse/tracheid vs ring-porous
table(sbyt$repro) # two levels
These species traits are translated into community-weighted means and modes (CWMM) by the function gjamSpec2Trait
:
tmp <- gjamSpec2Trait(pbys, sbyt, types)
tTypes <- tmp$traitTypes # M = 15 values
u <- tmp$plotByCWM # n X M
censor <- tmp$censor # (0, 1) censoring, two-level CAT's
specByTrait <- tmp$specByTrait # S X M
M <- ncol(u)
n <- nrow(u)
types # 12 individual trait types
cbind(colnames(u),tTypes) # M trait names and types
Note the change in data types by comparing types
for individuals of a species with tTypes
for CWMM values at the plot scale. At the plot scale tTypes
has \(M = 15\) values, because the leaf 'CAT'
group in types
includes four levels, which are expanded to four 'FC'
columns in u
. The two-level groups 'xylem'
and 'repro'
are transformed to censored continuous values on (0, 1) and thus each occupy a single column in u
.
As discussed in Clark (2016) the interpretation of CWMM values in u
is not the same as the interpretation of species-level traits assigned in forestTraits$specByTrait
. Let \(\mathbf{T'}\) be a species-by-traits matrix specByTrait
, constructed as CWMM values in function gjamSpec2Trait
. The row names of specByTrait
match the column names for the \(n \times S\) species abundance matrix plotByTrees
. The latter is referenced to individuals of a species.
The plot-by-trait matrix u
is referenced to a location, i.e., one row in matrix u
. It is a CWMM, with values derived from measurements on individual trees, but combined to produce a weighted value for each location. Ordinal traits (shade
, drought
, flood
) are community weighted modes, because ordinal scores cannot be averaged. The CWMM value for a plot may not be the same data type as the trait measured on an individual tree sbyt
. Here is a table of 15 columns in u
:
trait | typeName |
partition | comment |
---|---|---|---|
gmPerSeed |
CON |
\((-\infty, \infty)\) | centered, standardized |
maxHt |
CON |
" | " |
leafN |
CON |
" | " |
leafP |
CON |
" | " |
SLA |
CON |
" | " |
woodSG |
CON |
" | " |
shade |
OC |
\((-\infty, 0, p_{s1}, p_{s2}, p_{s3}, p_{s4}, \infty)\) | five tolerance bins |
drought |
OC |
" | " |
flood |
OC |
" | " |
leaf_broaddeciduous |
FC |
\((-\infty, 0, 1, \infty)\) | categorical traits become FC data as CWMs |
leaf_broadevergreen |
FC |
" | " |
leaf_needleevergreen |
FC |
" | " |
leaf_other |
FC |
" | " |
repro_monoecious |
CA |
\((-\infty, 0, 1, \infty)\) | two categories become continuous (censored) |
xylem_ring |
CA |
" | " |
The first six CON
variables are continuous, centered, and standardized, as is often done in trait studies. In gjam CON
is the only type that is not assumed to be censored at zero.
The three OC
variables are ordinal classes, lacking an absolute scale–the partition must be estimated.
The four fractional composition FC
columns are the levels of the single CAT
variable leaf
, expanded by the function gjamSpec2Trait
.
The last two traits in u
are fractions with two classes, only one of which is included here. They are censored at both 0 and 1, the intervals \((-\infty, 0)\) and \((1, \infty)\). This censoring can be generated using gjamCensorY
:
censorList <- gjamCensorY(values = c(0,1), intervals = cbind( c(-Inf,0),c(1,Inf) ),
y = u, whichcol = c(13:14))$censor
This censoring was already done with gjamSpec2Trait
, which knows to treat 'CAT'
data with only two levels as censored 'CA'
data. In this case the values = c(0,1)
indicates that zeros and ones in the data indicate censoring. The intervals
matrix gives their ranges.
Multilevel factors in xdata
require some interpretation. If you have not worked with multilevel factors, refer to the R help
page for factor
. The interpretation of coefficients for multilevel factors depends on the reference level used to construct a contrasts matrix. Standard models in R assign contrasts that may not assume the reference level that is desired. Moreover, results may depend on the order of observations and variables in the data.
In xdata
the variable soil
is a multilevel factor, which includes soil types that are both common and have potentially strong effects. Here are the first few rows of xdata
:
I used the name reference
for a soil type to aggregate types that are rare. Factor levels that rarely occur cannot be estimated in the model.
The R function relevel
allows definition of a reference level. In this case I want to compare levels to the reference soil type reference
:
xdata$soil <- relevel(xdata$soil,'reference')
To avoid confusion, contrasts can be inspected as output$modelSummary$contrasts
. If the reference class is all zeros and other classes are zeros and ones, then the intercept is the reference class.
Here is an analysis of the data, with 20 holdout plots. Predictors in xdata
are winter temperature (temp
), slope (u1
), aspect (u2
, u3
), local moisture
, climatic moisture deficit
and soil
.
\[[u_{i,1}, u_{i,2}, u_{i,3}]' = [sin(slope_{i}), sin(slope_{i})sin(aspect_{i}), sin(slope_{i})cos(aspect_{i})]'\]
(Clark 1990). As discussed above, the variable soil
is a multi-level factor. Because slope and aspect variables are products (interactions) I do not standardize them, including them in notStandard
,
ml <- list(ng = 3000, burnin = 500, typeNames = tTypes, holdoutN = 20,
censor=censor, notStandard = c('u1','u2','u3'))
out <- gjamGibbs(~ temp + stdage + moisture*deficit + deficit*soil,
xdata = xdata, ydata = u, modelList = ml)
tnames <- colnames(u)
specColor <- rep('black', M) # highlight types
wo <- which(tnames %in% c("leafN","leafP","SLA") ) # foliar traits
wf <- grep("leaf",tnames) # leaf habit
wc <- which(tnames %in% c("woodSG","diffuse","ring") ) # wood anatomy
specColor[wc] <- 'brown'
specColor[wf] <- 'darkblue'
specColor[wo] <- 'darkgreen'
pl <- list(GRIDPLOTS = TRUE, plotAllY = T, specColor = specColor,
SMALLPLOTS = F, sigOnly=F, ncluster = 3)
fit <- gjamPlot(output = out, plotPars = pl)
The model fit is interpreted in the same way as other gjam analyses. Note that specColor
is used to highlight different types of traits in the posterior plots for values in coefficient matrix \(\mathbf{A}\). Parameter estimates are contained in modelSummary
,
out$modelSummary$betaMu # Q by M coefficient matrix alpha
out$modelSummary$betaSe # Q by M coefficient std errors
out$modelSummary$sigMu # M by M covariance matrix omega
out$modelSummary$sigSe # M by M covariance std errors
The output
list contains a large number of diagnostics explained in help pages. The output$modelSummary
holds objects described in the help pages.
The object fit
generated by gjamPlot
holds coefficients that are summarized in a table:
fit$betaEstimates[1:5,] # Q by M coefficient matrix alpha
The objects in out
that contain the word traits
are empty, because gjam does not know that responses are traits. These objects are used when traits are modeled as predictive distributions, discussed next.
Consider the interactions and indirect effects for this model. If there are no interactions in the formula
passed to gjamGibbs
, then there will be no interactions to estimate with the function gjamIIE
(there will still be indirect effects, discussed below). If there are interactions in the formula
, I must specify the values for main effects that are involved in these interactions to be used for estimating their effects on predictions. For example, consider a model containing the interaction between predictors \(q\) and \(q'\),
\[E[y_{s}] = \cdots + \beta_{q,s}x_{q} + \beta_{q',s}x_{q'} + \beta_{qq',s}x_{q}x_{q'} + \cdots\]
The ‘effect’ of predictor \(x_{q}\) on \(y_{s}\) is the derivative
\[\frac{dy_{s}}{dx_{q}} = \beta_{q,s} + \beta_{qq',s}x_{q'}\]
which depends not on \(x_{q}\), but rather on \(x_{q'}\). So if I want to know how interactions affect the response I have to decide on values for all of the predictors that are involved in interactions. These values are passed to gjamIIE
in xvector
. The default has sdScaleX = F
, which means that effects can be compared on the basis of variation in \(\mathbf{X}\).
In this example interactions involve moisture
, deficit
, and the multi-level factor soil
, as specified in the formula
passed to gjamGibbs
. The first row of the design matrix is used with moisture
and deficit
set to -1 or +1 standard deviation to compare dry and wet sites in a dry climate:
xdrydry <- xwetdry <- out$x[1,]
xdrydry['moisture'] <- xdrydry['deficit'] <- -1
xwetdry['moisture'] <- 1
xwetdry['deficit'] <- -1
The first observation is from the reference soil level reference
, so all other soil classes are zero. Here is a plot of main effects and interactions for deciduous and evergreen traits:
par(mfrow=c(2,2), bty='n', mar=c(1,3,1,1), oma = c(0,0,0,0),
mar = c(3,2,2,1), tcl = -0.5, mgp = c(3,1,0), family='')
fit1 <- gjamIIE(output = out, xvector = xdrydry)
fit2 <- gjamIIE(output = out, xvector = xwetdry)
gjamIIEplot(fit1, response = 'leafbroaddeciduous',
effectMu = c('main','int'),
effectSd = c('main','int'), legLoc = 'bottomleft',
ylim=c(-.31,.3))
title('deciduous')
gjamIIEplot(fit1, response = 'leafneedleevergreen',
effectMu = c('main','int'),
effectSd = c('main','int'), legLoc = 'bottomleft',
ylim=c(-.3,.3))
title('evergreen')
gjamIIEplot(fit2, response = 'leafbroaddeciduous',
effectMu = c('main','int'),
effectSd = c('main','int'), legLoc = 'bottomleft',
ylim=c(-.3,.3))
gjamIIEplot(fit2, response = 'leafneedleevergreen',
effectMu = c('main','int'),
effectSd = c('main','int'), legLoc = 'bottomleft',
ylim=c(-.3,.3))
The main effects plotted in the graphs do not depend on the values in xvector
. Although this observation is taken from the reference
soil, the plot shows the main effects that would be obtained if it were on the different soils included in the model. The interactions show how the effect of each predictor is modified by interactions with other variables. Again, the interactions from each predictor do not depend on values for the predictor itself, but rather on the other variables with which it interacts. For example, the interaction effect of soilUltKan
on the broaddeciduous
trait is positive on dry sites in dry climates (top left). Combined with a negative main effect, this means that deciduous trees tend to be more abundance on moist sites in this soil type. Its main effect on leafneedleevergreen
is positive, but less so on moist sites in dry climates (bottom right).
The indirect effects come from the effects of responses. This example shows indirect effects for foliar N and P that come through broaddeciduous
leaf habit:
xvector <- out$x[1,]
par(mfrow=c(2,1), bty='n', mar=c(1,1,1,1), oma = c(0,0,0,0),
mar = c(3,2,2,1), tcl = -0.5, mgp = c(3,1,0), family='')
omitY <- colnames(u)[colnames(u) != 'leafbroaddeciduous'] # omit all but deciduous
fit <- gjamIIE(out, xvector)
gjamIIEplot(fit, response = 'leafP', effectMu = c('main','ind'),
effectSd = c('main','ind'), legLoc = 'topright',
ylim=c(-.6,.6))
title('foliar P')
gjamIIEplot(fit, response = 'leafN', effectMu = c('main','ind'),
effectSd = c('main','ind'), legLoc = 'bottomright',
ylim=c(-.6,.6))
title('foliar N')
There will always be indirect effects, because they come through the covariance matrix.
The PTM models species abundance data, then predicts traits. This approach has a number of advantages over TRM discussed in Clark (2016). The response is the \(n \times S\) matrix \(\mathbf{Y}\), which could be counts, biomass, and so forth. On the latent scale the observation is represented by a composition vector,
\[E\big[\mathbf{y}_{i}] = \mathbf{B'}\mathbf{x}_{i}\]
\[\mathbf{w}_{i} \sim MVN(\mathbf{B'}\mathbf{x}_{i},\Sigma)\]
where \(\boldsymbol{\beta}\) is the \(Q \times S\) matrix of coefficients, and \(\boldsymbol{\Sigma}\) is the \(S \times S\) residual covariance. A predictive distribution on the trait scale is obtained as a variable change,
\[\mathbf{A} = \mathbf{B}\mathbf{T}\] \[\boldsymbol{\Omega} = \mathbf{T'}\boldsymbol{\Sigma}\mathbf{T}\] \[\mathbf{u}_{i} = \mathbf{T'}\mathbf{w}_{i}\]
where \(\mathbf{T}\) is a \(S \times M\) matrix of trait values for each species, \(\mathbf{A}\) is the \(Q \times M\) matrix of coefficients, and \(\boldsymbol{\Omega}\) is the \(M \times M\) residual covariance (Fig. 2).
Figure 2. The predictive trait model fits species data and predicts traits using the species-by-trait matrix T, contained in the object
specbyTrait
. The white boxes are fitted, with trait matrix U, and coefficient matrix \(\boldsymbol{\alpha'}\) obtained by variable change.
The PTM begins by fitting pbys
, followed by predicting plotByTraits
. This requires a traitList
, which defines the objects needed for prediction. The species are weights, so they should be modeled as composition data, eight 'FC'
(rows sum to 1) or 'CC'
. Here the model is fitted with dimension reduction:
tl <- list(plotByTrait = u, traitTypes = tTypes, specByTrait = specByTrait)
rl <- list(r = 8, N = 20)
ml <- list(ng = 1000, burnin = 200, typeNames = 'CC', holdoutN = 20,
traitList = tl, reductList = rl)
out <- gjamGibbs(~ temp + stdage + deficit*soil, xdata = xdata,
ydata = pbys, modelList = ml)
S <- nrow(specByTrait)
specColor <- rep('black',S)
wr <- which(specByTrait[,'ring'] == 1) # ring porous
wb <- which(specByTrait[,'leafneedleevergreen'] == 1) # evergreen
ws <- which(specByTrait[,'shade'] >= 4) # shade tolerant
specColor[wr] <- 'brown'
specColor[ws] <- 'black'
specColor[wb] <- 'darkgreen'
par(family = '')
pl <- list(width=4, height=4, corLines=F, SMALLPLOTS=F,GRIDPLOTS=T,
specColor = specColor, ncluster = 6)
fit <- gjamPlot(output = out, pl)
Output is interpreted as previously, now with coefficients \(\boldsymbol{\beta}\) and covariance \(\boldsymbol{\Sigma}\). gjamPlot generates an additional plot with trait predictions. Parameter values are here:
out$modelSummary$betaTraitMu # Q by M coefficient matrix alpha
out$modelSummary$betaTraitSe # Q by M coefficient std errors
out$modelSummary$sigmaTraitMu # M by M covariance matrix omega
out$modelSummary$sigmaTraitSe # M by M covariance std errors
Trait predictive distributions are summarized here:
out$modelSummary$tMu[1:5,] # n by M predictive means
out$modelSummary$tSd[1:5,] # n by M predictive std errors
The groupings of species in terms of their similar responses to the environment (the ematrix
) are here, showing only the 4 most frequent species in each of the ncluster
= 8 groups:
fit$eComs[,1:4]
Additional quantities can be predicted from the output using the MCMC output in the list out$chains
.
When using gjam
in predictive trait mode remember the following:
typeNames
for ydata
data should be composition, either CC
or FC
nrow(plotByTrait)
must equal nrow(ydata)
ncol(plotByTrait)
must equal length(traitTypes)
ncol(plotByTrait)
must equal length(traitTypes)
rownames(specByTrait)
must match colnames(ydata)
I thank Benedict Bachelot for review of the code.
Clark, J.S. 2016. Clark, J.S. 2016. Why species tell us more about traits than traits tell us about species: Predictive models. Ecology 97, 1979-1993.
Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data, in review.