*Consider again Petrus’s hunting spider data of Exercise 15.2. He
established (e.g. in Code Box 15.7) that different species respond
differently to environmental gradients, especially so for soil dryness,
herb layer and reflectance. He would like to know what the main types of
environmental response were, across species.*

*What approach should he use to answer this question?*

He wants to understand *how* taxa vary in environmental
response, so he could use a Species Archetype Model to classify species
according to type of environmental response.

Run the below lines to install `ecomix`

:

```
library(devtools)
install_github("skiptoniam/ecomix")
```

Code to fit `ecomix`

will only be run if it is installed
(the if statement is included because CRAN policies require vignettes to
be able to be run only using packages that can be installed from
CRAN)

```
= requireNamespace("ecomix", quietly = TRUE)
is_ecomix_installed library(mvabund)
data(spider)
=spider$x
SpiderDF$abund=as.matrix(spider$abund)
SpiderDF= abund ~ soil.dry + bare.sand + fallen.leaves + moss +
spiderFormula + reflection
herb.layer if(is_ecomix_installed)
{= ecomix::species_mix(spiderFormula, data=SpiderDF, family="negative.binomial", nArchetypes=2, control=list(init_method='kmeans',ecm_refit=5, ecm_steps=2) )
ft_Mix coef(ft_Mix)$beta
}#> SAM modelling
#> There are 2 archetypes to group the species into
#> There are 28 site observations for 12 species
#> The model for the archetype (grouping) is ~soil.dry + bare.sand + fallen.leaves + moss + herb.layer + reflection
#> The model for the species is ~1
#> You are implementing a negative.binomial Species Archetype Model.
#> Using ECM algorithm to find starting values; using 5 refits
#> ECM restart 1 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -1041.731 | Ratio loglik 0
#> Iteration: 2 | New loglik -834.997 | Ratio loglik 0.801548
#> ECM restart 2 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -1041.731 | Ratio loglik 0
#> Iteration: 2 | New loglik -834.997 | Ratio loglik 0.801548
#> ECM restart 3 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -1041.731 | Ratio loglik 0
#> Iteration: 2 | New loglik -834.997 | Ratio loglik 0.801548
#> ECM restart 4 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -1041.731 | Ratio loglik 0
#> Iteration: 2 | New loglik -834.997 | Ratio loglik 0.801548
#> ECM restart 5 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -1041.731 | Ratio loglik 0
#> Iteration: 2 | New loglik -834.997 | Ratio loglik 0.801548
#> initial value 834.997322
#> iter 10 value 822.933991
#> iter 20 value 818.054221
#> iter 30 value 816.602567
#> iter 40 value 812.844421
#> iter 50 value 793.214165
#> iter 60 value 791.057454
#> iter 70 value 786.683339
#> iter 80 value 778.413169
#> iter 90 value 777.956005
#> final value 777.933344
#> converged
#> soil.dry bare.sand fallen.leaves moss herb.layer reflection
#> Archetype1 1.5629184 -0.05470042 -0.2784744 -0.1449658 0.7081518 -0.4730736
#> Archetype2 -0.1069941 0.22420728 -0.2855582 0.4186874 0.5894873 0.4759734
```

*Which predictors vary the most across archetypes (hence across
taxa)? Is this what you saw in Code Box 15.7?*

`soil.dry`

had the biggest change, jumping from
`2.15`

to `0.10`

. The next biggest changes were in
`herb.layer`

and `reflection`

. This is pretty much
what was happening in Code Box 15.7 as well.

```
if(is_ecomix_installed)
plot(ft_Mix, type="link")
#> Infinite residuals removed from residual plots: 82 in total.
```

*Do model assumptions seem reasonable here?*

This generally looks good, no fan-shape in residual vs fits plot, points close(ish) to a straight line on normal quanitle plot.

*Consider again Anthony’s revegetation study. We would like to
better characterise how different invertebrate orders respond to the
revegetation treatment. Fit a Species Archetype Model with two
archetypes.*

```
library(ecostats)
data(reveg)
=data.frame(reveg)
revegDF$abundMat=as.matrix(reveg$abund)
revegDFif(is_ecomix_installed)
{= ecomix::species_mix(abundMat~treatment, data=revegDF, family="negative.binomial", nArchetypes=2, control=list(init_method='kmeans',ecm_refit=5, ecm_steps=2) )
ft_revegMix coef(ft_revegMix)$beta
}#> SAM modelling
#> There are 2 archetypes to group the species into
#> There are 10 site observations for 24 species
#> The model for the archetype (grouping) is ~treatment
#> The model for the species is ~1
#> You are implementing a negative.binomial Species Archetype Model.
#> Using ECM algorithm to find starting values; using 5 refits
#> ECM restart 1 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -932.547 | Ratio loglik 0
#> Iteration: 2 | New loglik -745.016 | Ratio loglik 0.798905
#> ECM restart 2 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -932.547 | Ratio loglik 0
#> Iteration: 2 | New loglik -745.016 | Ratio loglik 0.798905
#> ECM restart 3 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -932.547 | Ratio loglik 0
#> Iteration: 2 | New loglik -745.016 | Ratio loglik 0.798905
#> ECM restart 4 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -932.547 | Ratio loglik 0
#> Iteration: 2 | New loglik -745.016 | Ratio loglik 0.798905
#> ECM restart 5 of 5
#> Initialising starting values
#> Initial groups parameter estimates by K-means clustering
#> Iteration: 1 | New loglik -932.547 | Ratio loglik 0
#> Iteration: 2 | New loglik -745.016 | Ratio loglik 0.798905
#> initial value 745.015688
#> final value 745.015688
#> converged
#> treatmentImpact
#> Archetype1 5.350399
#> Archetype2 1.441362
```

*Explain what the main two types of response to revegetation were,
across invertebrate taxa.*

Oh that’s odd – the estimated trend is almost identical for these two groups.

*Look at the posterior probabilities for each Order, identify a
few taxa that characterise each response type, and plot the raw data for
these taxa. Does the Species Archetype Model seem to capture the main
trends in response to revegetation treatment?*

This doesn’t seem worthwhile when there are no differences in response across these archetypes!!

*This code chunk has not been run, as it takes a minute or so to
complete*

```
if(is_ecomix_installed)
{=rep(2:6,3)
nClust= rep(NA, length(nClust))
bics for(iClust in 1:length(nClust))
{= ecomix::species_mix(spiderFormula, data=SpiderDF,
fti_Mix family="negative.binomial", nArchetypes=nClust[iClust],
control=list(init_method='kmeans',ecm_refit=5, ecm_steps=2))
= BIC(fti_Mix)
bics[iClust]
}plot(bics~nClust, ylab="BIC", xlab="# archetypes")
}
```

*How many archetypes would you use?*

I guess I’d go for four archetypes, that looks like the winner here.

*Note that repeat runs didn’t always give the same BIC. Why did
this happen?*

As in Maths Box 16.1, mixture model likelihoods can be messy with multiple maxima. Hence it’s a good idea to run multiple times and keep the best solution.

*Consider again a Species Archetype Model for Anthony’s
revegetation data. In Exercise 16.2 we assumed there were only two
archetypes. How many archetypes should be used for Anthony’s data?
Answer this question using the model selection technique of your
choice.*

OK I’m going for BIC as above.

*This code chunk has not been run, as it takes a minute or so to
complete*

```
if(is_ecomix_installed)
{=rep(1:5,3)
nClust= rep(NA, length(nClust))
bics for(iClust in 1:length(nClust))
{= ecomix::species_mix(abundMat~treatment, data=revegDF,
fti_Mix family="negative.binomial", nArchetypes=nClust[iClust])
= BIC(fti_Mix)
bics[iClust]
}plot(bics~nClust, ylab="BIC", xlab="# archetypes")
}
```

Well it looks like there should be two archetypes (presumably, one of these is increasing with revegetation, the other not).

*Consider again Petrus’s hunting spider data of Exercise 15.2… He
wants to know the extent to which species traits explain interspecific
variation in environmental response. What approach should he use to
answer this question?*

He wants to know *why* species vary in their environmental
response, so a good idea would be to include predictors on species
(species traits) and to use a fourth corner model to understand the
extent to which interspecific variation in environmental response can be
explained by these traits.

```
library(gllvm)
#>
#> Attaching package: 'gllvm'
#> The following objects are masked from 'package:VGAM':
#>
#> AICc, nobs, predict, vcov
#> The following objects are masked from 'package:stats4':
#>
#> nobs, vcov
#> The following object is masked from 'package:vegan':
#>
#> ordiplot
#> The following object is masked from 'package:mvabund':
#>
#> coefplot
#> The following objects are masked from 'package:stats':
#>
#> nobs, predict, simulate, vcov
data(spider)
= spider$x[,c("soil.dry","herb.layer")]
X = gllvm(spider$abund, X, spider$trait, randomX=~soil.dry+herb.layer, family="negative.binomial")
ft_trait logLik(ft_trait)
#> 'log Lik.' -711.2906 (df=60)
library(lattice)
= max( abs(ft_trait$fourth.corner) )
a = colorRampPalette(c("blue","white","red"))
colort .4th = levelplot(ft_trait$fourth.corner, col.regions=colort(100),
plotat=seq(-a, a, length=100), scales = list( x=list(rot = 45)) )
print(plot.4th)
```

`coefplot(ft_trait)`

*Heloise collected counts of 41 species of ants at 30 sites across
south-eastern Australia, stored as antTraits in the
mvabund package. She also recorded five environmental
variables at each site, and five functional traits for each species. She
would like to know if these traits explain why ant species differ in
environmental response. Fit a fourth corner model to this
dataset.*

```
data(antTraits)
= as.matrix(antTraits$env) envMat
```

*(The below code chunk takes about a minute to run.)*

```
= gllvm(antTraits$abund, envMat, antTraits$traits, randomX=~envMat, family="negative.binomial", starting.val="zero")
ft_anttrait logLik(ft_anttrait)
#> 'log Lik.' -1919.352 (df=223)
library(lattice)
= max( abs(ft_anttrait$fourth.corner) )
a = colorRampPalette(c("blue","white","red"))
colort = levelplot(t(as.matrix(ft_anttrait$fourth.corner)),
plot_4th col.regions=colort(100), at=seq(-a, a, length=100),
scales = list( x= list(rot = 45)) )
print(plot_4th)
```

`coefplot(ft_anttrait)`

This doesn’t always produce a `coefplot`

because of
infinite values – this is symptomatic of non-convergence. Make sure you
end up with a `logLik`

value of about `-2300`

or
better, if this doesn’t happen then re-run. I ended up using
`starting.val="zero"`

which usually gave better fits (it
doesn’t always but did this time around).

*What are the key traits that capture why (and how) ants differ in
environment response?*

A big one seems to be `Canopy.cover:polymorphism`

.
`Canopy.cover:Pilosity`

also seemed to be very negative in
most runs, but with a fairly large standard error.

*Don’t forget to check your P’s and Q’s!*

*The below code chunk has not been evaluated because it is a
function of the previous one*

```
par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,1,1))
plot(ft_anttrait, which=1:2)
```

This looks good to me. The points on the normal quantile plot don’t all stay in the envelope (which is pointwise not global, so is tighter than it should be), but the violations we have are not of the alarming kind – residuals are smaller in magnitude than expected, rather than too large. This is often symptomatic of overfitting rather than lack-of-fit – the model does have a lot of parameters in it!

```
= dim(spider$abund)[2]
nVars = spider$trait
newTraits # set factors not of interest here to be a constant value
$length= mean(spider$trait$length) #set length to its mean
newTraits$colour=factor(rep(levels(spider$trait$colour)[1],nVars),
newTraitslevels=levels(spider$trait$colour)) #set to first level of factor
# set starting rows of ’marks’ to take all possible values
= nlevels(spider$trait$marks)
nMarks $marks[1:nMarks]=levels(spider$trait$marks)
newTraits# create a new env dataset where the only thing that varies is soil:
= spider$x[1:2,c("soil.dry","herb.layer")]
newEnv "soil.dry"]=range(scale(spider$x[,"soil.dry"]))
newEnv[,"herb.layer"]=0
newEnv[,#make predictions and plot:
= predict(ft_trait,newX=newEnv,newTR=newTraits,type="response", level=0)
newPreds matplot(newEnv[,1], newPreds[,1:nMarks],type="l", log="y")
legend("topright",levels(newTraits$marks),lty=1:nMarks,col=1:nMarks)
```

*Recall that fourth corner model you fitted to Heloise’s data in
Exercise 16.5. For an interaction of your choice, construct a plot to
visualise how the environment-abundance association changes for
different values of the functional trait.*

I’m going with `Canopy.cover:Polymorphism`

, as the one
with the biggest (and only significant) interaction term:

*The below code chunk has not been evaluated because it is a
function of the previous one*

```
= dim(antTraits$abund)[2]
nVars = antTraits$traits
newTraits #set length vars to their mean
$Femur.length= mean(antTraits$traits$Femur.length)
newTraits$Webers.length= mean(antTraits$traits$Webers.length)
newTraits# set factors not of interest here to be a constant value
$No.spines=rep(0,nVars)
newTraits$Pilosity = factor(rep(levels(antTraits$traits$Pilosity)[1],nVars),
newTraits# set first few levels of Polymorphism to all possible values
levels=levels(antTraits$traits$Pilosity)) #set to first level of factor
= nlevels(antTraits$traits$Polymorphism)
nPoly $Polymorphism[1:nPoly]=levels(antTraits$traits$Polymorphism)
newTraits# create a new env dataset where the only thing that varies is Canopy.cover:
= antTraits$env
newEnv $Canopy.cover=range(scale(newEnv$Canopy.cover))
newEnv$Bare.ground = mean(newEnv$Bare.ground)
newEnv$Shrub.cover = mean(newEnv$Shrub.cover)
newEnv$Volume.lying.CWD = mean(newEnv$Volume.lying.CWD)
newEnv$Feral.mammal.dung = mean(newEnv$Feral.mammal.dung)
newEnv=as.matrix(newEnv)
newEnv#make predictions and plot:
= predict(ft_anttrait,newX=newEnv,newTR=newTraits,type="response", level=0)
newPreds matplot(newEnv[,"Canopy.cover"], newPreds[,1:nPoly],type="l", ylab="Mean abundance", xlab="Canopy cover", log="y")
legend("topright",levels(newTraits$Polymorphism),lty=1:nPoly,col=1:nPoly)
```

We can see there that when there is no polymorphism, abundance is relatively high with low canopy cover.

```
= gllvm(spider$abund, X, family="negative.binomial")
ft_spp = gllvm(spider$abund, X, spider$trait, family="negative.binomial")
ft_trait = gllvm(spider$abund, X, spider$trait, family="negative.binomial", formula=~soil.dry+herb.layer)
ft_main = anova(ft_main, ft_trait, ft_spp)
an_spider4th #> Model 1 : ~ soil.dry + herb.layer
#> Model 2 : y ~ soil.dry + herb.layer + (soil.dry + herb.layer):(length + colour + marks)
#> Model 3 : y ~ X
an_spider4th#> Resid.Df D Df.diff P.value
#> 1 287 0.00000 0
#> 2 279 40.12996 8 3.02999e-06
#> 3 265 76.84430 14 1.08408e-10
$D[2]/sum(an_spider4th$D)
an_spider4th#> [1] 0.3430666
```

*Consider again Heloise’s ant data. What proportion of the
variation in environmental response is explained by the measured species
traits?*

```
= gllvm(antTraits$abund, envMat, family="negative.binomial")
ft_antspp = gllvm(antTraits$abund, envMat, antTraits$traits, family="negative.binomial")
ft_anttrait =as.formula(paste("~", paste(colnames(envMat),collapse="+")))
mainFormula= gllvm(antTraits$abund, envMat, antTraits$traits, family="negative.binomial", formula=mainFormula)
ft_antmain = anova(ft_antmain, ft_anttrait, ft_antspp)
an_ant4th #> Warning in anova.gllvm(ft_antmain, ft_anttrait, ft_antspp): This test was not designed for tests with a df.diff larger than 20 so the P-value should be treated as approximate.
#> Model 1 : ~ Bare.ground + Canopy.cover + Shrub.cover + Volume.lying.CWD + Feral.mammal.dung
#> Model 2 : y ~ Bare.ground + Canopy.cover + Shrub.cover + Volume.lying.CWD + Feral.mammal.dung + (Bare.ground + Canopy.cover + Shrub.cover + Volume.lying.CWD + Feral.mammal.dung):(Femur.length + No.spines + Pilosity + Polymorphism + Webers.length)
#> Model 3 : y ~ X
an_ant4th#> Resid.Df D Df.diff P.value
#> 1 1062 0.00000 0
#> 2 1022 75.73794 40 0.000548758
#> 3 862 271.85048 160 8.23701e-08
$D[2]/sum(an_ant4th$D)
an_ant4th#> [1] 0.2178955
```

So it looks like about a quarter of interspecific variation can be explained by traits, which is not super, especially considering that the trait model uses a quarter as many parameters. This suggests we are not really onto a winner here…

#remove this chunk once gllvm has been updated on CRAN: