It is often the case that a factor only makes sense in a subset of a dataset (i.e. for some observations a factor may simply not be meaningful), or that with observational datasets there are no observations in some levels of an interaction term. There are also cases where a random effects grouping factor is only applicable in a subset the data, and it is desireable to model the noise introduced by the repeated measures on the group members within the subset of the data where the repeated measures exist. The nauf package allows unordered factors and random effects grouping factors to be coded as NA in the subsets of the data where they are not applicable or otherwise not contrastive. Sum contrasts are used for all unordered factors (using named_contr_sum in the standardize package), and then NA values are set to 0. This allows all of the data to be modeled together without creating collinearity or making the output difficult to interpret. It is highly recommended that regression variables be put on the same scale with the standardize function in the standardize package prior to using nauf functions (though this is not required for the functions to work). The “Using the standardize package” vignette also provides useful information on the difference between unordered and ordered factors.
The plosives dataset (Eager, 2017) included in the nauf package contains measures of plosive strength for instances of intervocalic Spanish /p/, /t/, /k/, /b/, /d/ and /g/ from speakers from three dialects. For this first example we will focus on the voiceless plosives /ptk/ from the 30 speakers in the Cuzco dialect. The dataset contains a combination of experimental data (with exactly repeated measures on items from a read speech task) and observational data (without exactly repeated measures; from interviews with the same speakers). For the spontaneous speech (as indicated by the logical variable spont) the item factor is coded as NA.
library(nauf)
summary(plosives)
#> cdur vdur vpct intdiff
#> Min. : 0.0 Min. : 0.00 Min. :0.0000 Min. : 0.000
#> 1st Qu.: 93.0 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 5.951
#> Median :127.0 Median : 0.00 Median :0.0000 Median :22.452
#> Mean :123.9 Mean : 39.21 Mean :0.2437 Mean :22.201
#> 3rd Qu.:161.0 3rd Qu.: 81.00 3rd Qu.:0.5213 3rd Qu.:36.277
#> Max. :298.0 Max. :212.00 Max. :0.9079 Max. :73.952
#>
#> intvel voicing place stress
#> Min. :0.0000 Voiced :2694 Bilabial:1752 Post-Tonic:1629
#> 1st Qu.:0.1974 Voiceless:2587 Dental :1862 Tonic :2007
#> Median :0.7440 Velar :1667 Unstressed:1645
#> Mean :0.8341
#> 3rd Qu.:1.2870
#> Max. :5.5594
#>
#> prevowel posvowel wordpos wordfreq speechrate
#> a:1675 a:1942 Initial:1166 Min. : 1 Min. : 1.000
#> e:1237 e: 900 Medial :4115 1st Qu.: 1057 1st Qu.: 4.115
#> i:1280 i: 782 Median : 4579 Median : 5.000
#> o: 708 o:1217 Mean : 20189 Mean : 5.236
#> u: 381 u: 440 3rd Qu.: 20368 3rd Qu.: 6.000
#> Max. :247340 Max. :10.000
#>
#> spont dialect sex age
#> Mode :logical Cuzco :2872 Female:2599 Older :1362
#> FALSE:2019 Lima : 862 Male :2682 Younger:3919
#> TRUE :3262 Valladolid:1547
#> NA's :0
#>
#>
#>
#> ed ling speaker item
#> Secondary :1445 Bilingual :1379 s34 : 133 i01 : 38
#> University:3836 Monolingual:3902 s09 : 124 i02 : 38
#> s32 : 124 i03 : 38
#> s22 : 120 i04 : 38
#> s38 : 116 i05 : 38
#> s25 : 114 (Other):1829
#> (Other):4550 NA's :3262
dat <- droplevels(subset(plosives, dialect == "Cuzco" & voicing == "Voiceless"))
item_nas <- is.na(dat$item)
xtabs(~ item_nas + dat$spont)
#> dat$spont
#> item_nas FALSE TRUE
#> FALSE 795 0
#> TRUE 0 620
For our example, we want to model the voiceless duration of the plosives as a function of place of articulation, stress, and whether or not the speech was spontaneous. When modeling the read speech data (spont = FALSE), we want to account for noise introduced by the repeated measures on item, but we can’t apply this random effects structure to the interview data (spont = TRUE). In addition to this, we want to account for the noise introduced by the individual speakers. Rather than leaving out the item effects to analyze all the data together, or keeping the item effects and analyzing the read and spontaneous speech separately, we can model both subsets together and have the item effects apply only within the read speech data using nauf. We just need to have item coded as NA when it is not applicable, which it already is. Then the nauf_lmer function takes care of the rest.
sdat <- standardize(vdur ~ spont + place + stress +
(1 + spont + place + stress | speaker) + (1 | item), dat)
mod <- nauf_lmer(sdat$formula, sdat$data)
summary(mod)
#> Linear mixed model fit by REML ['nauf.lmerMod']
#> Formula: vdur ~ spont + place + stress + (1 + spont + place + stress |
#> speaker) + (1 | item)
#> Data: sdat$data
#>
#> REML criterion at convergence: 3543
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -5.1474 -0.5000 0.0235 0.5397 4.6395
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> speaker (Intercept) 0.215923 0.46468
#> spontTRUE 0.002052 0.04530 -0.09
#> placeBilabial 0.016284 0.12761 -0.44 0.49
#> placeDental 0.008709 0.09332 0.01 -0.93 -0.14
#> stressPost-Tonic 0.024526 0.15661 0.35 -0.92 -0.78 0.73
#> stressTonic 0.020310 0.14251 -0.41 0.51 1.00 -0.16 -0.79
#> item (Intercept) 0.109717 0.33123
#> Residual 0.621677 0.78846
#> Number of obs: 1415, groups: speaker, 30; item, 27
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) -0.011606 0.093282 -0.124
#> spontTRUE -0.182422 0.039642 -4.602
#> placeBilabial -0.014944 0.048795 -0.306
#> placeDental 0.001387 0.044381 0.031
#> stressPost-Tonic 0.098136 0.054977 1.785
#> stressTonic 0.143076 0.047996 2.981
#>
#> Correlation of Fixed Effects:
#> (Intr) spTRUE plcBlb plcDnt strP-T
#> spontTRUE -0.266
#> placeBilabl -0.176 0.080
#> placeDental -0.008 -0.102 -0.450
#> strssPst-Tn 0.206 -0.005 -0.162 0.058
#> stressTonic -0.235 -0.021 0.292 -0.021 -0.613
This way, we are making use of all of the information in the dataset. We can obtain a more principled statistical test for the spont factor, and also get better information about the other fixed effects and the individual speakers, since the same random effects for speaker apply in both subsets. We can obtain Type III tests using anova (here with method = “S” to indicate Satterthwaite approximation for the denominator degrees of freedom in the F tests).
anova(mod, method = "S")
#> Mixed Model Anova Table (Type III tests, S-method)
#>
#> Model: vdur ~ spont + place + stress + (1 + spont + place + stress |
#> Model: speaker) + (1 | item)
#> Data: $
#> Data: sdat
#> Data: data
#> num Df den Df F Pr(>F)
#> spont 1 36.900 21.176 4.817e-05 ***
#> place 2 80.249 0.054 0.9475
#> stress 2 74.356 14.893 3.635e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that stress is significant, and can then get predicted marginal means (often called least-squares means) and pairwise comparisons for its levels using nauf_ref.grid to create a reference grid, and then calling nauf_pmmeans.
rg <- nauf_ref.grid(mod)
nauf_pmmeans(rg, "stress", pairwise = TRUE)
#>
#> Predicted marginal means for 'stress'
#>
#> Factors averaged over: 'spont' 'place'
#>
#> $pmmeans
#> stress pmmean SE df lower.CL upper.CL
#> Post-Tonic 0.08653011 0.11764046 42.24 -0.15083790 0.32389813
#> Tonic 0.13146972 0.09434706 41.10 -0.05905417 0.32199361
#> Unstressed -0.25281748 0.10383491 41.03 -0.46251171 -0.04312325
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Post-Tonic - Tonic -0.04493961 0.09252431 59.26 -0.486 0.8783
#> Post-Tonic - Unstressed 0.33934759 0.08901834 87.70 3.812 0.0007
#> Tonic - Unstressed 0.38428720 0.07594482 82.73 5.060 <.0001
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates
The fricatives dataset (Hualde and Prieto, 2014) included in the nauf package contains measures of duration and voicing for intervocalic alveolar fricatives in Spanish and Catalan. Spanish has only one such fricative, /s/, which can occur in any position in the word (initial, medial, or final). In Catalan, the situation is much more complicated. At the beginning of a word and in the middle of a word, both /s/ (underlyingly voiceless) and /z/ (underlyingly voiced) can occur (though word-initial /z/ is rare, and does not occur in the dataset). In word-final position, there is no contrast between /s/ and /z/ (labeled /S/, underlyingly neutralized), with the voicing of the fricative determined by the following sound. Because all of the fricatives in the dataset are intervocalic, /S/ ought to be like /z/ (according to traditional descriptions of Catalan), but may be shorter and possibly less voiced. That is, in the dataset, we have the following set of unique values.
dat <- fricatives
u <- unique(dat[, c("lang", "wordpos", "uvoi")])
u <- u[order(u$lang, u$wordpos, u$uvoi), ]
rownames(u) <- NULL
u
#> lang wordpos uvoi
#> 1 Catalan Final Neutralized
#> 2 Catalan Initial Voiceless
#> 3 Catalan Medial Voiced
#> 4 Catalan Medial Voiceless
#> 5 Spanish Final Voiceless
#> 6 Spanish Initial Voiceless
#> 7 Spanish Medial Voiceless
This raises a problem for the regression analysis, since underlying voicing uvoi is only contrastive within a subset of the data (specifically, when lang = “Catalan” and wordpos = “Medial”), and everywhere else is uniquely determined by lang and wordpos. Ideally, we would like to be able to have uvoi apply only where it is contrastive, and to be able to include random slopes for uvoi for the Catalan speakers, but not the Spanish speakers. Using traditional methods won’t help us here. If we were to take the full interaction of the three factors as a new factor with 7 levels, we can’t include slopes. If we run separate regressions on different subsets of the data, then we will have to limit the comparisons we can make, and won’t be making use of all of the information the data is providing us. Note also that this issue has nothing to do with the way the data were collected; the nature of the languages and the research questions creates these imbalances. With nauf, we can solve this problem by coding uvoi as NA in the subsets of the data where it is not contrastive. That is we can code the unique possible combinations as:
u$uvoi[!(u$lang == "Catalan" & u$wordpos == "Medial")] <- NA
u
#> lang wordpos uvoi
#> 1 Catalan Final <NA>
#> 2 Catalan Initial <NA>
#> 3 Catalan Medial Voiced
#> 4 Catalan Medial Voiceless
#> 5 Spanish Final <NA>
#> 6 Spanish Initial <NA>
#> 7 Spanish Medial <NA>
dat$uvoi[!(dat$lang == "Catalan" & dat$wordpos == "Medial")] <- NA
In this way, we are recognizing that word-final Catalan fricatives are limited to one value for uvoi, as are all Spanish fricatives, etc. The meaning of NA in this table is not always the same, so we have to keep track of it, and we will definitely want to include an interaction between lang and wordpos since, for example, “Catalan:Final:NA” means neutralized /S/ while “Spanish:Final:NA” means voiceless /s/. As for being able to include uvoi slopes for Catalan speakers and not Spanish speakers, we can create two new grouping factors, one for each language, where the factor is the speaker identifier or NA based on the language the observation comes from:
dat$c_speaker <- dat$s_speaker <- dat$speaker
dat$c_speaker[dat$lang != "Catalan"] <- NA
dat$s_speaker[dat$lang != "Spanish"] <- NA
With these NA values, we can then fit a model with nauf_lmer to predict the percentage of the fricatives that is voiced based on lang, wordpos, and uvoi:
s.pvoi <- standardize(pvoi ~ lang * wordpos + uvoi +
(1 + wordpos + uvoi | c_speaker) + (1 + wordpos | s_speaker),
dat)
m.pvoi <- nauf_lmer(s.pvoi$formula, s.pvoi$data)
summary(m.pvoi)
#> Linear mixed model fit by REML ['nauf.lmerMod']
#> Formula:
#> pvoi ~ lang * wordpos + uvoi + (1 + wordpos + uvoi | c_speaker) +
#> (1 + wordpos | s_speaker)
#> Data: s.pvoi$data
#>
#> REML criterion at convergence: 3584.7
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.4721 -0.6148 0.0511 0.5195 3.1344
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> c_speaker (Intercept) 0.154115 0.39258
#> wordposFinal 0.042090 0.20516 -0.67
#> wordposInitial 0.034222 0.18499 0.36 -0.91
#> uvoiVoiced 0.028117 0.16768 -0.21 0.76 -0.73
#> s_speaker (Intercept) 0.123763 0.35180
#> wordposFinal 0.011092 0.10532 0.92
#> wordposInitial 0.006788 0.08239 -0.38 0.01
#> Residual 0.482529 0.69464
#> Number of obs: 1622, groups: c_speaker, 26; s_speaker, 16
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 0.02215 0.06339 0.349
#> langCatalan 0.28058 0.06339 4.426
#> wordposFinal 0.46883 0.04183 11.208
#> wordposInitial -0.40194 0.03791 -10.603
#> uvoiVoiced 0.64890 0.04716 13.761
#> langCatalan:wordposFinal 0.28252 0.04183 6.754
#> langCatalan:wordposInitial -0.39017 0.03791 -10.292
#>
#> Correlation of Fixed Effects:
#> (Intr) lngCtl wrdpsF wrdpsI uvoVcd lngC:F
#> langCatalan -0.150
#> wordposFinl 0.105 -0.416
#> wordposIntl 0.045 0.155 -0.669
#> uvoiVoiced -0.098 -0.098 0.282 -0.236
#> lngCtln:wrF -0.416 0.105 0.076 -0.184 0.282
#> lngCtln:wrI 0.155 0.045 -0.184 0.057 -0.236 -0.669
To understand how nauf treats the NA values in uvoi, we call nauf_contrasts:
nauf_contrasts(m.pvoi)
#> $lang
#> Catalan
#> Catalan 1
#> Spanish -1
#>
#> $wordpos
#> Final Initial
#> Final 1 0
#> Initial 0 1
#> Medial -1 -1
#>
#> $uvoi
#> Voiced
#> Voiced 1
#> Voiceless -1
#> <NA> 0
The function returns a list with an element for each of the three factors. Sum contrasts have been applied for all three with named_contr_sum from the standardize package, but for uvoi there is an additional row indicating that a value of 0 is used when it is NA. In this way, the uvoiVoiced coefficient in the summary never contributes to the fitted value for any observation that belongs to a subset where underlying voicing is not contrastive, and its estimate represents half the difference between voiced and voiceless Catalan word-medial observations. We can run an anova just as we did in the Cuzco example (though in this case it is not as useful since we are interested in making specific comparisons), and also create a reference grid:
anova(m.pvoi, method = "S")
#> Mixed Model Anova Table (Type III tests, S-method)
#>
#> Model: pvoi ~ lang * wordpos + uvoi + (1 + wordpos + uvoi | c_speaker) +
#> Model: (1 + wordpos | s_speaker)
#> Data: $
#> Data: s.pvoi
#> Data: data
#> num Df den Df F Pr(>F)
#> lang 1 29.382 19.591 0.0001215 ***
#> wordpos 2 30.293 71.535 3.349e-12 ***
#> uvoi 1 20.331 189.361 9.136e-12 ***
#> lang:wordpos 2 30.293 52.983 1.285e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rg.pvoi <- nauf_ref.grid(m.pvoi)
We can now use the nauf_pmmeans function’s subset and na_as_level arguments to test hypotheses in the subset of the reference grid where comparisons are valid. For example, if we want to test whether Spanish fricatives are more voiced than Catalan fricatives, the only just comparison we can make is for word-initial and word-medial /s/. To do this, we provide nauf_pmmeans with a list of valid groups, with each group specified as a named list of unordered factor levels:
nauf_pmmeans(rg.pvoi, "lang", pairwise = TRUE,
subset = list(
list(lang = "Catalan", wordpos = "Initial", uvoi = NA),
list(lang = "Catalan", wordpos = "Medial", uvoi = "Voiceless"),
list(lang = "Spanish", wordpos = c("Initial", "Medial"), uvoi = NA)
)
)
#>
#> Predicted marginal means for 'lang'
#> Note: 'lang' is included in higher order interaction(s) 'lang:wordpos'
#>
#> Factors conditioned on: 'lang' 'wordpos' 'uvoi'
#>
#> See the 'subset' element of the 'nauf.specs' attribute for subsetted groups.
#>
#> $pmmeans
#> lang pmmean SE df lower.CL upper.CL
#> Catalan -0.3973943 0.10546712 24.10 -0.6150206 -0.1797680
#> Spanish -0.3515808 0.08533869 12.07 -0.5374041 -0.1657574
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Catalan - Spanish -0.04581351 0.1356687 35.55 -0.338 0.7376
Each of the lists specified in the subset list represents a set of valid combinations of factor levels where language is truly contrastive. Any row in the reference grid which belongs to at least one of the specified groups is kept and the others are ignored. So the Catalan estimate is an average of Catalan:Initial:NA and Catalan:Medial:Voiceless, and the Spanish estimate is an average of Spanish:Initial:NA and Spanish:Medial:NA. To test the effect of word position in Spanish, we can call:
nauf_pmmeans(rg.pvoi, "wordpos", pairwise = TRUE,
subset = list(lang = "Spanish", uvoi = NA)
)
#>
#> Predicted marginal means for 'wordpos'
#> Note: 'wordpos' is included in higher order interaction(s) 'lang:wordpos'
#>
#> Factors conditioned on: 'lang' 'uvoi'
#>
#> See the 'subset' element of the 'nauf.specs' attribute for subsetted groups.
#>
#> $pmmeans
#> wordpos pmmean SE df lower.CL upper.CL
#> Final -0.07211647 0.13419238 10.59 -0.3688850 0.22465208
#> Initial -0.27019498 0.10439365 10.34 -0.5017668 -0.03862313
#> Medial -0.43296656 0.08733771 13.14 -0.6214424 -0.24449071
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Final - Initial 0.1980785 0.09495946 30.94 2.086 0.1093
#> Final - Medial 0.3608501 0.09744109 10.11 3.703 0.0102
#> Initial - Medial 0.1627716 0.08900031 14.78 1.829 0.1945
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates
In this call, we are only specifying one group in subset, so we don’t need to double-list it (i.e. nauf will understnd list(lang = “Spanish”, uvoi = NA) as list(list(lang = “Spanish”, uvoi = NA))). In Catalan, while fricatives occur in all three word positions, word position is only fully contrastive (i.e. its effect can only be truly isolated) for word-initial vs. word-medial /s/:
nauf_pmmeans(rg.pvoi, "wordpos", pairwise = TRUE,
subset = list(
list(lang = "Catalan", wordpos = "Initial", uvoi = NA),
list(lang = "Catalan", wordpos = "Medial", uvoi = "Voiceless")
)
)
#>
#> Predicted marginal means for 'wordpos'
#> Note: 'wordpos' is included in higher order interaction(s) 'lang:wordpos'
#>
#> Factors conditioned on: 'lang' 'wordpos' 'uvoi'
#>
#> See the 'subset' element of the 'nauf.specs' attribute for subsetted groups.
#>
#> $pmmeans
#> wordpos pmmean SE df lower.CL upper.CL
#> Medial -0.3054125 0.1163456 23.48 -0.5458214 -0.06500364
#> Initial -0.4893760 0.1086127 23.75 -0.7136642 -0.26508781
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Medial - Initial 0.1839635 0.07856695 9.85 2.341 0.0416
Similarly, voicing is only contrastive for Catalan medial fricatives, and so we can call:
nauf_pmmeans(rg.pvoi, "uvoi", pairwise = TRUE,
subset = list(lang = "Catalan", wordpos = "Medial")
)
#>
#> Predicted marginal means for 'uvoi'
#> NA not considered a level for: 'uvoi'
#>
#> Factors conditioned on: 'lang' 'wordpos'
#>
#> See the 'subset' element of the 'nauf.specs' attribute for subsetted groups.
#>
#> $pmmeans
#> uvoi pmmean SE df lower.CL upper.CL
#> Voiced 0.9923974 0.1005598 25.74 0.7855905 1.19920437
#> Voiceless -0.3054125 0.1163456 23.48 -0.5458214 -0.06500364
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Voiced - Voiceless 1.29781 0.09431177 20.33 13.761 <.0001
For Catalan word-final neutralized /S/, we may be interested in making a comparison to see whether it is somewhere in between /s/ and /z/ in terms of voicing, or not significantly different from /z/. This case requires the additional argument na_as_level to be specified since we want an estimate for a group where uvoi is NA (the default is that estiamtes are not generated for any group that has NA’s in the estimate table):
nauf_pmmeans(rg.pvoi, c("wordpos", "uvoi"), pairwise = TRUE,
subset = list(
list(lang = "Catalan", wordpos = "Medial", uvoi = c("Voiced", "Voiceless")),
list(lang = "Catalan", wordpos = "Final", uvoi = NA)
),
na_as_level = "uvoi"
)
#>
#> Predicted marginal means for 'wordpos:uvoi'
#> NA considered a level for: 'uvoi'
#> Note: The interaction term 'wordpos:uvoi' is not in the model.
#>
#> Factors conditioned on: 'lang' 'wordpos' 'uvoi'
#>
#> See the 'subset' element of the 'nauf.specs' attribute for subsetted groups.
#>
#> $pmmeans
#> wordpos uvoi pmmean SE df lower.CL upper.CL
#> Medial Voiced 0.9923974 0.10055984 25.74 0.7855905 1.19920437
#> Medial Voiceless -0.3054125 0.11634557 23.48 -0.5458214 -0.06500364
#> Final NA 1.0540873 0.08543597 42.33 0.8817105 1.22646406
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio
#> Medial,Voiced - Medial,Voiceless 1.29780996 0.09431177 20.33 13.761
#> Medial,Voiced - Final,NA -0.06168987 0.08207608 62.53 -0.752
#> Medial,Voiceless - Final,NA -1.35949984 0.11077688 22.84 -12.272
#> p.value
#> <.0001
#> 0.7338
#> <.0001
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates
As a final example, we could test if Spanish word-final /s/ (being more voiced than Spanish word-medial /s/ above) is as voiced as Catalan word-final /S/:
nauf_pmmeans(rg.pvoi, "lang", pairwise = TRUE,
subset = list(wordpos = "Final", uvoi = NA)
)
#>
#> Predicted marginal means for 'lang'
#> Note: 'lang' is included in higher order interaction(s) 'lang:wordpos'
#>
#> Factors conditioned on: 'wordpos' 'uvoi'
#>
#> See the 'subset' element of the 'nauf.specs' attribute for subsetted groups.
#>
#> $pmmeans
#> lang pmmean SE df lower.CL upper.CL
#> Catalan 1.05408730 0.08543597 42.33 0.8817105 1.2264641
#> Spanish -0.07211647 0.13419238 10.59 -0.3688850 0.2246521
#>
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Catalan - Spanish 1.126204 0.1590814 20.09 7.079 <.0001
In this way, we are able to use all of the information in the dataset when fitting the model, account for the uncertainty introduced by repeated measures on the subjects, assigning different random effects to the two langauges, and then test any variety of hypotheses we might have about the effects of the factors within the subsets of the data where the comparisons make sense.
There are two situations in which unordered factors will need more than one set of contrasts: (1) when an unordered factor with NA values interacts with another unordered factor, and some levels are collinear with NA; and (2) when an unordered factor is included as a slope for a random effects grouping factor that has NA values, but only a subset of the levels for the slope factor occur when the grouping factor is applicable. Both of these situations occur when we consider all three dialects in the plosives dataset jointly (here we will look at the voiceless subset):
dat <- subset(plosives, voicing == "Voiceless")
xtabs(~ dialect + spont, dat)
#> spont
#> dialect FALSE TRUE
#> Cuzco 795 620
#> Lima 215 206
#> Valladolid 0 751
The data for Cuzco and Lima consist of two subsets: read speech and spontaneous speech (with the read speech task being the same for both dialects). The Valladolid data, however, come from a different corpus consisting only of spontaneous speech, with the measurements taken in the same way as for Cuzco and Lima. While it would of course be ideal to have read speech for the Valladolid speakers as well, this doesn’t mean that we need to split up the data and run multiple regressions. We can simply code spont as NA for Valladolid, split the speaker random effects by dialect, and only include a spont slope for Cuzco and Lima:
dat$spont[dat$dialect == "Valladolid"] <- NA
dat$c_speaker <- dat$l_speaker <- dat$v_speaker <- dat$speaker
dat$c_speaker[dat$dialect != "Cuzco"] <- NA
dat$l_speaker[dat$dialect != "Lima"] <- NA
dat$v_speaker[dat$dialect != "Valladolid"] <- NA
sdat <- standardize(cdur ~ spont * dialect +
(1 + spont | c_speaker) + (1 + spont | l_speaker) + (1 | v_speaker) +
(1 + dialect | item),
dat)
mod <- nauf_lmer(sdat$formula, sdat$data)
summary(mod)
#> Linear mixed model fit by REML ['nauf.lmerMod']
#> Formula: cdur ~ spont * dialect + (1 + spont | c_speaker) + (1 + spont |
#> l_speaker) + (1 | v_speaker) + (1 + dialect | item)
#> Data: sdat$data
#>
#> REML criterion at convergence: 6242.3
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -3.0413 -0.6728 -0.1214 0.5382 4.5849
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> c_speaker (Intercept) 0.0846787 0.29100
#> spontTRUE 0.0065212 0.08075 -0.53
#> item (Intercept) 0.2805902 0.52971
#> dialect.c2.Cuzco 0.0009203 0.03034 -1.00
#> v_speaker (Intercept) 0.0633240 0.25164
#> l_speaker (Intercept) 0.0428048 0.20689
#> spontTRUE 0.0158277 0.12581 -1.00
#> Residual 0.6030539 0.77657
#> Number of obs: 2587, groups:
#> c_speaker, 30; item, 27; v_speaker, 18; l_speaker, 8
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) -0.180515 0.052463 -3.441
#> spontTRUE -0.152609 0.060147 -2.537
#> dialectCuzco 0.574724 0.053716 10.699
#> dialectLima -0.203724 0.065241 -3.123
#> spontTRUE:dialect.c2.Cuzco -0.004251 0.032064 -0.133
#>
#> Correlation of Fixed Effects:
#> (Intr) spTRUE dlctCz dlctLm
#> spontTRUE -0.738
#> dialectCuzc -0.011 -0.092
#> dialectLima 0.392 -0.517 -0.421
#> spTRUE:.2.C 0.343 -0.434 -0.356 0.569
In this way, speaker effects are accounted for in each dialect, and within the spont = FALSE subset, item effects are additionally accounted for. However, note that for the interaction term spont:dialect, .c2. appears before Cuzco, and there is only one coefficient for the interaction term rather than two as we would normally expect. The same goes for the item slope for dialect. This is because using the main effect contrasts for dialect in the spont:dialect interaction term and item dialect slope would lead to collinearity issues (as explained in detail in the help page for the nauf_contrasts function). The nauf package automatically recognizes when this happens, and creates additional sets of contrasts which it uses only when it needs to. To see how the contrasts are coded, we can call nauf_contrasts:
nauf_contrasts(mod)
#> $spont
#> TRUE
#> TRUE 1
#> FALSE -1
#> <NA> 0
#>
#> $dialect
#> Cuzco Lima .c2.Cuzco
#> Cuzco 1 0 1
#> Lima 0 1 -1
#> Valladolid -1 -1 0
The nauf package allows unordered factors and random effects grouping factors to be used even when they are only applicable within a subset of a dataset. The user only needs to code them as NA when they are not applicable/contrastive, and nauf regression fitting functions take care of the rest. Different random effects structures for the same grouping variable can be fit in different subsets by creating new factors from the grouping factor, and setting them to NA in the appropriate subsets. The nauf_pmmeans function can then be used to test hypotheses conditioning on subsets of the data where a just comparison can be made.
If you use the nauf package in a publication, please cite:
Eager, Christopher D. (2017). nauf: Regression with NA Values in Unordered Factors. R package version 1.0.1. https://CRAN.R-project.org/package=nauf
If you analyze the plosives dataset in a publication, please cite:
Eager, Christopher D. (2017). Contrast preservation and constraints on individual phonetic variation. Doctoral thesis. University of Illinois at Urbana-Champaign.
If you analyze the fricatives dataset in a publication, please cite:
Hualde, J. I., & Prieto, P. (2014). Lenition of intervocalic alveolar fricatives in Catalan and Spanish. Phonetica, 71(2), 109-127.