This vignette outlines the procedures used to estimate multivariate
sex differences in personality traits using regularized logistic
regression and the gender diagnosticity approach with the
*multid* package.

This approach can be used to measure multivariate differences between any two groups, but in this example, we examine differences between females and males in their “answers to the Big Five Personality Test, constructed with items from the International Personality Item Pool” (which can be obtained from https://openpsychometrics.org/).

The regularized method enables the examination of group-membership probabilities and their distributions across individuals. In the context of statistical predictions of sex, these distributions are an updated version of the gender-typicality distributions used in the gender diagnosticity methodology (Lippa & Connelly, 1990).

For more information and empirical examples of how multivariate sex difference estimation and gender diagnosticity methodology can be integrated, please see Ilmarinen, Vainikainen, and Lönnqvist (2022) and Lönnqvist and Ilmarinen (2021).

You can install the released version of *multid* from CRAN with:

`install.packages("multid")`

You can install the development version from GitHub with:

```
install.packages("devtools")
::install_github("vjilmari/multid") devtools
```

Load multid and other packages needed for the vignette

```
library(rio)
library(multid)
library(dplyr)
library(overlapping)
```

This vignette uses openly available data from the “Big Five Personality Test,” which is constructed with items from the International Personality Item Pool. The dataset can be downloaded directly from https://openpsychometrics.org (by approach presented here https://stackoverflow.com/a/66351588/8995170), and there is no need to download it to your drive in order to reproduce the analyses described in this vignette.

```
# create a temporary directory
<- tempdir()
td
# create a temporary file
<- tempfile(tmpdir=td, fileext=".zip")
tf
# download file from internet into temporary location
download.file("http://openpsychometrics.org/_rawdata/BIG5.zip", tf)
# list zip archive
<- unzip(tf, list=TRUE)
file_names
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
<- import(file.path(td, "BIG5/data.csv"))
dat
# delete the files and directories
unlink(td)
```

```
# save names of personality items to a vector
<-names(dat)[which(names(dat)=="E1"):
per.itemswhich(names(dat)=="O10")]
# code item response 0 (zero) to not available (NA) on all these items
<-
dat[,per.items]sapply(dat[,per.items],function(x){ifelse(x==0,NA,x)})
# check that the numerical range makes sense
range(dat[,per.items],na.rm=T)
```

`#> [1] 1 5`

```
# calculate sum scores for Big Five (reverse coded items subtracted from 6)
# see codebook from https://openpsychometrics.org for item content and keys
$N<-rowMeans(
datcbind(dat[,c("N1","N3","N5","N6","N7","N8","N9","N10")],
6-dat[,c("N2","N4")]),na.rm=T)
$E<-rowMeans(
datcbind(dat[,c("E1","E3","E5","E7","E9")],
6-dat[,c("E2","E4","E6","E8","E10")]),na.rm=T)
$O<-rowMeans(
datcbind(dat[,c("O1","O3","O5","O7","O8","O9","O10")],
6-dat[,c("O2","O4","O6")]),na.rm=T)
$A<-rowMeans(
datcbind(dat[,c("A2","A4","A6","A8","A9","A10")],
6-dat[,c("A1","A3","A5","A7")]),na.rm=T)
$C<-rowMeans(
datcbind(dat[,c("C1","C3","C5","C7","C9","C10")],
6-dat[,c("C2","C4","C6","C8")]),na.rm=T)
```

For the purposes of this example, we will only analyze responses from the United States to examine sex differences. Additionally, we will only include participants who identified as male (gender=1) or female (gender=2), and who did not have any missing responses.

```
<-
US.datna.omit(dat[dat$country=="US" & (dat$gender==1 | dat$gender==2),
c("gender",per.items,"N","E","O","A","C")])
# code categorical gender variables
$gender.cat<-
US.datifelse(US.dat$gender==1,"Male","Female")
```

This section demonstrates several different ways to estimate sex differences in personality. We begin with typical univariate differences in the broad Big Five domains, followed by multivariate differences in both domains and using personality items as a multivariate set. We will pay special attention to regularized methods because this approach enables us to examine both sex differences and the underlying distributions that give rise to these mean differences.

The **d_pooled_sd** function is used to calculate
Cohen’s *d* with a pooled standard deviation across the male and
female groups for each Big Five trait separately. The function takes
several arguments, including the analyzed Big Five trait
(**var**), sex as the binary grouping variable
(**group.var**), the values that represent these groups
(**group.values**), and a logical value indicating whether
statistical inference is desired (**infer**). All arguments
except for statistical inference are also included in the multivariate
function **D_regularized**, which is presented further
below.

```
<-d_pooled_sd(data = US.dat,var = "N",
d.Ngroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "E",
d.Egroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "O",
d.Ogroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "A",
d.Agroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "C",
d.Cgroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
# compile to a table
<-rbind(d.N,d.E,d.O,d.A,d.C)
uni.drownames(uni.d)<-c("N","E","O","A","C")
round(uni.d,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> N 2945 5739 2.79 3.16 0.88 0.86 0.87 -0.37 -0.42
#> E 2945 5739 3.00 3.06 0.96 0.96 0.96 -0.07 -0.07
#> O 2945 5739 4.07 3.88 0.58 0.63 0.62 0.18 0.30
#> A 2945 5739 3.68 4.03 0.75 0.68 0.71 -0.35 -0.50
#> C 2945 5739 3.38 3.44 0.72 0.75 0.74 -0.07 -0.09
#> t_stat df p
#> N -18.45 5845.38 0
#> E -3.09 5919.79 0
#> O 13.44 6458.37 0
#> A -21.33 5455.81 0
#> C -3.95 6119.69 0
```

The column labeled “D” indicates Cohen’s *d*. Females score
higher in Neuroticism (N; *d* = -0.42) and Agreeableness (A;
*d* = -0.50), while males score higher in Openness (O; *d*
= 0.30). Differences in Extraversion (E) and Conscientiousness (C) are
statistically significant (*p* < .001), but their magnitudes
are negligible (*d* < |.10|).

The average absolute difference across the Big Five traits is
therefore *d* = 0.27. However, this average absolute difference
does not provide information about the overall distance between males
and females. For examining questions regarding the overall distance in a
certain domain, such as personality, one should use multivariate
methods

Other columns in the output are:

n.Male and n.Female = sample sizes

m.Male and m.Female = mean levels

sd.Male and sd.Female = standard deviations

pooled.sd = pooled standard deviation

diff = the raw (non-standardized) difference

t_stat = t statistic

df = Welch modified degrees of freedom

p = p-value

Although the univariate differences presented above estimate sex
differences in each broad Big Five trait separately, these single
Cohen’s *d* estimates do not provide a clear picture of the
overall distance between males and females. To understand overall
distances, multivariate methods that use multiple variables and account
for their covariability (intercorrelations) should be used. These
methods provide *D* values, which can be interpreted in the same
way as Cohen’s *d*, that is, as standardized distances between
males and females.

Mahalanobis’ *D* is the standardized distance between the male
and female group centroids in multivariate space. To calculate
Mahalanobis’ *D*, you need the correlation matrix of the
multivariate set and a vector of standardized mean differences. The
ordering of the variables should be the same in both inputs.

```
# let's use the previously obtained univariate d-values
<-uni.d[,"D"]
d_vector# calculate intercorrelations
<-cor(US.dat[,c("N","E","O","A","C")])
cor_mat# test that the ordering of these variables matches
names(d_vector)==colnames(cor_mat)
```

`#> [1] TRUE TRUE TRUE TRUE TRUE`

```
# print the correlation matrix
round(cor_mat,2)
```

```
#> N E O A C
#> N 1.00 -0.25 -0.10 -0.10 -0.26
#> E -0.25 1.00 0.15 0.31 0.10
#> O -0.10 0.15 1.00 0.10 0.05
#> A -0.10 0.31 0.10 1.00 0.17
#> C -0.26 0.10 0.05 0.17 1.00
```

```
# Calculate D
<-sqrt(t(d_vector) %*% solve(cor_mat) %*% d_vector)
D_maha D_maha
```

```
#> [,1]
#> [1,] 0.7681502
```

Mahalanobis’ *D* is, of course, larger than any of the
univariate *d*’s for single variables, here the standardized
distance between the group centroids of males and females in the
multivariate space is *D* = 0.77

In the D_regularized function, a predictive approach is used to
calculate the regularized variant of *D*. This approach involves
predicting the binary sex variable using a multivariate set of variables
with regularized logistic regression. The predicted values for each
individual represent the logits of probabilities of being male or
female, and are hereafter referred to as femininity-masculinity scores
(FM).

FM scores can be used in the same way as any other variable to calculate standardized distances between the mean FM scores of males and females with pooled standard deviations. In addition to estimating mean differences, there are other informative features of the multivariate/FM distributions.

The **D_regularized** function uses separate partitions
of the original data for regulation (i.e., training) and prediction
(i.e., testing). During regulation, the beta-coefficients for variables
in the multivariate set are penalized through a specified function, such
as a elastic net penalty with 10-fold cross-validation (default). The
resulting coefficient weights are then used in the prediction part of
the data as measures of FM for each individual. The arguments
**out** (a logical value indicating whether predictions are
made out-of-bag) and **size** (the size of the training
partition of the data, equal for males and females) are used to specify
that a partition is required. The multivariate set is supplied with the
**mv.vars** argument.

It is also possible to request **pcc** (probability of
correctly classified), **auc** (area under the receiver
operating characteristics), and frequencies indicating the proportions
of the sample that fall within certain cutoffs of predicted
probabilities from 0 to 1 for each group (**pred.prob**) by
using these arguments within **D_regularized** call.

```
# set seed number for reproducibility
set.seed(37127)
# decide the size of the regularization (training data fold)
# here, half of the smaller of the male and female groups is used
=round(min(table(US.dat$gender.cat))/2)
size
# run D_regularized
<-
D_outD_regularized(data=US.dat,
mv.vars=c("N","E","O","A","C"),
group.var="gender.cat",
group.values = c("Male","Female"),
out = T,
size=size,
pcc = T,
auc = T,
pred.prob = T)
# print the output
round(D_out$D,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> [1,] 1473 4267 0.34 -0.3 0.86 0.79 0.81 0.64 0.79
#> pcc.Male pcc.Female pcc.total auc
#> [1,] 0.63 0.67 0.66 0.71
```

The output shows that the sample sizes are smaller compared to the
univariate estimates because they only reflect the samples in the
prediction/testing dataset. The standardized multivariate difference
between males and females is estimated at *D* = 0.79, which is
very similar in magnitude to the multivariate sex difference indexed by
Mahalanobis’ *D*.

There are additional features in the output that may be of interest.
Similar to univariate *d* estimates, descriptive statistics for
both groups on the FM-score metric are provided, as well as an estimate
of the pooled standard deviation used to calculate D. The probability of
correctly classified individuals (**pcc**) and the area
under the receiver operating characteristics (**auc**) are
also included, which are commonly used indices of predictive
performance. In this example, they indicate the performance of the broad
Big Five domains in predicting sex. The predicted probabilities of being
male across certain cutoffs (defaults were used in this example) for
both groups can be obtained from the **P.table** part of
the output.

`round(100*D_out$P.table,1)`

```
#>
#> [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#> Female 7.2 39.0 37.0 14.1 2.7
#> Male 1.2 18.1 37.1 32.6 11.0
```

The table shows that majority of males are nevertheless not very clearly classified as males based on the Big Five personality traits (11% in .80-1 range), and same is true for classifying females as females (7% in 0-0.20 range).

To examine the regularized beta weights obtained during the training
phase, the **coefficients** function can be called on the
**cv.mod** object of the output. The **s**
argument specifies the value of the penalty parameter lambda used to
calculate predictions. The default value of “lambda.min” is used to
obtain the values where the mean cross-validated error is minimized,
while “lambda.1se” uses the largest value of lambda such that the error
is within one standard error of the minimum. It is also possible to plot
the cross-validation curve to visualize how “lambda.min” and
“lambda.1se” were obtained. For more information on regularized
regression conducted with the glmnet package, please refer to https://glmnet.stanford.edu/articles/glmnet.html.

```
coefficients(D_out$cv.mod,
s = "lambda.min")
```

```
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 3.37297381
#> N -0.54865910
#> E -0.02070834
#> O 0.64983681
#> A -0.88854735
#> C -0.23751055
```

```
coefficients(D_out$cv.mod,
s = "lambda.1se")
```

```
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 2.3670214
#> N -0.3900418
#> E .
#> O 0.4626928
#> A -0.6930196
#> C -0.1053781
```

`plot(D_out$cv.mod)`

Differences and similarities are two sides of the same coin. In the
case of sex differences, visual inspection of the distributions of
FM-scores can reveal both. Specifically, by assessing the overlap
between the distributions of FM-scores of males and females, we can
observe differences between the two groups. However, we can also
identify similarities by examining the degree of
**overlap** between the distributions.

```
# Obtain the D-estimate
<-unname(D_out$D[,"D"])
D
# Overlap relative to a single distribution (OVL)
<-2*pnorm((-D/2))
OVL
# Proportion of overlap relative to the joint distribution
<-OVL/(2-OVL)
OVL2
# non-parametric overlap with overlapping package
<-
np.overlapoverlap(x = list(D_out$pred.dat[
$pred.dat$group=="Male","pred"],
D_out$pred.dat[
D_out$pred.dat$group=="Female","pred"]),
D_outplot=T)
```

```
# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
<-unname(np.overlap$OV)
np.OVL2
# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
<-(2*np.OVL2)/(1+np.OVL2)
np.OVL
# comparison of different overlap-estimates
round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)
```

```
#> OVL np.OVL OVL2 np.OVL2
#> [1,] 0.69 0.82 0.53 0.7
```

In this example, the non-parametric overlap (np) estimates to these logistic distributions are almost identical to the parametric overlap estimates.

To obtain confidence intervals for *D*-estimates using
regularized techniques, a bootstrap approach can be employed. This
involves estimating *D* with multiple bootstrap samples drawn
with replacement from the original sample and using the distribution of
these *D* values to construct a percentile confidence interval
around the estimate. In this example, only 10 redraws with replacement
are used, but in practice, at least 1000 or 5000 redraws should be used
for more accurate results.

```
# number of redraws
=10
reps
# placeholder for D-estimates
<-rep(NA,reps)
D_out_boot
# repeat the procedure to obtain a distribution
<-Sys.time()
t1
for (i in 1:reps){
# draw replaced samples with the same size as the original data
<-sample_n(US.dat,size=nrow(US.dat),replace=T)
boot.dat
# run D_regularized for each sample
<-
D_out_boot[i]D_regularized(data=boot.dat,
mv.vars=c("N","E","O","A","C"),
group.var="gender.cat",
group.values = c("Male","Female"),
out = T,
size=size)$D[,"D"]
}<-Sys.time()
t2-t1 t2
```

`#> Time difference of 5.830213 secs`

```
# print 95% CI percentile interval
round(quantile(D_out_boot,c(.025,.975)),2)
```

```
#> 2.5% 97.5%
#> 0.77 0.83
```

After obtaining the bootstrap confidence interval, it can be reported
that the multivariate sex difference *D* = 0.79, 95% CI [0.77,
0.83].

Recently, there have been numerous studies in the field of personality psychology emphasizing the significance of narrow personality traits in predicting criteria, beyond the commonly used broad trait domains such as the Big Five (e.g., Seeboth & Mõttus, 2018). These narrow personality traits are usually either facets or nuances that are commonly modeled as being part of a certain Big Five trait, although their unique predictive utility suggests that they have trait-like properties (stability, cross-informant agreement etc.) that go beyond the broad Big Five domains (e.g., Mõttus et al., 2019). Consequently, narrower traits can be regarded as distinct traits. Thus, it is interesting to investigate whether using item scores as the variable set in multivariate estimation would provide better predictions of sex and different estimates of sex differences (see Ilmarinen et al., 2022).

However, due to the increased number of variables (from five to fifty
in this dataset), regularization methods and different data partitions
for training and testing are highly recommended to prevent overfitting
of the data. For illustrative purposes, Mahalanobis’ *D* is also
calculated below.

```
# place-holder vector for univariate Cohen d's for items
<-rep(NA,length(per.items))
d_vector_items
# loop through the items and obtain d for each item
for (i in 1:length(per.items)){
<-
d_vector_items[i]d_pooled_sd(data=US.dat,
var=per.items[i],
group.var="gender.cat",
group.values=c("Male","Female"))[,"D"]
}
<-cor(US.dat[,per.items])
cor_mat_items
# Calculate D
<-
D_maha_itemssqrt(t(d_vector_items) %*% solve(cor_mat_items) %*% d_vector_items)
D_maha_items
```

```
#> [,1]
#> [1,] 0.9212846
```

Item-based Mahalanobis’ *D* is somewhat larger than *D*
(Mahalanobis’ or regularized) based on Big Five domains. The
standardized distance between the group centroids of males and females
in a space comprised of personality items is *D* = 0.92

It is straightforward to use D_regularized -function even with large variable sets (especially when the variable names are save in a vector first; here “per.items”). Separate partitions of data are again used for regularization and prediction. Domain-based results are printed for comparison.

```
<-
D_items_outD_regularized(data=US.dat,
mv.vars=per.items,
group.var="gender.cat",
group.values = c("Male","Female"),
out = T,
size=size,pcc = T,auc = T,pred.prob = T)
# Item-based D
round(D_items_out$D,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> [1,] 1473 4267 0.46 -0.4 0.9 0.86 0.87 0.86 0.99
#> pcc.Male pcc.Female pcc.total auc
#> [1,] 0.7 0.69 0.69 0.76
```

```
# Big Five domain-based D
round(D_out$D,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> [1,] 1473 4267 0.34 -0.3 0.86 0.79 0.81 0.64 0.79
#> pcc.Male pcc.Female pcc.total auc
#> [1,] 0.63 0.67 0.66 0.71
```

With items as the multivariate set, *D* = 0.99 is slightly
larger than with Big Five trait domains, *D* = 0.79.

Differences in predictive performance can also be illustrated from comparing the predicted probabilities.

`round(100*D_items_out$P.table,1)`

```
#>
#> [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#> Female 11.6 38.5 33.4 14.3 2.2
#> Male 1.6 15.1 32.9 35.4 15.0
```

`round(100*D_out$P.table,1)`

```
#>
#> [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#> Female 7.2 39.0 37.0 14.1 2.7
#> Male 1.2 18.1 37.1 32.6 11.0
```

The analysis reveals that the distribution of the FM-scores shows more individuals in the extremes. However, despite this, the overall predictive performance based on the pcc and auc values is not substantially improved. It is noteworthy that over 30% of the data falls in the 40%-60% range, indicating that their personality profiles are rather indifferent in regards to sex.

Overlaps give similar information but in another format:

```
# Obtain the D-estimate
<-unname(D_items_out$D[,"D"])
D_items
# Overlap relative to a single distribution (OVL)
<-2*pnorm((-D_items/2))
OVL_items
# Proportion of overlap relative to the joint distribution
<-OVL_items/(2-OVL_items)
OVL2_items
# non-parametric overlap
<-
np.overlap_itemsoverlap(x = list(D_items_out$pred.dat[
$pred.dat$group=="Male","pred"],
D_items_out$pred.dat[
D_items_out$pred.dat$group=="Female","pred"]),
D_items_outplot=T)
```

```
# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
<-unname(np.overlap_items$OV)
np.OVL2_items
# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
<-(2*np.OVL2_items)/(1+np.OVL2_items)
np.OVL_items
# compare the different overlap-estimates
round(cbind(OVL_items,np.OVL_items,
2) OVL2_items,np.OVL2_items),
```

```
#> OVL_items np.OVL_items OVL2_items np.OVL2_items
#> [1,] 0.62 0.77 0.45 0.63
```

```
# print Big Five domain based overlaps for comparison
round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)
```

```
#> OVL np.OVL OVL2 np.OVL2
#> [1,] 0.69 0.82 0.53 0.7
```

The overlap between FM-distributions of males and females is somewhat smaller when individual items are used as the multivariate set rather than broad Big Five domains.

`sessionInfo()`

```
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=Finnish_Finland.utf8
#> [3] LC_MONETARY=Finnish_Finland.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Finnish_Finland.utf8
#>
#> time zone: Europe/Helsinki
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] overlapping_2.1 testthat_3.1.8 ggplot2_3.4.2 dplyr_1.1.2
#> [5] multid_0.8.0 rio_0.5.29
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.6 utf8_1.2.3 generics_0.1.3 shape_1.4.6
#> [5] lattice_0.21-8 stringi_1.7.12 pROC_1.18.2 hms_1.1.3
#> [9] digest_0.6.31 magrittr_2.0.3 evaluate_0.21 grid_4.3.0
#> [13] iterators_1.0.14 fastmap_1.1.1 plyr_1.8.8 foreach_1.5.2
#> [17] Matrix_1.5-4 glmnet_4.1-7 cellranger_1.1.0 jsonlite_1.8.4
#> [21] zip_2.3.0 brio_1.1.3 survival_3.5-5 fansi_1.0.4
#> [25] scales_1.2.1 codetools_0.2-19 jquerylib_0.1.4 cli_3.6.1
#> [29] rlang_1.1.1 splines_4.3.0 munsell_0.5.0 withr_2.5.0
#> [33] cachem_1.0.8 yaml_2.3.7 tools_4.3.0 colorspace_2.1-0
#> [37] forcats_1.0.0 curl_5.0.0 vctrs_0.6.2 R6_2.5.1
#> [41] lifecycle_1.0.3 foreign_0.8-84 pkgconfig_2.0.3 pillar_1.9.0
#> [45] bslib_0.4.2 openxlsx_4.2.5.2 gtable_0.3.3 data.table_1.14.8
#> [49] glue_1.6.2 Rcpp_1.0.10 highr_0.10 haven_2.5.2
#> [53] xfun_0.39 tibble_3.2.1 tidyselect_1.2.0 rstudioapi_0.14
#> [57] knitr_1.42 farver_2.1.1 htmltools_0.5.5 labeling_0.4.2
#> [61] rmarkdown_2.21 compiler_4.3.0 readxl_1.4.2
```