Getting started with ClustVarLV

Evelyne Vigneau

2022-05-28

beginners with ClustVarLV

The ClustVarLV package is dedicated to the CLV method for the Clustering of Variables Around Latent Variables (Vigneau & Qannari,2003; Vigneau, Chen & Qannari, 2015).

In presence of missing data, clustering and local imputations are simultaneously performed (Vigneau, 2018).

library(ClustVarLV)

For illustration, we consider the “apples_sh” dataset which includes the sensory characterization and consumers preference for 12 varieties of apples (Daillant-Spinnler et al.,1996).

data(apples_sh)
# 43 sensory attributes of 12 varieties of apple from southern hemisphere
senso<-apples_sh$senso
# Scores of liking given fy 60 consumers for each of the 12 varieties of apple
pref<-apples_sh$pref

Clustering of the sensory attributes

The aim is to find groups of sensory attributes correlated, or anti-correlated, to each others. Herein “directional” groups are sought. Each group is associated with a latent component which makes it possible to identify the underlying sensory dimensions.

resclv_senso <- CLV(X = senso, method = "directional", sX =TRUE)
# option sX=TRUE means that each attribute will be auto-scaled (standard deviation =1)
# Dendrogram of the CLV hierarchical clustering algorithm :
plot(resclv_senso,"dendrogram")

# Graph of the variation of the clustering criterion
plot(resclv_senso,"delta")

The graph of the variation of the clustering criterion between a partition into K clusters and a partition into (K-1) clusters (after consolidation) is useful for determining the number of clusters to be retained. Because the criterion clearly jumps when passing from 4 to 3 groups, a partition into 4 groups is retained.

# Summary the CLV results for a partition into 4 groups
summary(resclv_senso,K=4)
## $number
## clusters
##  1  2  3  4 
## 12 14 12  5 
## 
## $prop_within
##      Group.1 Group.2 Group.3 Group.4
## [1,]  0.8355  0.7337   0.734  0.7289
## 
## $prop_tot
## [1] 0.7616
## 
## $groups
## $groups[[1]]
##         cor in group  |cor|next group
## iogreen         0.98             0.74
## ioredap        -0.97             0.80
## ioacids         0.96             0.74
## iounrip         0.96             0.68
## iocooka         0.96             0.81
## iagreen         0.92             0.60
## ioplums        -0.90             0.75
## iograss         0.89             0.72
## iayelow        -0.89             0.63
## iagreli         0.89             0.55
## iosweet        -0.86             0.79
## iawhite         0.76             0.60
## 
## $groups[[2]]
##         cor in group  |cor|next group
## asgreen         0.94             0.80
## flgreen         0.93             0.81
## flredap        -0.93             0.88
## flunrip         0.93             0.64
## asredap        -0.93             0.83
## asastri         0.92             0.58
## assweet        -0.90             0.63
## flsweet        -0.86             0.60
## flacids         0.86             0.62
## flgrass         0.85             0.82
## flplumc        -0.83             0.66
## asacids         0.83             0.56
## flpdrop        -0.76             0.59
## iodampt         0.33             0.27
## 
## $groups[[3]]
##         cor in group  |cor|next group
## txcrisp         0.97             0.54
## txjuicy         0.94             0.58
## txspong        -0.94             0.53
## fbhardn         0.93             0.55
## iajuicy         0.90             0.52
## flfresh         0.90             0.66
## iapulpy        -0.83             0.64
## flpearl        -0.81             0.77
## iatrans         0.79             0.66
## fbjuicy         0.79             0.53
## txslowb         0.78             0.45
## flsoapy        -0.64             0.53
## 
## $groups[[4]]
##         cor in group  |cor|next group
## asbitte         0.95             0.34
## flbitte         0.90             0.41
## flcoxli        -0.84             0.54
## flofffl         0.81             0.29
## flwater         0.75             0.31
## 
## 
## $set_aside
## NULL
## 
## $cormatrix
##       Comp1 Comp2 Comp3 Comp4
## Comp1  1.00  0.76  0.43  0.43
## Comp2  0.76  1.00  0.67  0.19
## Comp3  0.43  0.67  1.00  0.01
## Comp4  0.43  0.19  0.01  1.00

The function plot_var() allows us to describe the groups of variables into a two dimensional space obtained by Principal Components Analysis. Several options are available for the choice of the axes, for adding labels, producing a plot without colours but symbols, having only one plot or a plot by groups of variables.

# Representation of the group membership for a partition into 4 groups
plot_var(resclv_senso,K=4,label=T,cex.lab=0.8)

or

plot_var(resclv_senso,K=4,beside=T)

Additional functions :

# Extract the group membership of each variable
get_partition(resclv_senso,K=4,type="vector")
# or 
get_partition(resclv_senso,K=4,type="matrix")

# Extract the group latent variables 
get_comp(resclv_senso,K=4)

Clustering of the consumers’ preference data

The aim is to find segments of consumers. Herein “local” groups are sought. Each group latent variable represents a synthetic direction of preference. If, simultaneously, the aim is to explain these directions of preference by means of the sensory attributes of the products, the sensory data has to be included as external data.

res.segext<- CLV(X = pref, Xr = senso, method = "local", sX=TRUE, sXr = TRUE)

print(res.segext)
plot(res.segext,"dendrogram")

plot(res.segext,"delta") 

Two or three segments may be explored. To Compare the partitions into two or three segments :

table(get_partition(res.segext,K=2),get_partition(res.segext,K=3))
##    
##      1  2  3
##   1 12 28  0
##   2  2  0 18

Each latent variable being a linear combination of the external variables (sensory), it is possible to extract the associated loadings

get_loading(res.segext,K=3)
## [[1]]
##           [,1]
## X1  0.07142857
## X2  0.07142857
## X6  0.07142857
## X7  0.07142857
## X8  0.07142857
## X28 0.07142857
## X34 0.07142857
## X40 0.07142857
## X42 0.07142857
## X48 0.07142857
## X49 0.07142857
## X53 0.07142857
## X54 0.07142857
## X58 0.07142857
## 
## [[2]]
##           [,1]
## X3  0.03571429
## X5  0.03571429
## X9  0.03571429
## X12 0.03571429
## X13 0.03571429
## X14 0.03571429
## X15 0.03571429
## X17 0.03571429
## X20 0.03571429
## X23 0.03571429
## X25 0.03571429
## X26 0.03571429
## X27 0.03571429
## X29 0.03571429
## X30 0.03571429
## X31 0.03571429
## X35 0.03571429
## X36 0.03571429
## X37 0.03571429
## X38 0.03571429
## X39 0.03571429
## X41 0.03571429
## X44 0.03571429
## X46 0.03571429
## X51 0.03571429
## X55 0.03571429
## X59 0.03571429
## X60 0.03571429
## 
## [[3]]
##           [,1]
## X4  0.05555556
## X10 0.05555556
## X11 0.05555556
## X16 0.05555556
## X18 0.05555556
## X19 0.05555556
## X21 0.05555556
## X22 0.05555556
## X24 0.05555556
## X32 0.05555556
## X33 0.05555556
## X43 0.05555556
## X45 0.05555556
## X47 0.05555556
## X50 0.05555556
## X52 0.05555556
## X56 0.05555556
## X57 0.05555556

Using the CLV_kmeans function

This procedure is less time consuming when the number of variables is large. The number of clusters needs to be fixed (e.g.3).

The initialization of the algorithm can be made at random, “nstart” times :

res.clvkm.rd<-CLV_kmeans(X = pref, Xr = senso, method = "local", sX=TRUE,
                         sXr = TRUE, clust=3, nstart=100)

or the initialization can be defined by the user, for instance on the basis of the clusters obtained by cutting the CLV dendrogram to get 3 clusters

res.clvkm.hc<-CLV_kmeans(X = pref, Xr = senso, method = "local", sX=TRUE,
                        sXr = TRUE, clust=res.segext[[3]]$clusters[1,])

It is possible to compare the partitions according to the procedure used :

table(get_partition(res.segext,K=3),get_partition(res.clvkm.hc,K=3)) 
##    
##      1  2  3
##   1 14  0  0
##   2  0 28  0
##   3  0  0 18

In this case, the CLV solution is the same that the CLV_kmeans solution with an initialization based on the partition obtained by cutting the dendrogram.

table(get_partition(res.segext,K=3),get_partition(res.clvkm.rd,K=3)) 
##    
##      1  2  3
##   1 13  0  1
##   2  0  0 28
##   3  1 17  0

Partitions are very close.

Clustering wile setting aside atypical or noisy variables

This functionnality is available with the CLV_kmeans procedure. You can refer to Vigneau, Qannari, Navez & Cottet (2016) and Vigneau & Chen (2015) for theoretical details.

By considering the sensory data, applying (as shown below) the strategy “kplusone” makes it possible to identify and put aside (in a group “G0”) a spurious attribute.

clvkm_senso_kpone<-CLV_kmeans(X = senso, method = "directional",sX=TRUE, clust=4, strategy="kplusone",rho=0.5)
get_partition(clvkm_senso_kpone,type="matrix")
##         G.0 G.1 G.2 G.3 G.4
## iosweet   0   0   1   0   0
## ioacids   0   0   1   0   0
## iogreen   0   0   1   0   0
## ioredap   0   0   1   0   0
## iograss   0   0   1   0   0
## iounrip   0   0   1   0   0
## iocooka   0   0   1   0   0
## iodampt   1   0   0   0   0
## ioplums   0   0   1   0   0
## iawhite   0   0   1   0   0
## iagreen   0   0   1   0   0
## iayelow   0   0   1   0   0
## iagreli   0   0   1   0   0
## iajuicy   0   0   0   0   1
## iatrans   0   0   0   0   1
## iapulpy   0   0   0   0   1
## fbjuicy   0   0   0   0   1
## fbhardn   0   0   0   0   1
## txcrisp   0   0   0   0   1
## txjuicy   0   0   0   0   1
## txslowb   0   0   0   0   1
## txspong   0   0   0   0   1
## flgreen   0   1   0   0   0
## flredap   0   1   0   0   0
## flsweet   0   1   0   0   0
## flacids   0   1   0   0   0
## flbitte   0   0   0   1   0
## flgrass   0   1   0   0   0
## flfresh   0   0   0   0   1
## flpdrop   0   1   0   0   0
## flwater   0   0   0   1   0
## flofffl   0   0   0   1   0
## flplumc   0   1   0   0   0
## flunrip   0   1   0   0   0
## flcoxli   0   0   0   1   0
## flpearl   0   0   0   0   1
## flsoapy   0   0   0   0   1
## assweet   0   1   0   0   0
## asacids   0   1   0   0   0
## asbitte   0   0   0   1   0
## asgreen   0   1   0   0   0
## asredap   0   1   0   0   0
## asastri   0   1   0   0   0

For the consumers liking data, by varying the parameter “rho” associated with the strategy “kplusone” , a more or less large proportion of consumers will be set aside :

sizG0<-NULL
for (r in seq(0,1,0.1)) {
  res<-CLV_kmeans(X = pref, method = "local", sX=TRUE, clust=3, nstart=20, strategy="kplusone",rho=r)
  sizG0<-c(sizG0,sum(get_partition(res)==0))
}
plot(seq(0,1,0.1),sizG0,type="b",xlab="rho",ylab="# var in noise cluster")

By choosing rho=0.4, 8 out 60 consumers are assigned to the noise cluster. They are highlighted in gray when using the “plot_var” function.

plot_var(CLV_kmeans(X = pref, method = "local", sX=TRUE, clust=3, nstart=20, strategy="kplusone",rho=0.4))

References

Daillant-Spinnler B., MacFie H.J.H, Beyts P., Hedderley D. (1996). Relationships”Relationships between perceived sensory properties and major preference directions of 12 varieties of apples from the southern hemisphere. Food Quality and Preference, 7(2), 113-126.

Vigneau E., Qannari E.M. (2003). Clustering of variables around latents components. Comm. Stat, 32(4), 1131-1150.

Vigneau E., Chen M., Qannari E.M. (2015). ClustVarLV: An R Package for the clustering of Variables around Latent Variables. The R Journal, 7(2), 134-148.

Vigneau E., Qannari E. M., Navez B., Cottet V. (2016). Segmentation of consumers in preference studies while setting aside atypical or irrelevant consumers. Food Quality and Preference, 47, 54-63.

Vigneau E., Chen M. (2016). Dimensionality reduction by clustering of variables while setting aside atypical variables. Electronic Journal of Applied Statistical Analysis, 9(1), 134-153

Vigneau E. (2018). Segmentation of a panel of consumers with missing data. Food Quality and Preference, 67, 10-17

Vigneau E. (2021). Clustering of Variables for Enhanced Interpretability of Predictive Models. Informatica, 45, 507-516