Clustering of unknown subpopulations in admixture models

Xavier Milhaud

library(admix)

The clustering of populations following admixture models is, for now, based on the K-sample test theory (see (Milhaud and Vandekerkhove 2023). Consider \(K\) samples. For \(i=1,...,K\), sample \(X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})\) follows \[L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.\]

We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the K-sample test procedure to obtain a data-driven method that cluster the \(K\) populations into \(N\) subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold:

This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the K-sample 2-component mixture clustering (K2MC).

Algorithm

We now detail the steps of the algorithm.

  1. Initialization: create the first cluster to be filled, i.e. \(c = 1\). By convention, \(S_0=\emptyset\).
  2. Select \(\{x,y\}={\rm argmin}\{d_n(i,j); i \neq j \in S \setminus \bigcup_{k=1}^c S_{k-1}\}\).
  3. Test \(H_0\) between \(x\) and \(y\).
If $H_0$ is not rejected then $S_1 = \{x,y\}$,\\
Else $S_1 = \{x\}$, $S_{c+1} = \{y\}$ and then $c=c+1$.
  1. While \(S\setminus \bigcup_{k=1}^c S_k = \emptyset\) do
Select $u={\rm argmin}\{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k\}$;
Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\\
  If $H_0$ not rejected, then put $S_c=S_c\bigcup \{u\}$;\\
  Else $S_{c+1} = \{u\}$ and $c = c+1$.

Applications

On \(\mathbb{R}^+\)

We present a case study with 5 populations to cluster on \(\mathbb{R}^+\), with Gamma-Exponential and Gamma-Gamma mixtures.

## Simulate data (chosen parameters indicate 3 clusters (populations (1,3), (2,5) and 4)!):
list.comp <- list(f1 = "gamma", g1 = "exp",
                  f2 = "gamma", g2 = "exp",
                  f3 = "gamma", g3 = "gamma",
                  f4 = "exp", g4 = "exp",
                  f5 = "gamma", g5 = "exp")
list.param <- list(f1 = list(shape = 16, rate = 4), g1 = list(rate = 1/3.5),
                   f2 = list(shape = 14, rate = 2), g2 = list(rate = 1/5),
                   f3 = list(shape = 16, rate = 4), g3 = list(shape = 12, rate = 2),
                   f4 = list(rate = 1/2), g4 = list(rate = 1/7),
                   f5 = list(shape = 14, rate = 2), g5 = list(rate = 1/6))
A.sim <- rsimmix(n=6000, unknownComp_weight=0.8, comp.dist = list(list.comp$f1,list.comp$g1),
                 comp.param = list(list.param$f1, list.param$g1))$mixt.data
B.sim <- rsimmix(n=6000, unknownComp_weight=0.5, comp.dist = list(list.comp$f2,list.comp$g2),
                 comp.param = list(list.param$f2, list.param$g2))$mixt.data
C.sim <- rsimmix(n=6000, unknownComp_weight=0.65, comp.dist = list(list.comp$f3,list.comp$g3),
                 comp.param = list(list.param$f3, list.param$g3))$mixt.data
D.sim <- rsimmix(n=6000, unknownComp_weight=0.75, comp.dist = list(list.comp$f4,list.comp$g4),
                 comp.param = list(list.param$f4, list.param$g4))$mixt.data
E.sim <- rsimmix(n=6000, unknownComp_weight=0.55, comp.dist = list(list.comp$f5,list.comp$g5),
                 comp.param = list(list.param$f5, list.param$g5))$mixt.data
## Look for the clusters:
list.comp <- list(f1 = NULL, g1 = "exp",
                  f2 = NULL, g2 = "exp",
                  f3 = NULL, g3 = "gamma",
                  f4 = NULL, g4 = "exp",
                  f5 = NULL, g5 = "exp")
list.param <- list(f1 = NULL, g1 = list(rate = 1/3.5),
                   f2 = NULL, g2 = list(rate = 1/5),
                   f3 = NULL, g3 = list(shape = 12, rate = 2),
                   f4 = NULL, g4 = list(rate = 1/7),
                   f5 = NULL, g5 = list(rate = 1/6))
clusters <- admix_clustering(samples = list(A.sim,B.sim,C.sim,D.sim,E.sim), n_sim_tab = 10,
                             comp.dist=list.comp, comp.param=list.param, parallel=FALSE, n_cpu=2)
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |====                                              |   8%
  |                                                        
  |==============================                    |  60%
  |                                                        
  |========================================          |  80%
  |                                                        
  |==================================================| 100%
clusters
#> Call:
#> admix_clustering(samples = list(A.sim, B.sim, C.sim, D.sim, E.sim), 
#>     n_sim_tab = 10, comp.dist = list.comp, comp.param = list.param, 
#>     parallel = FALSE, n_cpu = 2)
#> 
#> The number of populations/samples under study is 5.
#> The level of the underlying k-sample testing procedure is set to 5%.
#> 
#> The number of detected clusters in these populations equals 3.
#> The p-values of the k-sample tests (showing when to close the clusters (i.e. p-value < 0.05) equal: 0.333, 0, 0, 0.111.
#> 
#> The list of clusters with populations belonging to them (in numeric format, i.e. inside c()) :
#>    - Cluster #1: vector of populations c(1, 3)
#>   - Cluster #2: vector of populations 4
#>   - Cluster #3: vector of populations c(2, 5)
#> 
#> The list of estimated weights for the unknown component distributions in each detected cluster
#>       (in the same format and order as listed populations for clusters just above) :
#>    - estimated weights of the unknown component distributions for cluster  1 :  c(0.809178290213258, 0.631001196480248)
#>   - estimated weights of the unknown component distributions for cluster  2 :  numeric(0)
#>   - estimated weights of the unknown component distributions for cluster  3 :  c(0.523639573546502, 0.58440106951725)
#> 
#> The matrix giving the distances between populations, used in the clustering procedure through the k-sample tests:
#>             [,1]       [,2]        [,3]      [,4]       [,5]
#> [1,]  0.00000000 29.9669866  0.02926702  7.540128 45.3573537
#> [2,] 29.96698661  0.0000000 19.15790613  4.827311  0.1379936
#> [3,]  0.02926702 19.1579061  0.00000000 69.499273 20.9549883
#> [4,]  7.54012842  4.8273114 69.49927347  0.000000  3.1603965
#> [5,] 45.35735370  0.1379936 20.95498826  3.160397  0.0000000
Milhaud, Pommeret, X., and P. Vandekerkhove. 2023. “Contamination-Source Based k-Sample Clustering.” https://hal.science/hal-04129130.