This is a copy of the vignette of package `parallelpam`

. It is included here since parallelpam is underlying this package and you will need to know how does it work to process the single cell data using the PAM algorithm. But you must NOT load package `parallelpam`

. All the functions detailed below are already included into `scellpam`

and are available just by loading it with `library(scellpam)`

.

`library(scellpam)`

The package `parallelpam`

(Domingo (2022b)) is meant as a way to apply the PAM algorithm to quite (or very) big sets of data, such as the results of single-cell sequencing, but can be generally used for any type of data, as long as a distance/dissimilarity matrix can be calculated.

Differently to other packages, its main strength is its ability to perform clustering based on Partitioning Around Medoids (PAM) using a large number of items and doing it in parallel. Memory and speed limitations are reduced by extensive use of C++ programming which allows use of alternative data types (for instance, float vs. double to represent distance/dissimilarity), intermediate disk-storage to avoid memory copy operations whenever possible and use of threads.

Both phases of PAM (initialization with BUILD and optimization) have been parallelized so it you have a multi-core machine, many threads will be launched with a great acceleration of the calculations. This is done automatically, even you are allowed to choose the number of threads yourself for comparison purposes or to allow your machine to execute other things simultaneously.

Also, calculation of the matrix of distances/dissimilarities from the initial data and calculation of silhouette of the resulting classification are available, too, and are calculated in parallel in a multi-core machine.

The data are stored in memory reading them from binary files created with the package `jmatrix`

(Domingo (2022a)). To be familiar with them please read the vignette called `jmatrixsc`

which is included with this package.

WARNING: you must NOT load neither `jmatrix`

nor `parallelpam`

explicitly. Indeed, you do not need even to install them. All their functions have been included here, too, so doing `library(scellpam)`

is enough.

First of all, the package can show quite informative (but sometimes verbose) messages in the console. To turn on/off such messages you can use.

```
# Initially, state of debug is FALSE. Turn it on exclusively for
# the parallelpam part with
ScellpamSetDebug(FALSE,debparpam=TRUE)
#> Debugging for parallelpam inside scellpam package set to ON.
# There is another parameter, debjmat, to turn on messages about
# binary matrix creation/manipulation. By default is FALSE but turn it on
# if you like with
# ScellpamSetDebug(FALSE,debparpam=TRUE,debjmat=TRUE)
```

The first step is to load raw data coming from external sources like the main formats used in single cell experiments which should have been stored as a binary matrix file in `jmatrix`

format. Since this is a separate package, and for purposes of illustration, we will create an artificial matrix for a small problem that fits in `R`

memory with 5000 vectors with 500 dimensions each. Then we will calculate the dissimilarity matrix and finally we will apply PAM algorithm to it.

```
# Create the matrix with row names V1 to V5000 and column names d1 to d500
<-5000
nvec<-500
ndim<-matrix(runif(nvec*ndim),nrow=nvec)
Prownames(P)<-paste0("V",1:nvec)
colnames(P)<-paste0("d",1:ndim)
# Write it to disk as a binary file in jmatrix format.
# Please, see vignette jmatrixsc.
JWriteBin(P,"datatest.bin",dtype="float",dmtype="full",
comment="Synthetic problem data to test PAM")
```

For your real problem, the input format can be a .csv file. See function `CsvToJMat`

in package `scellpam`

(Domingo (2022c)).

To know details about the generated files do

```
JMatInfo("datatest.bin")
#> File: datatest.bin
#> Matrix type: FullMatrix
#> Number of elements: 2500000
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 5000
#> Number of columns: 500
#> Metadata: Stored names of rows and columns.
#> Metadata comment: "Synthetic problem data to test PAM"
```

This is the most computationally intensive part of the process (particularly, for samples with a high number of points and/or high dimensionalty) and therefore has been programmed in parallel, taking advantage of the multiple cores of the machine, if available. The funcion is called `CalcAndWriteDissimilarityMatrix`

. Its input and output files (first and second parameters) are of course compulsory. Input file can be a sparse of full binary `jmatrix`

(but obviously, not a symmetric matrix).

WARNING: notice that the vectors to calculate dissimilarities amongst them MUST be stored BY ROWS. This is due to efficiency reasons.

Output dissimilarity matrix will always be a binary symmetric (obviously square) matrix with a number of rows (and columns) equal to the number of rows (in this case, vectors) of the input file. The type of distance/dissimilarity can be `L1`

(Manhattan distance), `L2`

(Euclidean distance) or ‘Pearson’ (Pearson dissimilarity coefficient). The resulting matrix stores only names for the rows, which are the names of the vectors stored as rows in file `datatest.bin`

. If the number of vectors is N, only \(N(N+1)/2\) dissimilarity values are really stored.

A note on the number of threads, valid also for other algorithms that will be explained later:

Possible values for the number of threads are:

`-1`

(or any negative number) to indicate you do not want to use threads (strictly sequential computation).`0`

to allow the program to choose the number of threads according to the problem size and the number of available cores.Any positive number to force the use of such number of threads.

Choosing explicitly a number of threads bigger than the number of available cores is allowed, but discouraged and the program emits a warning about it.

With respect to option `0`

, for small problems (in this case, less than 1000 vectors) the function makes the choice of not using threads, since the overhead of opening and waiting termination is not worth. For bigger problems the number of chosen threads is the number of available cores, or twice this numer if the processor is capable of hyperthreading. Nevertheless, this choice may not be the best, depending on your machine, possibly due (I guess) to the memory access confilcts created by the need of keep processor cache coherence. You may have to do some trials with your data in your machine.

Now, let us try it with this small dataset.

```
CalcAndWriteDissimilarityMatrix("datatest.bin","datatestL2.bin",
distype="L2",restype="float",
comment="L2 distance for vectors in
jmatrix file datatest.bin",nthreads=0)
#> Input matrix is a full matrix with elements of type 'float' and size (5000,500)
#> Read full matrix from file datatest.bin. Its size is [5000 x 500] and it uses 2500000 elements of 4 bytes each with accounts for 9.53674 MBytes.
#> Creating dissimilarity matrix of size (5000x5000)
#> Loading required package: memuse
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 251003456 Kib.
#> That seems OK.
#> Launching up to 128 simultaneous threads.
#> End of dissimilarity matrix calculation (parallel version). Elapsed time: 0.313559 s
#> Output binary file datatestL2.bin written.
```

The resulting matrix is stored as a binary symmetric matrix of float values, as we can check.

```
JMatInfo("datatestL2.bin")
#> File: datatestL2.bin
#> Matrix type: SymmetricMatrix
#> Number of elements: 25000000 (12502500 really stored)
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 5000
#> Number of columns: 5000
#> Metadata: Stored only names of rows.
#> Metadata comment: "L2 distance for vectors in
#> jmatrix file datatest.bin"
```

The last step is to take the previously calculated matrix and apply the Partitioning Around Medoids classifier. Function is `ApplyPAM`

. First parameter (name of the file containing the dissimilarity matrix in `jmatrix`

format) and second parameter (`k`

, number of medoids) are compulsory. The names and default values for the rest of parameters are as in this example.

```
=ApplyPAM("datatestL2.bin",k=5,init_method="BUILD",max_iter=1000,nthreads=0)
L#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 250996164 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting BUILD initialization method, parallel version with 128 threads.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 1400. TD=8.777583
#> Looking for medoid 2. Medoid 2 found. Point 2531. 2115 reassigned points. TD=8.687799
#> Looking for medoid 3. Medoid 3 found. Point 1385. 1444 reassigned points. TD=8.638450
#> Looking for medoid 4. Medoid 4 found. Point 3147. 1073 reassigned points. TD=8.605658
#> Looking for medoid 5. Medoid 5 found. Point 2778. 852 reassigned points. TD=8.581519
#> Current TD: 8.581519
#> BUILD initialization method (parallel version) finished. Elapsed time: 0.0264367 s
#> Starting improved FastPAM1 method in parallel implementation with 128 threads.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Exiting, since DeltaTDst is 0.000000. Final value of TD is 8.581515
#> Optimization method (parallel version) finished. Elapsed time: 0.00582247 s
#> Time summary (parallel implementation with 128 threads).
#> Initalization: 0.0264367 s (method BUILD).
#> Optimization: 0.00582247 s in 0 iterations.
#> Total time: 0.0322592 s (0 minutes, 0.0322592 seconds).
```

Parameters `init_method`

(and another optional parameter, `initial_med`

) deserve special comment. The first is the method to initialize the medoids. Its possible values are `BUILD`

, `LAB`

and `PREV`

. The rest of the algorithm make medoid swapping between the points of the initial set made with `BUILD`

or `LAB`

and the rest of points until no swap can reduce the objective function, which is the sum of distances of each point to its closest medoid. But this may fall (and indeed falls) in local minima. If you initialize with `BUILD`

or `LAB`

the optional parameter `initial_med`

cannot be used.

The initialization methods `BUILD`

and `LAB`

are described in the paper from Schubert at al. (Schubert and Rousseeuw (2019)). `BUILD`

is deterministic. `LAB`

uses a sample of the total points to initialize. Obviously, you can run `LAB`

to get different initializations and compare the results.

The returned object is a list with two fields: `med`

and `clasif`

. This will be explained later.

From now on, typical calls to obtain only the initial medoids would be

```
=ApplyPAM("datatestL2.bin",k=5,init_method="BUILD",max_iter=0)
Lbuild#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 250984712 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting BUILD initialization method, parallel version with 128 threads.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 1400. TD=8.777583
#> Looking for medoid 2. Medoid 2 found. Point 2531. 2115 reassigned points. TD=8.687799
#> Looking for medoid 3. Medoid 3 found. Point 1385. 1444 reassigned points. TD=8.638450
#> Looking for medoid 4. Medoid 4 found. Point 3147. 1073 reassigned points. TD=8.605658
#> Looking for medoid 5. Medoid 5 found. Point 2778. 852 reassigned points. TD=8.581519
#> Current TD: 8.581519
#> BUILD initialization method (parallel version) finished. Elapsed time: 0.0269161 s
#> Time summary (parallel implementation with 128 threads).
#> Initalization: 0.0269161 s (method BUILD).
#> Optimization: 0 s in 0 iterations.
#> Total time: 0.0269161 s (0 minutes, 0.0269161 seconds).
=ApplyPAM("datatestL2.bin",k=5,init_method="LAB",max_iter=0)
Llab1#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 250975148 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting LAB initialization method, serial version.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 1762. TD=8.897045
#> Looking for medoid 2. Medoid 2 found. Point 2534. 2837 reassigned points. TD=8.770867
#> Looking for medoid 3. Medoid 3 found. Point 3799. 1817 reassigned points. TD=8.707559
#> Looking for medoid 4. Medoid 4 found. Point 3147. 1496 reassigned points. TD=8.659385
#> Looking for medoid 5. Medoid 5 found. Point 509. 1068 reassigned points. TD=8.629243
#> Current TD: 8.629243
#> LAB initialization method (serial version) finished. Elapsed time: 0.00198886 s
#> Time summary (parallel implementation with 128 threads).
#> Initalization: 0.00198886 s (method LAB).
#> Optimization: 0 s in 0 iterations.
#> Total time: 0.00198886 s (0 minutes, 0.00198886 seconds).
=ApplyPAM("datatestL2.bin",k=5,init_method="LAB",max_iter=0)
Llab2#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 250977532 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting LAB initialization method, serial version.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 159. TD=8.866509
#> Looking for medoid 2. Medoid 2 found. Point 2493. 2708 reassigned points. TD=8.737890
#> Looking for medoid 3. Medoid 3 found. Point 418. 1162 reassigned points. TD=8.701546
#> Looking for medoid 4. Medoid 4 found. Point 2275. 1374 reassigned points. TD=8.658340
#> Looking for medoid 5. Medoid 5 found. Point 2548. 843 reassigned points. TD=8.635493
#> Current TD: 8.635493
#> LAB initialization method (serial version) finished. Elapsed time: 0.00207224 s
#> Time summary (parallel implementation with 128 threads).
#> Initalization: 0.00207224 s (method LAB).
#> Optimization: 0 s in 0 iterations.
#> Total time: 0.00207224 s (0 minutes, 0.00207224 seconds).
```

As it can be seen, to get and compare different initializations you must set the parameter `max_iter`

to the value `0`

. In this case no iterations of objective function reduction are performed, and the returned object contains the initial medoids and the classification induced by them. Notice that even looking equal, the results of the latter two calls are different since `LAB`

initializes with a random component (the sample to choose initial medoids is chosen randomly).

You can check that the medoids, stored in `Llab1$med`

and `Llab2$med`

(see more on this below) are in general, different.

Now, these results can be used to initialize PAM if you find that any of them contains a specially good set of medoids. This is the role of method `PREV`

that we have mentioned before. A typical call would be

```
=ApplyPAM("datatestL2.bin",k=5,init_method="PREV",initial_med=Llab2$med)
Llab2Final#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 250976932 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting improved FastPAM1 method in parallel implementation with 128 threads.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 3 (point 418) swapped with point 1401; TD-change=-0.032879; TD=8.602636. 1595 reassigned points.
#> Iteration 1. Medoid at place 5 (point 2548) swapped with point 3942; TD-change=-0.012228; TD=8.590408. 1193 reassigned points.
#> Iteration 2. Medoid at place 1 (point 160) swapped with point 1385; TD-change=-0.004368; TD=8.586040. 1198 reassigned points.
#> Iteration 3. Medoid at place 4 (point 2275) swapped with point 2531; TD-change=-0.004694; TD=8.581347. 1167 reassigned points.
#> Iteration 4. Exiting, since DeltaTDst is 0.000000. Final value of TD is 8.581347
#> Optimization method (parallel version) finished. Elapsed time: 0.0289524 s
#> Time summary (parallel implementation with 128 threads).
#> Initalization: 0 s (method PREV).
#> Optimization: 0.0289524 s in 3 iterations (0.00965079 seconds/iteration).
#> Total time: 0.0289524 s (0 minutes, 0.0289524 seconds).
```

The initial set of medoids is taken from the object returned by the former calls.

With respect to that object, as we said it is a list with two vectors. The first one, `L$med`

, has as many components as requested medoids and the second, `L$clasif`

, has as many components as instances.

Medoids are expressed in `L$med`

by its number in the array of vectors (row number in the dissimilarity matrix) starting at 1 (`R`

convention).

`L$clasif`

contains the number of the medoid (i.e.: the cluster) to which each instance has been assigned, according to their order in `L$med`

(also from 1).

This means that if `L$clasif[p]`

is `m`

, the point `p`

belongs to the class grouped around medoid `L$med[m]`

. Let us see it.

```
# Which are the indexes of the points chosen as medoids?
$med
L#> V1401 V2531 V1385 V3147 V2778
#> 1401 2531 1385 3147 2778
#
# In which class has point 147 been classified?
$clasif[147]
L#> V147
#> 1
#
# And which is the index (row in the dissimilarity matrix) of
# the medoid closest to point 147?
$med[L$clasif[147]]
L#> V1401
#> 1401
```

In this way, values in `L$clasif`

are between 1 and the number of medoids, as we can see

```
min(L$clasif)
#> [1] 1
max(L$clasif)
#> [1] 5
```

They can be used as factors.

It is interesting to filter points based on the degree in which they belong to a cluster. Indeed, cluster refinement can be done getting rid of points far away from any cluster center, or which are at a similar distance of two or more of them.

This is characterized by the silhouette of each point. Three functions deal with this: `CalculateSilhouette`

, `FilterBySilhouetteQuantile`

and `FilterBySilhouetteThreshold`

.

```
=CalculateSilhouette(Llab2$clasif,"datatestL2.bin",nthreads=0)
S#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 250976180 Kib.
#> That seems OK.
#> Calculating silhouette (parallel version) with 128 threads.
#> 5000 points classified in 5 classes.
#> Finished parallel implementation of silhouette (including dissimilarity matrix load). Elapsed time: 0.0316314 s
```

The parameters to function `CalculateSilhouette`

are the array of class membership, as returned by `ApplyPAM`

in its `clasif`

field, and the file with the matrix of dissimilarities.

A parallel implementation has been programmed, being nthreads as explained before.

Silhouette is a number in \([-1,1]\); the higher its value, the most centered a point is in its cluster.

The returned object `S`

is a numeric vector with the value of the silhouette for each point, which will be a named vector if the classification vector was named.

This vector can be converted to an object of the class `cluster::silhouette`

with the function `NumSilToClusterSil`

(which needs the vector of classifications, too). This is done so that, if you load the package `cluster`

(Maechler et al. (2022)), plot will generate the kind of silhouette plots included in such package.

```
<- NumSilToClusterSil(Llab2$clasif,S)
Sclus library(cluster)
plot(Sclus)
```

Probably the plot does not look very nice with this random data which yields a very bad clustering (since the points are not, by its own nature, organized in clusters) but with real data you should see significant things (see package `scellpam`

(Domingo (2022c))).

Once the silhouette is calculated we can filter it by quantile or by threshold. All points under this quantile or threshold will be discarded, except if they are medoids. Parameters are:

The silhouette, as returned by CalculateSilhouette.

The list of medoids/clasif, as returned by ApplyPAM.

The file with matrix of counts for the whole set of cells, which was our first input.

The file that will contain the matrix of counts of the remaining cells.

The file with the dissimilarity matrix for the whole set, as calculated by

`CalcAndWriteDissimilarityMatrix`

.The file that will contain the dissimilarity for the remaining cells.

And (depending on the function used) the quantile in \([0,1]\) or the silhouette threshold in \([-1,1]\).

As an example,

```
=FilterBySilhouetteQuantile(S,Llab2,"datatest.bin",
Lfilt"datatestFilt.bin","datatestL2.bin",
"datatestL2Filt.bin",0.2)
```

If the original matrix contained row and column names, the column names are copied and the row names are transported for those rows that remain. The same happens with respect to rows of the dissimilarity matrix.

Notice that the new dissimilarity matrix could have been calculated from the matrix of filtered counts with `CalcAndWriteDissimilarityMatrix`

but creating it here, simply getting rid of the filtered rows and columns is much faster.

Also, if a medoid is below the silhouette quantile, it will not be filtered out, but a warning message will be shown, since this is a strange situation that may indicate that some of your clusters are not real but artifacts due to a few outliers that are close to each other.

But remember that this was the result of the first step of the PAM algorithm, so probably you will want to make them iterate.

```
=ApplyPAM("datatestL2Filt.bin",k=length(Lfilt$med),init_method="PREV",initial_med=Lfilt$med)
Lfinal#> Reading symmetric distance/dissimilarity matrix datatestL2Filt.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 31242 KiB, which is 0.01% of the available memory, which is 250898488 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting improved FastPAM1 method in parallel implementation with 128 threads.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 3 (point 345) swapped with point 1116; TD-change=-0.025873; TD=8.593267. 1213 reassigned points.
#> Iteration 1. Medoid at place 5 (point 2029) swapped with point 2014; TD-change=-0.006921; TD=8.586347. 959 reassigned points.
#> Iteration 2. Medoid at place 1 (point 143) swapped with point 2215; TD-change=-0.003715; TD=8.582632. 867 reassigned points.
#> Iteration 3. Medoid at place 4 (point 1817) swapped with point 2516; TD-change=-0.001868; TD=8.580764. 961 reassigned points.
#> Iteration 4. Exiting, since DeltaTDst is 0.000000. Final value of TD is 8.580764
#> Optimization method (parallel version) finished. Elapsed time: 0.0262346 s
#> Time summary (parallel implementation with 128 threads).
#> Initalization: 0 s (method PREV).
#> Optimization: 0.0262346 s in 3 iterations (0.00874485 seconds/iteration).
#> Total time: 0.0262346 s (0 minutes, 0.0262346 seconds).
```

Of course, we might have used simply 5 as number of medoids, `k`

, since this does not change by filtering, but this is to emphasize the fact that `ApplyPAM`

with method `PREV`

requires both parameters to be consistent.

Domingo, Juan. 2022a. *Jmatrix: Read from/Write in Disks Matrices with Any Data Type in a Binary Format*.

———. 2022b. *Parallelpam: Applies the Partitioning-Around-Medoids (PAM) Clustering Algortihm to Big Sets of Data Using Parallel Implementation, If Several Cores Are Available*.

———. 2022c. *Scellpam: Applying Partitioning Around Medoids to Single Cell with High Number of Cells*.

Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2022. *Cluster: Cluster Analysis Basics and Extensions*. https://CRAN.R-project.org/package=cluster.

Schubert, Erich, and Peter J. Rousseeuw. 2019. “Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms.” In *Similarity Search and Applications*, 171–87. Springer International Publishing. https://doi.org/10.1007/978-3-030-32047-8_16.