Prerequisites

This vignette provides a manual for the analysis of single-cell RNA-seq data with three different methods:

RaceID3 (J. S. Herman, Sagar, and Grun 2018): pre-processing, clustering and outlier identification
StemID2 (J. S. Herman, Sagar, and Grun 2018): lineage tree inference (dependend on a prior RaceID3 analysis)
VarID2 (Rosalez-Alvarez et al. 2023): pre-processing, clustering, quantification of technical and biological noise, pseudotime analysis (optional integration with RaceID3)

RaceID (Grun et al. 2015) (the current version is RaceID3 (J. S. Herman, Sagar, and Grun 2018)) is a method for cell type identification from single-cell RNA-seq data by unsupervised learning. An initial clustering is followed by outlier identification based on a backgorund model of combined technical and biological variability in single-cell RNA-seq data with unique molecular identifiers.

StemID (Grun et al. 2016) (the current version is StemID2 (J. S. Herman, Sagar, and Grun 2018)) permits inference of a lineage tree based on clusters identified by RaceID. The current version of RaceID (RaceID3) and StemID (StemID2) are published together with the FateID algorithm (J. S. Herman, Sagar, and Grun 2018).

The current version of the RaceID package implements additional improvements in rare cell type detection, offers batch effect removal utilities, optional imputing of gene expression, and substantially decreases runtime as well as memory usage.

VarID (Grun 2020) (the current version is VarID2 (Rosalez-Alvarez et al. 2023)) is implemented as part of the RaceID package. VarID2 allows the quantification of local gene expression variability and the deconvolution into technical and biological noise components, and provides an alternative clustering approach based on community detection. VarID2 analysis is independent of RaceID but can be fully integrated into RaceID objects.

RaceID is integrated with the FateID method for the prediction of cell fate probabilities (J. S. Herman, Sagar, and Grun 2018), which is available as a separate package.

RaceID

Example RaceID analysis

Input to RaceID is a matrix of raw expression values (unique molecular identifiers with or without Poisson correction (Grun, Kester, and Oudenaarden 2014)) with cells as column and genes as rows. This matrix can be provided as a matrix object, a data.frame or a sparse matrix produced by the Matrix package.

The RaceID package comes with sample data intestinalData for murine intestinal epethelial cells stored in sparse matrix format. The dataset was published previously (Grun et al. 2016).

RaceID and StemID functions have various input and output parameters, which are explained in detail on the help pages. Here, we mostly use default parameters, which represent a good choice for common datasets.

To start the analysis, a RaceID single-cell sequencing (SCseq) object is initialized with a count matrix.

library(RaceID)
sc <- SCseq(intestinalData)

The first step is the application of filtering for the purpose of quality control. Cells with a relatively low total number of transcripts are discarded.

sc <- filterdata(sc,mintotal=2000)

In this example, we filter out cells with <2,000 transcipts. The function allows further filtering of genes by choosing the input parameters minexpr and minnumber, i.e. only genes with at least minexpr transcripts in at least minnumber cells are retained. The filtered and normalized expression matrix (normalized to the minimum total transcript count across all cells retained after filtering) can be retrieved by the getfdata function:

fdata <- getfdata(sc)

The filterdata function allows the detection and removal of batch effects by different methods as outlined below in addtional examples. Alternatively, individual genes or groups of genes correlating to specific genes can be removed with the FGenes and CGenes arguments. This frequently allows a minimally invasive removal of batch effects or of particularly highly expressed genes with an unwanted dominating effect on the clustering.

Next, the distance matrix is computed as the input for clustering and outlier identification. This can be done with or without imputing gene expression from nearest neighbours (see example below for imputing). RaceID offers various alternative metric, e.g. spearman, kendall, and euclidean, as well as measure of proportionality (rho and phi from the propr package).

sc <- compdist(sc,metric="pearson")

This function computes the distance matrix based on highly variable genes by default. If all genes should be used, then the parameter FSelect needs to be set to FALSE. On this distance matrix clustering is performed:

sc <- clustexp(sc)

To infer the initial cluster number, this function computes the average within-cluster dispersion up to a number of clusters specified by the clustnr arguments, which equals 30 by default. If more major populations are expected, this parameter needs to be set to higher values which will increase the run time. The initial cluster number only serves as a rough estimate, since the subsequent outlier identification will infer additional clusters. The internal inference of the cluster number and the evaluation of cluster stability by the computation of Jaccard's similarity is done on all cells by default. For large datasets it is reasonable to subsample a limited number of cells, by setting the samp argument, e.g., to 1,000. In this case, the cluster number is inferred based on this smaller subset and Jacccard's similarity is not computed by bootstrapping but by sub-sampling. For k-medoids clustering, subsetting will imply almost deterministic clustering partitions, if the sample size approaches the size of the dataset. Therefore, samp should be signicantly smaller then the size of the dataset. Otherwise, bootstrapping is better for assessing the cluster stability. Subsampling can be expected to give a good estimate of the number of major clusters. Additional small clusters which might have been missed by the sampling can be reintroduces during the outlier identification step.

The inferred cluster number can be inspected in a saturation plot, which shows the decrease of the average within-cluster dispersion with increasing cluster number. If this decrease becomes constant, saturation is reached. The automatically chosen cluster number is detected such that the decrease is equal to the decrease upon further increasing the cluster number within the error bars:

plotsaturation(sc,disp=FALSE)

The average within-cluster dispersion can also by plotted:

plotsaturation(sc,disp=TRUE)

The cluster stability as assessed by Jaccard's similarity should also be inspected:

plotjaccard(sc)

In this example, the automated criterion overestimated the cluster number leading to instability as indicated by low Jaccard's similarity. Based on visual inspection of the average within-cluster dispersion as a function of the cluster number, we manually set the cluster number to 7 without recomputing the saturation behaviour.

sc <- clustexp(sc,cln=7,sat=FALSE)

This function perform k-medoids clustering by default. K-means clustering or hierarchical clustering can be chosen with the FUNcluster argument. For very large datasets, hierarchical clustering leads to significantly smaller run time.

Subsequently, outliers in the initial k-medoids clusters are identified based on an internally computed background model for the expected gene expression variability and the assumption that transcript counts follow a negative binomial distribution defined by the mean and the variance of the expression of each gene in a cluster. Outlier genes will be in the tail of this distribution at a p-value defined by the probthr parameter (1e-3 by default), and outlier cells require the presence of a number of outlier genes defined by the outlg parameter (2 by default).

sc <- findoutliers(sc)

In contrast to previous versions, outlier genes are inferred from non-normalized transcript counts, which follow a negative binomial distribution modeling the joint technical and biological variability. The assumption of a negative binomial distribution was demonstrated for raw transcript (UMI) count data, but is not strictly valid for normalized expression values (Grun, Kester, and Oudenaarden 2014). Hence, RaceID does not require normalization, since the correlation-based metric for the computation of the distance object is also independent of the normalization. Normalizaion is only relevant when using, e.g., the euclidean metric for the derivation of the distance matrix. RaceID applies a simple size normalization for data representation and follow-up analyses.

The background noise model can be inspected:

plotbackground(sc)

The number of outliers as a function of the p-value can be plotted:

plotsensitivity(sc)

Another way of checking the presence of outliers is the inspection of a barplot of p-values across all cells in each cluster:

plotoutlierprobs(sc)

A heatmap of cell-to-cell distances grouped be the final clusters inferred based on the original clusters and the outliers allows visual inspection of the clusters:

clustheatmap(sc)

This function is not recommended for very large datasets, since it produces similarly large plotting output.

The best way of visualising the detetcted cell types is plotting cells and clusters in a two-dimensional reduction representaion. RaceID can compute a t-SNE map

sc <- comptsne(sc)

or a k-nearest neighbour graph layout utilizing the Fruchterman-Rheingold algorithm:

sc <- compfr(sc,knn=10)

In this example, the number of nearest neighbours was chosen to be 10. In general, different values for knn should be tested to find an ideal layout.

RaceID further allows computation of a umap dimensional reduction representation:

sc <- compumap(sc)

Parameters for umap can be given as additional argument (see help function).

The t-SNE map can be plotted by

plotmap(sc)

The Fruchterman-Rheingold layout can be plotted by

plotmap(sc,fr=TRUE)

The umap representation can be plotted by

plotmap(sc,um=TRUE)

Maps can be changed for both t-SNE and Fruchterman-Rheingold algorithm by initializing the rseed argument of the comptsne or compfr function with a random number.

The dimensional reduction maps can be inspected, e.g., for the localization of (a subset of) samples included in the analysis:

types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
subset <- types[grep("IV|V",types)]
plotsymbolsmap(sc,types,subset=subset,cex=1)

Expression of a gene of interest or aggregated expression for a group of genes can be highlighted in the dimensional reduction representation:

plotexpmap(sc,"Lyz1",logsc=TRUE,cex=1)

g <- c("Apoa1", "Apoa1bp", "Apoa2", "Apoa4", "Apoa5")
plotexpmap(sc,g,n="Apoa genes",logsc=TRUE,cex=1)

It is also possible to highight expression only for a subset of cells, e.g. a particular batch or sample:

sample <- colnames(sc@ndata)[grep("^I5d",colnames(sc@ndata))]
plotexpmap(sc,"Lyz1",cells=sample,logsc=TRUE,cex=1)

For the murine intestinal example data, inspetion of known marker genes suggests that cluster 2 and 3 correspnd to Lgr5-expressing intestinal stem cells. Cluster 2 is proliferative as indicated by up-regulation of Mki67, Cluster 1 comprises transiently amplifying progenitors biased towards absorptive entorytes in cluster 4 marked by Apolipoproteins. Cluster 7 comprises Lysozyme-expressing Paneth cells while Mucus-producing goblet cells constitute clusters 6 and 10.

With the help of known marker genes, e.g., Chgb for enteroendocrine cells, Muc2, Agr2, and Clca3 for goblet cells, and Defa20 for Paneth cells, rare cell types among the detected RaceID clusters can be annotated:

genes <- c("Lyz1","Defa20","Agr2","Clca3","Muc2","Chgb","Neurog3","Apoa1","Aldob","Lgr5","Clca4","Mki67","Pcna")
plotmarkergenes(sc,genes=genes)

To inspect clusters in more detail, differentially expressed genes can be inferred by an internal approach akin to DESeq2 (Love, Huber, and Anders 2014) but with a dispersion estimated globally from the background model of RaceID.

For instance, to obtain differentially expressed genes within cluster 9 compared to all other cells:

d  <- clustdiffgenes(sc,4,pvalue=.01)
dg <- d$dg
head(dg,25)

##          mean.ncl   mean.cl        fc           pv         padj
## Apoa1   0.6362058 21.398348 33.634315 0.000000e+00 0.000000e+00
## Apoa4   0.5702229 13.588646 23.830412 0.000000e+00 0.000000e+00
## Fabp1   0.7444607 21.348669 28.676691 0.000000e+00 0.000000e+00
## Fabp2   1.6023268 27.105795 16.916521 0.000000e+00 0.000000e+00
## Reg1    0.2360083  6.856378 29.051431 0.000000e+00 0.000000e+00
## Reg3b   0.4229290  7.360573 17.403803 9.665345e-88 3.371594e-85
## Gstm3   0.4901265  7.645545 15.599126 2.408636e-85 7.201821e-83
## Rbp2    0.1837525  4.865970 26.481103 1.340014e-73 3.505811e-71
## Reg3g   0.5903128  6.381428 10.810248 4.488428e-63 1.043809e-60
## Plac8   2.2049822 12.625809  5.726037 1.040516e-56 2.177800e-54
## Spink3  0.1808018  3.908387 21.616969 6.661021e-56 1.267411e-53
## Sis     0.2079850  4.014474 19.301750 3.388549e-55 5.910194e-53
## Anpep   0.1337270  3.602236 26.937242 4.383642e-55 7.057664e-53
## Guca2b  0.4864784  5.159079 10.604950 1.758228e-54 2.628551e-52
## Slc5a1  0.3030379  4.362130 14.394669 9.178274e-54 1.280675e-51
## Ces2e   0.8445971  6.651895  7.875820 9.650062e-53 1.262349e-50
## Apoc2   0.2539176  4.025073 15.851887 4.568302e-51 5.391473e-49
## Aldob   1.4071637  8.711907  6.191111 4.636718e-51 5.391473e-49
## Aldh1a1 0.1080674  2.965912 27.445004 4.730617e-47 5.211148e-45
## Crip1   1.3463788  7.794775  5.789437 4.182806e-45 4.377307e-43
## Mep1b   0.1180375  2.923125 24.764369 7.826703e-45 7.800614e-43
## Mttp    0.3015435  3.758237 12.463333 4.122038e-43 3.921557e-41
## Cyp3a11 0.1544745  3.071999 19.886768 6.232458e-43 5.671537e-41
## Clec2h  0.1374732  2.930410 21.316236 1.195432e-42 1.042516e-40
## Ckmt1   0.7801578  5.336224  6.839929 1.340511e-42 1.122276e-40

The differentially expressed genes (in this example only the up-regulated ones with a fold change >1) can be plottted in a heatmap, which can highlight the clusters and samples of origin:

types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
genes <- head(rownames(dg)[dg$fc>1],10)
plotmarkergenes(sc,genes,samples=types)

The heatmap can also be ordered by cell names (i.e. by batch or sample) by setting order.cells to TRUE. With the input parameters cl and cells, the heatmap can be restricted to a subset of clusters or cells, respectively.

plotmarkergenes(sc,genes,cl=c(2,3,1,4),samples=types,order.cells=TRUE)

In this example, no inter-sample differences are apparent and all samples contribute to each cluster.

It is also possible to plot gene expression across clusters or samples using a dot plot, where the diameter represents the fraction of cells in a sample or cluster expressing a gene, and the color indicates the expression level or z-score.

For example, the following commmand plots expression z-scores across clusters 2, 3, 1 and 4:

fractDotPlot(sc, genes, cluster=c(2,3,1,4), zsc=TRUE)

Expression on a log2 scale across samples I, II, and III can be plotted by

samples <- sub("(\\d.+)$","", colnames(sc@ndata))
fractDotPlot(sc, genes, samples=samples, subset=c("I","II","III"), logscale=TRUE)

A differential gene expression analysis between two defined sets of cell, e.g., two (groups of) clusters can be performed:

A <- names(sc@cpart)[sc@cpart %in% c(2,3)]
B <- names(sc@cpart)[sc@cpart %in% c(4)]
x <- diffexpnb(getfdata(sc,n=c(A,B)), A=A, B=B )
plotdiffgenesnb(x,pthr=.05,lthr=.5,mthr=-1,Aname="Cl.2,3",Bname="Cl.4",show_names=TRUE,padj=TRUE)

See the paragraphs below for additional options of RaceID analyses and parameter choices ideal for analysing large datasets.

StemID

Example StemID analysis: projection mode

StemID is an algorithm for the inference of lineage trees and differentiation trajectories based on pseudo-temporal ordering of single-cell transcriptomes. It utilizies the clusters predicted by RaceID and thus requires to run RaceID first. The algorithm was originally published along with RaceID2 (Grun et al. 2016) and the improved current version StemID2 was published later (J. S. Herman, Sagar, and Grun 2018). In a nutshell, StemID infers links between clusters which are more populated with intermediate single-cell transcriptomes than expected by chance. To assign cells to inter-cluster links, two fundamentally different strategies are available (see nmode argument below). The first strategy considers the projection of a vector connecting a cluster medoid to a member cell of the same cluster onto the links from the medoid of its cluster to the medoids of all other clusters. The longest projection identifies the link this cell is assigned to and defines the projection coordinate. The second (nearest neighbour) mode identifies for a cell in a given cluster the number of k nearest neighbours in each other cluster and assigns the cell to the link with the cluster where the average distance to these k nearest neighbours is minimized. The coordinate on the link is inferred as in the first mode. A faster approximate version of the first mode is also implemented.

As a first step, a lineage tree object for the StemID analysis needs to be initialized with an SCseq object obtained from a RaceID analysis:

ltr <- Ltree(sc)

Next, the transcriptome entropy of cell needs to be calculated. This is used by the StemID algorithm for the prediction of the stem cell type, based on maximum transcriptome entropy and maximum number of links to other clusters.

ltr <- compentropy(ltr)

In the subsequent step, cells are projected onto inter-cluster links. Cells are assigned to a link based on minimum distance to k nearest neighbours (nmode=TRUE) or based on the maximum projection coordinate (nmode=FALSE). Only clusters with >cthr cells are included in the analysis. If fr=TRUE then the Fruchterman-Rheingold layout will be used for representation of the inferred lineage tree, and if um=TRUE a UMAP representation will be used. Otherwise, representation will be done in t-SNE space. The computation of the lineage graph is independent of the dimensional reduction method which is only used for visualization.

ltr <- projcells(ltr,cthr=5,nmode=FALSE)

If projections are used for link determenation (nmode=FALSE), the derivation of link significance is done by comparing to the link population after randomizing cell positions within the boundaries imposed by the gene expression manifold. This is done by bootstrapping using 500 randomizations by default. More randomizations are possible, but obviously linearly increase runtime.

ltr <- projback(ltr,pdishuf=100)

Based on this information, a lineage graph is computed to approximate the lineage tree (a tree structure is not strictly imposed).

ltr <- lineagegraph(ltr)

Finally, link p-values are computed and a threshold pthr is applied on the p-values:

ltr <- comppvalue(ltr,pthr=0.1)

The resulting graph can be plotted, overlaid with a dimensional reduction representation (Fruchterman-Rheingold or t-SNE, see projcells). To retain only the more populated links, a cutoff scthr on the linkscore can be applied, e.g. 0.2:

plotgraph(ltr,scthr=0.2,showCells=FALSE,showMap=TRUE)

To predict the stem cell, the StemID score can be computed and visualized:

x <- compscore(ltr,scthr=0.2)

StemID offers a number of plotting functions to inspect the results. RaceID performs clustering using Pearson's correlation as a metric by default. The StemID projections require a Euclidean space and thus an initial embedding into a high-dimensional space is performed by classical multidimensional scaling. To inspect how well cell-to-cell distances are preserved, a histogram of the log-ratios between the original and transformed distances can be plotted:

plotdistanceratio(ltr)

The StemID prediction can be compared to a minimal spanning tree of the cluster medoids:

plotspantree(ltr)

The cell projections onto the links can be directly compared with this minimal spanning tree:

plotspantree(ltr,projections=TRUE)

All linkscores and fold enrichments of the population on a link can be plotted as heatmaps:

plotlinkscore(ltr)

projenrichment(ltr)

The (z-scores of the) projections of all cells from a given cluster across all links can be plotted as heatmap, e.g. for cluster 3:

x <- getproj(ltr,i=3)

All cells on two different branches originating from the same cluster, e.g. cluster 3 cells on the links to cluster 1 and 8, can be extracted for the computation of differentially expressed genes:

x <- branchcells(ltr,list("1.3","3.8"))
head(x$diffgenes$z)

##   Prss32     Nip7    Dhrs4    Srsf3   Tm4sf5  Tsc22d1 
## 3.224859 2.814688 2.368598 2.335748 2.200576 2.187001

The cells on the two branches can be plotted as additional clusters in the dimensional reduction representation:

plotmap(x$scl,fr=TRUE)

Example StemID analysis: nearest-neighbour mode

Since the randomizations of cell positions for the derivation of link significance require long computation time, and the projection-based method leads to some weak links which are potentially false positives (and can be filtered out based on linkscore), the nearest-neighbour-based method has now been selected to be the default mode of StemID. This method is more robust and fast even on large datasets. The downside is that it will miss some weak links, i.e. lead to more false negatives in comparison to the projection mode.

First, a lineage tree object needs to be initialized followed by the calculation of the transcriptome entropy of each cell.

ltr <- Ltree(sc)
ltr <- compentropy(ltr)

Next, cell projection are calculated with the parameter nmode=TRUE, which is also the default value:

ltr <- projcells(ltr,cthr=5,nmode=TRUE,knn=3)

The knn parameter determines how many nearest neighbours are considered in each cluster for determining the link assignment: the distance two each cluster is calculated as the average across the distance of a cell to the knn nearest neighbours within each other cluster, and the cell is assigned to the link with the cluster minimizing this distance. Now, the lineage tree is inferred and the p-values for the links are calculated based on a binomial model:

ltr <- lineagegraph(ltr)
ltr <- comppvalue(ltr,pthr=0.05)

The resulting lineage graph can be inspected and reveals the expected trajectories connecting the stem cells (cluster 2 and 3 of cycling and quiescent cells, respectively) to enterocytes (cluster 4) via transiently amplifying progenitors (cluster 1), to Paneth cells (cluster 7), and to goblet cells (cluster 6). The StemID score suggests stem cell identity for clusters 2 and 3:

plotgraph(ltr,showCells=FALSE,showMap=TRUE)

x <- compscore(ltr)

RaceID/StemID Options

Batch effect removal

RaceID offers the possibility of batch correction utilizing an internal method or the published mnnCorrect function from the batchelor package (Haghverdi et al. 2018). The batchelor package needs to be separately installed from bioconductor if this mode is desired. In order to apply batch effect removal, a list with a vector of cell ids for each batch needs to be defined, e.g.:

n <- colnames(intestinalData)
b <- list(n[grep("^I5",n)],n[grep("^II5",n)],n[grep("^III5",n)],n[grep("^IV5",n)],n[grep("^V5",n)])

This list is provided as input to the filterdata function, and the bmode argument is initialized with the desired method, i.e. mnnCorrect or RaceID. The latter method simply compares the local neigbourhood, i.e. the set of k nearest neighbours, for each cell between two batches and identifies the neighbourhood of the two batches with the smallest average distance. A differential gene expression analysis between the closest neighbourhoods of two batches yields batch associated genes. The next batch is then compared in the same way to the merged dataset of the previous batches. Batches are compared and successively merged according to the order they are provided in b. An additional input parameter knn controls the number of nearest neighbours, i.e. the size of the neighbourhood.

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000,LBatch=b,bmode="RaceID",knn=10)

The filterdata function will identify all batch-associated genes, which are stored in the filterpar$BGenes slot of the SCseq object. All genes that correlate to a batch gene are removed for the computation of a distance object. This is a minimally invasive strategy in comparison to mnnCorrect, which works well if batches are very similar, such as datasets produced from the same material using the same single-cell RNA-seq technology.

Imputing of gene expression

RaceID also offers optional imputing of gene expression, which can be useful if gene expression differences between cell types or states are governed only by lowly expressed genes and are difficult to resolve by clustering based on raw counts. If imputing is desired, the knn argument needs to be initialized with the number of nearest neighbours used for imputing gene expression :

sc <- compdist(sc,knn=5,metric="pearson")

Now, for each cell the knn nearest neighbours are used to infer a local negative binomial for each gene, utilizing a weighted mean expression and the internal RaceID noise model to obtain the corresponding negative binomial. The weights are derived by quadratic programming, computing the expression vector of a cell as a linear combination of its knn nearest neighbours. The cell itself contributes with the same weight as the aggregated weights of the nearest neighbours to the weighted mean expression. With the help of this negative binomial the tail probability of each gene is derived across all knn nearest neighbours. The geometric means of these tail probabilities are finally applied as a weights for each nearest neighbours in the calculation of the imputed gene expression. This strategy ensures that gene expression can only be learned from nearest neighbours following the same transcript count distributions.

After this, all steps remain the same. Imputing often helps to improve cluster discrimination and stability. Importantly, distances derived from imputed gene expression are only used for clustering. The outlier identification relies on unimputed gene expression, and hence can correct spurious clusters produced from imputed values.

sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- compfr(sc)
sc <- comptsne(sc)
plotmap(sc,fr=TRUE)

If batch effect removal has been applied, the remaining batch effect can be checked by plotting symbols representig the sample of origin:

types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
plotsymbolsmap(sc,types,fr=TRUE)

Ideally, all samples should intermingle in each clusters. Imputed gene expression can be plotted by setting the imputed argument to TRUE. Otherwise, unimputed values are shown.

plotexpmap(sc,"Mki67",imputed=TRUE,fr=TRUE)
plotmarkergenes(sc,c("Clca4","Mki67","Defa24","Defa20","Agr2","Apoa1"),imputed=TRUE,samples=types)

An expression matrix with imputed expression values can be extracted for further analysis:

k <- imputeexp(sc)

Removing variability by regression

RaceID can also regress out variability associated with particular sources such as batches or cell cycle. If batch effect removal has been done by the filterdata function with bmode="RaceID" then this function can further regress out residual variability remaining after batch associated genes have been discarded for the distance computation. In the case, the argument Batch has to be set to TRUE and vars can be left empty if no further sources of variability should be regressed out. Batch effects can also be regressed out directly without prior removal using the filterdata function:

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
vars <- data.frame(row.names=colnames(intestinalData),batch=sub("(\\_\\d+)$","",colnames(intestinalData)))
sc   <- varRegression(sc,vars)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)

However, regression also leads to loss of biological variation and this step is only recommended if variability associated with a particular variable is a strong confounding factor and cannot be removed by other means.

Prior dimensional reduction and removal of variability by PCA/ICA

After running the filterdata function, a prior dimensional reduction using PCA or ICA can be performed using the CCcorrect function. This function can also be provided with a list vset of sets of genes, and principal components with loadings enriched in any of these sets will be discarded. Another options is to provide a list CGenes of genes, and sets to be tested for enrichment in each component are derived as the groups of all genes correlating to a component in FGenes. The CCcorrect function predicts a dimension for the prior dimensional reduction based on an ellbow function of the explained variability as a function of the number of components. This can be inspected by the plotdimsat function and manually adjusted:

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,dimR=TRUE)
plotdimsat(sc)
plotdimsat(sc,change=FALSE)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,nComp=9)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)

Rerunning CCcorrect requires to run filterdata first, because otherwise the dimensional reduction scores in the dimRed slot will be subject to a second dimensional reduction, which is not desired. All sub-sequent steps remain unaltered.

Inferring stable clusters by random forests analysis

RaceID provides the option to run a random forests based reclassifiction, in order to obtain a stable clustering partition. This can be done on the final clustering after running findoutliers:

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- rfcorrect(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)

However, this is normally not required, since the improved outlier detection of the current version leads to stable clusters which do not change substantially after applying this function. Running rfcorrect takes very long for large datasets and should be omitted in this case.

Parameters: Large Datasets

For large dataset (>25k cells), we recommend using the VarID2 approach (see below) for density based clustering which does not require computation and storage of a huge distance matrix. For datasets below this size we recommend running RaceID/StemID with specific parameter settings. Since the correlation metric does not require dataset normalization, this approach has certained advantages compared to VarID2. However, the latter can also be run with a correlation matrix. In the following, we discuss a few paramters critical for the runtime of RaceID/StemID on large datasets.

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc)

Preferentially, clustering should be done by FUNcluster="kmedoids" but "hclust" often gives similar results and is significantly faster.

sc <- clustexp(sc,samp=1000,FUNcluster="hclust")

For the determination of the cluster number and the inference of Jaccard's similarity, sub-sampling should be applied by setting the subset size samp to an integer number. A good choice could be 10-25% of the total number of cells. For large datasets this should be sufficient to discriminate the most abundant cluster, and additional smaller clusters will automatically be identified in the next step using the findoutliers function. It is important that the clustnr argument is much larger then the expected number of clusters. If stability analysis is not wanted, one can set bootnr=1. If clustering is re-run with a specific cluster number, e.g. cln=12, then the saturation criterion should be disabled by setting sat=FALSE.

If the cluster granularity is too fine or too coarse, the probthr argment can be decreased or increased, respectively, e.g.:

sc <- findoutliers(sc,probthr=1e-4)

It is adviced to change the perplexity argument to larger values, when computing the t-SNE map for large datasets, e.g. perplexity=200. However, a large perplexity will return an error for small datasets.

sc <- comptsne(sc,perplexity=100)
plotmap(sc)

The Fruchterman-Rheingold layout critically depends on the number knn of nearest neighbours, and different values should be tested:

sc <- compfr(sc,knn=10)
plotmap(sc,fr=TRUE)

For a sub-sequent StemID analysis of very large datasets, the nearest-neighbour mode (see above) will help to reduce the runtime.

Inspecting pseudo-temporal gene expression changes

To inspect pseudotemporal expression profiles, functions provided by the FateID package can be utilized. First, the trajectory needs to be defined based on a sequence of clusters. This sequence should ideally correspond to a trajectory predicted by StemID2, i.e. the clusters should be connected by a series of significant links. However, non-linked clusters can also be included. A pseudo-temporally ordered vector of cells along a StemID trajectory can be extracted with the cellsfromtree function from an Ltree object:

n <- cellsfromtree(ltr,c(2,1,4))

The list n contains a vector n$f of cell ids ordered by their projection coordinate along the trajectory reflecting differentiation pseudotime. This vector and a filtered or unfiltered expression matrix can be used as input for pseudo-temporal gene expression analysis. The filtered expression data used for RaceID3 can be extracted with the getfdata function:

x <- getfdata(ltr@sc)

Additional filtering and subsetting of the gene expression matrix for cells on the trajectory, n$f, is done in the next step, utilizing functions from the FateID package:

library(FateID)
fs  <- filterset(x,n=n$f)

The filterset function can be used to eliminate lowly expressed genes on the trajectory from the subsequent analysis and has two additional arguments to discard genes, which are not expressed at a level of minexpr or higher in at least minnumber of cells. The function returns a filtered expression data frame with genes as rows and cells as columns in the same order as in the input vector n. In the next step, a self-organizing map (SOM) of the pseudo-temporal expression profiles is computed:

s1d <- getsom(fs,nb=100,alpha=.5)

This map provides a grouping of similar expression profiles into modules. The first input argument is again an expression data frame. In this case, we use the filtered expression table generated by the filterset function to retain only genes that are expressed on the trajectory under consideration. Pseudo-temporal expression profiles along the differentiation trajectory of interest are computed after smoothing by local regression with smoothing parameter alpha.

This function returns a list of the following three components, i. e. a som object returned by the function som of the package som, a data frame x with smoothed and normalized expression profiles, and a data frame zs of z-score transformed pseudo-temporal expression profiles.

The SOM is then processed by another function to group the nodes of the SOM into larger modules and to produce additional z-score transformed and binned expression data frames for display:

ps  <- procsom(s1d,corthr=.85,minsom=3)

The first input argument is given by the SOM computed by the function getsom. The function has two additional input parameters to control the grouping of the SOM nodes into larger modules. The parameter corthr defines a correlation threshold. If the correlation of the z-scores of the average normalized pseudo-temporal expression profiles of neighboring nodes in the SOM exceeds this threshold, genes of the neighboring nodes are merged into a larger module. Only modules with at least minsom genes are kept. The function returns a list of various data frames with normalized, z-score-transformed, or binned expression along with the assignment of genes to modules of the SOM (see man pages for details).

The output of the processed SOM can be plotted using the plotheatmap function. First, in order to highlight the clustering partition y the same color scheme as in the SCseq object can be used:

y    <- ltr@sc@cpart[n$f]
fcol <- ltr@sc@fcol

Now, the different output data frames of the procsom function can be plotted.

Plot average z-score for all modules derived from the SOM:

plotheatmap(ps$nodes.z,xpart=y,xcol=fcol,ypart=unique(ps$nodes),xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

Plot z-score profile of each gene ordered by SOM modules:

plotheatmap(ps$all.z,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

Plot normalized expression profile of each gene ordered by SOM modules:

plotheatmap(ps$all.e,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

Plot binarized expression profile of each gene (z-score < -1, -1 < z-score < 1, z-score > 1):

plotheatmap(ps$all.b,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

In order to inspect genes within individual modules of the SOM, these genes can be extracted given the number of the module. The module numbers are contained in the return value nodes of the procsom function and can be extracted, e. g. for module number 24:

g <- names(ps$nodes)[ps$nodes == 24]

The average pseudo-temporal expression profile of this group can be plotted by the function plotexpression:

plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=NULL)

In the same way it is possible to plot expression profiles of individual genes, e.g.:

plotexpression(fs,y,"Clca4",n$f,col=fcol,cluster=FALSE,alpha=.5,types=NULL)

It is also possible to highlight the data points as specific symbols, for example reflecting batches, by using the types argument:

plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=sub("\\_\\d+","",n$f))

VarID2

Example VarID2 analysis

VarID2 is a novel algorithm utilizing a k nearest neighbour graph derived from transcriptome similarities (Rosalez-Alvarez et al. 2023). The key idea is the inference of locally homogenous neighbourhoods of cells in cell state space, i.e., all k nearest neighbours follow the same transcript count distributions as the cell to which they are nearest neighbours ("central" cell). The computation starts with k nearest neighbours and infers a negative binomial distribution for the transcript counts of each gene, derived from the mean-variance dependence computed locally across a central cell and its k nearest neighbours. The mean expression is derived from a weighted average across the k nearest neighbours, with weights inferred using quadratic programming, representing the central cell as a linear combination of its neighbours. The contribution of the central cell itself can be controlled by a tunable scaling parameter alpha. Typical values of alpha should be in the range of 0 to 10 (default 1). If alpha is NULL, then it is inferred by a local optimization, i.e., it is minimized under the constraint that the gene expression in the central cell does not deviate more then one standard deviation from the predicted weigthed mean across its k nearest neighbours, where the standard deviation is calculated from the predicted mean using the local background model (the average dependence of the variance on the mean expression). With the negative binomial distribution inferred from the local means and mean-variance dependence, the probabilities of observed transcript counts for all genes can be computed in each neighbour, and, after multiple testing correction, the geometric mean across the nb (deafult 3) genes with the lowest probability serves as a proxy for the link probability (i.e., the probability of neighbours being in the same state). If this probability falls below a user-defined threshold, the link to this neighbour is pruned After this procedure is applied to the nearest neighbours of each cell, a pruned k nearest neighbour graph with homogenous neighbourhoods in cell state space is obtained, in which the remaining links indicate that neighbours follow the same transcript count distribution for all genes. These local neighbourhoods can be used to compute local properties such as gene expression variability, which in turn can be correlated to cell state or predicted fate, derived, e.g., from FateID analysis, respectively.

Although VarID2 analysis can be performed independently of the RaceID data object, an integration with a RaceID analysis permits convenient visualization and processing of the results. For this reason, we here demonstrate an integrated pipeline, but explain at each step, how computation can be performed given minimal input data, i.e., a transcript count matrix.

The VarID2 method requires unique molecular identifier (UMI) counts, which have been shown to follow a negative binomial distribution, in order for the background model to be valid.

Given such a UMI matrix with cells as columns and genes as rows, we could first initialize a RaceID object, and perform cell and gene filtering to retain only expressed genes and high quality cells.

We here demonstrate the functionalities of VarID2 on our single-cell transcriptome data of intestinal epithelial cells (Grun et al. 2016).

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=1000,FGenes=grep("^Gm\\d",rownames(intestinalData),value=TRUE),CGenes=grep("^(mt|Rp(l|s))",rownames(intestinalData),value=TRUE))

In this example, all cells with a raw UMI count <1,000 are discarded, and all gene IDs starting with Gm (genes without official gene symbol, uncharacterized loci) are removed. Moreover, all mitochondrial and ribosomal sub-unit encoding genes as well as all genes correlating with these classes are filtered out. Although VarID2 analysis can be performed with any given distance matrix, the preferred (default) mode does not require a distance matrix, and relies on k-nearest neighbor analysis in PCA space. For large datasets, the computation of a distance matrix is prohibitive due to a memory requirement scaling with N^2 if N is the number of cells. In this case, the pruneKnn function can be executed with large=TRUE (default setting). Here, the input expression matrix is first optionally corrected for variability associated with total transcript count per cell by a negative binomial regression (if regNB=TRUE as by default; if regNB=FALSE, transcript counts in each cell are normalized to one, multipled by the minimal total transcript count across all cells, followed by adding a pseudocount of 0.1 and taking the logarithm). The mean dependence of the regression parameters is smoothed, and Pearson residual are computed. The matrix of Pearson residuals (default setting), or, alternatively, the raw count matrix then undergoes a dimensional reduction by a principle component analysis (PCA), including the top pcaComp principle components (PCs). The number of top PCs used for dimensional reduction is determined by an elbow criterion (requiring that the change in explained variability upon further increasing the number of PCs has become linear). However, the number of PCs can also be adjusted manually (parameter pcaComp). This reduced PC matrix is subsequently used for a fast k nearest neighbour search by the get.knn function from the FNN package based on euclidean distances. The algorithm parameter of the pruneKnn function enables the selection of the search algorithm to apply. In general, it is recommended to run VarID2 with large=TRUE and regNB=TRUE (default settings). A RaceID object can be utilized to facilitate the analysis, but in this mode the compdist function does not need to be executed:

expData  <- getExpData(sc)
res      <- pruneKnn(expData,no_cores=1)

The pruneKnn function allows normalization by a full negative binomial regression (akin to (Hafemeister and Satija 2019)), or, alternatively, by analyical Pearson residuals as proposed by Lause et al. (Lause, Berens, and Kobak 2020) (default). In the latter case, the dispersion parameter can either be inferred by maximum likelihood, or approximated as $\theta$=10 (default). The function has a number of optional parameters controlling the outcome of the this step. For instance, genes allows limiting the analysis to a subset of genes, knn defines the number of nearest neighbours, alpha is the scale parameter for the central cell, and FSelect allows internal feature selection. The background model is calculated internally based on the local mean-variance relation in the same way as done in RaceID and feature selection follows the same logic. The function further permits multithreading and the number of cores can be defines using the no_cores argument. If this is set to NULL, the function detects the number of available cores and uses this number minus two to leave resources for other tasks running on the same machine. Calling help for this function explains all parameters in detail (as for all other functions).

The coefficients of the negative binomial regression (or the coefficients of the analytical model, if offsetModel=TRUE) can be inspected. Intercept:

plotRegNB(expData,res,"(Intercept)")

Coefficient $\beta$ of total log UMI count:

plotRegNB(expData,res,"beta")

Dispersion parameter $\theta$:

plotRegNB(expData,res,"theta")

In this example, the analytical model was used (no regression was calculated) and $\theta$=10 was used (Lause, Berens, and Kobak 2020).

To check if the mean-dependence of the variance has been successfully removed by the normalization, the Pearson residuals can be plotted as a function of the mean:

plotPearsonRes(res,log=TRUE,xlim=c(-.1,.2))

The number of PCs selected based on the elbow criterion can be inspected by plotting the percentage variability explained by each principle component:

plotPC(res)

Alternatively, the difference between the percentage variability explained by PC i and PC i + 1 can be plotted on a logarithmic scale:

plotPC(res,logDiff=TRUE)

The pruned k nearest neighbour information can be used as input for graph inference, Louvain clustering, and the derivation of a Fruchterman-Rheingold graph layout:

cl <- graphCluster(res,pvalue=0.01)

Alternatively, community detection can be performed with the Leiden algorithm at variying resolution. However, the leiden package requires preinstallation of python libraries (https://cran.r-project.org/package=leiden). This can be done, e.g., by using python reticulate.

install.packages("reticulate")
reticulate::use_python("/usr/bin/python3", required=TRUE)
#confirm that leiden, igraph, and python are available (should return TRUE).
reticulate::py_module_available("leidenalg") && reticulate::py_module_available("igraph")
reticulate::py_available()

With this graphCluster can be run using leiden clustering at defined resolution:

cl <- graphCluster(res,pvalue=0.01,use.leiden=TRUE,leiden.resolution=1.5)

The function returns a list of three components: a vector partition with a clustering partition, a data.frame fr containing a Fruchterman-Rheingold layout computed from the pruned k nearest neighbour graph, and an integer number residual.cluster. If clusters with less than min.size (input parameter to graphCluster) were obtained, all these clusters are aggregated into the cluster with the largest cluster number in the partition. If this aggregation was done, the cluster nuber is stored in residual.cluster (which is NULL otherwise). By setting the use.weights argument of graphCluster to TRUE (default), clustering can be performed on the pruned knn graph with weighted links, where weights correspond to the link probabilities. This may lead to the emergence of too many clusters (which could happen, e.g., due to variability in the link probabilities when integrating different experimental batches). In this case, the use.weights parameter can be set to FALSE, and clustering is performed on a pruned knn graph with equal weights for all links. The pvalue parameter determines the probability cutoff for link pruning, i.e., links with probabilities lower than pvalue are removed.

To visualize the clustering with the help of RaceID, the SCseq object can be updated with the results:

sc <- updateSC(sc,res=res,cl=cl)

This function call will initialize the cpart and the cluster$kpart slots with the clustering partition in cl$partition, and the fr slot with the Fruchterman-Rheingold layout.

The clustering can readily be inspected in the Fruchterman-Rheingold layout:

plotmap(sc,fr=TRUE)

After computing a t-SNE map, the clustering can be also be visualized in t-SNE space:

sc <- comptsne(sc,perplexity=50)
plotmap(sc)

Alternatively, a umap representation can be computed for visualization:

sc <- compumap(sc,min_dist=0.5)
plotmap(sc,um=TRUE)

The pruned k nearest neighbour graph allows the derivation of transition probabilities between clusters, based on the remaining links connecting two clusters (at a minimum probability given by pvalue) and the inferred probabilities of these links:

probs <- transitionProbs(res,cl,pvalue=0.01)

If RaceID is used for visualization and has been updated with the clustering using the updateSC function (see above), the transition probabilities can be visualized in the dimensional reduction representation (fr=TRUE for Fruchterman-Rheingold or um=TRUE for umap representation, default is t-SNE):

plotTrProbs(sc,probs,um=TRUE)

The argument prthr allows discarding all links with a transition probability <= prthr and the th cthr argument sets a cluster size threshold. Only clusters with > cthr cells are shown.

The Individual neighbourhood of a cell, e.g, cell i=20, can be inspected with the inspectKnn function, which returns a list object with information on gene expression and outlier p-values across all original k nearest neighbours.

If the function is called with plotSymbol=TRUE, a dimensional reduction representation is returned (um=TRUE for umap), highlighting nearest neighbours and outliers which are subject to pruning:

nn <- inspectKNN(20,expData,res,cl,object=sc,pvalue=0.01,plotSymbol=TRUE,um=TRUE,cex=1)

One element of the returned list object is an outlier p-value across all k nearest neighbours:

head(nn$pv.neighbours)

##           I5d_25 I5d_31 V5d_29 I5d_7 II5d_16 V5d_55 V5d_11       I5d_13 II5d_3
## Dmbt1          1      1      1     1       1      1      1 4.203964e-10      1
## Lgals2         1      1      1     1       1      1      1 1.001882e-01      1
## Cox5b          1      1      1     1       1      1      1 1.000000e+00      1
## Elf3           1      1      1     1       1      1      1 1.000000e+00      1
## Serpinb6a      1      1      1     1       1      1      1 1.000000e+00      1
## Ndufs8         1      1      1     1       1      1      1 1.000000e+00      1
##           III5d_2 II5d_43 I5d_11 IV5d_16 V5d_44 V5d_48 V5d_26 I5d_71 I5d_9
## Dmbt1           1       1      1       1      1      1      1      1     1
## Lgals2          1       1      1       1      1      1      1      1     1
## Cox5b           1       1      1       1      1      1      1      1     1
## Elf3            1       1      1       1      1      1      1      1     1
## Serpinb6a       1       1      1       1      1      1      1      1     1
## Ndufs8          1       1      1       1      1      1      1      1     1
##             I5d_61 III5d_67   II5d_44     III5d_91 II5d_72 V5d_16       V5d_96
## Dmbt1     0.391412        1 0.0310523 0.0008214995       1      1 1.577005e-11
## Lgals2    1.000000        1 1.0000000 0.1276183067       1      1 6.164142e-02
## Cox5b     1.000000        1 1.0000000 1.0000000000       1      1 1.000000e+00
## Elf3      1.000000        1 1.0000000 1.0000000000       1      1 1.000000e+00
## Serpinb6a 1.000000        1 1.0000000 1.0000000000       1      1 1.000000e+00
## Ndufs8    1.000000        1 1.0000000 1.0000000000       1      1 1.000000e+00
##               III5d_87
## Dmbt1     0.0002871395
## Lgals2    1.0000000000
## Cox5b     1.0000000000
## Elf3      1.0000000000
## Serpinb6a 1.0000000000
## Ndufs8    1.0000000000

Another element contains the corresponding normalized gene expression counts:

head(nn$expr.neighbours)

##             I5d_25   I5d_31   V5d_29    I5d_7  II5d_16    V5d_55   V5d_11
## Dmbt1     0.000000 0.000000 1.001958 0.000000 1.001958 16.521861 2.007853
## Lgals2    4.031579 6.071431 8.127667 6.071431 5.049473 20.824484 6.071431
## Cox5b     3.017717 5.049473 8.127667 4.031579 3.017717 14.397368 3.017717
## Elf3      0.000000 3.017717 4.031579 0.000000 1.001958  0.000000 0.000000
## Serpinb6a 0.000000 2.007853 0.000000 0.000000 2.007853  2.007853 0.000000
## Ndufs8    0.000000 1.001958 1.001958 0.000000 0.000000  0.000000 0.000000
##              I5d_13    II5d_3  III5d_2   II5d_43    I5d_11   IV5d_16    V5d_44
## Dmbt1     58.126707 15.457411 0.000000  6.071431  7.097484 12.290360  2.007853
## Lgals2    49.489744 16.521861 0.000000 20.824484 13.341696 10.200553 12.290360
## Cox5b     11.243324  9.162012 0.000000 11.243324  8.127667  6.071431  3.017717
## Elf3       2.007853  4.031579 2.007853  3.017717  5.049473  3.017717  1.001958
## Serpinb6a  0.000000  3.017717 0.000000  0.000000  2.007853  0.000000  1.001958
## Ndufs8     2.007853  0.000000 1.001958  4.031579  2.007853  0.000000  2.007853
##             V5d_48   V5d_26   I5d_71    I5d_9    I5d_61 III5d_67   II5d_44
## Dmbt1     7.097484 2.007853 0.000000 2.007853 18.664133 0.000000 24.099582
## Lgals2    8.127667 6.071431 5.049473 6.071431 25.200659 1.001958 24.099582
## Cox5b     5.049473 0.000000 7.097484 6.071431  9.162012 2.007853  6.071431
## Elf3      2.007853 1.001958 1.001958 0.000000  8.127667 3.017717  3.017717
## Serpinb6a 0.000000 0.000000 1.001958 1.001958  1.001958 0.000000  2.007853
## Ndufs8    1.001958 0.000000 0.000000 0.000000  2.007853 0.000000  3.017717
##            III5d_91   II5d_72   V5d_16    V5d_96  III5d_87        mu        sd
## Dmbt1     30.778221  6.071431 0.000000 64.479391 33.043723 2.5055710 2.1985583
## Lgals2    48.279339 14.397368 1.001958 50.705898 18.664133 7.9128263 5.4987806
## Cox5b     13.341696  8.127667 2.007853 27.417123  7.097484 4.9477707 3.7253325
## Elf3       4.031579  2.007853 0.000000 10.200553  1.001958 1.0414053 1.1874937
## Serpinb6a  2.007853  1.001958 0.000000  6.071431  1.001958 0.4475373 0.7031934
## Ndufs8     3.017717  4.031579 1.001958  7.097484  2.007853 0.7428016 0.9551797
##                     pv
## Dmbt1     1.577005e-11
## Lgals2    6.164142e-02
## Cox5b     1.000000e+00
## Elf3      1.000000e+00
## Serpinb6a 1.000000e+00
## Ndufs8    1.000000e+00

If the same function is executed with plotSymbol=FALSE, the local mean-variance dependence is plotted along with a loess-regression, a second order polynomial fit (used for pruning), and the background model of the local variability (used for biological noise inference, see below):

nn <- inspectKNN(20,expData,res,cl,object=sc,pvalue=0.01,plotSymbol=FALSE)

Alternatively, the coefficient of variation (CV) can be plotted with the same model fits:

nn <- inspectKNN(20,expData,res,cl,object=sc,pvalue=0.01,plotSymbol=FALSE,cv=TRUE)

The plot also highlights the local variability associated with cell-to-cell variability of total transcript counts, as calculated directly from the mean and variance of total transcript counts (turquoise) or from a local fit of a gamma distribution (orange).

The inference of homogenous local neighbourhoods, given by a cell and its remaining nearest neighbours after pruning, permits the computation of local quantities across such sets of similar cells. To investigate the dynamics of gene expression noise across cell states, we derive local estimates of gene expression variability. The systematic dependence of gene expression variance on the mean expression is a central challenge for the derivation of differential variability between cell states. VarID2 solves this problem by deconvoluting binomial sampling noise, cell-to-cell variability in total transcript counts (as a consequence of cell-to-cell variability in sequencing efficiency, but also due to global biological variability in total RNA content, e.g., during different phases of the cell cycle), and residual variability, which we address as biological noise. This deconvolution is achieved by fitting a Gamma distribution to the local disrtibution of total transcript counts, yielding a technical noise paramater rt for each central cell, followed by a maximum a posterior inference of the reidual variability, i.e., biological noise, epsilon for each gene in every neighbourhood. The local mean expression mu within each neighbourhood is computed as the average of non-normalized transcript count across pruned local neighbourhoods.

The noise inference can be done on a transcript count matrix, filtered by expression, e.g., retaining only cells with a minimum of 5 UMI counts in at least 20 cells:

x <- getFilteredCounts(sc,minexpr=5,minnumber=20)
noise <- compTBNoise(res,x,pvalue=0.01,gamma = 0.5,no_cores=1)

The most important argument of the compTBNoise is the gamma parameter which determines the strength of the prior for the suppression of spurious inflation of noise levels at low expression. The p-value argument again determines the link probability threshold for pruning. For neighbourhoods with zero transcript count noise estimates are replaced by NA in noise$epsilon.

The prior parameter gamma should be chosen such that the correlation between mean noise across cells and total UMI count per cell is minimal. This dependence can be analysed with the function plotUMINoise:

plotUMINoise(sc,noise,log.scale=TRUE)

If a positive correlation is observed, gamma should be increased in order to weaken the prior. If the correlation is negative, gamma should be decreased in order to increase the strength of the prior.

The RaceID SCseq object can be updated with the biological noise estimates:

sc <- updateSC(sc,res=res,cl=cl,noise=noise)

This initializes a slot noise of the SCseq object with a gene-by-cell matrix of the biological noise estimates noise$epsilon. The noise levels of a given gene (or group of genes), can then be visualized in the same way as gene expression with the help of RaceID functions.

Gene expression (on logarithmic scale):

plotexpmap(sc,"Clca4",logsc=TRUE,um=TRUE,cex=1)

Biological noise (on logarithmic scale):

plotexpmap(sc,"Clca4",logsc=TRUE,um=TRUE,noise=TRUE,cex=1)

This plot indicates that cluster 2 comprises Clca4+ intestinal stem cells.

A scatter plot of noise versus gene expression allows inspection of the dependence between noise and gene expression:

plotExpNoise("Clca4",sc,noise,norm=TRUE,log="xy")

The biological noise can also be plotted in a cluster heatmap and directly compared to gene expression. After deriving a gene expression heatmap, the order of genes stored in ph (returned by the pheatmap function of the pheatmap package), can be applied to the noise heatmap in order to permit direct comparability:

genes <- c("Lyz1","Agr2","Clca3","Apoa1","Aldob","Clca4","Mki67","Pcna")
ph <- plotmarkergenes(sc,genes=genes,noise=FALSE)

plotmarkergenes(sc,genes=genes[ph$tree_row$order],noise=TRUE,cluster_rows=FALSE)

Based on marker gene expression, clusters 1, 5, and 6, can be annotated as enterocytes, goblet and Paneth cells. The gene expression can also be visualized in a dot plot highlighting the fraction of cells in a cluster expressing a gene (size of dots) together with the average expresson level (dot color):

fractDotPlot(sc, genes, zsc=TRUE)

With the clustering derived by graphCluster and the noise object returned by the compTBNoise function, genes with differential noise levels in a cluster (or a set of clusters) set versus a background set bgr (by default all clusters not in set) can be inferred:

ngenes <- diffNoisyGenesTB(noise,cl,set=1,no_cores=1)
head(ngenes)

##           mu.set     mu.bgr     mu.all   eps.set    eps.bgr    eps.all   log2FC
## Alpi    2.621820  0.4346426  0.7970891 0.7578040 0.09111629 0.20159597 2.166197
## Rbp2    3.302158  0.3872477  0.8702899 0.4666415 0.03726573 0.10841943 2.045465
## Clca4   2.183234  8.8030572  7.7060579 1.1240327 0.20844378 0.36016995 1.988563
## Rpl3   13.834292 22.1688750 20.7877156 0.3703321 0.02425532 0.08160519 1.920372
## Clec2h  1.972938  0.2797050  0.5602980 0.5456636 0.07200037 0.15049313 1.908371
## Ces2a   2.067663  0.4751257  0.7390318 0.3474690 0.03352372 0.08554895 1.744692
##              pvalue
## Alpi   1.775718e-17
## Rbp2   4.309437e-15
## Clca4  5.774251e-08
## Rpl3   8.000818e-14
## Clec2h 1.822229e-14
## Ces2a  1.852172e-16

The output lists average (non-normalized) expression mu and noise levels eps for clusters in set and bgr as well as the log2 fold change in noise levels, log2FC, between set and bgr and a multiple testing-corrected p-value (Bonferroni correction, multiplied by the maximal number knn of nearest neighbours).

This functions applies a Wilcoxon test to compare variability in a cluster (or set of clusters) versus the remaining clusters or a given background set of clusters. To avoid spurious signals with small effect size, a uniform random variable (by default, from [0:ps] with ps=0.1) is added to the noise estimates.

The top varaible genes can be plotted in a heatmap:

genes <- head(rownames(ngenes),50)
ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=FALSE,zsc=TRUE)

One can also cluster the cells and/or genes in the heatmap based on the noise quantification:

ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=TRUE,cluster_cols=TRUE)

The gene expression itself can be represented with the same order of cells and genes by utilizing the ph output:

plotmarkergenes(sc,genes=ph$tree_row$labels[ ph$tree_row$order ],noise=FALSE,cells=ph$tree_col$labels[ ph$tree_col$order ], order.cells=TRUE,cluster_rows=FALSE)

It is also possible to order genes by their noise levels across a (set of) cluster or the entire dataset, if the cl argument is omitted:

mgenes <- maxNoisyGenesTB(noise,cl=cl,set=3)
head(mgenes)

##     Reg3b   Tmem38b    Hspa1b    Hspa1a     H2afx     Reg3g 
## 1.5410659 1.0669050 1.0629981 0.9270232 0.8132892 0.7873700

plotmarkergenes(sc,genes=head(names(mgenes),50),noise=TRUE)

Differential noise between clusters in set and bgr can also be visualized as MA plot:

ngenes <- diffNoisyGenesTB(noise, cl, set=1, bgr=c(2,4))
plotDiffNoise(ngenes)

For comparison, differentially expressed genes between the same groups can be inferred and visualized in a similar manner:

dgenes <- clustdiffgenes(sc,1,bgr=c(2,4),pvalue=0.01)
plotdiffgenesnb(dgenes,xlim=c(-6,3))

The distribution of gene expression (for individual genes or groups of genes) across a set of clusters can be visualized as violin plots:

violinMarkerPlot(c("Mki67","Pcna"),sc,set=c(2,3,1))

Similarly, the distribution of gene expression noise can be plotted:

violinMarkerPlot(c("Mki67","Pcna"),sc,noise,set=c(2,3,1))

To further investigate transcriptome variability and related quantities, the function quantKnn allows computation of average noise levels across cells, cell-to-cell transcriptome correlation, and total UMI counts.

qn <- quantKnn(res, noise, sc, pvalue = 0.01, minN = 5, no_cores = 1)

These quantities can be inspected in the dimensional reduction representation, or as boxplots across cluster, with the level of a selected reference cluster highlighted as horizontal line. For instance, the Lgr5+ stem cell cluster 2 can be used as a reference:

StemCluster <- 2

Average biological noise across all genes:

plotQuantMap(qn,"noise.av",sc,um=TRUE,ceil=.6,cex=1)

plotQuantMap(qn,"noise.av",sc,box=TRUE,cluster=StemCluster)

Avergae Spearman's transcriptome correlation across all cells within pruned k nearest neighbourhoods:

plotQuantMap(qn,"local.corr",sc,um=TRUE,logsc=TRUE,cex=1)

plotQuantMap(qn,"local.corr",sc,box=TRUE,logsc=TRUE,cluster=StemCluster)

Total UMI count averaged across cells within pruned k nearest neighbourhoods:

plotQuantMap(qn,"umi",sc,um=TRUE,logsc=TRUE,cex=1)

plotQuantMap(qn,"umi",sc,box=TRUE,logsc=TRUE,cluster=StemCluster)

Dependencies between different quantities can be inspected in a scatter plot, highlighting a cluster of interest:

For instance, the dependency between biological noise and UMI counts:

plotQQ(qn,"umi","noise.av",sc,cluster=StemCluster,log="yx",cex=1)

A residual dependency of the noise estimate on the total UMI count may indicate that the prior is either not strong enough to suppress noise inflation at low expression (negative slope) or too strong (positive slope). In this case, the parameter gamma of the function compTBNoise should be increased or decreased, respectively.

Dependency between transcriptome correlation and average biological noise:

plotQQ(qn,"local.corr","noise.av",sc,cluster=StemCluster,log="xy",cex=1)

To study dynamics of gene expression and biological noise during differentiation, VarID2 facilitates pseudo-time analysis. Selection of a linear trajectry connecting a defined set of clusters could be based on the transition probabilities obtained from VarID2.

plotTrProbs(sc,probs,um=TRUE)

For examples, clusters 1, 5, and 3, represent the differentiation trajectory from Lgr5+ stem cell towards Apoa1+ enterocytes.

plotexpmap(sc,"Apoa1",um=TRUE,cex=1,logsc=TRUE)

To order cells pseudo-temporally, VarID2 offers a wrapper function pseudoTime, which calls the Slingshot method (Street et al. 2018) on a desired dimensional reduction, which can be a subset from the precomputed dimensional reductions of the RaceID object (sc@fr, sc@tsne, or sc@umap, given by the map parameter), or a dimensional reduction (t-SNE or UMAP) computed for the subset of cells on the trajectory. This set is assumed to be ordered along the trajectory, with the initial cluster as first entry. This function is not designed to recover branched trajetories; single lineages could be defined based on the output of the plotTrPtobs function as an ordered set of clusters. pseudoTime will always fit a single trajectory to the clusters specified in set according to the indicated order. The slingsot package needs to be installed from Bioconductor. If the package is not installed, the function falls back to principal curve inference using the princurve package.

# ordered set of clusters on the trajectory
set <- c(2,3,1)
# if slingshot is available, run with useSlingshot=TRUE (default)
pt <- pseudoTime(sc,m="umap",set=set,useSlingshot=FALSE)

Inferred pseudo-time can be highlighted in the inferred dimensional reduction representation:

plotPT(pt,sc,clusters=FALSE)

The trajectory can also be overlaid with the cluster annotation:

plotPT(pt,sc,clusters=TRUE,lineages=TRUE)

With the derived pseudo-time a filtered count matrix with cells ordered by pseudo-time can be obtained:

fs <- extractCounts(sc,minexpr=5,minnumber=20,pt=pt)

Using methods from the FateID package, pseudo-time expression profiles can be derived by self-organizing maps (SOM) and grouped into modules:

library(FateID)
s1d   <- getsom(fs,nb=50,alpha=1)
ps    <- procsom(s1d,corthr=.85,minsom=0)

part  <- pt$part
ord   <- pt$ord

plotheatmap(ps$all.z, xpart=part[ord], xcol=sc@fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=TRUE)

Individual pseudo-time expression profiles can be plotted for genes of interest, e.g.,

plotexpression(fs,y=part,g="Apoa1",n=ord,col=sc@fcol,cex=1,alpha=1)

It is also possible to plot profiles for a group of genes into the same window:

genes <- c("Mki67","Pcna","Apoa1")
plotexpressionProfile(fs,y=part,g=genes,n=ord,alpha=1,col=rainbow(length(genes)),lwd=2)

From the SOM all genes within a given module, e.g., module 1, can be extracted:

genes <- getNode(ps,1)
plotexpressionProfile(fs,y=part,g=head(genes,10),n=ord,alpha=1,lwd=2)

Similarly, gene modules with correlated noise profiles can be derived:

fsn    <- extractCounts(sc,minexpr=5,minnumber=20,pt=pt,noise=TRUE)
s1dn   <- getsom(fsn,nb=25,alpha=1)
psn    <- procsom(s1dn,corthr=.85,minsom=0)

plotheatmap(psn$all.z, xpart=part[ord], xcol=sc@fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=TRUE)

Noise profiles of individual genes or groups of genes:

plotexpression(fsn,y=part,g="Apoa1",n=ord, col=sc@fcol,cex=1,alpha=1,ylab="Noise")

genes <- c("Mki67","Pcna","Apoa1")
plotexpressionProfile(fsn,y=part,g=genes,n=ord,alpha=1,col=rainbow(length(genes)),lwd=2,ylab="Noise")

genes <- getNode(psn,1)
plotexpressionProfile(fsn,y=part,g=head(genes,10),n=ord,alpha=1,lwd=2,ylab="Noise")

For further details on plotting functions for pseudo-time analysis see vignette("FateID"). Gene expression and noise modules can now be co-analyzed to characterize the inter-dependence of gene expression and noise dynamics during differentiation.

Removing batch effects and other confounding factors with VarID2

VarID2 offers the possibility to remove batch effects and other confounding sources of variability. To achieve this, batch variables and other latent variables can be included as dependent variables in the negative binomial regression, utilized in the pruneKnn function if the parameter regNB is set to TRUE. Alternatively, batch effect correction can ben done by the harmony package, if the bmethod parameter of pruneKnn is set to "harmony". These batch correction options only exist if large is set to TRUE. The batch parameter is given by a categorical vector of batch names.

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=1000,FGenes=grep("^Gm\\d",rownames(intestinalData),value=TRUE),CGenes=grep("^(mt|Rp(l|s))",rownames(intestinalData),value=TRUE))
expData  <- getExpData(sc)

batch <- sub("5d.+","",colnames(expData))
names(batch) <- colnames(expData)
head(batch)

require(Matrix)
S_score   <- colMeans(sc@ndata[intersect(cc_genes$s,rownames(sc@ndata)),])
G2M_score <- colMeans(sc@ndata[intersect(cc_genes$g2m,rownames(sc@ndata)),])
regVar <- data.frame(S_score=S_score, G2M_score=G2M_score)
rownames(regVar) <- colnames(expData)

res   <- pruneKnn(expData,no_cores=1,batch=batch,regVar=regVar)
cl    <- graphCluster(res,pvalue=0.01)
probs <- transitionProbs(res,cl)
x <- getFilteredCounts(sc,minexpr=5,minnumber=5)
noise <- compTBNoise(res,x,pvalue=0.01,no_cores=1) 
sc <- updateSC(sc,res=res,cl=cl,noise=noise)
sc <- compumap(sc,min_dist=0.5)
sc <- comptsne(sc,perplexity=50)

plotmap(sc,cex=1)
plotmap(sc,fr=TRUE,cex=1)
plotmap(sc,um=TRUE,cex=1)
plotsymbolsmap(sc,batch,um=TRUE,cex=1)
plotexpmap(sc,"Mki67",um=TRUE,cex=1,log=TRUE)
plotexpmap(sc,"Pcna",um=TRUE,cex=1,log=TRUE)

To regress out additional latent variables associated with distinct phases of the cell cycle, the parameter regVar was initialized with a data.frame of cell cycle S phase and G2/M phase scores (aggregated normalized gene expression of cell cycle phase markers). Column names of regVar indicate the latent variable names, and row names correspond to cell IDs, i.e. column names of expData. Interaction terms between the batch variable and the UMI count or the other latent variables in regVar are included.

See previous paragraphs for more advice on how to explore the output data.

Integrating Seurat with VarID2/RaceID3

Seurat object can be converted into an SCseq object for VarID2 analysis, using the function Seurat2SCseq. This function expects a class Seurat object from the package as input and converts this into a SCseq object. The function transfers the counts, initializes ndata and fdata without further filtering, transfers the PCA cell embeddings from the Seurat object to dimRed, transfers the clustering partition, and umap and tsne dimensional reduction (if available). CAUTION: Cluster numbers in RaceID start at 1 by default. Hence, all Seurat cluster numbers are shifted by 1.

library(Seurat)
library(RaceID)

Se <- CreateSeuratObject(counts = intestinalData, project = "intestine", min.cells = 3, min.features = 200)
Se <- NormalizeData(Se, normalization.method = "LogNormalize", scale.factor = 10000)
Se <- FindVariableFeatures(Se, selection.method = "vst", nfeatures = 2000)

all.genes <- rownames(Se)
Se <- ScaleData(Se, features = all.genes)
Se <- RunPCA(Se, features = VariableFeatures(object = Se))
Se <- RunUMAP(Se, dims = 1:30, verbose = FALSE)
Se <- RunTSNE(Se, dims = 1:30, verbose = FALSE)
Se <- FindNeighbors(Se, dims = 1:10)
Se <- FindClusters(Se, resolution = 0.5)
DimPlot(Se, label = TRUE) + NoLegend()

res <- pruneKnn(Se)
## without pruning (fast)
## res <- pruneKnn(Se,do.prune=FALSE,no_cores=1)
sc  <- Seurat2SCseq(Se)
plotmap(sc,um=TRUE)

noise <- compTBNoise(res,getExpData(sc),no_cores=1)
sc <- updateSC(sc,noise=noise)
plotexpmap(sc,"Clca4",um=TRUE,cex=1,log=TRUE)
plotexpmap(sc,"Clca4",um=TRUE,noise=TRUE,cex=1)

See previous paragraphs for more advice on how to explore the output data.

FAQ

For bug reports and any questions related to RaceID and StemID please email directly to link.

How can I perform RaceID/VarID2 analysis on 10x Genomics Chromium Single Cell RNA-Seq data?

Processed count data from the Cell Ranger pipeline (or equivalent formats) can be read from the working directory (or any other path) and transformed into a sparse count matrix, which can be loaded into an SCseq object. This requires the Matrix package, which is installed as a dependency together with RaceID. The following chunk of code can be used:

require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]

sc <- SCseq(x)

How should I analyze large datasets (>25k cells)?

Datasets with >25k cells should be analyzed with VarID2 instead of RaceID, which is more suitable for large datasets and does not require a distance matrix. The memory requirement of a distance matrix scales with N^2 as a function of the number of cells N. VarID2 should be run with with parameters large=TRUE and regNB=TRUE of the pruneKnn function in order to achieve optimal performance (default parameters). To increase processing speed VarID2 can be run with multiple cores. However, each core requires additional memory, and with too many cores VarID2 will run out of memory. To reduce computation time, the number of genes used for the negative binomial regression of total UMI count per cell can be limited to, e.g., 2000 by setting the ngenes parameter.

Example for mouse data, including filtering of mitochondrial genes, ribosomal genes, and Gm* genes (and genes correlating to these groups):

require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]

sc <- SCseq(x)
sc <- filterdata(sc,mintotal=1000,CGenes=rownames(x)[grep("^(mt|Rp(l|s)|Gm\\d)",rownames(x))])
expData <- getExpData(sc)
res   <- pruneKnn(expData,no_cores=5)
cl    <- graphCluster(res,pvalue=0.01)
probs <- transitionProbs(res,cl)

## compute noise from corrected variance
noise <- compTBNoise(res,expData,pvalue=0.01,no_cores=5) 
sc <- updateSC(sc,res=res,cl=cl,noise=noise)

sc <- comptsne(sc)
sc <- compumap(sc)

What can I do if too many clusters were detected in my dataset?

This could be a consequence of technical noise, e.g., if experimental batches with strong differences in total UMI counts were co-analyzed. In such cases the background models of RaceID or VarID2 could indicate the presence of many spurious outliers. In RaceID, the number of outliers can be reduced by setting the the probthr parameter of the findoutliers function to values closer to zero. This is the probability threshold for outlier detection. In VarID2 the pvalue parameter of the graphCluster function can be set to lower values. This is the probability threshold for pruning links of the knn network. In the extreme case that no outliers (or pruning) are desired, probthr (or pvalue) should be set to zero. Moreover, the use.weights parameter of the graphCluster function can be set to FALSE; in this case, Louvain clustering is performed on a (pruned) knn-network with equal weights for all links. By default, links are weighted by the link probability, and if strong technical differences lead to low link probabilities, this could be the reason for the emergence of too many clusters.

References

Grun, D. 2020. “Revealing dynamics of gene expression variability in cell state space.” Nat. Methods 17 (1): 45–49.

Grun, D., L. Kester, and A. van Oudenaarden. 2014. “Validation of noise models for single-cell transcriptomics.” Nat. Methods 11 (6): 637–40.

Grun, D., A. Lyubimova, L. Kester, K. Wiebrands, O. Basak, N. Sasaki, H. Clevers, and A. van Oudenaarden. 2015. “Single-cell messenger RNA sequencing reveals rare intestinal cell types.” Nature 525 (7568): 251–55.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2): 266–77.

Hafemeister, C., and R. Satija. 2019. “Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression.” Genome Biol. 20 (1): 296.

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnology 36 (5): 421–27.

Herman, J. S., Sagar, and D. Grun. 2018. “FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data.” Nat. Methods 15 (5): 379–86.

Lause, J., P. Berens, and D. Kobak. 2020. “Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data.” Genome Biol. 22 (1): 258.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.

Rosalez-Alvarez, R.E., J. Rettkowski, J.S. Herman, G. Dumbovic, N. Cabezas-Wallscheid, and D. Grun. 2023. “Gene expression noise dynamics unveil functional heterogeneity of ageing hematopoietic stem cells.” Genome Biol. 24 (1): 148.

Street, K., D. Risso, R.B. Fletcher, D. Das, J. Ngai, N. Yosef, E. Purdom, and S. Dudoit. 2018. “Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics.” BMC Genomics 19 (June): 477.

RaceID/StemID/VarID reference manual

Dominic Grun

2024-04-03