Single-cell Sorter

Hongyu Guo

2021-03-16

The scSorter package implements the semi-supervised cell type assignment algorithm described in "scSorter: assigning cells to known cell types according to known marker genes". This algorithm assigns cells to known cell types, assuming that the identities of marker genes are given but the exact expression levels of marker genes are unavailable. This vignette will illustrate the steps to apply the algorithm to single-cell RNA sequencing (scRNA-seq) data.

Section 1 - Preliminaries

scSorter takes as input data the expression matrix from single-cell RNA sequencing and the annotation file that specifies the names of marker genes for each cell type of interest. scSorter assumes the input expression data have been properly normalized for the library size and have been properly transformed (e.g. log-transformation) to stabilize the variance.

To start the analysis, load scSorter package:

library(scSorter)

Real Data Example - TM pancreas

In this vignette, we will use the data created by the Tabula Muris Consortium (Tabula Muris Consortium and others 2018) to illustrate the procedure for scSorter to assign cells to known cell types. We follow the analysis presented in our paper and focus on pancreas tissue from which 1,564 cells with valid cell type annotation are available. The cells are from: pancreatic A (390 cells), pancreatic B (449 cells), pancreatic D (140 cells), pancreatic PP (73 cells), pancreatic acinar (182 cells), pancreatic ductal (161 cells), pancreatic stellate (49 cells), endothelial (66 cells) and immune (54 cells). Marker genes for these cell types are extracted from the original study.

Examing the data

We first load the data to ensure they are in the correct formats.

load(url('https://github.com/hyguo2/scSorter/blob/master/inst/extdata/TMpancreas.RData?raw=true'))

The expression matrix expr should represent genes by rows and cells by columns.

expr[1:5, 1:5]
#>               A1.MAA000884.3_10_M.1.1 A1.MAA001862.3_39_F.1.1
#> 0610005C13Rik                       0                       0
#> 0610007C21Rik                     168                     162
#> 0610007L01Rik                       1                       2
#> 0610007N19Rik                       0                       0
#> 0610007P08Rik                       0                       0
#>               A10.MAA000884.3_10_M.1.1 A10.MAA001857.3_38_F.1.1
#> 0610005C13Rik                        0                        0
#> 0610007C21Rik                      121                       18
#> 0610007L01Rik                       44                      170
#> 0610007N19Rik                        7                       22
#> 0610007P08Rik                        2                        0
#>               A11.MAA000577.3_8_M.1.1
#> 0610005C13Rik                       0
#> 0610007C21Rik                     373
#> 0610007L01Rik                       0
#> 0610007N19Rik                       0
#> 0610007P08Rik                       0
dim(expr)
#> [1] 23433  1564

Next, we check the annotation file.

head(anno)
#>                 Type Marker Weight
#> 1  Endothelial_cells Pecam1      2
#> 2  Endothelial_cells   Cdh5      2
#> 3  Endothelial_cells    Kdr      2
#> 4             Immune  Ptprc      2
#> 5 Pancreatic_A_cells    Gcg      2
#> 6 Pancreatic_A_cells   Mafb      2

The annotation file could be stored in a matrix or a data frame. Two columns, "Type" and "Marker", are mandatory as they record the names of marker genes for each cell type of interest and will guide scSorter to assign cells to each cell type. The third column "Weight" is optional. As explained in our paper, weights could be assigned to each marker gene to represent their relative importance during the cell type assignment. A larger weight reflects more confidence of a gene being a marker gene of the corresponding cell type. If such knowledge is not available, we recommend simply choose a constant value for all marker genes by setting the default_weight option in the scSorter main function and the "Weight" column could therefore be omitted.

Now that the formats of these data are correct, we could proceed to the next step.

Section 2 - Preprocessing the data

scSorter does not need all genes from the original expression matrix to conduct cell type assignment. Instead, it uses only marker genes and a selected group of highly variable genes other than marker genes to conduct the analysis. scSorter also requires the input expression data to be properly normalized for the library size and properly transformed to stabilize the variance. Here is a simple example of how this can be done:

Choose highly variable genes.

topgenes = xfindvariable_genes(expr, ngenes = 2000)

Normalize the input data.

expr = xnormalize_scData(expr)

We also filter out genes with non-zero expression in less than 10% of total cells.

topgene_filter = rowSums(as.matrix(expr)[topgenes, ]!=0) > ncol(expr)*.1
topgenes = topgenes[topgene_filter]

At last, we subset the preprocessed expression data. Now, we are ready to run scSorter.

picked_genes = unique(c(anno$Marker, topgenes))
expr = expr[rownames(expr) %in% picked_genes, ]

The above code is only for showing the concept. In practice, there are plenty of packages that could preprocess the data. Here, we use the package Seurat to illustrate this step (please remove the # mark in the code when you run it).

First, we normalize and transform the data with NormalizeData() function.

#library(Seurat)
#expr_obj = CreateSeuratObject(expr)
#expr_obj <- NormalizeData(expr_obj, normalization.method = "LogNormalize", scale.factor = 10000, verbose = F)

Then, we choose highly variable genes by FindVariableFeatures() function. We also filter out genes with non-zero expression in less than 10% of total cells.

#expr_obj <- FindVariableFeatures(expr_obj, selection.method = "vst", nfeatures = 2000, verbose = F)
#topgenes <- head(VariableFeatures(expr_obj), 2000)

#expr = GetAssayData(expr_obj)
#topgene_filter = rowSums(as.matrix(expr)[topgenes, ]!=0) > ncol(expr)*.1
#topgenes = topgenes[topgene_filter]

At last, we subset the preprocessed expression data. Now, we are ready to run scSorter.

#picked_genes = unique(c(anno$Marker, topgenes))
#expr = expr[rownames(expr) %in% picked_genes, ]

Section 3 - Running scSorter

The scSorter function requires the preprocessed data as input. As mentioned in Section 1, if the "Weight" column is omitted in the annotation file, a single weight should be specified for marker genes by using the default_weight option. Otherwise, the algorithm will use the weights from the annotation file to conduct the analysis.

rts <- scSorter(expr, anno)

Viewing Results

The cell type assignment results are stored in the Pred_Type vector.

print(table(rts$Pred_Type))
#> 
#>         Endothelial_cells                    Immune        Pancreatic_A_cells 
#>                        63                        41                       416 
#>   Pancreatic_Acinar_cells        Pancreatic_B_cells        Pancreatic_D_cells 
#>                       185                       457                       116 
#>   Pancreatic_Ductal_cells       Pancreatic_PP_cells Pancreatic_Stellate_cells 
#>                       162                        73                        33 
#>                   Unknown 
#>                        18

The misclassification rate is:

mis_rate = 1 - mean(rts$Pred_Type == true_type)
round(mis_rate, 4)
#> [1] 0.055

The confusion matrix is:

table(true_type, rts$Pred_Type)
#>                            
#> true_type                   Endothelial_cells Immune Pancreatic_A_cells
#>   Endothelial_cells                        63      0                  0
#>   Immune                                    0     41                  6
#>   Pancreatic_A_cells                        0      0                384
#>   Pancreatic_Acinar_cells                   0      0                  0
#>   Pancreatic_B_cells                        0      0                  0
#>   Pancreatic_D_cells                        0      0                 16
#>   Pancreatic_Ductal_cells                   0      0                  0
#>   Pancreatic_PP_cells                       0      0                  9
#>   Pancreatic_Stellate_cells                 0      0                  1
#>                            
#> true_type                   Pancreatic_Acinar_cells Pancreatic_B_cells
#>   Endothelial_cells                               2                  1
#>   Immune                                          3                  0
#>   Pancreatic_A_cells                              0                  1
#>   Pancreatic_Acinar_cells                       176                  2
#>   Pancreatic_B_cells                              0                446
#>   Pancreatic_D_cells                              0                  6
#>   Pancreatic_Ductal_cells                         1                  0
#>   Pancreatic_PP_cells                             0                  1
#>   Pancreatic_Stellate_cells                       3                  0
#>                            
#> true_type                   Pancreatic_D_cells Pancreatic_Ductal_cells
#>   Endothelial_cells                          0                       0
#>   Immune                                     0                       0
#>   Pancreatic_A_cells                         0                       0
#>   Pancreatic_Acinar_cells                    0                       0
#>   Pancreatic_B_cells                         0                       0
#>   Pancreatic_D_cells                       115                       0
#>   Pancreatic_Ductal_cells                    1                     159
#>   Pancreatic_PP_cells                        0                       0
#>   Pancreatic_Stellate_cells                  0                       3
#>                            
#> true_type                   Pancreatic_PP_cells Pancreatic_Stellate_cells
#>   Endothelial_cells                           0                         0
#>   Immune                                      0                         0
#>   Pancreatic_A_cells                          5                         0
#>   Pancreatic_Acinar_cells                     1                         0
#>   Pancreatic_B_cells                          3                         0
#>   Pancreatic_D_cells                          3                         0
#>   Pancreatic_Ductal_cells                     0                         0
#>   Pancreatic_PP_cells                        61                         0
#>   Pancreatic_Stellate_cells                   0                        33
#>                            
#> true_type                   Unknown
#>   Endothelial_cells               0
#>   Immune                          4
#>   Pancreatic_A_cells              0
#>   Pancreatic_Acinar_cells         3
#>   Pancreatic_B_cells              0
#>   Pancreatic_D_cells              0
#>   Pancreatic_Ductal_cells         0
#>   Pancreatic_PP_cells             2
#>   Pancreatic_Stellate_cells       9

References

Tabula Muris Consortium and others. 2018. “Single-Cell Transcriptomics of 20 Mouse Organs Creates a Tabula Muris.” Nature 562 (October). Genome Research: 367–72. doi:10.1038/s41586-018-0590-4.