Single cell analysis is a powerful method that allows for the deconvolution of the effect of treatments on complex populations containing different cell types, that may or may not respond to specific treatments. Depending on the technology used, the analytes can be genes, transcripts, proteins or metabolites. Using mass cytometry, bodenmiller et al measured the level of 9 proteins and 14 post-translational modifications. After using signal intensity from the 9 proteins (so called phenotypic markers) to define 14 sub-populations, they monitored the effect of several treatments using the 14 post-translational modifications.
Modeling and visualization of these type of data is challenging: the large number of events measured combined to the complexity of each samples is making the modeling complex, while the high dimensionality of the data precludes the use of standard visualizations.
The goal of this package is to enable the development of new methods by providing a curated set of data for testing and benchmarking.
For details on data acquisition please refer to Bodenmiller et al Nat Biotech 2012. Briefly, after treatment cells where profiled using a CyTOF, dead cells and debris were excluded and live cells were assigned to 1 of the 14 sub-populations using signal intensity from 9 phenotypic markers.
Samples corresponding to untreated cells, stimulated with BCR/FcR-XL, PMA/Ionomycin or vanadate or unstimulated, were downloaded from CytoBank as FCS files. Data was extracted and normalized using the
arcsinh function with a cofactor of 5.
We begin by assembling a full dataset from the
bodenmiller package, before filtering down to T cells:
data(refPhenoMat) data(refFuncMat) data(refAnnots) ref.df <- data.frame(refAnnots, refPhenoMat, refFuncMat) data(untreatedPhenoMat) data(untreatedFuncMat) data(untreatedAnnots) untreated.df <- bind_rows(ref.df %>% mutate(Treatment='unstimulated', Source=as.character(Source), Cells=as.character(Cells)), data.frame(untreatedAnnots, untreatedPhenoMat, untreatedFuncMat) %>% mutate(Treatment=as.character(Treatment), Source=as.character(Source), Cells=as.character(Cells))) %>% mutate(Treatment=factor(Treatment), Treatment=relevel(Treatment,'unstimulated'), Cells=factor(Cells)) btcells.df <- untreated.df %>% filter(Cells %in% c('cd8+','igm+')) %>% mutate(Cells=droplevels(Cells)) %>% group_by(Cells,Treatment) %>% mutate(cellID=seq(length(Cells))) %>% unite('cellID',one_of(c('Treatment','Cells','cellID')),sep = '_',remove = FALSE)
We end up with 19807 cells to analyse, broken down by stimulation condition as follows:
## # A tibble: 8 x 5 ## # Groups: Cells, Treatment  ## Cells Treatment n Total Fraction ## <fct> <fct> <int> <int> <dbl> ## 1 cd8+ unstimulated 5068 15792 32.1 ## 2 cd8+ BCR/FcR-XL 5098 15252 33.4 ## 3 cd8+ PMA/Ionomycin 5122 14576 35.1 ## 4 cd8+ vanadate 1200 6316 19 ## 5 igm+ unstimulated 1069 15792 6.8 ## 6 igm+ BCR/FcR-XL 863 15252 5.7 ## 7 igm+ PMA/Ionomycin 992 14576 6.8 ## 8 igm+ vanadate 395 6316 6.3
The Vanadate condition seems to contain less cells, and the fraction of CD4+ T cells is roughly 25% of the unstimulated sample. There is also a drop in the fraction of CD8+ T cells.
To simplify processing we will sample 1000 cells from each condition, with replacement where appropriate:
We used fan plots to visualize the phenotypic and functional profiles of CD8+ T cells and IgM+ B cells. As expected there is no change in phenotype while stimulation is changing the functional profile of each population.
btcells.df %>% gather('Channel','value', one_of(colnames(refPhenoMat),colnames(refFuncMat))) %>% filter(Channel %in% colnames(refPhenoMat)) %>% ggplot(aes(x=Channel,y=value))+ geom_fan()+ facet_grid(Treatment~Cells)+ theme_light(base_size = 16)+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
btcells.df %>% gather('Channel','value', one_of(colnames(refPhenoMat),colnames(refFuncMat))) %>% filter(Channel %in% colnames(refFuncMat)) %>% ggplot(aes(x=Channel,y=value))+ geom_fan()+ facet_grid(Treatment~Cells)+ theme_light(base_size = 16)+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
We select channels for analysis where 20% of cells from any treatment, any cell type, have a recorded intensity of at least 1 unit:
The following 20 were selected out of 23:
We start by projecting cells using the classical Radviz algorithm:
## Selected optimization function: in.da ## Starting performance: -102.1284 ## 0 Current performance: -104.6188 ## 1 Current performance: -105.0747 ## 2 Current performance: -107.0209 ## 3 Current performance: -107.0209 ## 4 Current performance: -107.995 ## 5 Current performance: -108.2137 ## 6 Current performance: -108.7123 ## 7 Current performance: -108.7661 ## 8 Current performance: -108.7724 ## 9 Current performance: -108.9197 ## 10 Current performance: -109.0426 ## 11 Current performance: -109.0426 ## 12 Current performance: -109.0968 ## 13 Current performance: -109.0968 ## 14 Current performance: -109.0968 ## 15 Current performance: -109.0968 ## 16 Current performance: -109.0968 ## 17 Current performance: -109.0968 ## Execution stopped after 17 iterations: no better solution found in the last 5 iterations
Next we visualize the projection, coloring individual cells by their treatment of origin:
Because of the spring paradigm used by Radviz the cells are concentrated to the center of the plot. Using the
rescalePlot function we can zoom inside the circle :
We see a clear effect of Vanadate, and 2 separate groups of points. We confirm with the next plot that those points actually correspond to B and T cells:
One can facet the display by population to visualize the effects of stimulation:
We can now visualize the effects of Vanadate and Ionomycin stimulations compared to unstimulated, but the contributions of individual channels are unclear and the effects of BCR/FcR-XL on B cells are unclear.
Moreover, the visualization depends on the optimized order of channels, which in turn depends on the relative amount of the different conditions. To address these challenges we implemented the Freeviz algorithm described in the next section.
Freeviz will optimize the order of channels as well as their weights based on predefined classes in the data. In that example we use stimulation and cell type :
##  "# iters: 21"
The final projection is shown below, with cells colored by stimulation:
Compared to Radviz, where the contribution of each channel is fixed through its position on the circle, in Freeviz the channels with the largest contribution to the difference between classes are the furthest away from the center of the projection. It is therefore easier to differentiate B cells from T cells, and to identify the channels that are affected the most by stimulation.
Freeviz plots can also be rescaled : if after rescaling the points extend beyond a particular anchor, the exact contribution of this anchor to the projection is lost except for the direction. The function will issue a warning whenever this occurs.
## Warning in rescalePlot(btcells.fv, fraction = 0.5): After rescaling the following anchors will be below the maximum range of points: ## pp38, pStat3, pZap70, pBtk, pStat5, pStat1, pErk, CD33, pSHP2, pLat, pAkt, pNFkB
After rescaling it is clear that a subset of B and T cells is not responding to the treatment:
In any case, it is possible to filter out the channels that have low influence on the projection:
We compute the cosine distance between cells based on all available channels:
btcells.dist <- as.matrix(btcells.df[,btcells.channels]) rownames(btcells.dist) <- btcells.df$cellID btcells.dist <- btcells.dist%*% t(btcells.dist) btcells.dist <- btcells.dist/(sqrt(diag(btcells.dist)) %*% t(sqrt(diag(btcells.dist)))) btcells.dist[btcells.dist>1] <- 1 btcells.dist[btcells.dist<0] <- 0 btcells.dist <- 2*acos(btcells.dist)/pi diag(btcells.dist) <- NA # avoid self loops
Next we build an adjacency matrix, selecting the 63 nearest neighbors foreach cell:
For 4000 vertices (cells) 250090 edges are recovered. We can compare the distribution of distances overall with that of selected nearest neighbors:
bind_rows(data.frame(value=btcells.dist[sample(1000)], Type='Overall', stringsAsFactors = FALSE), data.frame(value=btcells.adj[btcells.adj!=0][sample(1000)], Type='Nearest Neighbors', stringsAsFactors = FALSE)) %>% mutate(Type=factor(Type), Type=relevel(Type,'Overall')) %>% filter(!is.na(value)) %>% ggplot(aes(x=value))+ geom_histogram(aes(fill=Type), position = 'identity', bins=50, alpha=0.5)+ theme_light(base_size=16)
We enriched for cells that are closer together than the average population.
Next we turn the distances into weights using a gaussian kernel:
Next we build a kNN graph based on weights:
We use the Louvain algorithm to identify groups in the weighted T cells graph:
6 communities were identified.
Each community captures a specific population and a specific functional state, as illustrated in the following tables:
## # A tibble: 6 x 3 ## # Groups: Group  ## Group `cd8+` `igm+` ## <fct> <dbl> <dbl> ## 1 1 0.999 0.000880 ## 2 2 1 NA ## 3 3 0.996 0.00405 ## 4 4 NA 1 ## 5 5 0.00118 0.999 ## 6 6 NA 1
And with respect to treatment:
## # A tibble: 6 x 5 ## # Groups: Group  ## Group unstimulated `BCR/FcR-XL` `PMA/Ionomycin` vanadate ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 1 0.408 0.412 0.125 0.0546 ## 2 2 0.00806 0.0349 0.952 0.00538 ## 3 3 0.0709 0.0385 0.0101 0.881 ## 4 4 0.0298 0.507 0.463 NA ## 5 5 0.536 0.166 0.203 0.0952 ## 6 6 0.0474 0.00451 NA 0.948
Compared to Freeviz, where anchors are optimized based on classes, in Graphviz the anchors are optimized after the structure of the graph itself, similar to a force-directed layout with the context provided by anchors.
As with Radviz and Freeviz we start with optimizing the anchors:
##  "# iters: 24"
We can then plot the graph in the context of the selected channels, colored by community :
This plot is to be compared to the classical force directed layout used to visualize weighted graphs:
Points can be colored by cell assignment:
Confirming previous observations made with fan plots. Compared to classical visualization, one can quickly identify channels associated with specific communities.