Tutorial 2: Using metadata_ondisc_matrix and multimodal_ondisc_matrix

Tutorial 1 covered ondisc_matrix, the core class implemented by ondisc. Thus tutorial covers metadata_ondisc_matrix and multimodal_ondisc_matrix, two additional classes provided by the package. metadata_ondisc_matrix stores cell-specific and feature-specific covariate matrices alongside the expression matrix, and multimodal_ondisc_matrix stores multiple metadata_ondisc_matrices representing different cellular modalities. Together, metadata_ondisc_matrix and multimodal_ondisc_matrix facilitate feature selection, quality control, and other common single-cell data preprocessing tasks.

We begin by loading the package.

library(ondisc)

The metadata_ondisc_matrix class

A metadata_ondisc_matrix object consists of three components: (i) an ondisc_matrix representing the expression data, (ii) a data frame storing the cell-specific covariates, and (iii) a data frame storing the feature-specific covariates. The easiest way to initialize a metadata_ondisc_matrix is by calling create_ondisc_matrix_from_mtx on an mtx file and associated metadata files, setting the optional parameter return_metadata_ondisc_matrix to TRUE. Below, we reproduce the example from Tutorial 1, this time returning a metadata_ondisc_matrix instead of a list.

# Set paths to the .mtx and .tsv files
raw_data_dir <- system.file("extdata", package = "ondisc")
mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/genes.tsv")

# Specify directory in which to store the .h5 file
temp_dir <- tempdir()

# Initialize metadata_ondisc_matrix
expressions <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
                                              barcodes_fp = barcodes_fp,
                                              features_fp = features_fp,
                                              on_disk_dir = temp_dir,
                                              return_metadata_ondisc_matrix = TRUE)
#> 
|========                                                                 | 11%
|=================                                                        | 23%
|==========================                                               | 36%
|====================================                                     | 48%
|=============================================                            | 61%
|======================================================                   | 73%
|===============================================================          | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> 
|========                                                                 | 11%
|=================                                                        | 23%
|==========================                                               | 36%
|====================================                                     | 48%
|=============================================                            | 61%
|======================================================                   | 73%
|===============================================================          | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.

The variable expressions is an object of class metadata_ondisc_matrix; expressions contains the fields ondisc_matrix, cell_covariates, and feature_covariates.

# Print the variable
expressions
#> A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.

We alternately can initialize a metadata_ondisc_matrix by calling the constructor function of the metadata_ondisc_matrix class; see documentation (via ?metadata_ondisc_matrix) for details.

The multimodal_ondisc_matrix class

The multimodal_ondisc_matrix class is used to represent multimodal data. multimodal_ondisc_matrix objects have two fields: (i) a named list of metadata_ondisc_matrices representing different modalities, and (ii) a global (i.e., cross-modality) cell-specific covariate matrix. The ondisc package ships with example CRISPR perturbation data, which we use to initialize a new perturbation modality via a call to create_ondisc_matrix_from_mtx.

# Set paths to the perturbation .mtx and .tsv files
mtx_fp <- paste0(raw_data_dir, "/perturbation.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/guides.tsv")

# Initialize metadata_ondisc_matrix
perturbations <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
                                               barcodes_fp = barcodes_fp,
                                               features_fp = features_fp,
                                               on_disk_dir = temp_dir,
                                               return_metadata_ondisc_matrix = TRUE)
#> 
#> Writing CSC data.
#> Writing CSR data.

Like expressions, the variable perturbations is an object of class metadata_ondisc_matrix. However, because perturbations represents logical perturabtion data instead of integer gene expression data, the cell-specific and feature-specific covariates of perturbations differ from those of expressions.

# These matrices have different columns
head(expressions@cell_covariates)
#>   n_nonzero n_umis     p_mito
#> 1        43    214 0.04672897
#> 2        26    169 0.00000000
#> 3        22    116 0.05172414
#> 4        37    258 0.08139535
#> 5        36    224 0.08035714
#> 6        31    147 0.07482993
head(perturbations@cell_covariates)
#>   n_nonzero
#> 1        21
#> 2        11
#> 3        17
#> 4        14
#> 5         7
#> 6        11

The expressions and perturbations data are multimodal – they are collected from the same set of cells. We can create a multimodal_ondisc_matrix by passing a named list of metadata_ondisc_matrix objects – in this case, expressions and perturbations – to the constructor function of the multimodal_ondisc_matrix class.

modality_list <- list(expressions = expressions, perturbations = perturbations)
crispr_experiment <- multimodal_ondisc_matrix(modality_list)

The variable crispr_experiment is an object of class multimodal_ondisc_matrix. The column names of the global covariate matrix are derived from the names of the modalities.

# print variable
crispr_experiment
#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 250 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero.
#>  A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.

# show the global covariate matrix
head(crispr_experiment@global_cell_covariates)
#>   expressions_n_nonzero expressions_n_umis expressions_p_mito
#> 1                    43                214         0.04672897
#> 2                    26                169         0.00000000
#> 3                    22                116         0.05172414
#> 4                    37                258         0.08139535
#> 5                    36                224         0.08035714
#> 6                    31                147         0.07482993
#>   perturbations_n_nonzero
#> 1                      21
#> 2                      11
#> 3                      17
#> 4                      14
#> 5                       7
#> 6                      11

The figure below summarizes the relationship between ondisc_matrix, metadata_ondisc_matrix, and multimodal_ondisc_matrix.

**Figure**: Classes provided by the package. a) `ondisc_matrix`, b) `metadata_ondisc_matrix`, c) `multimodal_ondisc_matrix`

Figure: Classes provided by the package. a) ondisc_matrix, b) metadata_ondisc_matrix, c) multimodal_ondisc_matrix

Querying basic information

We can use the functions get_feature_ids, get_feature_names, and get_cell_barcodes to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of a metadata_ondisc_matrix or a multimodal_ondisc_matrix. get_feature_ids and get_feature_names return a list when called on a multimodal_ondisc_matrix, as the different modalities contain different features.

# metadata_ondisc_matrix
cell_barcodes <- get_cell_barcodes(expressions)
feature_ids <- get_feature_ids(expressions)
feature_names <- get_feature_names(expressions)

# multimodal_ondisc_matrix
cell_barcodes <- get_cell_barcodes(crispr_experiment)
feature_ids <- get_feature_ids(crispr_experiment)

We likewise can use dim, nrow, and ncol to query the dimension, number of rows, and number of columns of a metadata_ondisc_matrix or multimodal_ondisc_matrix. dim and nrow again return lists when called on a multimodal_ondisc_matrix.

# metadata_ondisc_matrix
dim(expressions)
#> [1] 300 900
nrow(expressions)
#> [1] 300
ncol(expressions)
#> [1] 900

# multimodal_ondisc_matrix
dim(crispr_experiment)
#> $expressions
#> [1] 300 900
#> 
#> $perturbations
#> [1] 250 900
nrow(crispr_experiment)
#> $expressions
#> [1] 300
#> 
#> $perturbations
#> [1] 250
ncol(crispr_experiment)
#> [1] 900

Subsetting

Similar to ondisc_matrices, metadata_ondisc_matrices and multimodal_ondisc_matrices can be subset using the [ operator. metadata_ondisc_matrices can be subset either by feature or cell, while multimodal_ondisc_matrices can be subset by cell only.

# metadata_ondisc_matrix
# keep cells 100 - 150
expressions_sub <- expressions[,100:150]
# keep genes ENSG00000188305, ENSG00000257284, ENSG00000251655
expressions_sub <- expressions[c("ENSG00000188305", "ENSG00000257284", "ENSG00000251655"),]

# multimodal_ondisc_matrix
# keep all cells except 1 - 100
crispr_experiment_sub <- crispr_experiment[,-c(1:100)]

As with ondisc_matrices, the original objects remain unchanged.

expressions
#> A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
crispr_experiment
#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 250 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero.
#>  A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.

Notes and tips