library(coreheat)
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which, which.max, which.min
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#>
Create correlation heatmaps from a numeric matrix. Ensembl Gene ID row names can be converted to Gene Symbols using, e.g., BioMart. Optionally, data can be clustered and filtered by correlation, tree cutting and/or number of missing values. Genes of interest can be highlighted in the plot and correlation significance be indicated by asterisks encoding corresponding P-Values. Plot dimensions and label measures are adjusted automatically by default. The plot features rely on the heatmap.n2() function in the ‘heatmapFlex’ package.
install.packages("coreheat")
install.packages("devtools")
::install_github("vfey/coreheat") devtools
Generate a random 10x10 matrix with two distinct sets and plot it with default settings without ID conversion since the IDs are made up.
set.seed(1234)
<- matrix(c(rnorm(100, mean = 1), rnorm(100, mean = -1)), nrow = 20)
mat rownames(mat) <- paste0("gene-", 1:20)
colnames(mat) <- paste0(c("A", "B"), rep(1:5, 2))
cormap2(mat, convert = FALSE, main="Random matrix")
A dataset consisting of 333 patients with primary prostate cancer was downloaded from The Cancer Genome Atlas (TCGA) via the cBioPortal for Cancer Genomics and two subsets were created, PrCaTCGASample.txt containing genes commonly associated with cancer, and PrCaTCGASample_500.txt containing 500 genes selected at random.
Function is used to convert Ensembl Gene IDs to HGNC Symbols.
I the following examples, we use the smaller of the two to demonstrate the basic functionality of the package.
26 genes commonly associated with cancer in general were selected to provide a small demonstration data set.
## Read data and prepare input data frame
<- system.file("extdata", "PrCaTCGASample.txt", package = "coreheat", mustWork = TRUE)
fl <- read.delim(fl, stringsAsFactors=FALSE)
dat0 <- data.frame(dat0[, grep("TCGA", names(dat0))], row.names=dat0$ensembl_gene_id)
dat1 cormap2(dat1, main="TCGA data frame + ID conversion")
Separately supplied IDs are used with a matrix created from the data frame of the previous example and genes of interest are highlighted inside the heatmap to emphasise individual pair-wise correlations.
<- as.matrix(dat0[, grep("TCGA", names(dat0))])
dat2 <- dat0$hgnc_symbol
sym cormap2(dat1, convert=FALSE, lab=sym, genes2highl=c("GNAS","NCOR1","AR", "ATM"),
main="TCGA matrix + custom labels")
The ExpressionSet object is created using the numeric data frame created above from the small TCGA dataset.
<- Biobase::ExpressionSet(as.matrix(dat1))
expr cormap2(expr, add.sig=TRUE, autoadj=FALSE, cex=1, main="TCGA ExpressionSet object + ID conversion")
500 genes were selected at random to demonstrate handling of larger datasets. We will first plot the entire matrix and then use the filtering capabilities of cormap2
to identify possibly co-expressed genes. ##### Full map
<- system.file("extdata", "PrCaTCGASample_500.txt", package = "coreheat", mustWork = TRUE)
fl <- read.delim(fl, stringsAsFactors=FALSE)
dat0_500 <- data.frame(dat0_500[, grep("TCGA", names(dat0_500))], row.names=dat0_500$ensembl_gene_id)
dat1_500 cormap2(dat1_500, biomart = TRUE, main="TCGA 500 genes")
Dendrogram branches are cut using function cutreeStatic
from the WGCNA
package. This will reduce the number of branches and, in consequence, the size of the correlation map, while retaining closely correlated values.
Note that getting the values for cut.thr
and cut-size
right can take a bit trial and error.
The number of rows left after filtering is added automatically if cutting is enabled. To remove that sub-title set postfix
to FALSE
.
We also highlight a few genes that all appear to have a high correlation to see if we find them again in the next plot.
<- cormap2(dat1_500, cut.thr = 1.5, cut.size = 40, biomart = TRUE,
cm genes2highl = c("RPLP1", "NOP53", "RPL13", "RPS17", "RPL11", "RPS24", "RPL30", "EEF1D", "RPL27",
"SNHG29", "HSPA1A", "HSPA1B", "DNAJB1", "BHLHE40", "NR4A1", "JUND", "HLA-A", "HLA-B",
"BTF3", "EEF1A1P9"),
main="TCGA 500 genes", verbose = TRUE)
#> @ Calculating correlation matrix...
#> Computing correlations...
#> Calculating P-Values...
#> Getting NAs...
#> Getting counts...
#> Getting P-Values...
#> @ Clustering and filtering correlation matrix...
#> Input is a list with correlation matrix and pvalues. Assuming non-clustered matrix.
#> Filtering NAs...
#> Getting distance matrix and hierarchical clustering...
#> Cutting the dendrogram at 1.5...
#>
#> Isolated tree section with 185 genes
#> Recalculating distance matrix and hierarchical clustering...
#> Ordering according to hierarchical clustering...
#> @ Formatting output matrix...
#> Checking input IDs:
#> Attempting to convert input matrix row names...
#> Using BioMart: 'ENSEMBL_MART_ENSEMBL'
#> Input ID type is 'ensembl_gene_id'
#> Information requested: 'hgnc_symbol'...
#> Merging input IDs and converted IDs...
#> done
#> @ Generating heatmap...
#> Ordering Y-axis...
This effectively zooms into the upper right corner of the large heatmap. In the next step we reduce the branch height to get regions with even shorter distances.
To start where we left off we use the correlation matrix produced in the previous step. We set the cut threshold quite low to get the shortest distances, only, and the number of objects per branch (needed to be considered a cluster) to a small value, too, to also find smaller regions.
<- cormap2(cormat = cm, cut.thr = 1.2, cut.size = 20, biomart = TRUE,
cm2 genes2highl = c("RPLP1", "NOP53", "RPL13", "RPS17", "RPL11", "RPS24", "RPL30", "EEF1D", "RPL27",
"SNHG29", "HSPA1A", "HSPA1B", "DNAJB1", "BHLHE40", "NR4A1", "JUND", "HLA-A", "HLA-B",
"BTF3", "EEF1A1P9"),
main="TCGA 500 genes", cex = 0.9, verbose = TRUE)
#> Input is a complete list with correlation matrix and P-Values.
#> @ Clustering and filtering correlation matrix...
#> Input is a list with correlation matrix and pvalues. Assuming non-clustered matrix.
#> Filtering NAs...
#> Getting distance matrix and hierarchical clustering...
#> Cutting the dendrogram at 1.2...
#> Isolated tree section with 75 genes
#> Recalculating distance matrix and hierarchical clustering...
#> Ordering according to hierarchical clustering...
#> @ Formatting output matrix...
#> Checking input IDs:
#> Attempting to convert input matrix row names...
#> Using BioMart: 'ENSEMBL_MART_ENSEMBL'
#> Input ID type is 'ensembl_gene_id'
#> Information requested: 'hgnc_symbol'...
#> Merging input IDs and converted IDs...
#> done
#> @ Generating heatmap...
#> Ordering Y-axis...
The genes we highlighted in the large map are found here, too. Filtering by branch height reduced the matrix further “picking” regions with similarly correlating values, in our case the bottom left block of positively correlated genes.
But what if we want to filter for only highly correlating values? The package offers two approaches to pick “clusters” of values with a correlation above a certain threshold, a sliding window approach where the map is searched for the threshold and then the window is expanded as long as the threshold is met, and a simple margin approach where the threshold must be met in a given percentage of the rows.
Sliding window The user can set the size of the window, the correlation threshold and the “number” of the cluster to be picked. The cluster “number” here is a subjective value chosen after inspecting the source heatmap by eye and starting to count from the bottom left. For example, the first cluster with a correlation above 0.6 should give us the darker red block in the bottom left corner of the first map (object cm
). Hence we set the window size to 5 and ask the function to highlight all the genes it finds from the same list as above. We add significance asterisks to see how good our correlation is.
<- cormap2(cormat = cm, cor.thr = 0.6, cor.window = 5, cor.cluster = 1, biomart = TRUE,
cm3 genes2highl = c("RPLP1", "NOP53", "RPL13", "RPS17", "RPL11", "RPS24", "RPL30", "EEF1D", "RPL27",
"SNHG29", "HSPA1A", "HSPA1B", "DNAJB1", "BHLHE40", "NR4A1", "JUND", "HLA-A", "HLA-B",
"BTF3", "EEF1A1P9"),
add.sig = TRUE, cex = 2, main="TCGA 500 genes", verbose = TRUE)
#> Input is a complete list with correlation matrix and P-Values.
#> @ Clustering and filtering correlation matrix...
#> Input is a list with correlation matrix and pvalues. Assuming non-clustered matrix.
#> Filtering NAs...
#> Getting distance matrix and hierarchical clustering...
#> Ordering according to hierarchical clustering...
#> Filtering by correlation...
#> Filtering by correlation threshold in a window of 5
#> Cycle 2 -> Found 1. cluster of correlating values (cor. threshold 0.6)
#> -> Expanding window...
#> -> done.
#> Filtered 170 rows not meeting correlation threshold (0.6 in 50% of the columns)
#> @ Formatting output matrix...
#> Checking input IDs:
#> Attempting to convert input matrix row names...
#> Using BioMart: 'ENSEMBL_MART_ENSEMBL'
#> Input ID type is 'ensembl_gene_id'
#> Information requested: 'hgnc_symbol'...
#> Merging input IDs and converted IDs...
#> done
#> @ Generating heatmap...
#> Ordering Y-axis...
This approach picked the block we expected.
To find negative or positive correlations, the function offers the possibility to filter by correlation threshold and a margin of rows in which the threshold is met. We start with the first matrix above and set the margin to 0.1 which corresponds to 10% of the rows. Note that this will pick all possible rows, not only adjacent ones. Positive correlation
<- cormap2(cormat = cm, cor.thr = 0.6, cor.mar = 0.3, biomart = TRUE,
cm3 genes2highl = c("RPLP1", "NOP53", "RPL13", "RPS17", "RPL11", "RPS24", "RPL30", "EEF1D", "RPL27",
"SNHG29", "HSPA1A", "HSPA1B", "DNAJB1", "BHLHE40", "NR4A1", "JUND", "HLA-A",
"HLA-B"),
main="TCGA 500 genes", cex = 1, verbose = T)
#> Input is a complete list with correlation matrix and P-Values.
#> @ Clustering and filtering correlation matrix...
#> Input is a list with correlation matrix and pvalues. Assuming non-clustered matrix.
#> Filtering NAs...
#> Getting distance matrix and hierarchical clustering...
#> Ordering according to hierarchical clustering...
#> Filtering by correlation...
#> Filtering by correlation threshold with a margin of 30%
#> Filtered 120 rows not meeting correlation threshold (0.6 in 30% of the columns)
#> @ Formatting output matrix...
#> Checking input IDs:
#> Attempting to convert input matrix row names...
#> Using BioMart: 'ENSEMBL_MART_ENSEMBL'
#> Input ID type is 'ensembl_gene_id'
#> Information requested: 'hgnc_symbol'...
#> Merging input IDs and converted IDs...
#> done
#> @ Generating heatmap...
#> Ordering Y-axis...
This also extracted a region from the positively correlated region at the bottom left of the original matrix but only the genes correlating above the threshold.
Negative correlation Since there are less and less extreme negative correlations in the matrix we have to set the thresholds to very lax values.
<- cormap2(cormat = cm, cor.thr = -0.2, cor.mar = 0.2, biomart = F,
cm3 genes2highl = c("RPLP1", "NOP53", "RPL13", "RPS17", "RPL11", "RPS24", "RPL30", "EEF1D", "RPL27",
"SNHG29", "HSPA1A", "HSPA1B", "DNAJB1", "BHLHE40", "NR4A1", "JUND", "HLA-A",
"HLA-B"),
add.sig = TRUE, cex = 2, main="TCGA 500 genes")
The function cormap_filt
splits (cuts) a dendrogram of a hierarchically clustered distance matrix at a given threshold dividing it into larger or smaller “sub-clusters”. Correlation P-Values are converted to represent significance as a sub-cluster-wise signal metric used for filtering. Optionally, up to 3 plots are produced, the third one being a filtered heatmap based on significance and three height cutting.
cormap_filt
is a convenience function that accepts an ExpressionSet, a data frame or a numeric matrix as input. Note, that it will not work on a correlation matrix, as it needs the P-Values calculated internally for filtering.
cormap_filt(dat1_500, main = "TCGA 500 genes", cut.thr = 21, p.thr = 0.01, cex.filt = 0.8)