library(cinaR)
data("atac_seq_consensus_bm")
Bed formatted consensus matrix (chr, start, end and samples)
dim(bed)
## [1] 1000 25
# bed formatted file
head(bed[,1:4])
## Chr Start End B6-18mo-M-BM-47-GT18-01783
## 52834 chr5 24841478 24845196 1592
## 29780 chr17 8162955 8164380 109
## 67290 chr8 40577584 40578029 72
## 51295 chr4 145277698 145278483 110
## 4267 chr1 180808752 180815472 2452
## 45102 chr3 88732151 88732652 49
Create the contrasts you want to compare, here we create contrasts for 22 mice samples from different strains.
# create contrast vector which will be compared.
c("B6", "B6", "B6", "B6", "B6", "NZO", "NZO", "NZO", "NZO", "NZO", "NZO",
contrasts<-"B6", "B6", "B6", "B6", "B6", "NZO", "NZO", "NZO", "NZO", "NZO", "NZO")
cinaR
function directly computes the differentially accessible peaks.
# If reference genome is not set hg38 will be used!
cinaR(bed, contrasts, reference.genome = "mm10") results <-
## >> preparing features information... 2020-11-09 18:39:34
## >> identifying nearest features... 2020-11-09 18:39:35
## >> calculating distance from peak to TSS... 2020-11-09 18:39:36
## >> assigning genomic annotation... 2020-11-09 18:39:36
## >> assigning chromosome lengths 2020-11-09 18:39:57
## >> done... 2020-11-09 18:39:57
Now, you can access differential accessibility (DA) and enrichment results.
names(results)
## [1] "DA.results" "Enrichment.Results"
Inside DA.results
, you have the consensus peaks (cp) and differentially accessible (DA) peaks. If batch correction was run, then cp
will be a batch-corrected consensus matrix, otherwise it is the filtered and normalized version of original consensus peaks you provided.
names(results$DA.results)
## [1] "cp" "DA.peaks"
There are many information cinaR
provides such as adjusted p value, log fold-changes, gene names etc for each peak:
colnames(results$DA.results$DA.peaks$B6_NZO)
## [1] "Row.names" "seqnames" "start" "end"
## [5] "width" "strand" "annotation" "geneChr"
## [9] "geneStart" "geneEnd" "geneLength" "geneStrand"
## [13] "geneId" "transcriptId" "distanceToTSS" "gene_name"
## [17] "logFC" "FDR"
Here is an overview of those DA peaks:
head(results$DA.results$DA.peaks$B6_NZO[,1:5])
## Row.names seqnames start end width
## 1 chr10_105840598_105842176 chr10 105840598 105842176 1579
## 2 chr10_59950325_59952673 chr10 59950325 59952673 2349
## 3 chr10_63176490_63176839 chr10 63176490 63176839 350
## 4 chr10_77220928_77221910 chr10 77220928 77221910 983
## 5 chr10_79751429_79751786 chr10 79751429 79751786 358
## 6 chr10_86021157_86023861 chr10 86021157 86023861 2705
Since the comparison is
B6_NZO
, if fold-changes are positive it means they are more accesible in B6 compared to NZO and vice versa for negative values!
and here is a little overview for enrichment analyses results:
head(results$Enrichment.Results$B6_NZO[,c("module.name","overlapping.genes", "adj.p")])
## module.name overlapping.genes adj.p
## 1 Myeloid lineage 1 TFEB,FBXL5,PLXNC1,GM2A,AGTPBP1,CTSB 0.05914491
## 2 U_metabolism/replication SLC2A6,GM2A,CTSB,PECAM1 0.05914491
## 3 U_mitochondrial proteins PIK3R1,PAQR3,UBE3A,MAP4K4,PTPRC 0.32816305
## 4 U_proteasome/ubiquitin cx PIK3R1,IREB2,PTPRC 0.39112517
## 5 U_Immunity/cytoskeleton RPS6,RPS19 0.66488512
## 6 Myeloid lineage 2 RNF157,MTUS1 0.66488512
You can easily get the PCA plots of the samples:
pca_plot(results, contrasts, show.names = F)
You can overlay different information onto PCA plots as well!
# Overlaid information
c("B6-18mo", "B6-18mo", "B6-18mo", "B6-18mo", "B6-18mo",
overlaid.info <-"NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo",
"B6-3mo", "B6-3mo", "B6-3mo", "B6-3mo", "B6-3mo",
"NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo")
# Sample IDs
c("S01783", "S01780", "S01781", "S01778", "S01779",
sample.names <-"S03804", "S03805", "S03806", "S03807", "S03808",
"S03809", "S04678", "S04679", "S04680", "S04681",
"S04682", "S10918", "S10916", "S10919", "S10921",
"S10917", "S10920")
pca_plot(results, overlaid.info, sample.names)
You can also plot most variable 100 peaks for all samples:
heatmap_plot(results)
Plus, you can set the number of peaks to be used in these plots, and also you can change the additional arguments of pheatmap
function. For more information check out ?pheatmap
.
heatmap_plot(results, heatmap.peak.count = 200, cluster_cols = F)
You can plot your enrichment results using:
dot_plot(results)
## Warning: Removed 54 rows containing missing values (geom_point).
if it gets too crowded, you can get rid of the irrelevant pathways as well:
dot_plot(results, filter.pathways = T)
Note that you can further divide the resolution of contrasts, for instance this is also a valid vector
sapply(strsplit(colnames(bed), split = "-", fixed = T),
contrasts <-function(x){paste(x[1:4], collapse = "-")})[4:25]
unique(contrasts)
## [1] "B6-18mo-M-BM" "B6-18mo-F-BM" "NZO-18mo-F-BM" "NZO-18mo-M-BM"
## [5] "B6-3mo-F-BM" "B6-3mo-M-BM" "NZO-3mo-F-BM" "NZO-3mo-M-BM"
in this case, each of them will be compared to each other which will result in 28 different comparisons.
You can run the enrichment analyses with a custom geneset:
cinaR(..., geneset = new_geneset)
geneset
must be a .gmt
formatted symbol file. You can download different genesets from this site.
You can use
read.gmt
function fromqusage
package to read genesets into your current environment.
Also, you can familarize yourself with the format by checking out :
# default geneset to be used
data("VP2008")
For now, cinaR
supports 3 genomes for human and mice models:
hg38
hg19
mm10
You can set your it using reference.genome
argument.
If you suspect your data have unknown batch effects, you can use:
cinaR(..., batch.correction = T)
This option will run Surrogate Variable Analysis (SVA) and try to adjust your data for unknown batch effects. If however, you already know the batches of the samples, you can simply set the batch.information
argument as well:
# batch information should be number a vector where
# the length of it equals to the number of samples.
cinaR(..., batch.correction = T, batch.information = c(rep(0, 11), rep(1,11)))
Reminder - In our example data we have 22 samples
Sometimes, one might want to add additional covariates to adjust the design matrix further, which affects the down-stream analyses. For instance, ages or sexes of the samples could be additional covariates. To account for those:
# Ages of the samples could be not in biological interests but should be accounted for!
cinaR(..., additional.covariates = c((18, 11), (3, 11)))
# More than one covariate for instance, sex and age
c("M", "F", "M", "F", "F", "F", "F", "F", "M", "M", "M",
sex.info <-"F", "F", "M", "M", "M", "F", "F", "M", "M", "F", "M")
c((18, 11), (3, 11)
age.info <- data.frame(Sex = sex.info, Age = age.info)
covs <-
cinaR(..., additional.covariates = covs)
Setting save.DA.peaks = TRUE
in cinaR
function will create a DApeaks.xlsx
file in the current directory. This file includes all the comparisons in different tabs. Additionally, you can set the path/name of the file using DA.peaks.path
argument after setting save.DA.peaks = TRUE
.
For instance,
cinaR(..., save.DA.peaks = T, DA.peaks.path = "./Peaks_mice.xlsx") results <-
will create an excel file with name Peaks_mice.xlsx
in the current directory.
Currently, cinaR
supports 4 different algorithms, namely;
If not set, it uses edgeR
for differential analyses. You can change the used algorithm by simply setting DA.choice
argument. For more information, ?cinaR
# new FDR threshold for DA peaks
cinaR(..., DA.fdr.threshold = 0.1)
results <-
# filters out pathways
cinaR(..., enrichment.FDR.cutoff = 0.1)
results <-
# does not run enrichment pipeline
cinaR(..., run.enrichment = FALSE)
results <-
# creates the piechart from chIpSeeker package
cinaR(..., show.annotation.pie = TRUE)
results <-
# change cut-off value for dot plots
dot_plot(..., fdr.cutoff = 0.05)
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47.
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8