Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. The original algorithm is detailed in Subramanian, Tamayo, et al. with Java implementations available through the Broad Institute.
The liger
package provides a lightweight R implementation of this enrichment test on a list of values. Given a list of values, such as p-values or log-fold changes derived from differential expression analysis or other analyses comparing biological states, this package enables you to test a priori defined set of genes for enrichment to enable interpretability of highly significant or high fold-change genes.
Consider an example, simulated dataset.
library(liger)
# load gene set
data("org.Hs.GO2Symbol.list")
# get universe
universe <- unique(unlist(org.Hs.GO2Symbol.list))
# get a gene set
gs <- org.Hs.GO2Symbol.list[[1]]
# fake dummy example where everything in gene set is perfectly enriched
vals <- rnorm(length(universe), 0, 10)
names(vals) <- universe
vals[gs] <- rnorm(length(gs), 100, 10)
head(vals) # look at vals
## AKT3 C10orf2 DNA2 LIG3 MEF2A MGME1
## 113.35871 114.92099 110.77896 99.63533 88.81057 99.27040
Here, vals
can be seen as representing a list of log-fold changes derived from differential expression analysis on samples in two biological states. We want to interpret the set of differentially expressed genes with high positive fold changes using gene set enrichment analysis.
To test for enrichment of a particular gene set:
names(org.Hs.GO2Symbol.list)[[1]]
## [1] "GO:0000002"
gs # look at gs
## [1] "AKT3" "C10orf2" "DNA2" "LIG3" "MEF2A" "MGME1"
## [7] "MPV17" "OPA1" "PID1" "PRIMPOL" "SLC25A33" "SLC25A36"
## [13] "SLC25A4" "STOML2" "TYMP"
gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=500)
## [1] 0.002
In this simulation, we created vals
such that gs
was obviously enriched. And indeed, we see that this gene set exhibits significant enrichment.
Now to test for enrichment of another gene set:
gs.new <- org.Hs.GO2Symbol.list[[2]]
names(org.Hs.GO2Symbol.list)[[2]]
## [1] "GO:0000003"
head(gs.new) # look at gs.new
## [1] "ACE" "ACR" "ADAM2" "ADAM20" "ADAM21" "ADAM28"
gsea(values=vals, geneset=gs.new, mc.cores=1, n.rand=500)
## [1] 0.29
In this simulation, we created vals
such that gs.new
was obviously not enriched. And indeed, we see that this gene set does not exhibit significant enrichment.
If we simulate a more ambiguous case:
# add some noise
vals[sample(1:length(universe), 1000)] <- rnorm(1000, 100, 10)
# test previously perfectly enriched gene set again
gs <- org.Hs.GO2Symbol.list[[1]]
gsea(values=vals, geneset=gs, mc.cores=1, n.rand=500)
## [1] 0.038
The enrichment plots and p-values are affected as expected.
We can also test a number of gene sets:
bulk.gsea(values=vals, set.list=org.Hs.GO2Symbol.list[1:5], mc.cores=1, n.rand=500)
## p.val q.val sscore edge
## GO:0000002 0.003992016 0.002 1.9753755 77.613048
## GO:0000003 0.305389222 0.396 -0.7174219 96.462827
## GO:0000012 0.001996008 0.058 -1.4222089 9.559664
## GO:0000014 0.089820359 0.164 -1.0359759 20.365777
To save on computation time, we can also iterative assess significance:
iterative.bulk.gsea(values=vals, set.list=org.Hs.GO2Symbol.list[1:5], mc.cores=1, n.rand=500)
## initial: [5e+02 - 2] done
## p.val q.val sscore edge
## GO:0000002 0.003992016 0.007984032 1.9753755 77.613048
## GO:0000003 0.305389222 0.305389222 -0.7174219 96.462827
## GO:0000012 0.001996008 0.007984032 -1.4222089 9.559664
## GO:0000014 0.089820359 0.119760479 -1.0359759 20.365777
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X Yosemite 10.10.5
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] liger_0.1
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 parallel_3.3.3 tools_3.3.3
## [4] Rcpp_0.12.14 stringi_1.1.6 highr_0.6
## [7] knitr_1.17 stringr_1.2.0 matrixStats_0.52.2
## [10] evaluate_0.10.1