Hai Fang
, Bogdan Knezevic
, Katie L Burnham
, Julian C Knight
Wellcome Trust Centre for Human Genetics, University of Oxford, UK
We introduce an R package called XGR. This package is designed to make a user-defined gene or SNP list more interpretable by comprehensively utilising ontology and network information to reveal relationships and enhance opportunities for biological discovery. XGR is unique in supporting a broad range of ontologies (including knowledge of biological and molecular functions, pathways, diseases and phenotypes - in both human and mouse) and different types of networks (including functional, physical and pathway interactions). After going through this user manual (particularly the Applications
section which includes the demo with published data), you will be able to: 1) perform enrichment analysis using either built-in or custom ontologies, 2) calculate semantic similarity between genes (or between SNPs) based on their ontology annotation profiles, and 3) identify a gene subnetwork given your query list of (significant) genes or SNPs. For end-users who are unfamiliar with R, please refer to our user-friendly web app.1
We assume R, a language and environment for statistical computing and graphics, has been installed. For installation of the XGR package itself (now hosted in GitHub2), there are two steps:
source("http://bioconductor.org/biocLite.R")
biocLite(c("devtools","dnet","RCircos","GenomicRanges","ggplot2","ggbio"), siteRepos=c("http://cran.r-project.org"))
XGR
package:library(devtools)
install_github(c("hfang-bristol/XGR"), dependencies=T)
The functions in the package XGR
are categorised into five groups according to the tasks they complete. They are summarised below.
Enrichment functions are supposed to do enrichment analysis based on several statistical tests (either Fisher’s exact test or hypergeometric/binomial test). The test is to estimate significance of overlaps between, for example, an input group of genes and a group of genes annotated by an ontology term. By default, all annotatable genes are used as the test background but can be specified by the user. If ontology terms are organised as a tree-like structure, this ontology structure can also be taken into account.
xEnricherGenes
: conducts gene-based enrichment analysis given a list of genes and the ontology in query. It supports two types of ontologies: 1) structured ontologies including Gene Ontology (Ashburner et al. 2000), Disease Ontology (Schriml et al. 2012), and Phenotype Ontologies in human and mouse (Köhler et al. 2013; C. L. Smith and Eppig 2009), and 2) non-structured ontologies/categories; for example, a collection of pathways, gene expression signatures, transcription factor targets, and gene druggable categories.
xEnricherSNPs
: conducts SNP-based enrichment analysis using GWAS Catalog traits mapped to Experimental Factor Ontology (Welter et al. 2014). Inclusion of additional SNPs that are in linkage disequilibrium (LD) with input SNPs are also allowed for enrichment analysis.
xEnricherYours
: conducts custom-based enrichment analysis provided with an entity file and an annotation file.
xEnrichViewer
: views enrichment results as a data frame that is also useful for the subsequent file saving.
xEnricher
: acts as a template for enrichment analysis. It is an internal function upon which high-level functions (ie xEnricherGenes
, xEnricherSNPs
and xEnricherYours
) rely.
Similarity functions serve to conduct similarity analysis calculating semantic similarity - a type of comparison to assess the degree of relatedness between two entities (eg genes) based on their annotation profiles (by ontology terms). To do so, information content (IC) of a term is first defined to measure how informative a term is to being used for annotating genes: –log10(frequency of genes annotated to this term). Similarity between two terms are then measured based on IC, usually at the most informative common ancester (MICA). Finally, similarity between two entities (eg genes) are derived from pairwise term similarity using best-matching based methods: average, maximum, and complete.
Network functions are supposed to identity a gene subnetwork from a gene interaction network with node/gene significant information. The node/gene information can be either directly provided (eg user-defined genes with the significance level; p-values or FDR) or indirectly provided (eg nearby genes of user-defined SNPs with the significance level; GWAS reported p-values). From a gene interaction network with nodes labelled with gene information, the algorithm searching for a maximum-scoring gene subnetwork has been reported in our previous publication (Fang and Gough 2014)), briefed as follows: 1) score transformation
, that is, given the threshold of tolerable p-value, nodes with p-values below this threshold (nodes of interest) are scored positively, and negative scores for nodes with threshold-above p-values (intolerable), 2) subnetwork identification
, that is, to find an interconnected gene subnetwork enriched with positive-score nodes, but allowing for a few negative-score nodes as linkers, and 3) controlling the subnetwork size
, that is, an iterative procedure is provided to finetune tolerable thresholds for identifying the gene subnetwork with a desired number of nodes.
xSubneterGenes
: takes as input a list of user-defined genes with the significance level (p-values), superposes these genes onto a gene interaction network, and outputs a maximum-scoring gene subnetwork that contains as many most significant (highly scored) genes as possible but also a few less significant (lowly scored) genes as linkers.
xSubneterSNPs
: identifies a gene subnetwork that is likely modulated by input SNPs and/or their Linkage Disequilibrium (LD) SNPs, including two major steps. The first step is to define and score nearby genes that are located within distance window of input and/or LD SNPs. The second step is to use xSubneterGenes
for identifying a maximum-scoring gene subnetwork.
Infrastructure functions are essential as they deal with infrastructure including built-in data loading, ontology annotation propagation, calculation of term-term semantic similarity, and graph conversions and visualisations.
xRDataLoader
: serves as hub for loading built-in data about genes, SNPs, ontologies and annotations.xDAGanno
: propagates annotations to the ontology root according to the true-path rule.xDAGsim
: calculates semantic similarity between terms, and returns a network with nodes for terms and edges for pair-wise semantic similarity between them.xConverter
: converts an object between graph classes.xCircos
: visualises the semantic similarity between genes (or SNPs) by the colour of links in a circos plot.xVisNet
: visualises the graph in different layouts.Auxiliary functions provide supplementary supports during the package development, such as code debugging and documentation creating.
An essential step of data analysis is how to make sense of a gene (or SNP) list in a biologically-meaningful way. Genes (and/or SNPs) may be identified from differential expression analysis, eQTL mapping and GWAS. In this section, we showcase the applications using several published datasets. The users are encouraged to adapt the provided codes to analyse their own datasets.
First of all, load the package XGR
:
library(XGR)
# the following packages are needed for visualisation
library(GenomicRanges)
library(RCircos)
The first dataset we use is based on expression data in monocytes (Fairfax et al. 2014). This dataset JKscience_TS1A
involves 228 individuals with expression data at four conditions: in the naive state (Naive
), after 2-hour LPS (LPS2
), after 24-hour LPS (LPS24
), and after 24-hour exposure to interferon gamma (IFN
). Differential expression analysis was performed using the limma package to identify genes that are differentially expressed between two conditions.
Extract genes that are significantly induced by interferon gamma as compared to the naive state
# Load differential expression analysis results
res <- xRDataLoader(RData.customised='JKscience_TS1A')
# Create a data frame for genes significantly induced by IFN
flag <- res$logFC_INF24_Naive < 0 & res$fdr_INF24_Naive < 0.01
df <- res[flag, c('Symbol','logFC_INF24_Naive','fdr_INF24_Naive')]
The first 5 rows of the data frame df
are shown below, with the column logFC_INF24_Naive
telling the log2-transformed fold change: naive
expression divided by IFN
expression (thus the induced genes with negative values).
Symbol | logFC_INF24_Naive | fdr_INF24_Naive |
---|---|---|
RERE | -1.24 | 8.41e-187 |
PAPD4 | -0.06 | 6.13e-03 |
F3 | -0.17 | 6.12e-05 |
LIN52 | -0.04 | 8.08e-03 |
CD558651 | -0.05 | 9.95e-05 |
Enrichment analysis at the gene level can be done by choosing one of ontologies currently supported:
Code | Ontology | Category |
---|---|---|
DO | Disease Ontology | Disease |
GOMF | Gene Ontology Molecular Function | Function |
GOBP | Gene Ontology Biological Process | Function |
GOCC | Gene Ontology Cellular Component | Function |
HPPA | Human Phenotype Phenotypic Abnormality | Phenotype |
HPMI | Human Phenotype Mode of Inheritance | Phenotype |
HPCM | Human Phenotype Clinical Modifier | Phenotype |
HPMA | Human Phenotype Mortality Aging | Phenotype |
MP | Mammalian/Mouse Phenotype | Phenotype |
DGIdb | DGI druggable gene categories | Druggable |
SF | SCOP domain superfamilies | Domain |
PS2 | phylostratific age information (our ancestors) | Evolution |
MsigdbH | Hallmark gene sets | MsigDB |
MsigdbC1 | Chromosome and cytogenetic band positional gene sets | MsigDB |
MsigdbC2CGP | Chemical and genetic perturbation gene sets | MsigDB |
MsigdbC2CPall | All pathway gene sets | MsigDB |
MsigdbC2CP | Canonical pathway gene sets | MsigDB |
MsigdbC2KEGG | KEGG pathway gene sets | MsigDB |
MsigdbC2REACTOME | Reactome pathway gene sets | MsigDB |
MsigdbC2BIOCARTA | BioCarta pathway gene sets | MsigDB |
MsigdbC3TFT | Transcription factor target gene sets | MsigDB |
MsigdbC3MIR | microRNA target gene sets | MsigDB |
MsigdbC4CGN | Cancer gene neighborhood gene sets | MsigDB |
MsigdbC4CM | Cancer module gene sets | MsigDB |
MsigdbC5BP | GO biological process gene sets | MsigDB |
MsigdbC5MF | GO molecular function gene sets | MsigDB |
MsigdbC5CC | GO cellular component gene sets | MsigDB |
MsigdbC6 | Oncogenic signature gene sets | MsigDB |
MsigdbC7 | Immunologic signature gene sets | MsigDB |
Optionally, the test background can be provided by the user. By default, all annotatable genes will be used. In this case, genes under differential expression analysis will be used as the test background.
background <- res$Symbol
Using a collection of canonical pathways
data <- df$Symbol
eTerm <- xEnricherGenes(data=data, background=background, ontology="MsigdbC2CPall")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'Pathway_enrichments.txt'
res_PW <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_PW), res_PW)
write.table(output, file="Pathway_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 10 significant pathways are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
Immune System | 639 | 341 | 6.80 | 5.0e-12 | 2.9e-09 |
Interferon gamma signaling | 47 | 41 | 6.47 | 5.0e-12 | 2.9e-09 |
Interferon Signaling | 120 | 84 | 6.53 | 2.2e-11 | 7.9e-09 |
Interferon alpha/beta signaling | 45 | 38 | 5.95 | 2.6e-10 | 6.9e-08 |
SCF(Skp2)-mediated degradation of p27/p21 | 41 | 34 | 5.47 | 5.4e-09 | 1.1e-06 |
Vif-mediated degradation of APOBEC3G | 35 | 30 | 5.39 | 6.3e-09 | 1.1e-06 |
p53-Independent G1/S DNA damage checkpoint | 34 | 29 | 5.26 | 1.4e-08 | 2.1e-06 |
Cytokine Signaling in Immune system | 207 | 123 | 5.50 | 1.8e-08 | 2.3e-06 |
ER-Phagosome pathway | 39 | 32 | 5.23 | 2.4e-08 | 2.8e-06 |
Proteasome | 33 | 28 | 5.13 | 3.0e-08 | 2.9e-06 |
Using Disease Ontology (DO)
data <- df$Symbol
eTerm <- xEnricherGenes(data=data, background=background, ontology="DO")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'DO_enrichments.txt'
res_DO <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=T)
output <- data.frame(term=rownames(res_DO), res_DO)
write.table(output, file="DO_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 5 significant terms are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
viral infectious disease | 528 | 270 | 5.35 | 0.0e+00 | 3.1e-05 |
disease by infectious agent | 863 | 408 | 4.59 | 2.1e-06 | 7.3e-04 |
influenza | 77 | 50 | 4.41 | 3.7e-06 | 8.8e-04 |
autoimmune disease of endocrine system | 56 | 36 | 3.65 | 8.9e-05 | 1.6e-02 |
measles | 26 | 19 | 3.39 | 1.7e-04 | 2.4e-02 |
Using Mammalian Phenotype (MP)
data <- df$Symbol
eTerm <- xEnricherGenes(data=data, background=background, ontology="MP")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'MP_enrichments.txt'
res_MP <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_MP), res_MP)
write.table(output, file="MP_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 10 significant terms are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
increased susceptibility to infection | 324 | 172 | 4.53 | 2.7e-06 | 0.0071 |
altered susceptibility to infection | 386 | 200 | 4.45 | 4.0e-06 | 0.0071 |
abnormal response to infection | 417 | 213 | 4.32 | 7.2e-06 | 0.0085 |
abnormal antigen presentation | 54 | 37 | 4.11 | 1.2e-05 | 0.0110 |
abnormal dendritic cell antigen presentation | 30 | 23 | 3.96 | 1.5e-05 | 0.0110 |
increased mature B cell number | 113 | 67 | 3.97 | 2.9e-05 | 0.0170 |
altered susceptibility to viral infection | 116 | 68 | 3.87 | 4.3e-05 | 0.0220 |
decreased B cell proliferation | 93 | 56 | 3.77 | 6.1e-05 | 0.0240 |
abnormal pilomotor reflex | 11 | 10 | 3.36 | 5.6e-05 | 0.0240 |
abnormal immunoglobulin level | 351 | 177 | 3.67 | 1.1e-04 | 0.0360 |
Network analysis at the gene level requires choosing a pre-defined gene networks as a whole network (called whole-network
). Generally speaking, two sources of whole-network information are supported: the STRING database (Szklarczyk et al. 2015) and the Pathways Commons database (Cerami et al. 2011). STRING is a meta-integration of undirect interactions from a functional aspect, while Pathways Commons mainly contains both undirect and direct interactions from a physical/pathway aspect. Both have scores to control the confidence of interactions. Therefore, the user can choose the interactions of varying quality in addition to interaction types:
Code | Interaction | Database |
---|---|---|
STRING_highest | Functional interactions (with highest confidence scores>=900) | STRING |
STRING_high | Functional interactions (with high confidence scores>=700) | STRING |
STRING_medium | Functional interactions (with medium confidence scores>=400) | STRING |
PCommonsUN_high | Physical/undirect interactions (with references & >=2 sources) | Pathways Commons |
PCommonsUN_medium | Physical/undirect interactions (with references & >=1 sources) | Pathways Commons |
PCommonsDN_high | Pathway/direct interactions (with references & >=2 sources) | Pathways Commons |
PCommonsDN_medium | Pathway/direct interactions (with references & >=1 sources) | Pathways Commons |
For the pathway-merged direct interactions, the user can also choose network sourced individually:
Code | Interaction | Database |
---|---|---|
PCommonsDN_Reactome | Pathway/direct interactions (only from Reactome) | Pathways Commons |
PCommonsDN_KEGG | Pathway/direct interactions (only from KEGG) | Pathways Commons |
PCommonsDN_HumanCyc | Pathway/direct interactions (only from HumanCyc) | Pathways Commons |
PCommonsDN_PID | Pathway/direct interactions (only from PID) | Pathways Commons |
PCommonsDN_PANTHER | Pathway/direct interactions (only from PANTHER) | Pathways Commons |
PCommonsDN_ReconX | Pathway/direct interactions (only from ReconX) | Pathways Commons |
PCommonsDN_PhosphoSite | Pathway/direct interactions (only from PhosphoSite) | Pathways Commons |
PCommonsDN_CTD | Pathway/direct interactions (only from CTD) | Pathways Commons |
In this subsection, from a pre-defined whole-network we demonstrate how to identify a gene network from an input list of genes with the significant info, in this case, the gene network induced by interferon gamma INF
.
# find maximum-scoring gene subnetwork with the desired node number=75
data <- df[,c("Symbol","fdr_INF24_Naive")]
subnet <- xSubneterGenes(data=data, network="STRING_high", subnet.size=75)
The identified gene network with nodes colored according to FDR is shown below:
pattern <- -log10(as.numeric(V(subnet)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet, pattern=pattern, glayout=layout_(subnet, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)
The second dataset we use is based on an immune-stimulated eQTL mapping study in monocytes (Fairfax et al. 2014). In this study, eQTLs were defined as SNPs showing significant association with gene expression at four conditions: in the naive state (Naive
), after 2-hour LPS (LPS2
), after 24-hour LPS (LPS24
), and after exposure to interferon gamma (IFN
). The eQTL association with gene expression is either in a cis- or trans-acting manner; accordingly they are called cis-eQTLs
and trans-eQTLs
. Genes whose expression is modulated by eQTLs are called eGenes
. The significant cis-eQTLs are stored in JKscience_TS2A
.
Extract cis-eQTLs induced after 24-hour exposure to interferon gamma
# Load cis-eQTL mapping results
cis <- xRDataLoader(RData.customised='JKscience_TS2A')
# Create a data frame for cis-eQTLs significantly induced by IFN
ind <- which(cis$IFN_t > 0 & cis$IFN_fdr < 0.05)
df_cis <- cis[ind, c('variant','Symbol','IFN_t','IFN_fdr')]
The first 5 rows of the data frame df_cis
are shown below:
variant | Symbol | IFN_t | IFN_fdr |
---|---|---|---|
rs10002954 | COQ2 | 6.961246 | 1.870000e-08 |
rs10005233 | SNCA | 7.789319 | 1.960000e-10 |
rs10005233 | SNCA | 4.501832 | 1.513695e-03 |
rs10006380 | NUP54 | 4.101834 | 6.318482e-03 |
rs10006851 | UBA6 | 5.132774 | 1.215260e-04 |
Only for input SNPs reported in GWAS Catalog traits mapped to Experimental Factor Ontology (EFO)
data <- df_cis$variant
eTerm <- xEnricherSNPs(data=data, ontology="EF")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'EF_enrichments.txt'
res_EF <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=T)
output <- data.frame(term=rownames(res_EF), res_EF)
write.table(output, file="EF_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 10 terms are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
Vitiligo | 38 | 4 | 7.75 | 3.7e-06 | 0.00043 |
ulcerative colitis | 130 | 6 | 5.79 | 1.7e-05 | 0.00100 |
autoimmune disease | 1233 | 18 | 3.85 | 2.2e-04 | 0.00850 |
lipoprotein measurement | 1313 | 18 | 3.55 | 4.9e-04 | 0.01300 |
immune system disease | 1545 | 20 | 3.48 | 5.5e-04 | 0.01300 |
coronary heart disease | 123 | 4 | 3.71 | 1.1e-03 | 0.02000 |
inflammatory bowel disease | 420 | 8 | 3.36 | 1.3e-03 | 0.02100 |
multiple sclerosis | 211 | 5 | 3.23 | 2.1e-03 | 0.02400 |
lipid or lipoprotein measurement | 1583 | 19 | 3.03 | 1.9e-03 | 0.02400 |
lipid measurement | 1485 | 18 | 2.99 | 2.1e-03 | 0.02400 |
For input SNPs plus their LD SNPs (based on European populations) reported in GWAS Catalog traits mapped to EFO
data <- df_cis$variant
eTerm <- xEnricherSNPs(data=data, ontology="EF", include.LD="EUR", LD.r2=0.8)
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'EF_LD_enrichments.txt'
res_EF_LD <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=T)
output <- data.frame(term=rownames(res_EF_LD), res_EF_LD)
write.table(output, file="EF_LD_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 10 significant terms are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
inflammatory bowel disease | 420 | 40 | 8.21 | 1.0e-11 | 1.7e-09 |
immune system disease | 1545 | 91 | 7.35 | 1.6e-11 | 1.7e-09 |
Parkinson’s disease | 119 | 20 | 9.09 | 2.7e-11 | 1.9e-09 |
autoimmune disease | 1233 | 76 | 7.10 | 1.1e-10 | 5.6e-09 |
ulcerative colitis | 130 | 17 | 6.96 | 3.8e-08 | 1.6e-06 |
digestive system disease | 868 | 53 | 5.79 | 9.6e-08 | 3.4e-06 |
psoriasis | 91 | 12 | 5.88 | 2.1e-06 | 6.3e-05 |
blood metabolite measurement | 221 | 19 | 5.10 | 6.9e-06 | 1.8e-04 |
lentiform nucleus measurement | 14 | 4 | 5.74 | 3.2e-05 | 7.5e-04 |
lipoprotein-associated phospholipase A(2) measurement | 26 | 5 | 4.98 | 8.0e-05 | 1.7e-03 |
Only for input SNPs
data <- df_cis$variant
ig_SNP <- xSocialiserSNPs(data=data, ontology="EF")
# save similarity results to the file 'EF_similarity.txt'
output <- igraph::get.data.frame(ig_SNP, what="edges")
write.table(output, file="EF_similarity.txt", sep="\t", row.names=F)
Circos plot of the most similar edges is shown below:
xCircos(g=ig_SNP, entity="SNP", verbose=F)
For input SNPs plus their LD SNPs
data <- df_cis$variant
ig_SNP_LD <- xSocialiserSNPs(data=data, ontology="EF", include.LD="EUR", LD.r2=0.8)
# save similarity results to the file 'EF_LD_similarity.txt'
output <- igraph::get.data.frame(ig_SNP_LD, what="edges")
write.table(output, file="EF_LD_similarity.txt", sep="\t", row.names=F)
Circos plot of the most similar edges is shown below:
xCircos(g=ig_SNP_LD, entity="SNP", verbose=F)
In this section, from a pre-defined whole-network (see above) we demonstrate how to identify a gene network from an input list of SNPs with the significant info, in this case, the gene network likely modulated by INF-induced cis-eQTLs.
# find maximum-scoring gene subnetwork with the desired node number=60
data <- df_cis[,c("variant","IFN_fdr")]
subnet_SNP <- xSubneterSNPs(data=data, network="STRING_high", distance.max=200000, seed.genes=T, subnet.significance=1e-1, subnet.size=60)
The identified gene network with nodes colored according to scores is shown below:
pattern <- -log10(as.numeric(V(subnet_SNP)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet_SNP, pattern=pattern, glayout=layout_(subnet_SNP, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)
The third dataset ImmunoBase
is GWAS lead SNPs associated with immunologically related human diseases, obtained from ImmunoBase
.
ImmunoBase <- xRDataLoader(RData.customised='ImmunoBase')
# get info about diseases
disease_list <- lapply(ImmunoBase, function(x) x$disease)
# get the number of disease associated variants/SNPs
variants_list <- lapply(ImmunoBase, function(x) length(names(x$variants)))
# get the number of genes that are located within 500kb distance window of SNPs
genes_list <- lapply(ImmunoBase, function(x) length(x$genes_variants))
# create a data frame
df_ib <- data.frame(Code=names(ImmunoBase), Disease=unlist(disease_list), num_SNPs=unlist(variants_list), num_nearby_genes=unlist(genes_list), stringsAsFactors=F)
A summary of diseases, GWAS lead SNPs and their nearby genes:
Code | Disease | num_SNPs | num_nearby_genes |
---|---|---|---|
AA | Alopecia Areata (AA) | 31 | 203 |
AS | Ankylosing Spondylitis (AS) | 38 | 437 |
ATD | Autoimmune Thyroid Disease (ATD) | 37 | 142 |
CEL | Celiac Disease (CEL) | 197 | 647 |
CRO | Crohn’s Disease (CRO) | 257 | 1861 |
IBD | Inflammatory Bowel Disease (IBD) | 194 | 2205 |
IGE | IgE and Allergic Sensitization (IGE) | 11 | 107 |
JIA | Juvenile Idiopathic Arthritis (JIA) | 31 | 340 |
MS | Multiple Sclerosis (MS) | 317 | 1608 |
NAR | Narcolepsy (NAR) | 10 | 88 |
PBC | Primary Biliary Cirrhosis (PBC) | 157 | 606 |
PSC | Primary Sclerosing Cholangitis (PSC) | 15 | 169 |
PSO | Psoriasis (PSO) | 87 | 533 |
RA | Rheumatoid Arthritis (RA) | 255 | 979 |
SJO | Sjogren Syndrome (SJO) | 13 | 104 |
SLE | Systemic Lupus Erythematosus (SLE) | 87 | 669 |
SSC | Systemic Scleroderma (SSC) | 8 | 55 |
T1D | Type 1 Diabetes (T1D) | 310 | 1003 |
UC | Ulcerative Colitis (UC) | 181 | 1618 |
VIT | Vitiligo (VIT) | 39 | 308 |
In the following subsections, we focus on two diseases, Crohn’s Disease (CRO) and Celiac Disease (CEL), identifying putative gene networks that are likely modulated by their corresponding lead (or LD) SNPs. The applications in the other diseases can be similarly done.
# get SNPs reported in CRO GWAS and their significance info (p-values)
gr <- ImmunoBase$CRO$variant
data <- as.matrix(mcols(gr)[, c('Variant','Pvalue')])
# find maximum-scoring gene subnetwork with the desired node number=30
subnet_CRO <- xSubneterSNPs(data=data, network="STRING_high", distance.max=500000, seed.genes=T, subnet.size=30)
The identified gene network with nodes colored according to scores (the higher the more signficant) is shown below:
pattern <- -log10(as.numeric(V(subnet_CRO)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet_CRO, pattern=pattern, glayout=layout_(subnet_CRO, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)
Identify pathways enriched with genes in the identified network
data <- V(subnet_CRO)$name
eTerm <- xEnricherGenes(data=data, ontology="MsigdbC2CPall")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'CRO_Pathway_enrichments.txt'
res_CRO_PW <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_CRO_PW), res_CRO_PW)
write.table(output, file="CRO_Pathway_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 10 significant pathways are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
IL27-mediated signaling events | 26 | 5 | 16.50 | 2.1e-10 | 1.1e-08 |
GMCSF-mediated signaling events | 37 | 5 | 13.80 | 2.0e-09 | 3.5e-08 |
IL23-mediated signaling events | 37 | 5 | 13.80 | 2.0e-09 | 3.5e-08 |
IL12-mediated signaling events | 63 | 5 | 10.40 | 5.6e-08 | 6.9e-07 |
IL4-mediated signaling events | 65 | 5 | 10.20 | 6.8e-08 | 6.9e-07 |
Cytokine-cytokine receptor interaction | 266 | 8 | 7.56 | 1.6e-07 | 1.3e-06 |
IFN-gamma pathway | 40 | 4 | 10.50 | 2.0e-07 | 1.5e-06 |
NO2-dependent IL 12 Pathway in NK cells | 17 | 3 | 12.20 | 2.6e-07 | 1.6e-06 |
ErbB2/ErbB3 signaling events | 44 | 4 | 9.96 | 3.3e-07 | 1.9e-06 |
IL6-mediated signaling events | 47 | 4 | 9.62 | 4.6e-07 | 2.3e-06 |
# get SNPs reported in CEL GWAS and their significance info (p-values)
gr <- ImmunoBase$CEL$variant
data <- as.matrix(mcols(gr)[, c('Variant','Pvalue')])
# find maximum-scoring gene subnetwork with the desired node number=30
subnet_CEL <- xSubneterSNPs(data=data, network="STRING_high", distance.max=500000, seed.genes=T, subnet.size=30)
The identified gene network with nodes colored according to scores (the higher the more signficant) is shown below:
pattern <- -log10(as.numeric(V(subnet_CEL)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet_CEL, pattern=pattern, glayout=layout_(subnet_CEL, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)
Identify pathways enriched with genes in the identified network
data <- V(subnet_CEL)$name
eTerm <- xEnricherGenes(data=data, ontology="MsigdbC2CPall")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'CEL_Pathway_enrichments.txt'
res_CEL_PW <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_CEL_PW), res_CEL_PW)
write.table(output, file="CEL_Pathway_enrichments.txt", sep="\t", row.names=F)
Enrichment results for the top 10 significant pathways are shown below:
name | nAnno | nOverlap | zscore | pvalue | adjp |
---|---|---|---|---|---|
Selective expression of chemokine receptors during T-cell polarization | 29 | 9 | 31.1 | 9.0e-20 | 4.7e-18 |
Cytokine-cytokine receptor interaction | 266 | 12 | 13.1 | 5.2e-14 | 1.4e-12 |
The Co-Stimulatory Signal During T-cell Activation | 21 | 5 | 20.2 | 1.5e-11 | 2.7e-10 |
Chemokine receptors bind chemokines | 56 | 6 | 14.7 | 1.3e-10 | 1.8e-09 |
Th1/Th2 Differentiation | 19 | 4 | 17.0 | 1.4e-09 | 1.3e-08 |
Chemokine signaling pathway | 190 | 8 | 10.3 | 1.3e-09 | 1.3e-08 |
Costimulation by the CD28 family | 62 | 5 | 11.5 | 1.6e-08 | 1.2e-07 |
IL12-mediated signaling events | 63 | 5 | 11.4 | 1.7e-08 | 1.2e-07 |
IL12 signaling mediated by STAT4 | 33 | 4 | 12.8 | 2.8e-08 | 1.7e-07 |
G alpha (i) signalling events | 193 | 7 | 8.8 | 3.7e-08 | 2.1e-07 |
Here is the output of sessionInfo()
on the system on which this user manual was built:
> R version 3.2.4 (2016-03-10)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: OS X 10.11.4 (El Capitan)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats4 parallel stats graphics grDevices utils datasets
> [8] methods base
>
> other attached packages:
> [1] BiocInstaller_1.18.5 stringr_1.0.0 rmarkdown_0.9.5
> [4] RCircos_1.1.3 GenomicRanges_1.20.8 GenomeInfoDb_1.4.3
> [7] IRanges_2.2.9 S4Vectors_0.6.6 BiocGenerics_0.14.0
> [10] XGR_1.0.0 dnet_1.0.8 supraHex_1.7.3
> [13] hexbin_1.27.1 igraph_1.0.1
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.28.0 foreach_1.4.3
> [3] splines_3.2.4 Formula_1.2-1
> [5] highr_0.5.1 latticeExtra_0.6-28
> [7] RBGL_1.44.0 BSgenome_1.36.3
> [9] Rsamtools_1.20.4 yaml_2.1.13
> [11] RSQLite_1.0.0 lattice_0.20-33
> [13] biovizBase_1.16.0 digest_0.6.9
> [15] RColorBrewer_1.1-2 XVector_0.8.0
> [17] colorspace_1.2-6 ggbio_1.16.1
> [19] htmltools_0.3 Matrix_1.2-4
> [21] plyr_1.8.3 OrganismDbi_1.10.0
> [23] XML_3.98-1.3 biomaRt_2.24.0
> [25] zlibbioc_1.14.0 scales_0.4.0
> [27] BiocParallel_1.2.18 ggplot2_2.1.0
> [29] GenomicFeatures_1.20.1 nnet_7.3-12
> [31] survival_2.38-3 magrittr_1.5
> [33] evaluate_0.8.3 GGally_1.0.1
> [35] nlme_3.1-125 MASS_7.3-45
> [37] foreign_0.8-66 graph_1.46.0
> [39] tools_3.2.4 doMC_1.3.4
> [41] formatR_1.3 munsell_0.4.3
> [43] cluster_2.0.3 AnnotationDbi_1.30.1
> [45] lambda.r_1.1.7 Biostrings_2.36.1
> [47] compiler_3.2.4 futile.logger_1.4.1
> [49] grid_3.2.4 RCurl_1.95-4.7
> [51] iterators_1.0.8 dichromat_2.0-0
> [53] VariantAnnotation_1.14.13 bitops_1.0-6
> [55] codetools_0.2-14 gtable_0.2.0
> [57] DBI_0.3.1 reshape_0.8.5
> [59] reshape2_1.4.1 GenomicAlignments_1.4.1
> [61] gridExtra_2.2.1 knitr_1.12.3
> [63] rtracklayer_1.28.6 Hmisc_3.17-2
> [65] futile.options_1.0.0 Rgraphviz_2.12.0
> [67] ape_3.4 stringi_1.0-1
> [69] Rcpp_0.12.4 rpart_4.1-10
> [71] acepack_1.3-3.3
Ashburner, M, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, et al. 2000. “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.” Nat Genet 25 (1): 25–29. doi:10.1038/75556.
Cerami, E. G., B. E. Gross, E. Demir, I. Rodchenkov, O. Babur, N. Anwar, N. Schultz, G. D. Bader, and C. Sander. 2011. “Pathway Commons, a web resource for biological pathway data.” Nucleic Acids Research 39 (Database): D685–D690. doi:10.1093/nar/gkq1039.
Fairfax, Benjamin P, Peter Humburg, Seiko Makino, Vivek Naranbhai, Daniel Wong, Evelyn Lau, Luke Jostins, et al. 2014. “Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression.” Science (New York, N.Y.) 343 (March): 1246949. doi:10.1126/science.1246949.
Fang, Hai, and Julian Gough. 2014. “The ’dnet’ approach promotes emerging research on cancer patient survival.” Genome Medicine 6 (8): 64. doi:10.1186/s13073-014-0064-8.
Köhler, Sebastian, Sandra C Doelken, Christopher J Mungall, Sebastian Bauer, Helen V Firth, Isabelle Bailleul-Forestier, Graeme C M Black, et al. 2013. “The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data.” Nucleic Acids Research, November, 1–9. doi:10.1093/nar/gkt1026.
Schriml, L M, C Arze, S Nadendla, Y W Chang, M Mazaitis, V Felix, G Feng, and W A Kibbe. 2012. “Disease Ontology: a backbone for disease semantic integration.” Nucleic Acids Res 40 (Database issue): D940–6. doi:10.1093/nar/gkr972.
Smith, C L, and J T Eppig. 2009. “The Mammalian Phenotype Ontology: enabling robust annotation and comparative analysis.” Wiley Interdiscip Rev Syst Biol Med 1 (3): 390–99. doi:10.1002/wsbm.44.
Szklarczyk, Damian, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Davide Heller, Jaime Huerta-cepas, Milan Simonovic, et al. 2015. “STRING v10 : protein – protein interaction networks , integrated over the tree of life.” Nucleic Acids Res 43 (Database): D447–D452. doi:10.1093/nar/gku1003.
Welter, Danielle, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, et al. 2014. “The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.” Nucleic Acids Research 42 (D1): 1001–6. doi:10.1093/nar/gkt1229.