1 Summary

We introduce an R package called XGR. This package is designed to make a user-defined gene or SNP list more interpretable by comprehensively utilising ontology and network information to reveal relationships and enhance opportunities for biological discovery. XGR is unique in supporting a broad range of ontologies (including knowledge of biological and molecular functions, pathways, diseases and phenotypes - in both human and mouse) and different types of networks (including functional, physical and pathway interactions). After going through this user manual (particularly the Applications section which includes the demo with published data), you will be able to: 1) perform enrichment analysis using either built-in or custom ontologies, 2) calculate semantic similarity between genes (or between SNPs) based on their ontology annotation profiles, and 3) identify a gene subnetwork given your query list of (significant) genes or SNPs. For end-users who are unfamiliar with R, please refer to our user-friendly web app.1

2 Installation

We assume R, a language and environment for statistical computing and graphics, has been installed. For installation of the XGR package itself (now hosted in GitHub2), there are two steps:

  • First, install the dependent packages:
source("http://bioconductor.org/biocLite.R")
biocLite(c("devtools","dnet","RCircos","GenomicRanges","ggplot2","ggbio"), siteRepos=c("http://cran.r-project.org"))
  • Second, install the XGR package:
library(devtools)
install_github(c("hfang-bristol/XGR"), dependencies=T)

3 Functionality

The functions in the package XGR are categorised into five groups according to the tasks they complete. They are summarised below.

3.1 Enrichment functions

Enrichment functions are supposed to do enrichment analysis based on several statistical tests (either Fisher’s exact test or hypergeometric/binomial test). The test is to estimate significance of overlaps between, for example, an input group of genes and a group of genes annotated by an ontology term. By default, all annotatable genes are used as the test background but can be specified by the user. If ontology terms are organised as a tree-like structure, this ontology structure can also be taken into account.

3.1.1 xEnricherGenes

xEnricherGenes: conducts gene-based enrichment analysis given a list of genes and the ontology in query. It supports two types of ontologies: 1) structured ontologies including Gene Ontology (Ashburner et al. 2000), Disease Ontology (Schriml et al. 2012), and Phenotype Ontologies in human and mouse (Köhler et al. 2013; C. L. Smith and Eppig 2009), and 2) non-structured ontologies/categories; for example, a collection of pathways, gene expression signatures, transcription factor targets, and gene druggable categories.

3.1.2 xEnricherSNPs

xEnricherSNPs: conducts SNP-based enrichment analysis using GWAS Catalog traits mapped to Experimental Factor Ontology (Welter et al. 2014). Inclusion of additional SNPs that are in linkage disequilibrium (LD) with input SNPs are also allowed for enrichment analysis.

3.1.3 xEnricherYours

xEnricherYours: conducts custom-based enrichment analysis provided with an entity file and an annotation file.

3.1.4 xEnrichViewer

xEnrichViewer: views enrichment results as a data frame that is also useful for the subsequent file saving.

3.1.5 xEnricher

xEnricher: acts as a template for enrichment analysis. It is an internal function upon which high-level functions (ie xEnricherGenes, xEnricherSNPs and xEnricherYours) rely.

3.2 Similarity functions

Similarity functions serve to conduct similarity analysis calculating semantic similarity - a type of comparison to assess the degree of relatedness between two entities (eg genes) based on their annotation profiles (by ontology terms). To do so, information content (IC) of a term is first defined to measure how informative a term is to being used for annotating genes: –log10(frequency of genes annotated to this term). Similarity between two terms are then measured based on IC, usually at the most informative common ancester (MICA). Finally, similarity between two entities (eg genes) are derived from pairwise term similarity using best-matching based methods: average, maximum, and complete.

3.2.1 xSocialiserGenes

xSocialiserGenes: conducts gene-based similarity analysis given a list of genes and the ontology in query. It supports several structured ontologies including Gene Ontology, Disease Ontology, and Phenotype Ontologies (in human and mouse), and returns socialised genes represented as a network with nodes for input genes and edges for pair-wise semantic similarity between them.

3.2.2 xSocialiserSNPs

xSocialiserSNPs: conducts SNP-based similarity analysis using GWAS Catalog traits mapped to Experimental Factor Ontology. Inclusion of additional SNPs that are in linkage disequilibrium (LD) with input SNPs are also allowed for similarity analysis. It returns socialised SNPs represented as a network with nodes for input SNPs and edges for pair-wise semantic similarity between them.

3.2.3 xSocialiser

xSocialiser: acts as a template for similarity analysis. It is an internal function upon which high-level functions (ie xSocialiserGenes and xSocialiserSNPs) rely.

3.3 Network functions

Network functions are supposed to identity a gene subnetwork from a gene interaction network with node/gene significant information. The node/gene information can be either directly provided (eg user-defined genes with the significance level; p-values or FDR) or indirectly provided (eg nearby genes of user-defined SNPs with the significance level; GWAS reported p-values). From a gene interaction network with nodes labelled with gene information, the algorithm searching for a maximum-scoring gene subnetwork has been reported in our previous publication (Fang and Gough 2014)), briefed as follows: 1) score transformation, that is, given the threshold of tolerable p-value, nodes with p-values below this threshold (nodes of interest) are scored positively, and negative scores for nodes with threshold-above p-values (intolerable), 2) subnetwork identification, that is, to find an interconnected gene subnetwork enriched with positive-score nodes, but allowing for a few negative-score nodes as linkers, and 3) controlling the subnetwork size, that is, an iterative procedure is provided to finetune tolerable thresholds for identifying the gene subnetwork with a desired number of nodes.

3.3.1 xSubneterGenes

xSubneterGenes: takes as input a list of user-defined genes with the significance level (p-values), superposes these genes onto a gene interaction network, and outputs a maximum-scoring gene subnetwork that contains as many most significant (highly scored) genes as possible but also a few less significant (lowly scored) genes as linkers.

3.3.2 xSubneterSNPs

xSubneterSNPs: identifies a gene subnetwork that is likely modulated by input SNPs and/or their Linkage Disequilibrium (LD) SNPs, including two major steps. The first step is to define and score nearby genes that are located within distance window of input and/or LD SNPs. The second step is to use xSubneterGenes for identifying a maximum-scoring gene subnetwork.

3.4 Infrastructure functions

Infrastructure functions are essential as they deal with infrastructure including built-in data loading, ontology annotation propagation, calculation of term-term semantic similarity, and graph conversions and visualisations.

  • xRDataLoader: serves as hub for loading built-in data about genes, SNPs, ontologies and annotations.
  • xDAGanno: propagates annotations to the ontology root according to the true-path rule.
  • xDAGsim: calculates semantic similarity between terms, and returns a network with nodes for terms and edges for pair-wise semantic similarity between them.
  • xConverter: converts an object between graph classes.
  • xCircos: visualises the semantic similarity between genes (or SNPs) by the colour of links in a circos plot.
  • xVisNet: visualises the graph in different layouts.

3.5 Auxiliary functions

Auxiliary functions provide supplementary supports during the package development, such as code debugging and documentation creating.

  • xFunArgs: assigns arguments with default values for a given function, useful for code debugging.
  • xRdWrap: wraps long texts onto the next line for Rd files.
  • xRd2HTML: converts Rd files to HTML files.

4 Applications

An essential step of data analysis is how to make sense of a gene (or SNP) list in a biologically-meaningful way. Genes (and/or SNPs) may be identified from differential expression analysis, eQTL mapping and GWAS. In this section, we showcase the applications using several published datasets. The users are encouraged to adapt the provided codes to analyse their own datasets.

First of all, load the package XGR:

library(XGR)
# the following packages are needed for visualisation
library(GenomicRanges)
library(RCircos)

4.1 Analysing differential genes

The first dataset we use is based on expression data in monocytes (Fairfax et al. 2014). This dataset JKscience_TS1A involves 228 individuals with expression data at four conditions: in the naive state (Naive), after 2-hour LPS (LPS2), after 24-hour LPS (LPS24), and after 24-hour exposure to interferon gamma (IFN). Differential expression analysis was performed using the limma package to identify genes that are differentially expressed between two conditions.

Extract genes that are significantly induced by interferon gamma as compared to the naive state

# Load differential expression analysis results
res <- xRDataLoader(RData.customised='JKscience_TS1A')
# Create a data frame for genes significantly induced by IFN
flag <- res$logFC_INF24_Naive < 0 & res$fdr_INF24_Naive < 0.01
df <- res[flag, c('Symbol','logFC_INF24_Naive','fdr_INF24_Naive')]

The first 5 rows of the data frame df are shown below, with the column logFC_INF24_Naive telling the log2-transformed fold change: naive expression divided by IFN expression (thus the induced genes with negative values).

Symbol logFC_INF24_Naive fdr_INF24_Naive
RERE -1.24 8.41e-187
PAPD4 -0.06 6.13e-03
F3 -0.17 6.12e-05
LIN52 -0.04 8.08e-03
CD558651 -0.05 9.95e-05

4.1.1 Gene-based enrichment analysis

Enrichment analysis at the gene level can be done by choosing one of ontologies currently supported:

Code Ontology Category
DO Disease Ontology Disease
GOMF Gene Ontology Molecular Function Function
GOBP Gene Ontology Biological Process Function
GOCC Gene Ontology Cellular Component Function
HPPA Human Phenotype Phenotypic Abnormality Phenotype
HPMI Human Phenotype Mode of Inheritance Phenotype
HPCM Human Phenotype Clinical Modifier Phenotype
HPMA Human Phenotype Mortality Aging Phenotype
MP Mammalian/Mouse Phenotype Phenotype
DGIdb DGI druggable gene categories Druggable
SF SCOP domain superfamilies Domain
PS2 phylostratific age information (our ancestors) Evolution
MsigdbH Hallmark gene sets MsigDB
MsigdbC1 Chromosome and cytogenetic band positional gene sets MsigDB
MsigdbC2CGP Chemical and genetic perturbation gene sets MsigDB
MsigdbC2CPall All pathway gene sets MsigDB
MsigdbC2CP Canonical pathway gene sets MsigDB
MsigdbC2KEGG KEGG pathway gene sets MsigDB
MsigdbC2REACTOME Reactome pathway gene sets MsigDB
MsigdbC2BIOCARTA BioCarta pathway gene sets MsigDB
MsigdbC3TFT Transcription factor target gene sets MsigDB
MsigdbC3MIR microRNA target gene sets MsigDB
MsigdbC4CGN Cancer gene neighborhood gene sets MsigDB
MsigdbC4CM Cancer module gene sets MsigDB
MsigdbC5BP GO biological process gene sets MsigDB
MsigdbC5MF GO molecular function gene sets MsigDB
MsigdbC5CC GO cellular component gene sets MsigDB
MsigdbC6 Oncogenic signature gene sets MsigDB
MsigdbC7 Immunologic signature gene sets MsigDB

Optionally, the test background can be provided by the user. By default, all annotatable genes will be used. In this case, genes under differential expression analysis will be used as the test background.

background <- res$Symbol

Using a collection of canonical pathways

data <- df$Symbol
eTerm <- xEnricherGenes(data=data, background=background, ontology="MsigdbC2CPall")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'Pathway_enrichments.txt'
res_PW <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_PW), res_PW)
write.table(output, file="Pathway_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 10 significant pathways are shown below:

name nAnno nOverlap zscore pvalue adjp
Immune System 639 341 6.80 5.0e-12 2.9e-09
Interferon gamma signaling 47 41 6.47 5.0e-12 2.9e-09
Interferon Signaling 120 84 6.53 2.2e-11 7.9e-09
Interferon alpha/beta signaling 45 38 5.95 2.6e-10 6.9e-08
SCF(Skp2)-mediated degradation of p27/p21 41 34 5.47 5.4e-09 1.1e-06
Vif-mediated degradation of APOBEC3G 35 30 5.39 6.3e-09 1.1e-06
p53-Independent G1/S DNA damage checkpoint 34 29 5.26 1.4e-08 2.1e-06
Cytokine Signaling in Immune system 207 123 5.50 1.8e-08 2.3e-06
ER-Phagosome pathway 39 32 5.23 2.4e-08 2.8e-06
Proteasome 33 28 5.13 3.0e-08 2.9e-06

Using Disease Ontology (DO)

data <- df$Symbol
eTerm <- xEnricherGenes(data=data, background=background, ontology="DO")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'DO_enrichments.txt'
res_DO <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=T)
output <- data.frame(term=rownames(res_DO), res_DO)
write.table(output, file="DO_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 5 significant terms are shown below:

name nAnno nOverlap zscore pvalue adjp
viral infectious disease 528 270 5.35 0.0e+00 3.1e-05
disease by infectious agent 863 408 4.59 2.1e-06 7.3e-04
influenza 77 50 4.41 3.7e-06 8.8e-04
autoimmune disease of endocrine system 56 36 3.65 8.9e-05 1.6e-02
measles 26 19 3.39 1.7e-04 2.4e-02

Using Mammalian Phenotype (MP)

data <- df$Symbol
eTerm <- xEnricherGenes(data=data, background=background, ontology="MP")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'MP_enrichments.txt'
res_MP <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_MP), res_MP)
write.table(output, file="MP_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 10 significant terms are shown below:

name nAnno nOverlap zscore pvalue adjp
increased susceptibility to infection 324 172 4.53 2.7e-06 0.0071
altered susceptibility to infection 386 200 4.45 4.0e-06 0.0071
abnormal response to infection 417 213 4.32 7.2e-06 0.0085
abnormal antigen presentation 54 37 4.11 1.2e-05 0.0110
abnormal dendritic cell antigen presentation 30 23 3.96 1.5e-05 0.0110
increased mature B cell number 113 67 3.97 2.9e-05 0.0170
altered susceptibility to viral infection 116 68 3.87 4.3e-05 0.0220
decreased B cell proliferation 93 56 3.77 6.1e-05 0.0240
abnormal pilomotor reflex 11 10 3.36 5.6e-05 0.0240
abnormal immunoglobulin level 351 177 3.67 1.1e-04 0.0360

4.1.2 Gene-based network analysis

Network analysis at the gene level requires choosing a pre-defined gene networks as a whole network (called whole-network). Generally speaking, two sources of whole-network information are supported: the STRING database (Szklarczyk et al. 2015) and the Pathways Commons database (Cerami et al. 2011). STRING is a meta-integration of undirect interactions from a functional aspect, while Pathways Commons mainly contains both undirect and direct interactions from a physical/pathway aspect. Both have scores to control the confidence of interactions. Therefore, the user can choose the interactions of varying quality in addition to interaction types:

Code Interaction Database
STRING_highest Functional interactions (with highest confidence scores>=900) STRING
STRING_high Functional interactions (with high confidence scores>=700) STRING
STRING_medium Functional interactions (with medium confidence scores>=400) STRING
PCommonsUN_high Physical/undirect interactions (with references & >=2 sources) Pathways Commons
PCommonsUN_medium Physical/undirect interactions (with references & >=1 sources) Pathways Commons
PCommonsDN_high Pathway/direct interactions (with references & >=2 sources) Pathways Commons
PCommonsDN_medium Pathway/direct interactions (with references & >=1 sources) Pathways Commons

For the pathway-merged direct interactions, the user can also choose network sourced individually:

Code Interaction Database
PCommonsDN_Reactome Pathway/direct interactions (only from Reactome) Pathways Commons
PCommonsDN_KEGG Pathway/direct interactions (only from KEGG) Pathways Commons
PCommonsDN_HumanCyc Pathway/direct interactions (only from HumanCyc) Pathways Commons
PCommonsDN_PID Pathway/direct interactions (only from PID) Pathways Commons
PCommonsDN_PANTHER Pathway/direct interactions (only from PANTHER) Pathways Commons
PCommonsDN_ReconX Pathway/direct interactions (only from ReconX) Pathways Commons
PCommonsDN_PhosphoSite Pathway/direct interactions (only from PhosphoSite) Pathways Commons
PCommonsDN_CTD Pathway/direct interactions (only from CTD) Pathways Commons

In this subsection, from a pre-defined whole-network we demonstrate how to identify a gene network from an input list of genes with the significant info, in this case, the gene network induced by interferon gamma INF.

# find maximum-scoring gene subnetwork with the desired node number=75
data <- df[,c("Symbol","fdr_INF24_Naive")]
subnet <- xSubneterGenes(data=data, network="STRING_high", subnet.size=75)

The identified gene network with nodes colored according to FDR is shown below:

pattern <- -log10(as.numeric(V(subnet)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet, pattern=pattern, glayout=layout_(subnet, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)

4.2 Analysing eQTL SNPs

The second dataset we use is based on an immune-stimulated eQTL mapping study in monocytes (Fairfax et al. 2014). In this study, eQTLs were defined as SNPs showing significant association with gene expression at four conditions: in the naive state (Naive), after 2-hour LPS (LPS2), after 24-hour LPS (LPS24), and after exposure to interferon gamma (IFN). The eQTL association with gene expression is either in a cis- or trans-acting manner; accordingly they are called cis-eQTLs and trans-eQTLs. Genes whose expression is modulated by eQTLs are called eGenes. The significant cis-eQTLs are stored in JKscience_TS2A.

Extract cis-eQTLs induced after 24-hour exposure to interferon gamma

# Load cis-eQTL mapping results
cis <- xRDataLoader(RData.customised='JKscience_TS2A')
# Create a data frame for cis-eQTLs significantly induced by IFN
ind <- which(cis$IFN_t > 0 & cis$IFN_fdr < 0.05)
df_cis <- cis[ind, c('variant','Symbol','IFN_t','IFN_fdr')]

The first 5 rows of the data frame df_cis are shown below:

variant Symbol IFN_t IFN_fdr
rs10002954 COQ2 6.961246 1.870000e-08
rs10005233 SNCA 7.789319 1.960000e-10
rs10005233 SNCA 4.501832 1.513695e-03
rs10006380 NUP54 4.101834 6.318482e-03
rs10006851 UBA6 5.132774 1.215260e-04

4.2.1 SNP-based enrichment analysis

Only for input SNPs reported in GWAS Catalog traits mapped to Experimental Factor Ontology (EFO)

data <- df_cis$variant
eTerm <- xEnricherSNPs(data=data, ontology="EF")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'EF_enrichments.txt'
res_EF <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=T)
output <- data.frame(term=rownames(res_EF), res_EF)
write.table(output, file="EF_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 10 terms are shown below:

name nAnno nOverlap zscore pvalue adjp
Vitiligo 38 4 7.75 3.7e-06 0.00043
ulcerative colitis 130 6 5.79 1.7e-05 0.00100
autoimmune disease 1233 18 3.85 2.2e-04 0.00850
lipoprotein measurement 1313 18 3.55 4.9e-04 0.01300
immune system disease 1545 20 3.48 5.5e-04 0.01300
coronary heart disease 123 4 3.71 1.1e-03 0.02000
inflammatory bowel disease 420 8 3.36 1.3e-03 0.02100
multiple sclerosis 211 5 3.23 2.1e-03 0.02400
lipid or lipoprotein measurement 1583 19 3.03 1.9e-03 0.02400
lipid measurement 1485 18 2.99 2.1e-03 0.02400

For input SNPs plus their LD SNPs (based on European populations) reported in GWAS Catalog traits mapped to EFO

data <- df_cis$variant
eTerm <- xEnricherSNPs(data=data, ontology="EF", include.LD="EUR", LD.r2=0.8)
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'EF_LD_enrichments.txt'
res_EF_LD <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=T)
output <- data.frame(term=rownames(res_EF_LD), res_EF_LD)
write.table(output, file="EF_LD_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 10 significant terms are shown below:

name nAnno nOverlap zscore pvalue adjp
inflammatory bowel disease 420 40 8.21 1.0e-11 1.7e-09
immune system disease 1545 91 7.35 1.6e-11 1.7e-09
Parkinson’s disease 119 20 9.09 2.7e-11 1.9e-09
autoimmune disease 1233 76 7.10 1.1e-10 5.6e-09
ulcerative colitis 130 17 6.96 3.8e-08 1.6e-06
digestive system disease 868 53 5.79 9.6e-08 3.4e-06
psoriasis 91 12 5.88 2.1e-06 6.3e-05
blood metabolite measurement 221 19 5.10 6.9e-06 1.8e-04
lentiform nucleus measurement 14 4 5.74 3.2e-05 7.5e-04
lipoprotein-associated phospholipase A(2) measurement 26 5 4.98 8.0e-05 1.7e-03

4.2.2 SNP-based similarity analysis

Only for input SNPs

data <- df_cis$variant
ig_SNP <- xSocialiserSNPs(data=data, ontology="EF")
# save similarity results to the file 'EF_similarity.txt'
output <- igraph::get.data.frame(ig_SNP, what="edges")
write.table(output, file="EF_similarity.txt", sep="\t", row.names=F)

Circos plot of the most similar edges is shown below:

xCircos(g=ig_SNP, entity="SNP", verbose=F)

For input SNPs plus their LD SNPs

data <- df_cis$variant
ig_SNP_LD <- xSocialiserSNPs(data=data, ontology="EF", include.LD="EUR", LD.r2=0.8)
# save similarity results to the file 'EF_LD_similarity.txt'
output <- igraph::get.data.frame(ig_SNP_LD, what="edges")
write.table(output, file="EF_LD_similarity.txt", sep="\t", row.names=F)

Circos plot of the most similar edges is shown below:

xCircos(g=ig_SNP_LD, entity="SNP", verbose=F)

4.2.3 SNP-based network analysis

In this section, from a pre-defined whole-network (see above) we demonstrate how to identify a gene network from an input list of SNPs with the significant info, in this case, the gene network likely modulated by INF-induced cis-eQTLs.

# find maximum-scoring gene subnetwork with the desired node number=60
data <- df_cis[,c("variant","IFN_fdr")]
subnet_SNP <- xSubneterSNPs(data=data, network="STRING_high", distance.max=200000, seed.genes=T, subnet.significance=1e-1, subnet.size=60)

The identified gene network with nodes colored according to scores is shown below:

pattern <- -log10(as.numeric(V(subnet_SNP)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet_SNP, pattern=pattern, glayout=layout_(subnet_SNP, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)

4.3 Analysing GWAS SNPs

The third dataset ImmunoBase is GWAS lead SNPs associated with immunologically related human diseases, obtained from ImmunoBase.

ImmunoBase <- xRDataLoader(RData.customised='ImmunoBase')
# get info about diseases
disease_list <- lapply(ImmunoBase, function(x) x$disease)
# get the number of disease associated variants/SNPs
variants_list <- lapply(ImmunoBase, function(x) length(names(x$variants)))
# get the number of genes that are located within 500kb distance window of SNPs
genes_list <- lapply(ImmunoBase, function(x) length(x$genes_variants))
# create a data frame
df_ib <- data.frame(Code=names(ImmunoBase), Disease=unlist(disease_list), num_SNPs=unlist(variants_list), num_nearby_genes=unlist(genes_list), stringsAsFactors=F)

A summary of diseases, GWAS lead SNPs and their nearby genes:

Code Disease num_SNPs num_nearby_genes
AA Alopecia Areata (AA) 31 203
AS Ankylosing Spondylitis (AS) 38 437
ATD Autoimmune Thyroid Disease (ATD) 37 142
CEL Celiac Disease (CEL) 197 647
CRO Crohn’s Disease (CRO) 257 1861
IBD Inflammatory Bowel Disease (IBD) 194 2205
IGE IgE and Allergic Sensitization (IGE) 11 107
JIA Juvenile Idiopathic Arthritis (JIA) 31 340
MS Multiple Sclerosis (MS) 317 1608
NAR Narcolepsy (NAR) 10 88
PBC Primary Biliary Cirrhosis (PBC) 157 606
PSC Primary Sclerosing Cholangitis (PSC) 15 169
PSO Psoriasis (PSO) 87 533
RA Rheumatoid Arthritis (RA) 255 979
SJO Sjogren Syndrome (SJO) 13 104
SLE Systemic Lupus Erythematosus (SLE) 87 669
SSC Systemic Scleroderma (SSC) 8 55
T1D Type 1 Diabetes (T1D) 310 1003
UC Ulcerative Colitis (UC) 181 1618
VIT Vitiligo (VIT) 39 308

In the following subsections, we focus on two diseases, Crohn’s Disease (CRO) and Celiac Disease (CEL), identifying putative gene networks that are likely modulated by their corresponding lead (or LD) SNPs. The applications in the other diseases can be similarly done.

4.3.1 SNP-modulated gene network in Crohn’s Disease

# get SNPs reported in CRO GWAS and their significance info (p-values)
gr <- ImmunoBase$CRO$variant
data <- as.matrix(mcols(gr)[, c('Variant','Pvalue')])

# find maximum-scoring gene subnetwork with the desired node number=30
subnet_CRO <- xSubneterSNPs(data=data, network="STRING_high", distance.max=500000, seed.genes=T, subnet.size=30)

The identified gene network with nodes colored according to scores (the higher the more signficant) is shown below:

pattern <- -log10(as.numeric(V(subnet_CRO)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet_CRO, pattern=pattern, glayout=layout_(subnet_CRO, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)

Identify pathways enriched with genes in the identified network

data <- V(subnet_CRO)$name
eTerm <- xEnricherGenes(data=data, ontology="MsigdbC2CPall")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'CRO_Pathway_enrichments.txt'
res_CRO_PW <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_CRO_PW), res_CRO_PW)
write.table(output, file="CRO_Pathway_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 10 significant pathways are shown below:

name nAnno nOverlap zscore pvalue adjp
IL27-mediated signaling events 26 5 16.50 2.1e-10 1.1e-08
GMCSF-mediated signaling events 37 5 13.80 2.0e-09 3.5e-08
IL23-mediated signaling events 37 5 13.80 2.0e-09 3.5e-08
IL12-mediated signaling events 63 5 10.40 5.6e-08 6.9e-07
IL4-mediated signaling events 65 5 10.20 6.8e-08 6.9e-07
Cytokine-cytokine receptor interaction 266 8 7.56 1.6e-07 1.3e-06
IFN-gamma pathway 40 4 10.50 2.0e-07 1.5e-06
NO2-dependent IL 12 Pathway in NK cells 17 3 12.20 2.6e-07 1.6e-06
ErbB2/ErbB3 signaling events 44 4 9.96 3.3e-07 1.9e-06
IL6-mediated signaling events 47 4 9.62 4.6e-07 2.3e-06

4.3.2 SNP-modulated gene network in Celiac Disease

# get SNPs reported in CEL GWAS and their significance info (p-values)
gr <- ImmunoBase$CEL$variant
data <- as.matrix(mcols(gr)[, c('Variant','Pvalue')])

# find maximum-scoring gene subnetwork with the desired node number=30
subnet_CEL <- xSubneterSNPs(data=data, network="STRING_high", distance.max=500000, seed.genes=T, subnet.size=30)

The identified gene network with nodes colored according to scores (the higher the more signficant) is shown below:

pattern <- -log10(as.numeric(V(subnet_CEL)$significance))
pattern[is.infinite(pattern)] <- max(pattern[!is.infinite(pattern)])
vmax <- ceiling(stats::quantile(pattern, 0.75))
vmin <- floor(min(pattern))
xVisNet(g=subnet_CEL, pattern=pattern, glayout=layout_(subnet_CEL, with_kk()), vertex.shape="sphere", colormap="yr", zlim=c(vmin,vmax), newpage=F, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2)

Identify pathways enriched with genes in the identified network

data <- V(subnet_CEL)$name
eTerm <- xEnricherGenes(data=data, ontology="MsigdbC2CPall")
# view enrichment results for the top significant terms
xEnrichViewer(eTerm)
# save enrichment results to the file 'CEL_Pathway_enrichments.txt'
res_CEL_PW <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp), sortBy="adjp", details=F)
output <- data.frame(term=rownames(res_CEL_PW), res_CEL_PW)
write.table(output, file="CEL_Pathway_enrichments.txt", sep="\t", row.names=F)

Enrichment results for the top 10 significant pathways are shown below:

name nAnno nOverlap zscore pvalue adjp
Selective expression of chemokine receptors during T-cell polarization 29 9 31.1 9.0e-20 4.7e-18
Cytokine-cytokine receptor interaction 266 12 13.1 5.2e-14 1.4e-12
The Co-Stimulatory Signal During T-cell Activation 21 5 20.2 1.5e-11 2.7e-10
Chemokine receptors bind chemokines 56 6 14.7 1.3e-10 1.8e-09
Th1/Th2 Differentiation 19 4 17.0 1.4e-09 1.3e-08
Chemokine signaling pathway 190 8 10.3 1.3e-09 1.3e-08
Costimulation by the CD28 family 62 5 11.5 1.6e-08 1.2e-07
IL12-mediated signaling events 63 5 11.4 1.7e-08 1.2e-07
IL12 signaling mediated by STAT4 33 4 12.8 2.8e-08 1.7e-07
G alpha (i) signalling events 193 7 8.8 3.7e-08 2.1e-07

5 Session Info

Here is the output of sessionInfo() on the system on which this user manual was built:

> R version 3.2.4 (2016-03-10)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: OS X 10.11.4 (El Capitan)
> 
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
> 
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
> [8] methods   base     
> 
> other attached packages:
>  [1] BiocInstaller_1.18.5 stringr_1.0.0        rmarkdown_0.9.5     
>  [4] RCircos_1.1.3        GenomicRanges_1.20.8 GenomeInfoDb_1.4.3  
>  [7] IRanges_2.2.9        S4Vectors_0.6.6      BiocGenerics_0.14.0 
> [10] XGR_1.0.0            dnet_1.0.8           supraHex_1.7.3      
> [13] hexbin_1.27.1        igraph_1.0.1        
> 
> loaded via a namespace (and not attached):
>  [1] Biobase_2.28.0            foreach_1.4.3            
>  [3] splines_3.2.4             Formula_1.2-1            
>  [5] highr_0.5.1               latticeExtra_0.6-28      
>  [7] RBGL_1.44.0               BSgenome_1.36.3          
>  [9] Rsamtools_1.20.4          yaml_2.1.13              
> [11] RSQLite_1.0.0             lattice_0.20-33          
> [13] biovizBase_1.16.0         digest_0.6.9             
> [15] RColorBrewer_1.1-2        XVector_0.8.0            
> [17] colorspace_1.2-6          ggbio_1.16.1             
> [19] htmltools_0.3             Matrix_1.2-4             
> [21] plyr_1.8.3                OrganismDbi_1.10.0       
> [23] XML_3.98-1.3              biomaRt_2.24.0           
> [25] zlibbioc_1.14.0           scales_0.4.0             
> [27] BiocParallel_1.2.18       ggplot2_2.1.0            
> [29] GenomicFeatures_1.20.1    nnet_7.3-12              
> [31] survival_2.38-3           magrittr_1.5             
> [33] evaluate_0.8.3            GGally_1.0.1             
> [35] nlme_3.1-125              MASS_7.3-45              
> [37] foreign_0.8-66            graph_1.46.0             
> [39] tools_3.2.4               doMC_1.3.4               
> [41] formatR_1.3               munsell_0.4.3            
> [43] cluster_2.0.3             AnnotationDbi_1.30.1     
> [45] lambda.r_1.1.7            Biostrings_2.36.1        
> [47] compiler_3.2.4            futile.logger_1.4.1      
> [49] grid_3.2.4                RCurl_1.95-4.7           
> [51] iterators_1.0.8           dichromat_2.0-0          
> [53] VariantAnnotation_1.14.13 bitops_1.0-6             
> [55] codetools_0.2-14          gtable_0.2.0             
> [57] DBI_0.3.1                 reshape_0.8.5            
> [59] reshape2_1.4.1            GenomicAlignments_1.4.1  
> [61] gridExtra_2.2.1           knitr_1.12.3             
> [63] rtracklayer_1.28.6        Hmisc_3.17-2             
> [65] futile.options_1.0.0      Rgraphviz_2.12.0         
> [67] ape_3.4                   stringi_1.0-1            
> [69] Rcpp_0.12.4               rpart_4.1-10             
> [71] acepack_1.3-3.3

Ashburner, M, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, et al. 2000. “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.” Nat Genet 25 (1): 25–29. doi:10.1038/75556.

Cerami, E. G., B. E. Gross, E. Demir, I. Rodchenkov, O. Babur, N. Anwar, N. Schultz, G. D. Bader, and C. Sander. 2011. “Pathway Commons, a web resource for biological pathway data.” Nucleic Acids Research 39 (Database): D685–D690. doi:10.1093/nar/gkq1039.

Fairfax, Benjamin P, Peter Humburg, Seiko Makino, Vivek Naranbhai, Daniel Wong, Evelyn Lau, Luke Jostins, et al. 2014. “Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression.” Science (New York, N.Y.) 343 (March): 1246949. doi:10.1126/science.1246949.

Fang, Hai, and Julian Gough. 2014. “The ’dnet’ approach promotes emerging research on cancer patient survival.” Genome Medicine 6 (8): 64. doi:10.1186/s13073-014-0064-8.

Köhler, Sebastian, Sandra C Doelken, Christopher J Mungall, Sebastian Bauer, Helen V Firth, Isabelle Bailleul-Forestier, Graeme C M Black, et al. 2013. “The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data.” Nucleic Acids Research, November, 1–9. doi:10.1093/nar/gkt1026.

Schriml, L M, C Arze, S Nadendla, Y W Chang, M Mazaitis, V Felix, G Feng, and W A Kibbe. 2012. “Disease Ontology: a backbone for disease semantic integration.” Nucleic Acids Res 40 (Database issue): D940–6. doi:10.1093/nar/gkr972.

Smith, C L, and J T Eppig. 2009. “The Mammalian Phenotype Ontology: enabling robust annotation and comparative analysis.” Wiley Interdiscip Rev Syst Biol Med 1 (3): 390–99. doi:10.1002/wsbm.44.

Szklarczyk, Damian, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Davide Heller, Jaime Huerta-cepas, Milan Simonovic, et al. 2015. “STRING v10 : protein – protein interaction networks , integrated over the tree of life.” Nucleic Acids Res 43 (Database): D447–D452. doi:10.1093/nar/gku1003.

Welter, Danielle, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, et al. 2014. “The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.” Nucleic Acids Research 42 (D1): 1001–6. doi:10.1093/nar/gkt1229.