Functional Annotation with Ensembl Biomart, GO, and KeGG

2017-05-28

Functional Annotation with Ensembl Biomart

The Ensembl Biomart database enables users to retrieve a vast diversity of annotation data for specific organisms. Initially, Steffen Durinck and Wolfgang Huber provide an powerful interface between the R language and Ensembl Biomart by providing the R package biomaRt. However, the biomartr package extends the functionality of the biomaRt package and introduces a more organism centered annotation retrieval concept.

The following sections will introduce users to the functionality and data retrieval precedures of biomartr and will show how biomartr extends the functionality of the initial biomaRt package.

Getting Started with biomaRt

The best way to get started with the methodology presented by the established biomaRt package is to understand the workflow of data retrieval. The database provided by Ensembl Biomart is organized in so called: marts, datasets, and attributes. So when users want to retrieve information for a specific organism of interest, first they need to specify the marts and datasets in which the information of the corresponding organism can be found. Subsequently they can specify the attributes argument that is ought to be returned for the corresponding organism.

The availability of marts, datasets, and attributes can be checked by the following functions:

# install the biomaRt package
# source("http://bioconductor.org/biocLite.R")
# biocLite("biomaRt")

# load biomaRt
library(biomaRt)

# look at top 10 databases
head(listMarts(host = "www.ensembl.org"), 10)
#>                biomart               version
#> 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 88
#> 2   ENSEMBL_MART_MOUSE      Mouse strains 88
#> 3     ENSEMBL_MART_SNP  Ensembl Variation 88
#> 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 88
#> 5    ENSEMBL_MART_VEGA                  Vega

Users will observe that several marts providing annotation for specific classes of organisms or groups of organisms are available.

For our example, we will choose the hsapiens_gene_ensembl mart and list all available datasets that are element of this mart.

head(listDatasets(useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org")), 10)
#>                        dataset                     description         version
#> 1       csavignyi_gene_ensembl     C.savignyi genes (CSAV 2.0)        CSAV 2.0
#> 2        tguttata_gene_ensembl Zebra Finch genes (taeGut3.2.4)     taeGut3.2.4
#> 3       pcapensis_gene_ensembl           Hyrax genes (proCap1)         proCap1
#> 4      mlucifugus_gene_ensembl      Microbat genes (Myoluc2.0)       Myoluc2.0
#> 5          fcatus_gene_ensembl     Cat genes (Felis_catus_6.2) Felis_catus_6.2
#> 6        saraneus_gene_ensembl           Shrew genes (sorAra1)         sorAra1
#> 7      mgallopavo_gene_ensembl      Turkey genes (Turkey_2.01)     Turkey_2.01
#> 8       trubripes_gene_ensembl           Fugu genes (FUGU 4.0)        FUGU 4.0
#> 9  aplatyrhynchos_gene_ensembl       Duck genes (BGI_duck_1.0)    BGI_duck_1.0
#> 10  dmelanogaster_gene_ensembl          Fruitfly genes (BDGP6)           BDGP6

The useMart() function is a wrapper function provided by biomaRt to connect a selected BioMart database (mart) with a corresponding dataset stored within this mart.

We select dataset hsapiens_gene_ensembl and now check for available attributes (annotation data) that can be accessed for Homo sapiens genes.

head(listAttributes(useDataset(dataset = "hsapiens_gene_ensembl", 
                               mart    = useMart("ENSEMBL_MART_ENSEMBL",
                               host    = "www.ensembl.org"))), 10)
#>                     name              description         page
#> 1        ensembl_gene_id           Gene stable ID feature_page
#> 2  ensembl_transcript_id     Transcript stable ID feature_page
#> 3     ensembl_peptide_id        Protein stable ID feature_page
#> 4        ensembl_exon_id           Exon stable ID feature_page
#> 5            description         Gene description feature_page
#> 6        chromosome_name Chromosome/scaffold name feature_page
#> 7         start_position          Gene Start (bp) feature_page
#> 8           end_position            Gene End (bp) feature_page
#> 9                 strand                   Strand feature_page
#> 10                  band           Karyotype band feature_page

Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named useDataset() is needed in which useMart() and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for Homo sapiens as well as a short description.

Furthermore, users can retrieve all filters for Homo sapiens that can be specified by the actual BioMart query process.

head(listFilters(useDataset(dataset = "hsapiens_gene_ensembl", 
                            mart    = useMart("ENSEMBL_MART_ENSEMBL",
                            host    = "www.ensembl.org"))), 10)
#>                  name                            description
#> 1     chromosome_name               Chromosome/scaffold name
#> 2               start                                  Start
#> 3                 end                                    End
#> 4          band_start                             Band Start
#> 5            band_end                               Band End
#> 6        marker_start                           Marker Start
#> 7          marker_end                             Marker End
#> 8       encode_region                          Encode region
#> 9              strand                                 Strand
#> 10 chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1

After accumulating all this information, it is now possible to perform an actual BioMart query by using the getBM() function.

In this example we will retrieve attributes: start_position,end_position and description for the Homo sapiens gene "GUCA2A".

Since the input genes are ensembl gene ids, we need to specify the filters argument filters = "hgnc_symbol".

# 1) select a mart and data set
mart <- useDataset(dataset = "hsapiens_gene_ensembl", 
                   mart    = useMart("ENSEMBL_MART_ENSEMBL",
                   host    = "www.ensembl.org"))

# 2) run a biomart query using the getBM() function
# and specify the attributes and filter arguments
geneSet <- "GUCA2A"

resultTable <- getBM(attributes = c("start_position","end_position","description"),
                     filters    = "hgnc_symbol", 
                     values     = geneSet, 
                     mart       = mart)

resultTable 
#>   start_position end_position                                                       description
#> 1       42162691     42164718 guanylate cyclase activator 2A [Source:HGNC Symbol;Acc:HGNC:4682]

When using getBM() users can pass all attributes retrieved by listAttributes() to the attributes argument of the getBM() function.

Getting Started with biomartr

This query methodology provided by Ensembl Biomart and the biomaRt package is a very well defined approach for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) seems non-intuitive from the user perspective. Therefore, the biomartr package provides another query methodology that aims to be more organism centric.

Taken together, the following workflow allows users to perform fast BioMart queries for attributes using the biomart() function implemented in this biomartr package:

  1. get attributes, datasets, and marts via : organismAttributes()

  2. choose available biological features (filters) via: organismFilters()

  3. specify a set of query genes: e.g. retrieved with getGenome(), getProteome() or getCDS()

  4. specify all arguments of the biomart() function using steps 1) - 3) and perform a BioMart query

Note that dataset names change very frequently due to the update of dataset versions. So in case some query functions do not work properly, users should check with organismAttributes(update = TRUE) whether or not their dataset name has been changed. For example, organismAttributes("Homo sapiens", topic = "id", update = TRUE) might reveal that the dataset ENSEMBL_MART_ENSEMBL has changed.

Retrieve marts, datasets, attributes, and filters with biomartr

Retrieve Available Marts

The getMarts() function allows users to list all available databases that can be accessed through BioMart interfaces.

# load the biomartr package
library(biomartr)

# list all available databases
getMarts()
#> # A tibble: 16 x 2
#>                     mart                        version
#>                    <chr>                          <chr>
#>  1  ENSEMBL_MART_ENSEMBL               Ensembl Genes 88
#>  2    ENSEMBL_MART_MOUSE               Mouse strains 88
#>  3 ENSEMBL_MART_SEQUENCE                       Sequence
#>  4 ENSEMBL_MART_ONTOLOGY                       Ontology
#>  5  ENSEMBL_MART_GENOMIC            Genomic features 88
#>  6      ENSEMBL_MART_SNP           Ensembl Variation 88
#>  7  ENSEMBL_MART_FUNCGEN          Ensembl Regulation 88
#>  8     ENSEMBL_MART_VEGA                           Vega
#>  9           plants_mart        Ensembl Plants Genes 35
#> 10     plants_variations   Ensembl Plants Variations 35
#> 11           fungal_mart         Ensembl Fungi Genes 35
#> 12     fungal_variations    Ensembl Fungi Variations 35
#> 13          protist_mart      Ensembl Protists Genes 35
#> 14    protist_variations Ensembl Protists Variations 35
#> 15          metazoa_mart       Ensembl Metazoa Genes 35
#> 16    metazoa_variations  Ensembl Metazoa Variations 35

Retrieve Available Datasets from a Specific Mart

Now users can select a specific database to list all available datasets that can be accessed through this database. In this example we choose the ENSEMBL_MART_ENSEMBL database.

head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 5)
#> # A tibble: 5 x 3
#>                   dataset                 description         version
#>                     <chr>                       <chr>           <chr>
#> 1  pcapensis_gene_ensembl       Hyrax genes (proCap1)         proCap1
#> 2 mlucifugus_gene_ensembl  Microbat genes (Myoluc2.0)       Myoluc2.0
#> 3     fcatus_gene_ensembl Cat genes (Felis_catus_6.2) Felis_catus_6.2
#> 4   saraneus_gene_ensembl       Shrew genes (sorAra1)         sorAra1
#> 5 mgallopavo_gene_ensembl  Turkey genes (Turkey_2.01)     Turkey_2.01

Now you can select the dataset hsapiens_gene_ensembl and list all available attributes that can be retrieved from this dataset.

tail(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 38)
#> # A tibble: 38 x 3
#>                       dataset                   description    version
#>                         <chr>                         <chr>      <chr>
#>  1      csabaeus_gene_ensembl  Vervet-AGM genes (ChlSab1.1)  ChlSab1.1
#>  2     pvampyrus_gene_ensembl       Megabat genes (pteVam1)    pteVam1
#>  3     pcapensis_gene_ensembl         Hyrax genes (proCap1)    proCap1
#>  4  ptroglodytes_gene_ensembl Chimpanzee genes (CHIMP2.1.4) CHIMP2.1.4
#>  5    cporcellus_gene_ensembl    Guinea Pig genes (cavPor3)    cavPor3
#>  6  amelanoleuca_gene_ensembl         Panda genes (ailMel1)    ailMel1
#>  7       btaurus_gene_ensembl            Cow genes (UMD3.1)     UMD3.1
#>  8     trubripes_gene_ensembl         Fugu genes (FUGU 4.0)   FUGU 4.0
#>  9        oaries_gene_ensembl        Sheep genes (Oar_v3.1)   Oar_v3.1
#> 10 dnovemcinctus_gene_ensembl   Armadillo genes (Dasnov3.0)  Dasnov3.0
#> # ... with 28 more rows

Retrieve Available Attributes from a Specific Dataset

Now that you have selected a database (hsapiens_gene_ensembl) and a dataset (hsapiens_gene_ensembl), users can list all available attributes for this dataset using the getAttributes() function.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# list all available attributes for dataset: hsapiens_gene_ensembl
head( getAttributes(mart    = "ENSEMBL_MART_ENSEMBL", 
                    dataset = "hsapiens_gene_ensembl"), 10 )
#>                     name              description
#> 1        ensembl_gene_id           Gene stable ID
#> 2  ensembl_transcript_id     Transcript stable ID
#> 3     ensembl_peptide_id        Protein stable ID
#> 4        ensembl_exon_id           Exon stable ID
#> 5            description         Gene description
#> 6        chromosome_name Chromosome/scaffold name
#> 7         start_position          Gene Start (bp)
#> 8           end_position            Gene End (bp)
#> 9                 strand                   Strand
#> 10                  band           Karyotype band

Retrieve Available Filters from a Specific Dataset

Finally, the getFilters() function allows users to list available filters for a specific dataset that can be used for a biomart() query.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# list all available filters for dataset: hsapiens_gene_ensembl
head( getFilters(mart    = "ENSEMBL_MART_ENSEMBL", 
                 dataset = "hsapiens_gene_ensembl"), 10 )
#>                  name                            description
#> 1     chromosome_name               Chromosome/scaffold name
#> 2               start                                  Start
#> 3                 end                                    End
#> 4          band_start                             Band Start
#> 5            band_end                               Band End
#> 6        marker_start                           Marker Start
#> 7          marker_end                             Marker End
#> 8       encode_region                          Encode region
#> 9              strand                                 Strand
#> 10 chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1

Organism Specific Retrieval of Information

In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be interested in specific annotations for this particular model organism. The organismBM() function addresses this issue and provides users with an organism centric query to marts and datasets which are available for a particular organism of interest.

Note that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file named _biomart/listDatasets.txt within the tempdir() folder, allowing subsequent queries to be performed much faster. The tempdir() folder, however, will be deleted after a new R session was established. In this case the inital call of the subsequent functions again will take time to retrieve all organism specific data from the BioMart database.

This concept of locally storing all organism specific database linking information available in BioMart into an internal file allows users to significantly speed up subsequent retrieval queries for that particular organism.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
organismBM(organism = "Homo sapiens")
#> # A tibble: 17 x 5
#>    organism_name                                                                            description                  mart                       dataset    version
#>            <chr>                                                                                  <chr>                 <chr>                         <chr>      <chr>
#>  1      hsapiens                                                               Human genes (GRCh38.p10)  ENSEMBL_MART_ENSEMBL         hsapiens_gene_ensembl GRCh38.p10
#>  2      hsapiens                                                    homo_sapiens sequences (GRCh38.p10) ENSEMBL_MART_SEQUENCE     hsapiens_genomic_sequence GRCh38.p10
#>  3      hsapiens                                                                         marker_feature  ENSEMBL_MART_GENOMIC         hsapiens_marker_start GRCh38.p10
#>  4      hsapiens                                                                        karyotype_start  ENSEMBL_MART_GENOMIC      hsapiens_karyotype_start GRCh38.p10
#>  5      hsapiens                                                                          karyotype_end  ENSEMBL_MART_GENOMIC        hsapiens_karyotype_end GRCh38.p10
#>  6      hsapiens                                                                     marker_feature_end  ENSEMBL_MART_GENOMIC           hsapiens_marker_end GRCh38.p10
#>  7      hsapiens                                                                                 encode  ENSEMBL_MART_GENOMIC               hsapiens_encode GRCh38.p10
#>  8      hsapiens         Human Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p10)      ENSEMBL_MART_SNP                  hsapiens_snp GRCh38.p10
#>  9      hsapiens                                                 Human Structural Variants (GRCh38.p10)      ENSEMBL_MART_SNP            hsapiens_structvar GRCh38.p10
#> 10      hsapiens Human Somatic Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p10)      ENSEMBL_MART_SNP              hsapiens_snp_som GRCh38.p10
#> 11      hsapiens                                         Human Somatic Structural Variants (GRCh38.p10)      ENSEMBL_MART_SNP        hsapiens_structvar_som GRCh38.p10
#> 12      hsapiens                                                      Human Binding Motifs (GRCh38.p10)  ENSEMBL_MART_FUNCGEN        hsapiens_motif_feature GRCh38.p10
#> 13      hsapiens                                            Human Other Regulatory Regions (GRCh38.p10)  ENSEMBL_MART_FUNCGEN     hsapiens_external_feature GRCh38.p10
#> 14      hsapiens                                                Human miRNA Target Regions (GRCh38.p10)  ENSEMBL_MART_FUNCGEN hsapiens_mirna_target_feature GRCh38.p10
#> 15      hsapiens                                                 Human Regulatory Features (GRCh38.p10)  ENSEMBL_MART_FUNCGEN   hsapiens_regulatory_feature GRCh38.p10
#> 16      hsapiens                                                 Human Regulatory Evidence (GRCh38.p10)  ENSEMBL_MART_FUNCGEN    hsapiens_annotated_feature GRCh38.p10
#> 17      hsapiens                                                               Human genes (GRCh38.p10)     ENSEMBL_MART_VEGA            hsapiens_gene_vega GRCh38.p10

The result is a table storing all marts and datasets from which annotations can be retrieved for Homo sapiens. Furthermore, a short description as well as the version of the dataset being accessed (very useful for publications) is returned.

Users will observe that 3 different marts provide 6 different datasets storing annotation information for Homo sapiens.

Please note, however, that scientific names of organisms must be written correctly! For ex. “Homo Sapiens” will be treated differently (not recognized) than “Homo sapiens” (recognized).

Similar to the biomaRt package query methodology, users need to specify attributes and filters to be able to perform accurate BioMart queries. Here the functions organismAttributes() and organismFilters() provide useful and intuitive concepts to obtain this information.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# return available attributes for "Homo sapiens"
head(biomartr::organismAttributes("Homo sapiens"), 20)

Users will observe that the organismAttributes() function returns a data.frame storing attribute names, datasets, and marts which are available for Homo sapiens. After the ENSEMBL release 87 the ENSEMBL_MART_SEQUENCE service provided by Ensembl does not work properly and thus the organismAttributes() function prints out warning messages to make the user aware when certain marts provided bt Ensembl do not work properly, yet.

An additional feature provided by organismAttributes() is the topic argument. The topic argument allows users to to search for specific attributes, topics, or categories for faster filtering.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "id"
head(organismAttributes("Homo sapiens", topic = "id"), 20)

Now, all attribute names having id as part of their name are being returned.

Another example is topic = "homolog".

# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "homolog"
head(organismAttributes("Homo sapiens", topic = "homolog"), 20)

Or topic = "dn" and topic = "ds" for dn and ds value retrieval.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "dn"
head(organismAttributes("Homo sapiens", topic = "dn"))
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "ds"
head(organismAttributes("Homo sapiens", topic = "ds"))

Analogous to the organismAttributes() function, the organismFilters() function returns all filters that are available for a query organism of interest.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# return available filters for "Homo sapiens"
head(organismFilters("Homo sapiens"), 20)

The organismFilters() function also allows users to search for filters that correspond to a specific topic or category.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for filter topic "id"
head(organismFilters("Homo sapiens", topic = "id"), 20)

Performing BioMart queries with biomartr

The short introduction to the functionality of organismBM(), organismAttributes(), and organismFilters() will allow users to perform BioMart queries in a very intuitive organism centric way. The main function to perform BioMart queries is biomart().

For the following examples we will assume that we are interested in the annotation of specific genes from the Homo sapiens proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the organismAttributes() function.

# show all elements of the data.frame
options(tibble.print_max = Inf)

head(organismAttributes("Homo sapiens", topic = "id"))
# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieve the proteome of Homo sapiens from refseq
file_path <- getProteome( db       = "refseq",
                          organism = "Homo sapiens",
                          path     = file.path("_ncbi_downloads","proteomes") )

Hsapiens_proteome <- read_proteome(file_path, format = "fasta")

# remove splice variants from id
gene_set <- unlist(sapply(strsplit(Hsapiens_proteome@ranges@NAMES[1:5], ".",fixed = TRUE), function(x) x[1]))

result_BM <- biomart( genes      = gene_set,
                      mart       = "ENSEMBL_MART_ENSEMBL", 
                      dataset    = "hsapiens_gene_ensembl",
                      attributes = c("ensembl_gene_id","ensembl_peptide_id"),
                      filters    = "refseq_peptide")

result_BM 

The biomart() function takes as arguments a set of genes (gene ids specified in the filter argument), the corresponding mart and dataset, as well as the attributes which shall be returned.

Gene Ontology

The biomartr package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO() function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO() function allows GO information retrieval from the Ensembl Biomart database.

In this example we will retrieve GO information for a set of Homo sapiens genes stored as hgnc_symbol.

GO Annotation Retrieval via BioMart

The getGO() function takes several arguments as input to retrieve GO information from BioMart. First, the scientific name of the organism of interest needs to be specified. Furthermore, a set of gene ids as well as their corresponding filter notation (GUCA2A gene ids have filter notation hgnc_symbol; see organismFilters() for details) need to be specified. The database argument then defines the database from which GO information shall be retrieved.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for GO terms of an example Homo sapiens gene
GO_tbl <- getGO(organism = "Homo sapiens", 
                genes    = "GUCA2A",
                filters  = "hgnc_symbol")

GO_tbl

Hence, for each gene id the resulting table stores all annotated GO terms found in Ensembl Biomart.