Creating an ontology_index

Daniel Greene

2016-10-27

An ontology_index can be created by loading a pre-existing one - for example by calling data(hpo), reading ontologies encoded in OBO or OWL format into R using get_ontology, or by calling the function ontology_index explicitly. An ontology_index is a named list of properties for each term, where each property is represented by a list or vector named by term (thus lookups by term name are simple). All ontology_index objects contain the id, name, parents, children and ancestors for each term, however additional properties can be added. For details on how to use an ontology_index, see the ‘Introduction to ontologyX’ vignette.

Reading in from a file

The functions get_OBO, get_OWL and get_ontology can read ontologies into R as ontology_index objects, the latter determining which function to call based on the file name. By default, the properties id, name, parents, children, ancestors and obsolete are populated.

Note: the xml2 package is required in order to read ontologies encoded in OWL format.

The syntax is simple:

ontology <- get_ontology(file)

Additional information is often present in the original files - for example definitions (the def field in OBO format) and part_of relationships to other terms. It is possible to include these extra properties by passing additional arguments to get_ontology or get_OBO/get_OWL as appropriate. The given name of the additional arguments then go on to form the names of the properties included in the returned ontology_index, and the format of the arguments depends on whether you are reading an OBO or OWL formatted file.

For OBO files, these arguments should be character vectors where the first element is a regular expression specifying what text to extract as the property and the second element giving the mode of the returned array (defaulting to character, but potentially also taking the values logical,numeric,list, etc.). Here for example, we additionally extract the "definition" field for each term:

ontology <- get_OBO(file, definition=c("^def: (\\.*)", "character"))

Now, ontology will have the ontology$definition field, a character vector (named by term as usual) populated, in addition to the standard properties. Thus, definitions can be extracted simply using ontology$definition[termID].

For OWL files the additional arguments should be functions which given an xml_nodeset (see the xml2 package), corresponding to all the terms in the ontology, return lists or vectors of term properties. A selection of such functions are included with the package, including OWL_IDs, OWL_labels, OWL_obsolete and OWL_is_a. The following expressions can be used to load ontology_indexes with the part_of relationship populated from OWL formatted ontologies:

ontology <- get_OWL(file, part_of=OWL_part_of)

Here, OWL_part_of is a function which returns a list of character vectors containing the part_of relationships for each term. Customs functions based on XPath expressions can be defined if other term properties are desired, see ?OWL_strings_from_nodes for details.

By default, if there are terms in parent-offspring relationships specified by the ontology which are missing, an error is thrown. However, the ontology reading functions get_ontology, get_OWL and get_OBO have a remove_missing argument (FALSE by default), which causes parent-offspring relationships with a missing term to be ignored, hence the complete parts of the ontology can still be used.

When reading ontologies from OWL files, the term IDs are often given as URLs:

"http://purl.obolibrary.org/obo/<ontology code>_<number>",

and thus the term IDs stored by the ontology_index object obtained with get_OWL will also be in this form. However, data is often supplied with the OBO ID format: <ontology code>:<number>. Possible solutions to making such data compatible with the ontologyX packages are to transform the term IDs in the data to the OWL format (i.e. to match the index) or transform the ontology_index object itself (to match the data, so that it’s IDs are in OBO style). The former is simpler and can be done by pasteing data strings onto the appropriate URL and replacing the underscore with a colon. The latter can be done with the help of regular expressions, for example as demonstrated below, by looping over the properties in the ontology_index, and transforming them as appropriate depending on whether they are lists or vectors:

term_url <- "http://purl.obolibrary.org/obo/(\\w+)_(\\d+)"
for (property in names(ontology)) {
    if (is.character(unlist(use.names=FALSE, ontology[[property]]))) ontology[[property]] <- if (is.list(ontology[[property]])) lapply(ontology[[property]], function(term_properties) gsub(x=term_properties, pattern=term_url, replacement="\\1:\\2")) else gsub(x=ontology[[property]], pattern=term_url, replacement="\\1:\\2")
    names(ontology[[property]]) <- gsub(x=names(ontology[[property]]), pattern=term_url, replacement="\\1:\\2")
}

Adding term properties

Modifying an existing ontology_index to add term properties is as simple as modifying a list. In the example below, we’ll add the number of children for each term.

ontology$number_of_children <- sapply(ontology$children, length)

In the same manner, a valid ontology_index can be built up from scratch as a list, of course requiring that the standard properties are included for use with functions in ontologyIndex.