This vignette contains example use cases for taxonbridge
:
1. Detecting ambiguity and inconsistencies
2. Annotating a custom taxonomy
3. Visualizing a custom taxonomy
The first example illustrates how to detect ambiguity and inconsistency in a merged taxonomy. Start by loading the 2000 row sample dataset that comes with taxonbridge
:
library(taxonbridge)
<- load_sample()
sample dim(sample)
#> [1] 2000 20
Next, retrieve all rows that have lineage information in both the GBIF backbone and NCBI:
<- get_lineages(sample) lineages
Then validate the lineages by using the kingdom and family taxonomic ranks, and create a list of the resulting tibble(s). Note that phylum, class, and order may also be used. In this example, entries that failed validation are returned by setting valid = FALSE
.
<- get_validity(lineages, rank = "kingdom", valid = FALSE)
kingdom #> Term conversion carried out on kingdom taxonomic rank
<- get_validity(lineages, rank = "family", valid = FALSE)
family <- list(kingdom, family) candidates
Finally, detect candidate incongruencies (excluding those with uninomial scientific names):
get_inconsistencies(candidates, uninomials = FALSE)
#> [1] "Gordonia neofelifaecis" "Attheya septentrionalis"
Two binomial names exhibit incongruency. Upon reference to the literature and the individual entries it can be seen that:
Attheya septentrionalis is assigned to different families of the problematica order Chaetocerotales
Gordonia neofelifaecis is a plant (family: Theaceae) in the GBIF but a bacterium in the NCBI (family: Gordoniaceae)
Attheya septentrionalis has the status “synonym” in the GBIF data:
$canonicalName=="Attheya septentrionalis", "taxonomicStatus"]
lineages[lineages#> # A tibble: 1 × 1
#> taxonomicStatus
#> <chr>
#> 1 synonym
Applying the get_status()
function and rerunning the exercise leaves only Gordonia neofelifaecis as a binomial incongruency with biological provenance:
<- get_status(get_lineages(sample), status = "accepted")
lineages <- get_validity(lineages, rank = "kingdom", valid = FALSE)
kingdom #> Term conversion carried out on kingdom taxonomic rank
<- get_validity(lineages, rank = "family", valid = FALSE)
family <- list(kingdom, family)
candidates get_inconsistencies(candidates, uninomials = FALSE)
#> [1] "Gordonia neofelifaecis"
Again, start by loading the 2000 row sample dataset that comes with taxonbridge
. Then apply the get_taxa()
method to find all decapod crustaceans in the sample dataset:
library(taxonbridge)
<- load_sample()
sample <- get_taxa(sample, order = "decapoda") decapoda
The decapoda
object will serve as your base taxonomy. Create a new object that only retains decapods known as swimming crabs:
<- get_taxa(sample, family = "portunidae") swimming_crabs
Next annotate your base taxonomy with this colloquial name for the family Portunidae:
<- annotate(decapoda, names = swimming_crabs$canonicalName,
decapoda new_column = "swimming_crabs", present = "1")
A new column by the name “swimming_crabs” has been added to the base taxonomy:
colnames(decapoda)
#> [1] "taxonID" "canonicalName" "taxonRank"
#> [4] "parentNameUsageID" "acceptedNameUsageID" "originalNameUsageID"
#> [7] "taxonomicStatus" "kingdom" "phylum"
#> [10] "class" "order" "family"
#> [13] "genericName" "specificEpithet" "infraspecificEpithet"
#> [16] "ncbi_id" "ncbi_lineage_names" "ncbi_lineage_ids"
#> [19] "ncbi_rank" "ncbi_lineage_ranks" "swimming_crabs"
Since only the present
parameter and not the absent
parameter was passed to annotate()
, all species that are not members of the Portunidae will be assigned NA
by default in the swimming_crabs
column. Swimming crabs can therefore be retrieved from the base taxonomy with the following command:
!is.na(decapoda$swimming_crabs),"canonicalName"]
decapoda[#> # A tibble: 1 × 1
#> canonicalName
#> <chr>
#> 1 Callinectes sapidus
Continue using the annotated base taxonomy from example 2. Prepare two distributions for the base taxonomy using prepare_rank_dist()
:
<- prepare_rank_dist(decapoda, GBIF = TRUE)
GBIF_dist <- prepare_rank_dist(decapoda, NCBI = TRUE)
NCBI_dist plot_mdb(GBIF_dist)
plot_mdb(NCBI_dist)
The plots show that there is a difference between the entries of the NCBI and GBIF. Looking at the previously prepared distributions reveal that the NCBI lacks lineage data for two species:
GBIF_dist#> $GBIF
#> # A tibble: 3 × 2
#> Rank Frequency
#> <chr> <int>
#> 1 family 1
#> 2 genus 3
#> 3 species 13
#>
#> attr(,"class")
#> [1] "one_rank"
NCBI_dist #> $NCBI
#> # A tibble: 4 × 2
#> Rank Frequency
#> <chr> <int>
#> 1 family 1
#> 2 genus 3
#> 3 species 11
#> 4 <NA> 2
#>
#> attr(,"class")
#> [1] "one_rank"
Assuring that both the NCBI data and GBIF data have lineage data by using get_lineages()
solves this problem at the cost of losing two GBIF entries that are not available in the NCBI:
<- get_lineages(decapoda)
lineages <- prepare_rank_dist(lineages, GBIF = TRUE)
GBIF_dist <- prepare_rank_dist(lineages, NCBI = TRUE)
NCBI_dist plot_mdb(GBIF_dist)
plot_mdb(NCBI_dist)
Note that get_lineages()
should be used with care since extinct species in the GBIF are unlikely to have lineage data in the NCBI.
Now that both the GBIF and NCBI data have lineage information, the validity of the lineage information can be accessed in the same way as was done in example 1:
get_validity(decapoda, valid = FALSE)
#> # A tibble: 2 × 21
#> taxonID canonicalName taxonRank parentNameUsageID acceptedNameUsageID
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2221665 Palibythus magnificus species 2221664 NA
#> 2 2226015 Lysirude genus 6927319 NA
#> # … with 16 more variables: originalNameUsageID <dbl>, taxonomicStatus <chr>,
#> # kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
#> # genericName <chr>, specificEpithet <chr>, infraspecificEpithet <chr>,
#> # ncbi_id <dbl>, ncbi_lineage_names <chr>, ncbi_lineage_ids <chr>,
#> # ncbi_rank <chr>, ncbi_lineage_ranks <chr>, swimming_crabs <chr>
Two entries have invalid data:
The species Palibythus magnificus belongs to the family Palinuridae in the GBIF but belongs to the family Synaxidae in the NCBI.
The genus Lysirude belongs to the family Lyreididae in the GBIF but belongs to the family Raninidae in the NCBI.
Annotating the base taxonomy with these inconsistencies is a good idea:
<- annotate(decapoda, get_validity(decapoda, valid = FALSE)$canonicalName,
decapoda new_column = "family_inconsistencies", present = 1)
#> 2 annotations were made.
colnames(decapoda)
#> [1] "taxonID" "canonicalName" "taxonRank"
#> [4] "parentNameUsageID" "acceptedNameUsageID" "originalNameUsageID"
#> [7] "taxonomicStatus" "kingdom" "phylum"
#> [10] "class" "order" "family"
#> [13] "genericName" "specificEpithet" "infraspecificEpithet"
#> [16] "ncbi_id" "ncbi_lineage_names" "ncbi_lineage_ids"
#> [19] "ncbi_rank" "ncbi_lineage_ranks" "swimming_crabs"
#> [22] "family_inconsistencies"
!is.na(decapoda$family_inconsistencies),"canonicalName"]
decapoda[#> # A tibble: 2 × 1
#> canonicalName
#> <chr>
#> 1 Palibythus magnificus
#> 2 Lysirude