galah
is an R interface to biodiversity data hosted by the Atlas of Living Australia (ALA). The ALA is a repository of biodiversity
data, focussed primarily on observations of individual life forms. Like the
Global Biodiversity Information Facility (GBIF), the
basic unit of data at ALA is an occurrence record, based on the 'Darwin Core' data standard.
galah
enables users to locate and download species observations, taxonomic
information, or associated media such images or sounds, and to restrict their
queries to particular taxa or locations. Users can specify which columns are
returned by a query, or restrict their results to observations that meet
particular quality-control criteria. All functions return a data.frame
as
their standard format.
Functions in galah
are designed according to a nested architecture. Users
that require data should begin by locating the relevant ala_
function (see
downloading data section); the arguments within that
function then call correspondingly-named select_
functions;
and finally the specific values that can be interpreted by those select_
functions are given by find_
functions.
Install from CRAN:
install.packages("galah")
Install the development version from GitHub:
install.packages("remotes")
remotes::install_github("AtlasOfLivingAustralia/galah")
See the README for system requirements.
Load the package
library(galah)
Each occurrence record contains taxonomic information, and also some
information about the observation itself, such as its location and the date
of the observation. Each piece of information associated with a
given occurrence is stored in a field, which corresponds to a column
when imported to a data.frame
.
Data fields are important because they provide a means to filter
occurrence records; i.e. to return only the information that you need, and
no more. Consequently, much of the architecture of galah
has been
designed to make filtering as simple as possible, by using functions with the
select_
prefix.
select_taxa()
enables users search for taxonomic names and check the results
are 'correct' before using the result to download data.
The function allows both free-text searches and searches where the rank(s) are
specified. Specifying the rank can be useful when names are ambiguous.
# free text search
taxa_filter <- select_taxa("Eolophus")
## Assuming that query term(s) provided are scientific or common names
# specifying ranks
select_taxa(query = list(genus = "Eolophus", kingdom = "Aves"))
## Assuming that query term(s) provided are scientific or common names
## search_term scientific_name scientific_name_authorship
## 1 Eolophus_Aves Eolophus Bonaparte, 1854
## taxon_concept_id rank match_type kingdom phylum class
## 1 urn:lsid:biodiversity.org.au:afd.taxon:b2de5e40-df8f-4339-827d-25e63454a4a2 genus exactMatch Animalia Chordata Aves
## order family genus issues
## 1 Psittaciformes Cacatuidae Eolophus noIssue
select_taxa()
can optionally provide information about child concepts, and
counts of the number of records held by the ALA for the specified taxa.
select_taxa(query = "Eolophus", children = TRUE, counts = TRUE)
## Assuming that query term(s) provided are scientific or common names
## search_term scientific_name scientific_name_authorship
## 1: Eolophus Eolophus Bonaparte, 1854
## 2: <NA> Eolophus roseicapilla (Vieillot, 1817)
## taxon_concept_id rank match_type kingdom phylum
## 1: urn:lsid:biodiversity.org.au:afd.taxon:b2de5e40-df8f-4339-827d-25e63454a4a2 genus exactMatch Animalia Chordata
## 2: urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 species taxonIdMatch Animalia Chordata
## class order family genus issues species vernacular_name count
## 1: Aves Psittaciformes Cacatuidae Eolophus noIssue <NA> <NA> 779838
## 2: Aves Psittaciformes Cacatuidae Eolophus noIssue Eolophus roseicapilla Galah 779836
This shows that there is only one species in the family Eolophus.
Users can provide an sf
object or a Well-Known Text (WKT) string for
location-based filtering.
locations <- select_locations(query = st_read('act_rect.shp'))
As mentioned above, all occurrence records in the ALA contain additional
information about the record, stored in fields. Field-based filters are
specified with select_filters()
, which takes indvidual filters, in the form
field = value
, and/or a data quality profile.
To find available fields and corresponding valid values, field lookup
functions are provided. For finding field names, use search_fields()
, for
finding valid field values, use find_field_values()
.
search_fields("basis")
## id
## 1: basisOfRecord
## 2: raw_basisOfRecord
## 3: BASIS_OF_RECORD_INVALID
## 4: OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD
## description
## 1: What this is a record of e.g. specimen, human observation, fossil http://rs.tdwg.org/dwc/terms/basisOfRecord
## 2: The basis of record as supplied by the data publisher http://rs.tdwg.org/dwc/terms/verbatimBasisOfRecord
## 3: Basis of record badly formed
## 4: Occurrence status inferred from basis of record
## type link
## 1: fields <NA>
## 2: fields <NA>
## 3: assertions <NA>
## 4: assertions <NA>
field_values <- find_field_values("basisOfRecord")
Build a field filter
filters <- select_filters(basisOfRecord = "HumanObservation")
By default, a filter is included. To negate a filter, use exclude()
.
filters <- select_filters(basisOfRecord = "HumanObservation",
occurrenceStatus = exclude("absent"))
A notable extention of the filtering approach is to remove records with low
'quality'. ALA performs quality control checks on all records that it stores.
These checks are used to generate new fields, that can then be used to filter
out records that are unsuitable for particular applications. However, there
are many possible data quality checks, and it is not always clear which are
most appropriate in a given instance. Therefore, galah
supports ALA
data quality profiles, which can be passed to select_filters()
to quickly
remove undesirable records. A full list of data quality profiles is returned by
find_profiles()
.
profiles <- find_profiles()
View filters included in a profile
find_profile_attributes("ALA")
## description
## 1: Exclude all records where spatial validity is "false"
## 2: Exclude all records with an assertion that the scientific name provided does not match any of the names lists used by the ALA. For a full explanation of the ALA name matching process see https://github.com/AtlasOfLivingAustralia/ala-name-matching
## 3: Exclude all records with an assertion that the scientific name provided is not structured as a valid scientific name. Also catches rank values or values such as "UNKNOWN"
## 4: Exclude all records with an assertion that the name and classification supplied can't be used to choose between 2 homonyms
## 5: Exclude all records with an assertion that kingdom provided doesn't match a known kingdom e.g. Animalia, Plantae
## 6: Exclude all records with an assertion that the scientific name provided in the record does not match the expected taxonomic scope of the resource e.g. Mammal records attributed to bird watch group
## 7: Exclude all records with an assertion of the occurence is cultivated or escaped from captivity
## 8: Exclude all records with an assertion of latitude value provided is zero
## 9: Exclude all records with an assertion of longitude value provided is zero
## 10: Exclude all records with an assertion of latitude and longitude have been transposed
## 11: Exclude all records with an assertion of coordinates are the exact centre of the state or territory
## 12: Exclude all records with an assertion of coordinates are the exact centre of the country
## 13: Exclude all records where duplicate status is "duplicate"
## 14: Exclude all records where coordinate uncertainty (in metres) is greater than 10km
## 15: Exclude all records with unresolved user assertions
## 16: Exclude all records with unconfirmed user assertions
## 17: Exclude all records where outlier layer count is 3 or more
## 18: Exclude all records where Record type is "Fossil specimen"
## 19: Exclude all records where Record type is "EnvironmentalDNA"
## 20: Exclude all records where Presence/Absence is "absent"
## 21: Exclude all records where year is prior to 1700
## description
## filter
## 1: -spatiallyValid:"false"
## 2: -assertions:TAXON_MATCH_NONE
## 3: -assertions:INVALID_SCIENTIFIC_NAME
## 4: -assertions:TAXON_HOMONYM
## 5: -assertions:UNKNOWN_KINGDOM
## 6: -assertions:TAXON_SCOPE_MISMATCH
## 7: -establishmentMeans:"MANAGED"
## 8: -decimalLatitude:0
## 9: -decimalLongitude:0
## 10: -assertions:"PRESUMED_SWAPPED_COORDINATE"
## 11: -assertions:"COORDINATES_CENTRE_OF_STATEPROVINCE"
## 12: -assertions:"COORDINATES_CENTRE_OF_COUNTRY"
## 13: -duplicateStatus:"ASSOCIATED"
## 14: -coordinateUncertaintyInMeters:[10001 TO *]
## 15: -userAssertions:50001
## 16: -userAssertions:50005
## 17: -outlierLayerCount:[3 TO *]
## 18: -basisOfRecord:"FOSSIL_SPECIMEN"
## 19: -(basisOfRecord:"MATERIAL_SAMPLE" AND contentTypes:"EnvironmentalDNA")
## 20: -occurrenceStatus:ABSENT
## 21: -year:[* TO 1700]
## filter
Include a profile in the filters
filters <- select_filters(basisOfRecord = "HumanObservation",
profile = "ALA")
Functions that return data from ALA are named with the prefix ala_
,
followed by a suffix describing the information that they provide.
By combining different filter functions, it is possible to build complex
queries to return only the most valuable information for a given problem.
Once you have retrieved taxon information, you can use this to search for
occurrence records with ala_occurrences()
. However, it is
also possible to download data on species via ala_species()
,
or media content (largely images) via ala_media()
.
Alternatively, users can retrieve record counts using ala_counts()
.
In addition to the filter functions above, when downloading
occurrence data users can specify which columns are returned using
select_columns()
. Individual column names and/or column groups can be
specified.
To view the fields for each group, see the documentation for select_columns()
.
To view the list of available fields, run search_fields()
.
cols <- select_columns("institutionID", group = "basic")
To download occurrence data you will need to specify your email in
ala_config()
. This email must be associated with an active ALA account. See
more information in the config section
ala_config(email = your_email_here, profile_path = path_to_profile)
Download occurrence records for Eolophus roseicapilla
occ <- ala_occurrences(taxa = select_taxa("Eolophus roseicapilla"),
filters = select_filters(stateProvince = "Australian Capital Territory",
year = seq(2010, 2020),
profile = "ALA"),
columns = select_columns("institutionID", group = "basic"))
head(occ)
## X decimalLatitude decimalLongitude eventDate scientificName
## 1 1 -35.3450 149.062 2012-12-04 Eolophus roseicapilla
## 2 2 -35.2501 149.131 2014-12-05 Eolophus roseicapilla
## 3 3 -35.3405 149.109 2018-12-27 Eolophus roseicapilla
## 4 4 -35.2033 149.013 2013-12-01 Eolophus roseicapilla
## 5 5 -35.3206 148.956 2017-12-04 Eolophus roseicapilla
## 6 6 -35.2660 148.932 2011-12-11 Eolophus roseicapilla
## taxonConceptID recordID
## 1 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ffef254e-0328-417f-9c8a-64bfacd32c00
## 2 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ffee66fa-968d-4a0f-b245-03d0bc56d303
## 3 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ffcb2236-2f65-4d0d-a347-6d5d30f80b97
## 4 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ffafbfd6-3b11-4ef6-881b-7d417a11db34
## 5 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ffa24dee-d1e9-4e97-8574-76433c299429
## 6 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ff87372e-bde7-4e87-929e-80e59ed94e8f
## data_resource
## 1 Garden Bird Surveys
## 2 eBird Australia
## 3 eBird Australia
## 4 eBird Australia
## 5 eBird Australia
## 6 eBird Australia
A common use case of the ALA is to identify which species occur in a specified
region, time period, or taxonomic group. ala_species()
enables the user to
look up this information, using the common set of filter functions.
# List rodent species in the NT
species <- ala_species(taxa = select_taxa("Rodentia"),
filters = select_filters(stateProvince = "Northern Territory"))
## Assuming that query term(s) provided are scientific or common names
head(species)
## species species_name
## 1 urn:lsid:biodiversity.org.au:afd.taxon:f38bcd7e-ae6a-4734-bd64-06995bc230eb Mesembriomys gouldii
## 2 urn:lsid:biodiversity.org.au:afd.taxon:46611113-a1e3-45b1-b58c-7aef088a9da7 Zyzomys argurus
## 3 urn:lsid:biodiversity.org.au:afd.taxon:5d73fc2f-3caa-4b44-aa40-3711e8304f80 Pseudomys hermannsburgensis
## 4 urn:lsid:biodiversity.org.au:afd.taxon:49001532-929e-4b78-97d3-c885e97d671b Notomys alexis
## 5 urn:lsid:biodiversity.org.au:afd.taxon:89dfa41e-2c5a-44d1-80bf-8d4cd3c73089 Melomys burtoni
## 6 urn:lsid:biodiversity.org.au:afd.taxon:107696b5-063c-4c09-a015-6edfdb6f4d52 Mus musculus
## scientific_name_authorship taxon_rank kingdom phylum class order family genus vernacular_name
## 1 (J.E. Gray, 1843) species Animalia Chordata Mammalia Rodentia Muridae Mesembriomys Black-footed Tree-rat
## 2 (Thomas, 1889) species Animalia Chordata Mammalia Rodentia Muridae Zyzomys Common Rock-rat
## 3 (Waite, 1896) species Animalia Chordata Mammalia Rodentia Muridae Pseudomys Sandy Inland Mouse
## 4 Thomas, 1922 species Animalia Chordata Mammalia Rodentia Muridae Notomys Spinifex Hopping-mouse
## 5 (Ramsay, 1887) species Animalia Chordata Mammalia Rodentia Muridae Melomys Grassland Melomys
## 6 Linnaeus, 1758 species Animalia Chordata Mammalia Rodentia Muridae Mus House Mouse
ala_counts()
provides summary counts on records in the ALA, without needing
to download all the records. In addition to the filter arguments, it has an
optional group_by
argument, which provides counts binned by the requested
field.
# Total number of records in the ALA
ala_counts()
## [1] 95659866
# Total number of records, broken down by kindgom
ala_counts(group_by = "kingdom")
## kingdom count
## 1 Animalia 70502996
## 2 Plantae 21440207
## 3 Fungi 1865765
## 4 Chromista 914030
## 5 Protista 67314
## 6 Bacteria 58066
## 7 Protozoa 22465
## 8 Archaea 1103
## 9 Eukaryota 735
## 10 Virus 414
In addition to text data describing individual occurrences and their attributes, ALA stores images, sounds and videos associated with a given record. These can be downloaded to R
using ala_media()
and the same
set of filters as the other data download functions.
# Use the occurrences previously downloaded
media_data <- ala_media(
taxa = select_taxa("Eolophus roseicapilla"),
filters = select_filters(year = 2020),
download_dir = "media")
Various aspects of the galah package can be customized. To preserve
configuration for future sessions, set profile_path
to a location of a
.Rprofile
file.
To download occurrence records, you will need to provide an email address registered with the ALA. You can create an account here. Once an email is registered with the ALA, it should be stored in the config:
ala_config(email="myemail@gmail.com")
galah
can cache most results to local files. This means that if the same code
is run multiple times, the second and subsequent iterations will be faster.
By default, this caching is session-based, meaning that the local files are stored in a temporary directory that is automatically deleted when the R session is ended. This behaviour can be altered so that caching is permanent, by setting the caching directory to a non-temporary location.
ala_config(cache_directory="example/dir")
By default, caching is turned off. To turn caching on, run
ala_config(caching=FALSE)
If things aren't working as expected, more detail (particularly about web requests and caching behaviour) can be obtained by setting the verbose
configuration option:
ala_config(verbose=TRUE)
ALA requires that you provide a reason when downloading occurrence data (via the galah ala_occurrences()
function). The reason is set as “scientific research” by default, but you can change this using ala_config()
. See find_reasons()
for valid download reasons.
ala_config(download_reason_id=your_reason_id)