Create an ontology

library(ontologics)
library(dplyr, warn.conflicts = FALSE)

Any work with an ontology would either start by reading it in from an already existing database, or by creating a new ontology from scratch.

Even though this package is still under development, we do already provide a function that can read in an ontology from an *.rds file (one that is optimized for the usage within R), and can write to any format that is useful for triplestores or the semantic web. This vignette focuses on the basic building blocks for creating a new ontology and you can find more on how to map new concepts from external ontologies, and how to export an ontology so that it’s interoperable with the semantic web.

An existing ontology

# read in example ontology
crops <- load_ontology(path = system.file("extdata", "crops.rds", package = "ontologics"))

crops   # ... has a pretty show-method
#>   sources : 1
#>     -> 'harmonised' (73)
#> 
#>   classes : 3 
#>    ∟ group    20   Groups of crop or livestock commoditi...
#>     ∟ class   53   Classes of crop or livestock commodi...
#>      ∟ crop    0   Crop or livestock commodities
#> 
#>   top concepts: 73 
#>     -> group: 'CEREALS' (10), 'FRUIT' (8), 'VEGETABLES' (6), 'UNGULATES' (5), 'BIOENERGY CROPS' (4), ...
#>     -> class: 'Bioenergy herbaceous' (20), 'Barley' (20), 'Fibre crops' (20), 'Flower herbs' (20), 'Grass crops' (20), ...
#>     -> crop:

The onto class is an S3 class with the 3 slots @sources, @classes and @concepts, each of which are reflected by an entry in the show-method. Often the classes in an ontology have a hierarchical order, but this is not obligatory. In any case, the first three levels of the hierarchical structure together with the number of concepts of each level and the description is shown here. Moreover, the five most frequent concepts are shown together with a visual representation of the frequency distribution of all concepts at the first three levels.

The three main slots are represented by a function that allows to add new items to this slot (new_source, new_class and new_concept) and an additional function allows to create mappings between your focal ontology and any external ontology (new_mappings). There is more detailed information about the architecture of the onto-class in the vignette Ontology database description.

New ontology

A new ontology is built by calling the function start_ontology(). This requires a bunch of meta-data that will be stored in the ontology and which serve the purpose of properly linking also this ontology to other linked open data.

lulc <- start_ontology(name = "land_surface_properties",
                       version = "0.0.1",
                       path = tempdir(), 
                       code = ".xx",
                       description = "showcase of the ontologics R-package", 
                       homepage = "https://github.com/luckinet/ontologics", 
                       license = "CC-BY-4.0")

lulc
#>   sources : 1
#>     -> 'harmonised' (0)
#> 
#>   classes : 0 
#> 
#>   top concepts: 0

These information are stored in the @sources slot, just like any other external data source. It is recommended to always set the code for building IDs with a leading symbol that can’t be transformed into a numeric/integer, to avoid problems in case the ontology is opened in a spreadsheet program that may automatically do this transformation without asking or informing the author.

kable(lulc@sources)
id label version date description homepage license notes
1 harmonised 0.0.1 2023-05-10 showcase of the ontologics R-package https://github.com/luckinet/ontologics CC-BY-4.0

Next, classes and their hierarchy need to be defined. Each concept is always a combination of a code, a label and a class. The code must be unique for each unique concept, but the label or the class can have the same value for two concepts. For instance, the concept football can have the class game or the class object and then mean two different things, despite having the same label.

# currently it is only possible to set one class at a time
lulc <- new_class(
  new = "landcover", 
  target = NA, 
  description = "A good definition of landcover",
  ontology = lulc)

lulc <- new_class(
    new = "land-use", 
    target = "landcover", 
    description = "A good definition of land use",
    ontology = lulc)

# the class IDs are derived from the code that was previously specified 
kable(lulc@classes$harmonised[, 1:6])
id label description has_broader has_close_match has_narrower_match
.xx landcover A good definition of landcover NA NA NA
.xx.xx land-use A good definition of land use landcover NA NA

Then, new concepts that have these classes can be defined. In case classes are chosen that are not yet defined, you’ll get a warning.

lc <- c(
  "Urban fabric", "Industrial, commercial and transport units",
  "Mine, dump and construction sites", "Artificial, non-agricultural vegetated areas",
  "Temporary cropland", "Permanent cropland", "Heterogeneous agricultural areas",
  "Forests", "Other Wooded Areas", "Shrubland", "Herbaceous associations",
  "Heterogeneous semi-natural areas", "Open spaces with little or no vegetation",
  "Inland wetlands", "Marine wetlands", "Inland waters", "Marine waters"
)

lulc <- new_concept(
  new = lc,
  class = "landcover",
  ontology = lulc
)

kable(lulc@concepts$harmonised[, 1:5])
id label description class has_broader
.01 Urban fabric NA landcover NA
.02 Industrial, commercial and transport units NA landcover NA
.03 Mine, dump and construction sites NA landcover NA
.04 Artificial, non-agricultural vegetated areas NA landcover NA
.05 Temporary cropland NA landcover NA
.06 Permanent cropland NA landcover NA
.07 Heterogeneous agricultural areas NA landcover NA
.08 Forests NA landcover NA
.09 Other Wooded Areas NA landcover NA
.10 Shrubland NA landcover NA
.11 Herbaceous associations NA landcover NA
.12 Heterogeneous semi-natural areas NA landcover NA
.13 Open spaces with little or no vegetation NA landcover NA
.14 Inland wetlands NA landcover NA
.15 Marine wetlands NA landcover NA
.16 Inland waters NA landcover NA
.17 Marine waters NA landcover NA

An ontology is different from a vocabulary in that concepts that are contained in an ontology are related semantically to one another. For example, concepts can be nested into other concepts. Hence, let’s create also a second level of concepts that depend on the first level.

lu <- tibble(
  concept = c(
    "Fallow", "Herbaceous crops", "Temporary grazing",
    "Permanent grazing", "Shrub orchards", "Palm plantations",
    "Tree orchards", "Woody plantation", "Protective cover",
    "Agroforestry", "Mosaic of agricultural-uses",
    "Mosaic of agriculture and natural vegetation",
    "Undisturbed Forest", "Naturally Regenerating Forest",
    "Planted Forest", "Temporally Unstocked Forest"
  ),
  broader = c(
    rep(lc[5], 3), rep(lc[6], 6),
    rep(lc[7], 3), rep(lc[8], 4)
  )
)



lulc <- get_concept(label = lu$broader, ontology = lulc) %>% 
  left_join(lu %>% select(label = broader), .) %>% 
  new_concept(
    new = lu$concept,
    broader = .,
    class = "land-use",
    ontology = lulc
  )
#> Joining with `by = join_by(label)`

kable(lulc@concepts$harmonised[, 1:5])
id label description class has_broader
.01 Urban fabric NA landcover NA
.02 Industrial, commercial and transport units NA landcover NA
.03 Mine, dump and construction sites NA landcover NA
.04 Artificial, non-agricultural vegetated areas NA landcover NA
.05 Temporary cropland NA landcover NA
.05.01 Fallow NA land-use .05
.05.02 Herbaceous crops NA land-use .05
.05.03 Temporary grazing NA land-use .05
.06 Permanent cropland NA landcover NA
.06.01 Permanent grazing NA land-use .06
.06.02 Shrub orchards NA land-use .06
.06.03 Palm plantations NA land-use .06
.06.04 Tree orchards NA land-use .06
.06.05 Woody plantation NA land-use .06
.06.06 Protective cover NA land-use .06
.07 Heterogeneous agricultural areas NA landcover NA
.07.01 Agroforestry NA land-use .07
.07.02 Mosaic of agricultural-uses NA land-use .07
.07.03 Mosaic of agriculture and natural vegetation NA land-use .07
.08 Forests NA landcover NA
.08.01 Undisturbed Forest NA land-use .08
.08.02 Naturally Regenerating Forest NA land-use .08
.08.03 Planted Forest NA land-use .08
.08.04 Temporally Unstocked Forest NA land-use .08
.09 Other Wooded Areas NA landcover NA
.10 Shrubland NA landcover NA
.11 Herbaceous associations NA landcover NA
.12 Heterogeneous semi-natural areas NA landcover NA
.13 Open spaces with little or no vegetation NA landcover NA
.14 Inland wetlands NA landcover NA
.15 Marine wetlands NA landcover NA
.16 Inland waters NA landcover NA
.17 Marine waters NA landcover NA

Here we see that get_concept() was used to extract those broader concepts, into which the new level is nested. This is to ensure that a valid concept is provided, i.e., one that has already been included into the ontology.