fhircrackr: Handling HL7 FHIR resources in R

2020-06-30

Introduction

fhircrackr is a package designed to help analyzing HL7 FHIR resources.

FHIR stands for Fast Healthcare Interoperability Resources and is a standard describing data formats and elements (known as “resources”) as well as an application programming interface (API) for exchanging electronic health records. The standard was created by the Health Level Seven International (HL7) health-care standards organization. For more information on the FHIR standard, visit https://www.hl7.org/fhir/.

While FHIR is a very useful standard to describe and exchange medical data in an interoperable way, it is not very useful for statistical analyses of said data. This is due to the fact that FHIR data is stored in many nested and interlinked resources instead of matrix-like structures.

Thus, to be able to do statistical analyses a tool is needed that allows converting these nested resources into data frames. This process of flattening FHIR resources is not trivial, as the unpredictable degree of nesting and connectedness of the resources makes generic solutions to this problem not feasible.

We therefore implemented a package that makes it possible to download FHIR resources from a server into R and to flatten these resources into (multiple) data frames.

The package is still under development. The CRAN version of the package contains all functions that are already stable, for more recent (but potentially unstable) developments, the development version of the package can be downloaded from GitHub using devtools::install_github("POLAR-fhiR/fhircrackr").

Prerequisites

The complexity of the problem requires a couple of prerequisites both regarding knowledge and access to data. We will shortly list the preconditions to using the fhircrackr package here:

  1. First of all, you need the endpoint of the FHIR server you want to access. If you don’t have you own FHIR server, you can use one of the publicly available servers, such as https://hapi.fhir.org/baseR4 or http://fhir.hl7.de:8080/baseDstu3. The endpoint of a FHIR server is often referred to as [base].

  2. To download ressources from the server, you should be familiar with FHIR search requests. FHIR search allows you to download sets of resources that match very specific requirements. As the focus of this package is dealing with FHIR resources in R, rather than the intricacies of FHIR search, we will mostly use simple examples of FHIR search requests. Most of them will have the form [base]/[type]?parameter(s), where [type] refers to the type of resource you are looking for and parameter(s) characterise specific properties those resources should have. http://hapi.fhir.org/baseR4/Patient?gender=female for example downloads all Patient resources from the fhir server at http://hapi.fhir.org/baseR4/ that represent female patients.

  3. In the first step, fhircrackr downloads the resources in xml format into R. To specify which elements from the FHIR resources you want in your data frame, you should have at least some familiarity with XPath expressions. A good tutorial on XPath expressions can be found here.

In the following we’ll go through a typical workflow with fhircrackr step by step.

Download and flatten fhir resources from a server

Example 1: Download Patient resources

We will start with a very simple example and use fhir_search() to download Patient resources from a publicly available HAPI server after we’ve loaded the package with library(fhircrackr):

The minimum information fhir_search() requires is a string containing the full FHIR search request in the argument request. In general, a fhir search request returns a bundle of the resources you requested. If there are a lot of resources matching your request, the search result isn’t returned in one big bundle but distributed over several of them. If the argument max_bundles is set to its default Inf, fhir_search() will return all available bundles, meaning all resources matching your request. If you set it to 2 as in the above example, the download will stop after the first two bundles. Note that in this case, the result may not contain all the resources from the server matching your request.

If you want to connect to a fhir server that uses basic authentification, you can supply the arguments username and password.

Because endpoints can sometimes be hard to reach, fhir_search() will start five attempts to connect to the endpoint before it gives up. With the arguments max_attempts and delay_between_attempts you can control this number as well the time interval between attempts.

As you can see in the next block of code, fhir_search() returns a list of xml objects where each list element represents one bundle of resources, so a list of two xml objects in our case:

If for some reason you cannot connect to a FHIR server at the moment but want to explore the following functions anyway, the package provides an example list of bundles containing Patient and MedicationStatement resources. See ?medication_bundles for how to use it.

Now we know that inside these xml objects there is the patient data somewhere. To get it out, we will use fhir_crack(). The most important arguments fhir_crack() takes is bundles, the list of bundles that is returned by fhir_search() and design, an object that tells the function wich data to extract from the bundle. It returns a list of data.frames.

In general, design is a named list containing one element per data frame that will be created. The element names of design are going to be the names of the resulting data frames. It usually makes sense to create one data frame per type of resource. Because we have just downloaded resources of the type Patient, the design here would be a list of length 1.

The elements of design are lists themselves. They can be of length 1 or length 2, depending on the level of precision in extracting the attributes. There are three levels of precision in extracting the data for our data frame with fhir_crack():

1. Extract all available attributes

If we want to extract all available attributes, the list describing the data frames inside design is a list of length 1, containing only an Xpath expression to the resource type we want to extract:

Note that this can easily become a rather wide and sparse data frame. This is due to the fact that every attribute appearing in at least one of the resources will be turned into a variable (i.e. column), even if none of the other resources contain this attribute. For those resources, the value on that attribute will be set to NA. Depnding on the variability of the resources, the resulting data frame can contain a lot of NA values. If a recource has multiple entries for an attribute, these are pasted together using the string provided in the argument sep as a seperator. The column names in this option are automatically generated by pasting together the path to the respective attribute.

3. Extract specific attributes

If we know exactly which attributes we want to extract, we can specify them in a named list that we provide as the second element of the list describing the data.frame:

This options will usually return the most tidy and clear data frames. You should always extract the resource id, because this is used to link to other resources you might also extract. If you are not sure which attributes are available or where they are located in the resource, it can be helpful to start by extracting all availabe attributes. Then you can get an overview over the available attributes and their location and continue by doing a second, more targeted extraction to get your final data frame.

Of course the previous example is using just one resource type. If you are interested in several types of resources, design will have more elements and the result will be a list of several data frames.

The abstract form design should therefore have is:

Example 2: Download MedicationStatement and corresponding Patient resources

In reality your FHIR search requests are probably going to be slightly more complex than just asking for Patient resources. Consider the following example where we want to download MedicationStatements refering to a certain medication we specify with its snomed code and also the Patient resources the MedicationStatements are linked to.

When the FHIR search request gets longer, it can be helpful to build up the request piece by piece like this:

Then we can download the resources:

And convert them into to data frames, one for the MedicationStatements and one for the Patients:

design <- list(

    MedicationStatement = list(

        ".//MedicationStatement",

        list(
            MS.ID              = "id/@value",
            STATUS.TEXT        = "text/status/@value",
            STATUS             = "status/@value",
            MEDICATION.SYSTEM  = "medicationCodeableConcept/coding/system/@value",
            MEDICATION.CODE    = "medicationCodeableConcept/coding/code/@value",
            MEDICATION.DISPLAY = "medicationCodeableConcept/coding/display/@value",
            DOSAGE             = "dosage/text/@value",
            PATIENT            = "subject/reference/@value",
            LAST.UPDATE        = "meta/lastUpdated/@value"
        )
    ),

    Patients = list(

        ".//Patient",
        "./*/@value"
    )
)


list_of_tables <- fhir_crack(medication_bundles, design)
#> 
#>  MedicationStatement
#>  1....................
#>  2....................
#>  3....................
#> 
#>  Patients
#>  1....................
#>  2....................
#>  3....................
#> FHIR-Resources cracked.

head(list_of_tables$MedicationStatement)
#>   MS.ID STATUS.TEXT STATUS     MEDICATION.SYSTEM MEDICATION.CODE
#> 1 30233   generated active http://snomed.info/ct       429374003
#> 2 42012   generated active http://snomed.info/ct       429374003
#> 3 42091   generated active http://snomed.info/ct       429374003
#> 4 45646   generated active http://snomed.info/ct       429374003
#> 5 45724   generated active http://snomed.info/ct       429374003
#> 6 45802   generated active http://snomed.info/ct       429374003
#>   MEDICATION.DISPLAY           DOSAGE       PATIENT
#> 1   simvastatin 40mg 1 tab once daily Patient/30163
#> 2   simvastatin 40mg 1 tab once daily Patient/41945
#> 3   simvastatin 40mg 1 tab once daily Patient/42024
#> 4   simvastatin 40mg 1 tab once daily Patient/45579
#> 5   simvastatin 40mg 1 tab once daily Patient/45657
#> 6   simvastatin 40mg 1 tab once daily Patient/45735
#>                     LAST.UPDATE
#> 1 2019-09-26T14:34:44.543+00:00
#> 2 2019-10-09T20:12:49.778+00:00
#> 3 2019-10-09T22:44:05.728+00:00
#> 4 2019-10-11T16:17:42.365+00:00
#> 5 2019-10-11T16:30:24.411+00:00
#> 6 2019-10-11T16:32:05.206+00:00

head(list_of_tables$Patients)
#>   id.value gender.value birthDate.value
#> 1    60096         male      2019-11-13
#> 2    49443       female      1970-10-19
#> 3    46213       female      2019-10-11
#> 4    45735         male      1970-10-11
#> 5    42024       female      1979-10-09
#> 6    58504         male      2019-11-08

As you can see, the result now contains two data frames, one for Patient resources and one for MedicationStatement resources.

Example 3: Multiple entries

A particularly complicated problem in flattening FHIR resources is caused by the fact that there can be multiple entries to an attribute. The profile your FHIR resources have been built by defines how often a particular attribute can appear in a resource. This is called the cardinality of the attribute. For example the Patient resource defined here can have zero or one birthdates but arbitratily many addresses. In general, fhir_crack() will paste multiple entries for same attribute together, using the seperator provided by the sep argument. In most cases this will work just fine, but there are some special cases that require a little more attention.

Let’s have a look at the following example, where we have a bundle containing just two Patient resources:

This bundle contains three Patient resources. The first resource has just one entry for the address attribute. The second Patient resource has two entries containing the same elements for the address attribute. The third Patient resource has a rather messy address attribute, with three entries containing different elements.

Let’s see what happens if we extract all attributes:

As you can see, multiple entries for the same attribute (address) are pasted together. This works fine for Patient 2, but for Patient 3 you can see a problem with the number of entries that are displayed. The original Patient resource had three (inclomplete) address entries, but because the first two of them use complementary elements (use and city vs. type and country), the resulting pasted entries look like there had just been two entries for the address attribute.

You can counter this problem with the add_indices argument and customize the appearance of the indices with brackets:

Now the indices display the entry the value belongs to. That way you can see that Patient resource 3 had three entries for the attribute address and you can also see which attributes belong to which entry.

Of course this is a very specific that only occurs if your resources have multiple entries with complementary elements. In the vast majority of cases multiple entries in one resource will look identical, thus making numbering of those entries superfluous.

Save and load downloaded bundles

Since fhir_crack() discards of all the data not specified in design it makes sense to store the original search result for reproducibility and in case you realise later on that you need elements from the resources that you haven’t extracted at first.

There are two ways of saving the FHIR bundles you downloaded: Either you save them as R objects, or you write them to an xml file.

Save and load bundles as R objects

If you want to save the list of downloaded bundles as an .rda or .RData file, you cannot just R’s save()or save_image() on it, because this will break the external pointers in the xml objects representing your bundles. Instead, you have to serialize the bundles before saving and unserialize them after loading. For single xml objects the package xml2 proved serialization functions. For convenience, however, fhircrackr provides the functions fhir_serialize() and fhir_unserialize() that you can use directly on the list of bundles returned by fhir_search():

If you load this bundle again, you have to unserialize it, before you can work with it:

After unserialization, the pointers are restored and you can continue to work with the bundles. Note that the example bundle medication_bundles that is provided with the fhircrackr package is also provided in its serialized form and has to be unserialized as described on its help page.

Save and load bundles as xml files

If you want to store the bundles in xml files instead of R objects, you can use the functions fhir_save() and fhir_load(). fhir_save() takes a list of bundles in form of xml objects (as returned by fhir_search()) and writes them into the directory specified in the argument directory. Each bundle is saved as a seperate xml-file. If the folder defined in directory doesn’t exist, it is created in the current working directory.

To read bundles saved with fhir_save() back into R, you can use fhir_load():

fhir_load() takes the name of the directory (or path to it) as its only argument. All xml-files in this directory will be read into R and returned as a list of bundles in xml format just as returned by fhir_search().

Download capability statement

The capability statement documents a set of capabilities (behaviors) of a FHIR Server for a particular version of FHIR. You can download this statement using the function fhir_capability_statement():

cap <- fhir_capability_statement("http://hapi.fhir.org/baseR4/")
#> 
#> Download completed. All available bundles were downloaded.
#> FHIR-Resources cracked.

fhir_capability_statement() takes a FHIR server endpoint and returns a list of data frames containing all information from the capability statement of this server.

You can then access the parts that interest you, for example:

cap$META$software.version
#> NULL

Further Options

Extract data below resource level

While we recommend extracting exactly one data frame per resource, it is technically possible to choose a different level per data frame:

The above example shows that instead of the MedicationStatement resource, we can choose the MedicationCodeableConcept as the root level for our extraction. This can be useful to get a quick and relatively clean overview over the types of codes used on this level of the resource. It is however important to note that this mode of extraction makes it impossible to recognise if each row belongs to one ressource of if several of these rows came from the same ressource. This of course also means that you cannot link this information to data from other ressources because this extraction mode discards of that information.