REDCapR
REDCap is a powerful database solution used by many institutions around the world:
“REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment (including compliance with 21 CFR Part 11, FISMA, HIPAA, and GDPR), it is specifically geared to support online and offline data capture for research studies and operations. The REDCap Consortium, a vast support network of collaborators, is composed of thousands of active institutional partners in over one hundred countries who utilize and support their own individual REDCap systems.”
The {REDCapR} package streamlines calls to the REDCap API. Arguably, its main use is to import records from a REDCap project. This works well for simple projects, however becomes ugly when complex databases that include longitudinal structure and/or repeated instruments are involved.
We wrote the REDCapTidieR
package to make the life of
analysts who deal with complex REDCap databases easier. It does so by
building upon {REDCapR} to make its output
tidier. Instead of one large data frame composed of a
sparse matrix, the analyst gets to work with a set of tidy tibbles, one
for each REDCap instrument.
To demonstrate the use of REDCapTidieR
, let’s look at a
REDCap database that has information about some 734 superheroes, derived
from data scraped from the Superhero Database.
This REDCap project contains two instruments:
Heroes Information contains demographic data (name, eye color, height and weight, etc.). This is a nonrepeating instrument.
Super Hero Powers is a repeating instrument that captures all of the superpowers of a specific superhero. A superhero can have zero, one, or many superpowers associated with them.
Here is a screenshot of the REDCap Status Dashboard of this database. Note that Abin Sur (record #2) has a single circle in the Super Hero Powers column, indicating that they have one superpower. Agent 13 (record #8) has no superpowers.
Great! Now let’s import the superheroes data into R. We can use
REDCapR::redcap_read_oneshot()
which returns a list with
and element named data
that contains all of the data as a
data frame. We turned this data frame into a tibble for better
readability:
<- REDCapR::redcap_read_oneshot(redcap_uri, token)$data
superheroes
%>% tibble() superheroes
This data structure is sometimes called the sparse matrix. It’s what happens when REDCap mashes the contents of a database that has both repeating and non-repeating instruments into a single table.
While it may seem a good idea to have everything in one data frame, there are significant downsides, including:
It’s unwieldy! Although there are only 734 superheroes in the data set, there are 6,700 rows. Every transformation first requires whittling down a huge data set. This increases cognitive load.
It’s sparse - there are a lot of NA
values
indicating missing values. This is confusing and ambiguous, because
these NA
values don’t represent data fields left blank in
the database but instead are an artifact of how the table is
built.
Important metadata is missing. For example, it’s not trivial to determine which fields are associated with a specific instrument.
The meaning of a row in the data set is inconsistent. It depends
on whether the row is associated with a specific instance of a specific
repeated instrument or not (you can figure this out by looking at the
redcap_repeat_instrument
column). This is confusing because
we’d expect the granularity of a table to be consistent. It also
technically violates the definition of Tidy Data
because multiple types of observational units are stored in the same
table.
The main function of the {REDCapTidieR} package is the
read_redcap_tidy()
function. It has a similar API to
REDCapR::redcap_read_oneshot()
, requiring a REDCap database
URI and an API token.
Let’s try it out and observe the output:
library(REDCapTidieR)
<- read_redcap_tidy(redcap_uri, token)
superheroes_tidy
superheroes_tidy
This returns a tibble with two rows. This may be surprising because you might expect more rows from a database with 734 superheroes. However, this is a tibble of tibbles, or a supertibble.
In the REDCapTidieR supertibble, each row represents a REDCap instrument. The first column contains the instrument name. The second column is a list column containing a tibble for each instrument. The third column indicates the repeat/nonrepeat structure of the instrument.
There’s a good chance that if we pull data from a REDCap database
that we’d like that data to be represented as individual tibbles in the
global environment. While it’s possible to do this manually (see below),
this can become tedious if the REDCap project has many instruments. We
wrote the bind_tables()
function to automate this.
bind_tables()
takes the output of
read_redcap_tidy()
, extracts the individual tibbles and
binds them to an environment. By default, this is the global
environment, but you can also supply your own environment object which
will be modified using reference semantics.
Let’s take a look at the data frames in the global environment before
and after calling bind_tables()
:
ls.str(Filter(is.data.frame, as.list(.GlobalEnv)))
%>%
superheroes_tidy bind_tables()
ls.str(Filter(is.data.frame, as.list(.GlobalEnv)))
Note that there are now two additional tibbles,
heroes_information
and super_hero_powers
, in
the global environment!
If you don’t like the idea of tibbles magically appearing in your
environment or if you’d like a more pure approach to extracting tibbles,
you can use the extract_table()
or
extract_tables()
functions.
Use extract_table()
to extract a single tibble from the
supertibble:
%>%
superheroes_tidy extract_table("heroes_information")
Use extract_tables()
to create a named list of tibbles
from a supertibble. The default is to extract all tibbles:
<- superheroes_tidy %>%
superheroes_list_of_tibbles extract_tables()
str(superheroes_list_of_tibbles, max.level = 1)
A neat feature of these extraction functions is that they support
tidy-select
semantics and selectors for picking tables:
<- superheroes_tidy %>%
superheroes_list_of_tibbles_ending_with_powers extract_tables(ends_with("powers"))
str(superheroes_list_of_tibbles_ending_with_powers, max.level = 1)
REDCapTidieR
TibblesSo what do the REDCapTidieR tibbles (the ones inside the supertibble)
look like? Consider heroes_information
, which contains data
from a nonrepeating instrument, and note the
following:
This is not a sparse table - no NA
s
The tibble’s name heroes_information
is descriptive
- it tells you what’s in the tibble. The tibble’s name is the same as
the instrument name in REDCap.
Each row has the same observational unit - one
superhero, identified by its
record_id
.
heroes_information
Now look at the super_hero_powers
tibble, which contains
data from a repeating instrument, and note the
following:
Again, this is not a sparse table, making it easy to understand its data content, and the name is descriptive.
Each row has the same observational unit - in this case,
superpower per superhero, identified by
record_id
and redcap_repeat_instance
. This is
because super_hero_powers
comes from a repeated
instrument.
super_hero_powers
In summary, here are the rules by which REDCapTidieR
constructs tibbles:
There is one tibble per REDCap instrument. The tibble’s name is derived from the name given to the instrument in REDCap.
The first column is always the record’s ID column. Note: by
default, REDCap names the record id field record_id
, but
this can be changed. REDCapTidieR
is smart enough to deal
with this. For example, if the record ID field was renamed to
subject_id
then the first column of each tibble would be
subject_id
.
Additional identifying columns may follow the record ID column,
depending on the context and repeat type of the instrument. For example,
we have seen the redcap_repeat_instance
column which
appears when the tibble is derived from a repeated instrument. Tibbles
derived from longitudinal
projects have up to two additional columns, one for events and one
for arms.
Note: Taken in combination, the identifying columns of any
REDCapTidieR
tibble are guaranteed to be unique and NOT NULL, and can be used as composite primary key. This makes it easy to join REDCapTidieR tibbles!
After the identifying columns, data columns appear in the same
order as in the REDCap instrument. Columns derived from categorical
field types (yesno
, truefalse
,
dropdown
, radiobutton
, checkbox
)
are populated with data representing the label. This is
different than REDCapR’s default behavior which will show the
raw value.
The final field is always form_status_complete
,
which is an column to indicate whether the instrument was marked
complete.