In the iris dataset, replace
Sepal.Width with 4 value if it exceeds 4.
Data cleaning work flows or scripts typically contain a lot of ‘if this do that’ type of statements. Such statements are typically condensed expert knowledge. With this package, such ‘data modifying rules’ are taken out of the code and become instead parameters to the work flow. This allows you to maintain, document and reason about data modification rules separately from the flow of your programme.
This means you, the expert, can focus on the content and let R do the work.
The workflow of
dcmodify is designed to take two concerns of your hands. The first concern is how to implement the many ideas and rules that define how and when to modify data. The second concern is related to how to apply such rules to your data. We therefore introduce two nouns and one verb that govern the basic workflow.
modifier: This is an object that stores (conditional) data modification rules.
modify: This is a function that applies the rules in a modifier to your data.
Here’s an example using the
retailers data set from the validate package.
## staff turnover other.rev total.rev staff.costs total.costs profit vat ## 1 75 NA NA 1130 NA 18915 20045 NA ## 2 9 1607 NA 1607 131 1544 63 NA ## 3 NA 6886 -33 6919 324 6493 426 NA
First we define a set of modifying rules, using
Next, the rules can be applied to our data.
Alternatively, if you’re a fan of the magrittr, package you can do this
%<>% operator makes sure that the original dataset gets overwritten, and
modify_so is a shortcut function for defining modificaton rules in-line.
The rules you define in a
modifier are executed on records where the conditions yields
TRUE. In R this poses the problem on what to do when in a record the condition evaluates to
NA. For example, the condition
other.rev < 0
in the first rule of
m above evaluates to
NA in the first record of the
retailers dataset. Such cases are handled by treating it as if the condition evaluated to
Modifier rules can also be defined and stored outside of the R script through the use of YAML files. Defining a YAML file can be done by hand, or by exporting an existing modifier object via
as_yaml. Exporting the modifier defined in the Basic workflow section would look as follows:
This code will create a YAML file with the following content:
rules: - expr: if (other.rev < 0) other.rev <- -1 * other.rev name: M1 label: '' description: '' created: 2021-07-29 16:57:00 origin: command-line meta:  - expr: if (is.na(staff.costs)) staff.costs <- mean(staff.costs) name: M2 label: '' description: '' created: 2021-07-29 16:57:00 origin: command-line meta: 
Out of all these keys only
expr: are required, all others are optional.
Once a YAML file is created,
modifier can read the modification rules from the file and store it as a modifier object. For this the
.file argument is used:
Using separate files for the storage of rules has the advantage that the same set of rules can be easily shared across many different scripts.
You, the user can assume that the rules are evaluated record-by-record. In reality, the package is smart enough to analyse the rules a little bit and to make sure they can be evaluated in a vectorized manner. This way explicit (and slow) R-loops are avoided as much as possible.
In short, when you call
modify_so, the following steps are performed.
The functionality of this package resembles
dplyr::mutate, since it also allows one to specify data mutations on data frames (or other tabular data objects). The dplyr package is especially useful for interactive use and also for use in programming through ‘underscored’ functions such as
dcmodify package has been developed with a production street in mind where similar data sets are processed frequently. By taking the modifying rules out of the software, R programmers can build an application that allows users that are less knowledgeable about programming to specify their modification rules.
It can be interesting to study the effect of a certain set of data modifying rules. The lumberjack package is capable of tracking changes in data.
To start logging data you need to replace the magrittr pipe (
%>%) with the lumberjack operator
%>>% and insert some logging commands into the pipeline.
## Dumped a log at cellwise.csv
## step time srcref ## 1 1 2021-09-24 11:29:10 CEST NA ## 2 1 2021-09-24 11:29:10 CEST NA ## 3 1 2021-09-24 11:29:10 CEST NA ## 4 1 2021-09-24 11:29:10 CEST NA ## 5 1 2021-09-24 11:29:10 CEST NA ## 6 1 2021-09-24 11:29:10 CEST NA ## expression key variable old ## 1 modify_so(if (height < mean(height)) height <- mean(height)) a height 58 ## 2 modify_so(if (height < mean(height)) height <- mean(height)) b height 59 ## 3 modify_so(if (height < mean(height)) height <- mean(height)) c height 60 ## 4 modify_so(if (height < mean(height)) height <- mean(height)) e height 62 ## 5 modify_so(if (height < mean(height)) height <- mean(height)) f height 63 ## 6 modify_so(if (height < mean(height)) height <- mean(height)) g height 64 ## new ## 1 61 ## 2 61 ## 3 61 ## 4 61 ## 5 61 ## 6 61
Conditional statements including
else are not supported yet. Rules containing
if() else are ignored with a warning.