Update Datasets With replacer

Dragos Bandur

2022-08-19

Introduction

replacer, a value replacement utility currently based on package data.table, is intended for outside-database update of data. It requires the preparation of a lookup file which is a list of replacement requests with and, in special circumstances, without an index column.

The input and output data files are comma-separated values file format (csv). Various other file formats can therefore be processed after conversion.

Data processing such as cleanup, file format conversion and the appendage of new data are outside the scope of this utility.

There could be several lookup files associated with one data file and vice versa which could be batch-processed.

This vignette defines the terms related to the replacement and to the elements of the lookup file, and details the procedure of creating an efficient lookup file suggesting ways to maintain it relevant for future updates. It continues by introducing the two User-intended functions, for single file and batch file processing, and describes their internal workflow. Finally, it presents a screen output result obtained with a lookup without index and introduces the datasets that are used in examples.

Motivation

This utility is accessible to beginners to R working with outside-database files. It facilitates complex dataset updates with minimal prompt time by employing User-friendly functions which automatically follow a decision tree rooted in the presence of User-made index.

The choice for data.table is motivated by its concise and sound code, effectiveness and efficiency in processing large and complex datasets, and its independence on packages, or verses, other than base R itself; this utility will only install data.table on User’s machine, had it had not been installed already.

Hence, familiarity with basic R commands is necessary. Familiarity with data.table is not, yet strongly recommended!

Definitions

The following terms are specific to this utility:

Lookup may contain a combination of simple and multiple replacements, simple replacements only, multiple replacements only (i.e. missing only, duplicated values only, or mixed) as needed. Multiple replacements can be split as needed.

On special circumstances, column “id” may be absent from lookup while the presence of the other three standard columns is mandatory. Modifying standard columns’ names will result in error. Any extra columns in lookup will be automatically removed during internal conformance checks on lookup.

Multiple replacements apply to missing or duplicated non-missing values. While replacement values of duplicated values could be regarded as generic, they are entirely decided by the User. They are not restricted in this sense.

In this version, replacements of missing data are restricted to words formed with the word roots MIS|mis, PRE|pre, UN|un, ABS|abs and the adverbs YES|yes, NO|no for missing values of type character and with a number equal to- or larger than 3 integer 9s (i.e. 999, 9999 etc.) for missing values of double or integer types.

Short sentences beginning with any word root or adverb above may be accepted as replacements.

Example

Typical data replacement workflow: the initial data file and the corresponding lookup file (here named “data.csv” and “lookup.csv”) are saved in a directory, here named dir. Once the package is installed, the following commands should be typed at the R prompt:

> require(replacer)
> dir = 'C:/path/to/directory'
> replaceVals(dir) ## to update the values in 'data.csv'

Results will show on screen (see Section “Screen Output Example” which presents the result of a different data/lookup pair and requests set). A csv file similar to Out-Data is saved on dir.

In-Data
Rw a b uninvlvd
1 aa 7.2 0.6
2 17.6 0.1
3 cc 1.6 1.6
4 dd 0.8 1.2
5 ee -14.0 5.0
6 cc 0.1
7 1.6
8 -5.0 3.3
9 mm 1.6 9.6
10 5.2
11
12 cc
Lookup
Rw vars oldVals newVals id source
1 a ZZ 2 source1
2 a CC 8 source2
3 a cc DD 0 source3
4 a mm UU 9 source4
5 a cc CA 12 source5
6 a YES source6
7 b 1.607 1.1 0 source7
8 b 1.607 8.8 7 source8
9 b 999 source9
Out-Data
Rw a b uninvlvd
1 aa 7.2 0.6
2 ZZ 17.6 0.1
3 DD 1.1 1.6
4 dd 0.8 1.2
5 ee -14.0 5.0
6 DD 0.1
7 YES 8.8
8 CC -5.0 3.3
9 UU 1.1 9.6
10 YES 5.2
11 YES 999.0
12 CA 999.0

This small example covers the complete set of lookup requests with index present. Column “Rw” is here for convenience only. It should not be manually created.

The left-hand side table, In-Data, is the original data file and contains 3 columns of interest. The two columns (“a” and “b”) involved in replacements, are of character respectively, of numeric data types and both contain missing, duplicates and unique values. The “uninvlvd” column in In-Data is not listed in lookup’s vars, it is not involved in replacements and therefore, remains unchanged in Out-Data. Column source in Lookup is not standard and therefore, is automatically removed from computations.

The Out-Data result was achieved with a lookup file with User-made index; the replacement requests were as follows:

User-made Index In Lookup

A User-made index may or, may not be necessary in lookup:

Lookup With User-made Index

A User-made index is the safest value replacement procedure; it requires little insight in the structure of data and insures fast review of large datasets. When there is a full range of request types including splitting, the index should contain the value types shown in the above example, in lookup column id.

Zero Value Present

There could be several distinct sets of duplicated values in any data column. If the purpose is multiple replacements of any/all of these sets (or subsets of these sets) then, for each subset, the replacement request should take only one row in lookup having 0 as index value. Corresponding replacement values are at User’s discretion.

Empty (NA) Value Present

Same rules as above apply to multiple replacement of sets of missing values with two differences: 1) the corresponding index value is empty or NA, and 2) the generic value is restricted to a word or short sentence starting with any word root or adverb listed in “Definitions”.

For generic values of numeric type, data type preservation (see Note 2) is not required (i.e. it is not required to enter 9999.0 instead of 9999 in the id row corresponding to missing values of double data type).

Further revisions may bring more generic value customization if necessary.

Row Number Present

To request simple replacements, whether having unique, duplicated or missing values in oldVals column, the index should contain the corresponding data file row number. Sorting lookup by vars would maintain focus on one data column at a time.

Standard columns oldVals and newVals in lookup, revert to the character data type upon completion since they contain numeric and character values combined. This behavior is normal.

In conclusion, a User-made index covers all types of replacements, including splits.

Lookup Without User-made Index

There are special circumstances when the index is absent in lookup:

Index Must Be Absent In Lookup

There is no need for index when lookup contains only multiple replacement requests (Section “Screen Output Example”). An extension of this case involves splitting on missing values and is explained in the sub-section below.

Index May Be Absent In Lookup

This case requires a deeper insight in the structure of the data file such as knowing in advance which values are duplicated. When used standalone, helper function whichDups() - part of this utility - finds the duplicated values in all columns of the data file.

The requirements for an effective lookup table without index are:

Note 1 Whether the index is present or not in lookup, at the time of running the updates the row order in the data file must be identical to the row order in the data file as it was at the completion of lookup. If necessary, an index named other than “id” could be built and preserved inside the dataset and it will be treated as uninvolved column. Re-ordering the dataset by this index before run will retrieve the row order at the completion of lookup and will keep the lookup relevant for future data updates.

Note 2 It is recommended that User preserves the data type between values set for replacement and their replacements (i.e. integer/double to integer/double etc.). Although the helper function con2fcoales() automatically solves such discrepancies in the background, if an error message similar to “Item 2 is type integer but the first item is type double. Please coerce …” appears on screen, it means that 1) the requirement was obviously violated and that 2) the severity of violation was beyond the mitigation capability of con2fcoales().

In such cases, a review of all output messages and a review of the data file and the associated lookup should ensue.

Note 3 Due to their special type, splitting on missing values across data columns is allowed when index is absent, i.e. if one involved data column contains one missing value and other involved columns contain missing values set (or not) for multiple replacements, splitting is allowed and the unique missing value can be replaced with a non-generic replacement value.

Functions Intended For User

There are two functions designed for direct use and set to display a series of messages and comments informing the User on the computational path taken and on findings along the way. While the messages can be turned off, the comments remain visible.

Single file replacements are processed by function replaceVals(). When the data and lookup files are named simply “data.csv” and “lookup.csv” the function only requires the path to the directory where these files were stored, written as length 1 quoted character, with forward “/” or double backward “\\” slash and without end slash.

At start, this function performs a series of conformance checks on data and lookup, excludes non-standard columns, separates the data into involved and uninvolved columns then, rejoins them on exit to preserve the data format.

Batch file replacements are processed by function bReplace(). When messages are set to TRUE it offers an wide range of custom messages/comments for each data/lookup input pair and request type. During run, bReplace() calls replaceVals() function as many times as data/lookup pairs are on the names list, displays a named list of messages, comments, updated data and counts related to duplicated and/or missing values requests, for each pair.

Both functions save their updated data to the directory above, in csv file format.

Internal Workflow

Helper function sReplace() (described in the Manual) is the data replacement workhorse for this utility. When called by replaceVals(), it firstly checks for index presence in lookup. Upon result, the function moves along the branches of a decision tree:

If The Index Is Found Absent

The function starts by identifying duplicated and/or missing values within the involved columns as well as eventual splits on missing values.

If lookup requires multiple and simple replacements, the function separates lookup into maximum 3 subsets: 1) of multiple replacements for duplicated values, for which it later creates an internal index list, 2) of multiple replacements for missing values for which an internal index is not necessary, and 3) the remainder subset containing unique values, including unique missing values, for which it also creates an internal index list.

Requests for splitting on duplicated values with no index stop the function with an error.

Internal Index For Duplicated Values

The function creates an internal index list of row numbers corresponding to all elements of distinct subsets of duplicated values found within each involved data column and loops the function data.table::set() for replacements.

No Internal Index For Missing Values

As already mentioned, no index is created for multiple replacements of missing values as there is only one generic value per data column (this utility uses no other information on missing data beside the data type. For a different approach on this subject the User is directed to data imputation literature).

The subset of missing values is then reshaped, and the columns are coalesced with corresponding data columns, for each generic value present in lookup.

Index For Unique Values

As stated above, simple replacements of unique values without User-made index are possible. Once the internal index is created, this subset is reshaped, joined with the data on index and then, the corresponding columns are coalesced (please see these terms in data.table Manual).

If The Index Is Found Present

The function subsets the lookup using the special index values 0 and/or NA (or empty). At maximum, 3 subsets are formed as above. The replacement process is similar with the process used for absent index with the difference that the unique values already have the User-made index of row numbers.

In conclusion, whether single or batch file processing, all request types are processed in one run while Users have the ability of monitoring the internal workflow.

Screen Output Example

Presented below is an update performed on the “Chile” dataset from package carData.

At the R prompt, type or just copy/paste the following commands:

help(Chile, package = "carData")
require(replacer)
dir = system.file("extdata", package = "replacer")
replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)

Screen output:

reading data from: C:/R/R-4.1.0/library/replacer/extdata ...

completed reading data ...

checking standard columns in lookup ...

starting replacements ...

found no request for 1:1 replacements ...

searching for 1:many replacements ...

found request for multiple duplicated value replacments: creating index ...

replacing multiple duplicated values ...

found request for multiple missing value replacements: replacing ...

helper function has completed!

$` updated_chile_using_chile_nadup`
              region population    sex age      education income statusquo       vote
   1:          North     200000   Male  65        Primary  35000   1.00820         Y
   2:          North     200000   Male  29 Post-secondary   7500  -1.29617         N
   3:          North     200000 Female  38        Primary  15000   1.23072         Y
   4:          North     200000 Female  49        Primary  35000  -1.03163         N
   5:          North     200000 Female  23      Secondary  35000  -1.10496         N
  ---                                                                                       
2696: Metro Santiago      15000   Male  42        Primary  15000  -1.26247         N
2697: Metro Santiago      15000 Female  28        Primary  15000   1.32950         Y
2698: Metro Santiago      15000 Female  44        Primary  75000   1.42045         Y
2699: Metro Santiago      15000   Male  21      Secondary  75000   0.18315 PREFERS NO ANSWER
2700: Metro Santiago      15000   Male  20 Post-secondary  35000   1.38179         Y

$` multiple_dups_repl_counts`
          vars oldVals          newVals education population region  sex
 1:  education       P          Primary      1107         NA     NA   NA
 2:  education      PS   Post-secondary       462         NA     NA   NA
 3:  education       S        Secondary      1120         NA     NA   NA
 4: population  175000           200000        NA        140     NA   NA
 5: population   25000            50000        NA        360     NA   NA
 6:     region       C          Central        NA         NA    600   NA
 7:     region       M   Metro Santiago        NA         NA    100   NA
 8:     region       N            North        NA         NA    322   NA
 9:     region       S            South        NA         NA    718   NA
10:     region      SA City of Santiago        NA         NA    960   NA
11:        sex       F           Female        NA         NA     NA 1379
12:        sex       M             Male        NA         NA     NA 1321

$` multiple_NAs_repl_counts`
    region population        sex        age  education     income  statusquo       vote 
         0          0          0          1         11         98         17        168 

This update has replaced the abbreviations used in the source data with full names and complete words, as well as, updated the population totals on few regions with fictitious values. Missing values in column vote were replaced by a short sentence using the word root “PRE” listed above.

The requests were processed through a lookup table without User-made index.

The first output block named “updated_chile_using_chile_nadup” displays the head and the tail of the updated data file.

To view the complete updated dataset on screen, type:

upData = replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)[[1]]
View(upData)

Next block, named “multiple_dups_repl_counts”, displays a count of duplicated value updates processed on each involved column, by value and replacement value. In this example, the left-hand side of this block contains the complete cases of lookup.

Finally, the third block named “multiple_NAs_repl_counts” shows counts of missing values replaced by generic values within respective involved columns. In other cases, unrequested replacements within any of the involved columns will be commented in the screen output (see replaceVals() Examples).

All block names change according to input and request type; these names show what was processed and what was not. In case of error, the displayed messages/comments should hint of where in the process the error occurred.

Example Datasets

Although artificial, the datasets in this package cover situations encountered in practice. They contain fictitious values inserted for exemplification purpose with the exception of “chile” dataset, a copy of the “Chile” dataset from package carData; however, some replacement values within associated lookup files are also fictitious.

The batch-file example (see bReplace() Examples) contains the names of data and associated lookup files presented below. Duplicated values and counts of missing values are shown below for all these datasets. When several distinctive subsets of duplicated values are present in an involved column the column’s name appears indexed.

Data And Associated Lookup Files With Index

Duplicated
a c d1 d2
cc 0.737 15 8
Missing
a b c d e f
4 2 3 2 2 3

Associated lookup files:

lookup_id.csv”: mixed simple/multiple replacements of unique, duplicated and missing data, similar to the typical replacement workflow example. The file contains standard columns only.

lookup_idsimple.csv”: simple replacements i.e. without 0 or NA values in “id”. The file contains standard columns only.

Duplicated
age27 statusquo48 statusquo100 education2
44 1.46885 -1.15762 PS
Missing
region population sex age education income statusquo vote
0 0 0 1 11 98 17 168

Associated lookup file:

chile_id.csv”: splits of multiple replacements of duplicated values. Also, multiple replacements of missing data. The file includes non-standard column “source”.

Data And Associated Lookup Files Without Index

Duplicated
a b1 b2 b3 c1 c2 d1 d2
cc -4.979 2.142 1.121 0.737 0.476 15 8
Missing
a b c d
7 1 3 1

Associated lookup file:

lookup.csv”: mixed simple and multiple replacements of unique, duplicated and missing values. The file contains standard columns only.

Duplicated
b d
2.142 15
Missing
a b c d
1 3 1 1

Associated lookup file:

lookup_unique”: split across columns on missing values. The file contains standard columns only.

Other Associated Lookup Files

lookupDUP”, “lookupNA”: multiple replacements only of duplicated or missing values in “data_id” dataset. Both files contain standard columns only.

chile_nadup”: multiple replacements only of duplicated and missing values in “chile” dataset. The file contains non-standard column “source” .

Session Info

#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] replacer_1.0.2    kableExtra_1.3.4  knitr_1.39        data.table_1.14.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13   xml2_1.3.3        magrittr_2.0.3    munsell_0.5.0    
#>  [5] rvest_1.0.2       viridisLite_0.4.0 colorspace_2.0-3  R6_2.5.1         
#>  [9] rlang_1.0.4       fastmap_1.1.0     highr_0.9         stringr_1.4.0    
#> [13] httr_1.4.4        tools_4.2.1       webshot_0.5.3     xfun_0.32        
#> [17] cli_3.3.0         jquerylib_0.1.4   systemfonts_1.0.4 htmltools_0.5.3  
#> [21] yaml_2.3.5        digest_0.6.29     lifecycle_1.0.1   formatR_1.12     
#> [25] sass_0.4.2        glue_1.6.2        cachem_1.0.6      evaluate_0.16    
#> [29] rmarkdown_2.15    stringi_1.7.8     compiler_4.2.1    bslib_0.4.0      
#> [33] scales_1.2.0      svglite_2.1.0     jsonlite_1.8.0