dataCompareR aims to make it easy to compare two tabular data objects in R. It’s specifically designed to show differences between two sets of data in a useful way that should make it easier to understand the differences, and if necessary, help you work out how to remedy them. In this regard, it aims to offer a more useful output than all.equal
when your two datasets do not match, but isn’t intended to replace all.equal
if you just want to test for equality.
It’s expected that dataCompareR will be used to compare data frames, but it can be used to compare any objects that can be coerced to data frames, such as data tables, tibbles or matrices. dataCompareR cannot compare data that is not tabular in format (nested JSON, irregular lists etc) but does handle tabular data that needs to be matched (or joined) on one or more keys (or ID columns).
The hope is that dataCompareR is easy to understand, so please don’t feel like you’re obliged to read this! The aim of this vignette is twofold - to offer end-to-end examples of using dataCompareR and, for those who want to know, to provide details of how the package performs the comparison.
For the purpose of this vignette we’ll intentionally modify iris to use for our comparison.
library(dataCompareR)
# We'll use iris for our comparison
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Make a copy of iris
iris2 <- iris
# And change it, first by subsetting just the first 140 rows
iris2 <- iris2[1:140,]
# then removing the Petal.Width column
iris2$Petal.Width <- NULL
# And then changing some values
iris2[1:10,1] <- iris2[1:10,1] + 1
And then run a comparison using the rCompare
function
# run the comparison
compIris <- rCompare(iris, iris2)
## Running rCompare...
## Warning: package 'bindrcpp' was built under R version 3.4.1
rCompare
returns an S3 object which you can use with summary and print. Summary is a good way to check the results
# Check the results
summary(compIris)
## dataCompareR is generating the summary...
##
## Data Comparison
## ===============
##
## Date comparison run: 2017-11-10 15:31:18
## Comparison run on R version 3.4.0 (2017-04-21)
## With dataCompareR version 0.1.1
##
##
## Meta Summary
## ============
##
##
## Dataset Name Number of Rows Number of Columns
## ------------- --------------- ------------------
## iris 150 5
## iris2 140 4
##
##
## Variable Summary
## ================
##
## Number of columns in common: 4
## Number of columns only in iris: 1
## Number of columns only in iris2: 0
## Number of columns with a type mismatch: 0
## No match key used, comparison is by row
##
##
## Columns only in iris: Petal.Width
## Columns in both : PETAL.LENGTH, SEPAL.LENGTH, SEPAL.WIDTH, SPECIES
##
## Row Summary
## ===========
##
## Total number of rows read from iris: 150
## Total number of rows read from iris2: 140
## Number of rows in common: 140
## Number of rows dropped from iris: 10
## Number of rows dropped from iris2: 0
##
##
## Data Values Comparison Summary
## ==============================
##
## Number of columns compared with ALL rows equal: 3
## Number of columns compared with SOME rows unequal: 1
## Number of columns with missing value differences: 0
##
## Columns with all rows equal : PETAL.LENGTH, SEPAL.WIDTH, SPECIES
##
## Summary of columns with some rows unequal:
##
##
##
## Column Type (in iris) Type (in iris2) # differences Max difference # NAs
## ------------- --------------- ---------------- -------------- --------------- ------
## SEPAL.LENGTH double double 10 1 0
##
##
##
## Unequal column details
## ======================
##
##
##
## #### Column - SEPAL.LENGTH
## Showing sample of size 5
##
##
##
## SEPAL.LENGTH (iris) SEPAL.LENGTH (iris2) Type (iris) Type (iris2) Difference
## --- -------------------- --------------------- ------------ ------------- -----------
## 7 4.6 5.6 double double -1
## 2 4.9 5.9 double double -1
## 6 5.4 6.4 double double -1
## 1 5.1 6.1 double double -1
## 4 4.6 5.6 double double -1
Or you save a copy of the report using saveReport
# Write the summary to a file
saveReport(compIris, reportName = 'compIris')
In the first example, we compared our data based on it’s order. What if want to match our data of a key? We’ll produce another test data set based on the pressure dataset
# We'll use the pressure dataset for comparison
head(pressure)
## temperature pressure
## 1 0 0.0002
## 2 20 0.0012
## 3 40 0.0060
## 4 60 0.0300
## 5 80 0.0900
## 6 100 0.2700
# Make a copy of pressure
pressure2 <- pressure
# And change it, first by randomising the row order
pressure2 <- pressure2[sample(nrow(pressure2)),]
# then changing just one element, so for temperature of
pressure2[5,1]
## [1] 360
# We modify pressure to be twice as large
pressure2[5,2] <- pressure2[5,2] * 2
Run the comparison with rCompare
specifying that we want to match on temperature
# run the comparison
compPressure <- rCompare(pressure, pressure2, keys = 'temperature')
## Running rCompare...
And this time, we’ll choose to get a shorter summary using print
# Check the results - use print for a quick summary
print(compPressure)
## All columns were compared, all rows were compared
## There are 1 mismatched variables:
## First and last 5 observations for the 1 mismatched variables
## TEMPERATURE valueA valueB variable typeA typeB diffAB
## 1 360 806 1612 PRESSURE double double -806
We can also extract the mismatching data to explore futher using generateMismatchData
which generates a list containing two data frames, each having the missing rows from the comparison.
## Warning: package 'dplyr' was built under R version 3.4.2
# use generateMismatchData to pull out the mismatching rows from each table
mismatches <- generateMismatchData(compPressure, pressure, pressure2)
mismatches
## $pressure_mm
## TEMPERATURE PRESSURE
## 1 360 806
##
## $pressure2_mm
## TEMPERATURE PRESSURE
## 1 360 1612
The aspects of the dataCompareR::rCompare function that matter to the end user are:-
as.data.frame
. If you need more advanced coercion, please do this before calling dataCompareR.NA
and NaN
, which are handled in the following way
NA
, match is TRUE
NaN
, match is TRUE
NA
and the other NaN
, match is FALSE
NA
, and the other is a valid value, match is FALSE
NaN
, and the other is a valid value, match is FALSE