dataPreparation

2017-07-07

This vignette introduces dataPreparation, what it offers, how simple it is to use it.

1 Introduction

1.1 Package presentation

Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

data.table and other dependencies are handled at installation.

1.2 Main preparation steps

Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Here are the functions available in this package to tackle those issues:

Correct Transform Filter Handle NA Shape
findAndTransformDates diffDates fastFilterVariables fastHandleNa shapeSet
findAndTransformNumerics aggregateByKey whichAreConstant setAsNumericMatrix
setColAsCharacter setColAsFactorOrLogical whichAreInDouble
setColAsNumeric whichAreBijection
setColAsDate fastRound

All of those functions are integrated in the full pipeline function prepareSet.

In this tutorial we will detail all those steps and how to treat them with this package using an exemple data set.

1.3 Tutorial data

For this tutorial, we are going to use a messy version of adult data base.

data(messy_adult)
print(head(messy_adult, n = 4))
        date1      date2       date3          date4    num1   num2
1: 2017-22-09 06/17/2017 24-Jul-2017   26-July-2017 -0.4543 -0,826
2: 2017-18-07 07/15/2017 26-Jul-2017   28-July-2017  1.4432     NA
3: 2017-05-01         NA 20-Aug-2017 22-August-2017  1.7226 0,7611
4: 2017-28-11 12/19/2017 14-May-2017    16-May-2017  1.9542     NA
   constant                      mail    num3 age    type_employer fnlwgt
1:        1      lucas.pierre@aol.com -0,4543  39        State-gov  77516
2:        1  caroline.marie@yahoo.com  1,4432  50 Self-emp-not-inc  83311
3:        1 jake.lucas@protonmail.com  1,7226  38          Private 215646
4:        1      marie.pierre@aol.com  1,9542  53          Private 234721
   education education_num            marital        occupation
1: Bachelors            13      Never-married      Adm-clerical
2: Bachelors            13 Married-civ-spouse   Exec-managerial
3:   HS-grad             9           Divorced Handlers-cleaners
4:      11th             7 Married-civ-spouse Handlers-cleaners
    relationship  race  sex capital_gain capital_loss hr_per_week
1: Not-in-family White Male         2174            0          40
2:       Husband White Male            0            0          13
3: Not-in-family White Male            0            0          40
4:       Husband Black Male            0            0          40
         country income
1: United-States  <=50K
2: United-States  <=50K
3: United-States  <=50K
4: United-States  <=50K

We added 9 really ugly columns to the data set:

The same info can be contained in two different columns.

2 Correct functions

2.1 Identifing and transforming date columns

The first thing to do is to identify columns that are dates (the first 4 ones) and transform them.

messy_adult <- findAndTransformDates(messy_adult)
## [1] "findAndTransformDates: It took me 0.36s to identify formats"
## [1] "findAndTransformDates: It took me 0.38s to transform 4 columns to a Date format"
Let’s have a look to the transformation performed on those 4 columns:
date1_prev date2_prev date3_prev date4_prev transfo date1 date2 date3 date4
2017-22-09 06/17/2017 24-Jul-2017 26-July-2017 => 2017-09-22 2017-06-17 2017-07-24 2017-07-26
2017-18-07 07/15/2017 26-Jul-2017 28-July-2017 => 2017-07-18 2017-07-15 2017-07-26 2017-07-28
2017-05-01 NA 20-Aug-2017 22-August-2017 => 2017-01-05 NA 2017-08-20 2017-08-22
2017-28-11 12/19/2017 14-May-2017 16-May-2017 => 2017-11-28 2017-12-19 2017-05-14 2017-05-16
2017-02-08 01/14/2017 31-Oct-2017 02-November-2017 => 2017-08-02 2017-01-14 2017-10-31 2017-11-02
2017-28-07 NA 31-Dec-2017 02-January-2018 => 2017-07-28 NA 2017-12-31 2018-01-02

As one can see, even if formats where differents and some how ugly, they where all handled.

2.2 Identifing and transforming numeric columns

And now the same thing with numeric

messy_adult <- findAndTransformNumerics(messy_adult)
## [1] "findAndTransformNumerics: It took me 0.19s to identify 3 numerics column(s), i will set them as numerics"
## [1] "findAndTransformNumerics: It took me 0.06s to transform 3 column(s) to a numeric format"
num1_prev num2_prev num3_prev transfo num1 num2 num3
-0.4543 -0,826 -0,4543 => -0.4543 -0.8260 -0.4543
1.4432 NA 1,4432 => 1.4432 NA 1.4432
1.7226 0,7611 1,7226 => 1.7226 0.7611 1.7226
1.9542 NA 1,9542 => 1.9542 NA 1.9542
-0.5645 0,8952 -0,5645 => -0.5645 0.8952 -0.5645
0.8582 0,7568 0,8582 => 0.8582 0.7568 0.8582

So now our data set is a bit less ugly.

3 Filter functions

3.1 Identifing useless columns

The idea now is to identify useless columns:

3.1.1 Look for constant variables

constant_cols <- whichAreConstant(messy_adult)
## [1] "whichAreConstant: constant is constant."
## [1] "whichAreConstant: it took me 0.22s to identify 1 constant column(s)"

3.1.2 Look for columns in double

double_cols <- whichAreInDouble(messy_adult)
## [1] "whichAreInDouble: num3 is exactly equal to num1. I put it in drop list."
## [1] "whichAreInDouble: it took me 0.22s to identify 1 double(s)"

3.1.3 Look for columns that are bijections of one another

bijections_cols <- whichAreBijection(messy_adult)
## [1] "whichAreBijection: date4 is a bijection of date3. I put it in drop list."
## [1] "whichAreBijection: num3 is a bijection of num1. I put it in drop list."
## [1] "whichAreBijection: education_num is a bijection of education. I put it in drop list."
## [1] "whichAreBijection: it took me 0.66s to identify 3 bijection(s)"

To control this, let’s have a look to the concerned columns:

kable(head(messy_adult[, .(constant, date3, date4, num1, num3, education, education_num)])) %>%
   kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 12)
constant date3 date4 num1 num3 education education_num
1 2017-07-24 2017-07-26 -0.4543 -0.4543 Bachelors 13
1 2017-07-26 2017-07-28 1.4432 1.4432 Bachelors 13
1 2017-08-20 2017-08-22 1.7226 1.7226 HS-grad 9
1 2017-05-14 2017-05-16 1.9542 1.9542 11th 7
1 2017-10-31 2017-11-02 -0.5645 -0.5645 Bachelors 13
1 2017-12-31 2018-01-02 0.8582 0.8582 Masters 14

Indeed:

  • constant was build constant, it contains only 1,
  • num1 and num3 are equal,
  • date3 and date4 are separated by 2 days: date4 doesn’t contain any new information for a ML algorithm,
  • education and education_num contains the same information one with a key index, the other one with the character corresponding. whichAreBijection keeps the character column.

3.1.4 Filter them all

To directly filter all of them:

ncols = ncol(messy_adult)
messy_adult <- fastFilterVariables(messy_adult)
print(paste0("messy_adult now have ", ncol(messy_adult), " columns; so ", ncols - ncol(messy_adult), " less than before."))
## [1] "fastFilterVariables: I check for constant columns"
## [1] "whichAreConstant: constant is constant."
## [1] "whichAreConstant: it took me 0.2s to identify 1 constant column(s)"
## [1] "fastFilterVariables: I delete 1 constant column(s) in dataSet"
## [1] "fastFilterVariables: I check for columns in double"
## [1] "whichAreInDouble: num3 is exactly equal to num1. I put it in drop list."
## [1] "whichAreInDouble: it took me 0.18s to identify 1 double(s)"
## [1] "fastFilterVariables: I delete 1 column(s) that are in double in dataSet"
## [1] "fastFilterVariables: I check for columns that are bijections of another column"
## [1] "whichAreBijection: date4 is a bijection of date3. I put it in drop list."
## [1] "whichAreBijection: education_num is a bijection of education. I put it in drop list."
## [1] "whichAreBijection: it took me 0.53s to identify 2 bijection(s)"
## [1] "fastFilterVariables: I delete 2 column(s) that are bijections of another column in dataSet"
## [1] "messy_adult now have 20 columns; so 4 less than before."

4 useless rows have been deleted. Without those useless columns, your machine learning algorithm will at least be faster and maybe give better results.

3.2 Rounding

One might want to round numeric variables in order to save some RAM, or for algorithmic reasons:

messy_adult <- fastRound(messy_adult, digits = 2)
date1 date2 date3 num1 num2 mail
2017-09-22 2017-06-17 2017-07-24 -0.45 -0.83 lucas.pierre@aol.com
2017-07-18 2017-07-15 2017-07-26 1.44 NA caroline.marie@yahoo.com
2017-01-05 NA 2017-08-20 1.72 0.76 jake.lucas@protonmail.com
2017-11-28 2017-12-19 2017-05-14 1.95 NA marie.pierre@aol.com
2017-08-02 2017-01-14 2017-10-31 -0.56 0.90 caroline.caroline@aol.com
2017-07-28 NA 2017-12-31 0.86 0.76 NA

4 Transform functions

Before sending this to a machine learning algorithm, a few transformations should be performed.

The idea with the functions presented here is to perform those transformation in a RAM efficient way.

4.1 Dates differences

Since no machine learning algorithm handle Dates, one need to transform them or drop them. A way to transform dates is to perform differences between every dates.

We can also add an analysis date to compare dates with the date your data is from. For example, if you have a birth-date you may want to compute age by performing today - birth-date.

Once this is done, we drop date columns.

messy_adult <- diffDates(messy_adult, analysisDate = as.Date("2018-01-01"), units = "days")
date_cols <- names(messy_adult)[sapply(messy_adult, is.POSIXct)]
messy_adult[, c(date_cols) := NULL]
date1.Minus.date2 date1.Minus.date3 date2.Minus.date3 date1.Minus.analysisDate date2.Minus.analysisDate date3.Minus.analysisDate
97 60 -37 -100.95833 -197.95833 -160.9583333
3 -8 -11 -166.95833 -169.95833 -158.9583333
NA -227 NA -360.95833 NA -133.9583333
-21 198 219 -33.95833 -12.95833 -231.9583333
200 -90 -290 -151.95833 -351.95833 -61.9583333
NA -156 NA -156.95833 NA -0.9583333

4.2 Aggregate according to a key

Say in fact you want to model something by country, you would want to compute an aggregation of this table in order to have one line per country.

agg_adult <- aggregateByKey(messy_adult, key = "country")
## [1] "aggregateByKey: I start to aggregate"
## [1] "aggregateByKey: 117 columns have been constructed. It took 0.88seconds. "
## [1] "117 columns have been built; for 42 countries."
country occupation.Tech-support occupation.Transport-moving relationship.Husband relationship.Not-in-family relationship.Other-relative relationship.Own-child relationship.Unmarried
? 16 25 246 149 29 63 62
Cambodia 0 0 9 4 3 1 1
Canada 3 8 53 37 1 11 10
China 2 0 38 17 5 6 2
Columbia 3 2 16 20 3 6 13
Cuba 0 8 41 18 5 12 13

Every time you have more than one line per individual this function would be pretty cool.

5 Handling NAs values

Then, let’s handle NAs

messy_adult <- fastHandleNa(messy_adult)
##     num1  num2                      mail age   ... hr_per_week
## 1: -0.45 -0.83      lucas.pierre@aol.com  39   ...          40
## 2:  1.44  0.00  caroline.marie@yahoo.com  50   ...          13
## 3:  1.72  0.76 jake.lucas@protonmail.com  38   ...          40
## 4:  1.95  0.00      marie.pierre@aol.com  53   ...          40
##          country income date1.Minus.date2 date1.Minus.date3
## 1: United-States  <=50K                97                60
## 2: United-States  <=50K                 3                -8
## 3: United-States  <=50K                 0              -227
## 4: United-States  <=50K               -21               198
##    date2.Minus.date3 date1.Minus.analysisDate date2.Minus.analysisDate
## 1:               -37               -100.95833               -197.95833
## 2:               -11               -166.95833               -169.95833
## 3:                 0               -360.95833                  0.00000
## 4:               219                -33.95833                -12.95833
##    date3.Minus.analysisDate
## 1:                -160.9583
## 2:                -158.9583
## 3:                -133.9583
## 4:                -231.9583

It set default values in place of NA. If you want to put some specific values (constants, or even a function for example mean of values) you should go check fastHandleNa documentation.

6 Shape functions

There are two types of machine learning algorithm in R. Those which accept data.table and factor, those which only accept numeric matrix.

Transforming a data set into something acceptable for a machine learning algorithm could be tricky.

The shapeSet function do it for you, you just have to choose if you want a data.table or a numerical_matrix.

First with data.table:

clean_adult = shapeSet(copy(messy_adult), finalForm = "data.table", verbose = FALSE)
print(table(sapply(clean_adult, class)))
## 
##  factor integer numeric 
##       9       1      13

As one can see, there only are, numeric and factors.

Now with numerical_matrix:

clean_adult <- shapeSet(copy(messy_adult), finalForm = "numerical_matrix", verbose = FALSE)
num1 num2 mail age type_employer? type_employerFederal-gov
-0.45 -0.83 1 39 0 0
1.44 0.00 1 50 0 0
1.72 0.76 1 38 0 0
1.95 0.00 1 53 0 0
-0.56 0.90 1 28 0 0
0.86 0.76 0 37 0 0

As one can see, with finalForm = "numerical_matrix" every character and factor have been binarized.

7 Full pipeline

Doing it all with one function is possible:

To do that we will reload the ugly data set and perform aggregation.

data("messy_adult")
agg_adult <- prepareSet(messy_adult, finalForm = "data.table", key = "country", analysisDate = Sys.Date(), digits = 2)
## [1] "prepareSet: step one: correcting mistakes."
## [1] "fastFilterVariables: I check for constant columns"
## [1] "whichAreConstant: constant is constant."
## [1] "whichAreConstant: it took me 0.22s to identify 1 constant column(s)"
## [1] "fastFilterVariables: I delete 1 constant column(s) in dataSet"
## [1] "fastFilterVariables: I check for columns in double"
## [1] "whichAreInDouble: it took me 0.17s to identify 0 double(s)"
## [1] "fastFilterVariables: I check for columns that are bijections of another column"
## [1] "whichAreBijection: date3 is a bijection of date4. I put it in drop list."
## [1] "whichAreBijection: num1 is a bijection of num3. I put it in drop list."
## [1] "whichAreBijection: education_num is a bijection of education. I put it in drop list."
## [1] "whichAreBijection: it took me 0.43s to identify 3 bijection(s)"
## [1] "fastFilterVariables: I delete 3 column(s) that are bijections of another column in dataSet"
## [1] "findAndTransformNumerics: It took me 0.22s to identify 2 numerics column(s), i will set them as numerics"
## [1] "findAndTransformNumerics: It took me 0.05s to transform 2 column(s) to a numeric format"
## [1] "findAndTransformDates: It took me 0.2s to identify formats"
## [1] "findAndTransformDates: It took me 0.03s to transform 3 columns to a Date format"
## [1] "prepareSet: step two: transforming dataSet."
## [1] "aggregateByKey: I start to aggregate"
## [1] "aggregateByKey: 117 columns have been constructed. It took 0.83seconds. "
## [1] "prepareSet: step three: filtering dataSet."
## [1] "prepareSet: I check for constant columns"
## [1] "whichAreConstant: min.capital_gain is constant."
## [1] "whichAreConstant: it took me 0.28s to identify 1 constant column(s)"
## [1] "prepareSet: I delete 1 constant column(s) in result"
## [1] "prepareSet: I check for columns in double"
## [1] "whichAreInDouble: nbr.mail is exactly equal to nbrLines. I put it in drop list."
## [1] "whichAreInDouble: it took me 0.41s to identify 1 double(s)"
## [1] "prepareSet: I delete 1 column(s) that are in double in result"
## [1] "prepareSet: I check for columns that are bijections of another column"
## [1] "whichAreBijection: mean.age is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: sd.age is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: mean.fnlwgt is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: max.fnlwgt is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: sd.fnlwgt is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: mean.hr_per_week is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: sd.hr_per_week is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: mean.date4.Minus.analysisDate is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: sd.date4.Minus.analysisDate is a bijection of country. I put it in drop list."
## [1] "whichAreBijection: min.num2 is a bijection of mean.num2. I put it in drop list."
## [1] "whichAreBijection: max.num2 is a bijection of mean.num2. I put it in drop list."
## [1] "whichAreBijection: min.num3 is a bijection of mean.num3. I put it in drop list."
## [1] "whichAreBijection: max.num3 is a bijection of mean.num3. I put it in drop list."
## [1] "whichAreBijection: occupation.? is a bijection of type_employer.?. I put it in drop list."
## [1] "whichAreBijection: marital.Married-AF-spouse is a bijection of type_employer.Never-worked. I put it in drop list."
## [1] "whichAreBijection: occupation.Armed-Forces is a bijection of type_employer.Never-worked. I put it in drop list."
## [1] "whichAreBijection: sd.capital_gain is a bijection of mean.capital_gain. I put it in drop list."
## [1] "whichAreBijection: min.date1.Minus.date2 is a bijection of mean.date1.Minus.date2. I put it in drop list."
## [1] "whichAreBijection: max.date1.Minus.date2 is a bijection of mean.date1.Minus.date2. I put it in drop list."
## [1] "whichAreBijection: min.date1.Minus.date4 is a bijection of mean.date1.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: max.date1.Minus.date4 is a bijection of mean.date1.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: mean.date1.Minus.analysisDate is a bijection of mean.date1.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: min.date1.Minus.analysisDate is a bijection of mean.date1.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: max.date1.Minus.analysisDate is a bijection of mean.date1.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: sd.date1.Minus.analysisDate is a bijection of sd.date1.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: min.date2.Minus.date4 is a bijection of mean.date2.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: max.date2.Minus.date4 is a bijection of mean.date2.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: mean.date2.Minus.analysisDate is a bijection of mean.date2.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: min.date2.Minus.analysisDate is a bijection of mean.date2.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: max.date2.Minus.analysisDate is a bijection of mean.date2.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: sd.date2.Minus.analysisDate is a bijection of sd.date2.Minus.date4. I put it in drop list."
## [1] "whichAreBijection: it took me 7.02s to identify 31 bijection(s)"
## [1] "prepareSet: I delete 31 column(s) that are bijections of another column in result"
## [1] "prepareSet: step four: handling NA."
## [1] "prepareSet: step five: shaping result."
## [1] "Transforming characters into factors."
## [1] "Transforming numerical variables into factors when length(unique(col)) <=10."
## Going through 83 numerical variables to transform if necessary.
## [1] "Transforming logicals into binaries.\n"
## [1] "Previous distribution of column types:"
## col_class_init
##  factor numeric 
##       1      83 
## [1] "Current distribution of column types:"
## col_class_end
##  factor numeric 
##      32      52 
## [1] "Quantiles for the number of factor modalities:"
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##  2.0  2.1  3.0  4.0  5.0  7.0  8.0  9.0  9.8 10.0 42.0

As one can see, every previously steps have been done.

Let’s have a look to the result

## [1] "84 columns have been built; for 42 countries."
country nbrLines mean.num2 sd.num2 mean.num3 sd.num3 min.age
? 583 0 0 0 0 17
Cambodia 19 0 0 0 0 18
Canada 121 0 0 0 0 17
China 75 0 0 0 0 22
Columbia 59 0 0 0 0 18
Cuba 95 0 0 0 0 21

8 Conclusion

We hope that this package is helpful, that it helped you prepare your data in a faster way.

If you would like to add some features to this package, notice some issues, please tell us on git hub. Also if you want to contribute, please don’t hesitate to contact us.