Package presentation
Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.
This package is
- fast (use
data.table
and exponential search)
- RAM efficient (perform operations by reference and column-wise to avoid copying data)
- stable (most exceptions are handled)
- verbose (log a lot)
data.table
and other dependencies are handled at installation.
Main preparation steps
Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:
- Read: load the data set (this package don’t treat this point: for csv we recommend
data.table::fread
)
- Correct: most of the times, there are some mistake after reading, wrong format… one have to correct them
- Transform: aggregating according to a key, computing differences between dates, … in order to have information usable for a ML algorithm (aka: numeric or categorical)
- Filter: get read of useless information in order to speed up computation
- Handle NA: replace missing values
- Shape: put your data set in a nice shape usable by a ML algorithm
Here are the functions available in this package to tackle those issues:
findAndTransformDates |
diffDates |
fastFilterVariables |
fastHandleNa |
shapeSet |
findAndTransformNumerics |
aggregateByKey |
whichAreConstant |
|
setAsNumericMatrix |
setColAsCharacter |
setColAsFactorOrLogical |
whichAreInDouble |
|
|
setColAsNumeric |
|
whichAreBijection |
|
|
setColAsDate |
|
fastRound |
|
|
All of those functions are integrated in the full pipeline function prepareSet
.
In this tutorial we will detail all those steps and how to treat them with this package using an exemple data set.
Tutorial data
For this tutorial, we are going to use a messy version of adult data base.
data(messy_adult)
print(head(messy_adult, n = 4))
date1 date2 date3 date4 num1 num2
1: 2017-22-09 06/17/2017 24-Jul-2017 26-July-2017 -0.4543 -0,826
2: 2017-18-07 07/15/2017 26-Jul-2017 28-July-2017 1.4432 NA
3: 2017-05-01 NA 20-Aug-2017 22-August-2017 1.7226 0,7611
4: 2017-28-11 12/19/2017 14-May-2017 16-May-2017 1.9542 NA
constant mail num3 age type_employer fnlwgt
1: 1 lucas.pierre@aol.com -0,4543 39 State-gov 77516
2: 1 caroline.marie@yahoo.com 1,4432 50 Self-emp-not-inc 83311
3: 1 jake.lucas@protonmail.com 1,7226 38 Private 215646
4: 1 marie.pierre@aol.com 1,9542 53 Private 234721
education education_num marital occupation
1: Bachelors 13 Never-married Adm-clerical
2: Bachelors 13 Married-civ-spouse Exec-managerial
3: HS-grad 9 Divorced Handlers-cleaners
4: 11th 7 Married-civ-spouse Handlers-cleaners
relationship race sex capital_gain capital_loss hr_per_week
1: Not-in-family White Male 2174 0 40
2: Husband White Male 0 0 13
3: Not-in-family White Male 0 0 40
4: Husband Black Male 0 0 40
country income
1: United-States <=50K
2: United-States <=50K
3: United-States <=50K
4: United-States <=50K
We added 9 really ugly columns to the data set:
- 4 dates with various formats and NAs
- 1 constant column
- 3 numeric with different decimal separator
- 1 email adress
The same info can be contained in two different columns.