The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! We will use < 10 lines of code and just 6 function names to explore penguins:
function | package | description |
---|---|---|
library | {base} | load a package |
filter() | {dplyr} | subset rows using column values |
describe() | {explore} | describe variables of the table |
explore() | {explore} | explore graphically a variable |
explore_all() | {explore} | explore all variables of the table |
explain_tree() | {explore} | explain a target using a decision tree |
The penguins dataset comes with the palmerpenguins package. It has 344 observations and 8 variables. (https://github.com/allisonhorst/palmerpenguins)
So we have to load the palmerpenguins package. Furthermore, we use the packages {dplyr} for filter() and %>% and {explore} for data exploration.
penguins %>% describe()
#> # A tibble: 8 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 species fct 0 0 3 NA NA NA
#> 2 island fct 0 0 3 NA NA NA
#> 3 bill_length_mm dbl 2 0.6 165 32.1 43.9 59.6
#> 4 bill_depth_mm dbl 2 0.6 81 13.1 17.2 21.5
#> 5 flipper_length_mm int 2 0.6 56 172 201. 231
#> 6 body_mass_g int 2 0.6 95 2700 4202. 6300
#> 7 sex fct 11 3.2 3 NA NA NA
#> 8 year int 0 0 3 2007 2008. 2009
There are some NA-values (unknown values) in the data. The variable containing the most NAs is sex. flipper_length_mm and others contain only 2 observations with NAs.
We use only penguins with known flipper length for the data exploration!
We reduced the penguins from 344 to 342.
What is the relationship between all the variables and species?
We already see some strong patterns in the data. flipper_length_mm seperates species Gentoo, bill_length_mm seperates species Adelie from Chinstrap. And we see that Chinstrap and Gentoo are located on seperate islands.
Now we explain species using a decision tree:
We found an easy explanation how to find out the species by just using flipper_length_mm and bill_length_mm.
Now let’s take a closer look to these variables:
The plot shows a not perfect but good seperation between the 3 species!