furniture
We will first make a ficticious data set:
df <- data.frame(a = rnorm(100, 1.5, 2),
b = seq(1, 100, 1),
c = c(rep("control", 40), rep("Other", 7), rep("treatment", 50), rep("None", 3)),
d = c(sample(1:1000, 90, replace=TRUE), rep(-99, 10)))
There are two functions that we’ll demonstrate here:
washer
table1
washer
is a great function for quick data cleaning. In situations where there are placeholders, extra levels in a factor, or several values need to be changed to another.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
df <- df %>%
mutate(d = washer(d, -99), ## changes the placeholder -99 to NA
c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"
Now that the data is “washed” we can start exploring and reporting.
table1(df, a, b, factor(c), d)
##
## |=====================================
## Mean/Count (SD/%)
## Observations 100
## a
## 1.69 (2.00)
## b
## 50.50 (29.01)
## factor(c)
## control 50 (50.00%)
## treatment 50 (50.00%)
## d
## 429.76 (295.97)
## |=====================================
The variables must be numeric or factor. Since we use a special type of evaluation (i.e. Non-Standard Evaluation) we can change the variables in the function (e.g., factor(c)
). This can be extended to making a whole new variable in the function as well.
table1(df, a, b, d, ifelse(a > 1, 1, 0))
##
## |============================================
## Mean/Count (SD/%)
## Observations 100
## a
## 1.69 (2.00)
## b
## 50.50 (29.01)
## d
## 429.76 (295.97)
## ifelse(a > 1, 1, 0)
## 0.61 (0.49)
## |============================================
This is just the beginning though. Two powerful things the function can do are shown below:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE)
##
## |================================================================
## control treatment P-Value
## Observations 50 50
## a 0.296
## 1.48 (2.07) 1.90 (1.93)
## b <.001
## 28.50 (22.37) 72.50 (14.58)
## d 0.358
## 402.28 (307.53) 459.79 (283.34)
## ifelse(a > 1, 1, 0) 0.84
## 0.60 (0.49) 0.62 (0.49)
## |================================================================
The splitby = ~factor(c)
stratifies the means and counts by a factor variable (in this case either control or treatment). When we use this we can also automatically compute tests of significance using test=TRUE
.
Finally, you can polish it quite a bit using a few other options. For example, you can do the following:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
splitby_labels = c("Control", "Treatment"))
##
## |=========================================================
## Control Treatment P-Value
## Observations 50 50
## A 0.296
## 1.48 (2.07) 1.90 (1.93)
## B <.001
## 28.50 (22.37) 72.50 (14.58)
## D 0.358
## 402.28 (307.53) 459.79 (283.34)
## New Var 0.84
## 0.60 (0.49) 0.62 (0.49)
## |=========================================================
You can also format the numbers (adding a comma for big numbers such as in 20,000 instead of 20000):
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
splitby_labels = c("Control", "Treatment"),
format_number = TRUE)
##
## |=========================================================
## Control Treatment P-Value
## Observations 50 50
## A 0.296
## 1.48 (2.07) 1.90 (1.93)
## B <.001
## 28.50 (22.37) 72.50 (14.58)
## D 0.358
## 402.28 (307.53) 459.79 (283.34)
## New Var 0.84
## 0.60 (0.49) 0.62 (0.49)
## |=========================================================
This can also be outputted as a latex, markdown, or pandoc table (matching all the output types of knitr::kable
). Below shows how to do a latex table:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
splitby_labels = c("Control", "Treatment"),
output_type = "latex")
Last item to show you: simple_table1()
. This instead of reporting counts and percentages for categorical variables, it reports only percentages. I have more plans for this function but for now, that is the main benefit over table1()
.
simple_table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
splitby_labels = c("Control", "Treatment"))
##
## |=========================================================
## Control Treatment P-Value
## Observations 50 50
## A 0.296
## 1.48 (2.07) 1.90 (1.93)
## B <.001
## 28.50 (22.37) 72.50 (14.58)
## D 0.358
## 402.28 (307.53) 459.79 (283.34)
## New Var 0.84
## 0.60 (0.49) 0.62 (0.49)
## |=========================================================
The three functions: table1
, simple_table1
and washer
add simplicity to cleaning up and understanding your data. Use these pieces of furniture to make your quantitative life a bit easier.