Furniture

2016-11-15

Using furniture

We will first make a ficticious data set:

df <- data.frame(a = rnorm(100, 1.5, 2), 
                 b = seq(1, 100, 1), 
                 c = c(rep("control", 40), rep("Other", 7), rep("treatment", 50), rep("None", 3)),
                 d = c(sample(1:1000, 90, replace=TRUE), rep(-99, 10)))

There are two functions that we’ll demonstrate here:

  1. washer
  2. table1

Washer

washer is a great function for quick data cleaning. In situations where there are placeholders, extra levels in a factor, or several values need to be changed to another.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
df <- df %>%
  mutate(d = washer(d, -99),  ## changes the placeholder -99 to NA
         c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"

Table1

Now that the data is “washed” we can start exploring and reporting.

table1(df, a, b, factor(c), d)
## 
## |=====================================
##               Mean/Count (SD/%)
##  Observations 100              
##  a                             
##               1.69 (2.00)      
##  b                             
##               50.50 (29.01)    
##  factor(c)                     
##     control   50 (50.00%)      
##     treatment 50 (50.00%)      
##  d                             
##               429.76 (295.97)  
## |=====================================

The variables must be numeric or factor. Since we use a special type of evaluation (i.e. Non-Standard Evaluation) we can change the variables in the function (e.g., factor(c)). This can be extended to making a whole new variable in the function as well.

table1(df, a, b, d, ifelse(a > 1, 1, 0))
## 
## |============================================
##                      Mean/Count (SD/%)
##  Observations        100              
##  a                                    
##                      1.69 (2.00)      
##  b                                    
##                      50.50 (29.01)    
##  d                                    
##                      429.76 (295.97)  
##  ifelse(a > 1, 1, 0)                  
##                      0.61 (0.49)      
## |============================================

This is just the beginning though. Two powerful things the function can do are shown below:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE)
## 
## |================================================================
##                      control         treatment       P-Value
##  Observations        50              50                     
##  a                                                   0.296  
##                      1.48 (2.07)     1.90 (1.93)            
##  b                                                   <.001  
##                      28.50 (22.37)   72.50 (14.58)          
##  d                                                   0.358  
##                      402.28 (307.53) 459.79 (283.34)        
##  ifelse(a > 1, 1, 0)                                 0.84   
##                      0.60 (0.49)     0.62 (0.49)            
## |================================================================

The splitby = ~factor(c) stratifies the means and counts by a factor variable (in this case either control or treatment). When we use this we can also automatically compute tests of significance using test=TRUE.

Finally, you can polish it quite a bit using a few other options. For example, you can do the following:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       splitby_labels = c("Control", "Treatment"))
## 
## |=========================================================
##               Control         Treatment       P-Value
##  Observations 50              50                     
##  A                                            0.296  
##               1.48 (2.07)     1.90 (1.93)            
##  B                                            <.001  
##               28.50 (22.37)   72.50 (14.58)          
##  D                                            0.358  
##               402.28 (307.53) 459.79 (283.34)        
##  New Var                                      0.84   
##               0.60 (0.49)     0.62 (0.49)            
## |=========================================================

You can also format the numbers (adding a comma for big numbers such as in 20,000 instead of 20000):

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       splitby_labels = c("Control", "Treatment"),
       format_number = TRUE)
## 
## |=========================================================
##               Control         Treatment       P-Value
##  Observations 50              50                     
##  A                                            0.296  
##               1.48 (2.07)     1.90 (1.93)            
##  B                                            <.001  
##               28.50 (22.37)   72.50 (14.58)          
##  D                                            0.358  
##               402.28 (307.53) 459.79 (283.34)        
##  New Var                                      0.84   
##               0.60 (0.49)     0.62 (0.49)            
## |=========================================================

This can also be outputted as a latex, markdown, or pandoc table (matching all the output types of knitr::kable). Below shows how to do a latex table:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       splitby_labels = c("Control", "Treatment"),
       output_type = "latex")

Simple Table 1

Last item to show you: simple_table1(). This instead of reporting counts and percentages for categorical variables, it reports only percentages. I have more plans for this function but for now, that is the main benefit over table1().

simple_table1(df, a, b, d, ifelse(a > 1, 1, 0),
              splitby=~factor(c), 
              test=TRUE,
              var_names = c("A", "B", "D", "New Var"),
              splitby_labels = c("Control", "Treatment"))
## 
## |=========================================================
##               Control         Treatment       P-Value
##  Observations 50              50                     
##  A                                            0.296  
##               1.48 (2.07)     1.90 (1.93)            
##  B                                            <.001  
##               28.50 (22.37)   72.50 (14.58)          
##  D                                            0.358  
##               402.28 (307.53) 459.79 (283.34)        
##  New Var                                      0.84   
##               0.60 (0.49)     0.62 (0.49)            
## |=========================================================

Conclusion

The three functions: table1, simple_table1 and washer add simplicity to cleaning up and understanding your data. Use these pieces of furniture to make your quantitative life a bit easier.