Introduction to summarytools

Dominic Comtois

2018-01-13

summarytools is an R package providing tools to neatly and quickly summarize data, with functions that most R programmers once wished were included in base R. It can also make R a little easier to use for newbies. With a few lines of simple code, you can get a good look at the data at hand.

An emphasis has been put on both what and how results are presented, so that the package can serve both as a data exploration and reporting tool than can be used either on its own for minimal reports, or integrated in a larger set of tools such as RStudio’s for rmarkdown and knitr.

Four Core Functions

The package is built around four main functions:

Output Options

All summarytools objects returned by the main functions can be:

Text-based output relies on the pander package, while html output relies on RStudio’s htmltools.

Bare-Bones Example: Frequency Table

To show what default (console) outputs look like, we’ll first generate a frequency table for iris$Species.

freq(iris$Species)
Frequencies   
Species     
Data frame: iris   
Type: Factor (unordered)   

                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
---------------- ------ --------- -------------- --------- --------------
          setosa     50     33.33          33.33     33.33          33.33
      versicolor     50     33.33          66.67     33.33          66.67
       virginica     50     33.33         100.00     33.33         100.00
            <NA>      0                               0.00         100.00
           Total    150    100.00         100.00    100.00         100.00

To get familiar with the output styles, try different values for style= and see how results look in the console.

Markdown Outputs

When using style='rmarkdown' with freq() or descr(), the generated outputs are ready for markdown rendering. With dfSummary(), options for style are “multiline” (default) and “grid”, and plain.ascii=FALSE must be used to have proper line feeds in multiline cells.

Note: In an .Rmd document with knitr, always set the chunk option results='asis':

```{r, results='asis'}
library(summarytools)  
freq(tobacco$smoker, style='rmarkdown')  
```

Frequencies

smoker
Data frame: tobacco
Type: Factor (unordered)

  Freq % Valid % Valid Cum. % Total % Total Cum.
Yes 298 29.80 29.80 29.80 29.80
No 702 70.20 100.00 70.20 100.00
<NA> 0 0.00 100.00
Total 1000 100.00 100.00 100.00 100.00

Descriptive (Univariate) Statistics

The descr() function accepts both vectors and data frames, in which case it will show statistics for all numerical variables it contains. We’ll use one of the datasets included in the package.

data(exams)
descr(exams[ ,3:5], style='rmarkdown')

Descriptive Statistics

Data Frame: exams
N: 30

  french math geography
Mean 73.94 73.54 70.04
Std.Dev 10.79 9.19 10.65
Min 44.80 55.60 47.20
Median 73.60 73.75 68.50
Max 94.70 93.20 96.30
MAD 7.56 9.93 12.31
IQR 8.50 13.35 11.90
CV 6.85 8.00 6.58
Skewness 0.03 0.12 0.10
SE.Skewness 0.43 0.44 0.43
Kurtosis 0.45 -0.58 -0.03
N.Valid 29.00 28.00 29.00
Pct.Valid 96.67 93.33 96.67

To rather see variables in rows and stats in columns, use transpose=TRUE:

descr(exams, style = 'rmarkdown', transpose = TRUE)

Cross-Tabulations

Let a few examples speak for themselves. First, a bare-bones cross-tabulation.

with(tobacco, ctable(smoker, diseased, prop = 'n', totals = FALSE))
Cross-Tabulation 
smoker * diseased   
Data Frame: tobacco   
-------- ---------- ----- -----
           diseased   Yes    No
  smoker                       
     Yes              125   173
      No               99   603
-------- ---------- ----- -----

Then show proportions, by row.

with(tobacco, ctable(smoker, diseased, prop = 'r'))
Cross-Tabulation / Row proportions 
smoker * diseased   
Data Frame: tobacco   
-------- ---------- ------------- ------------- ---------------
           diseased           Yes            No           Total
  smoker                                                       
     Yes              125 (41.9%)   173 (58.1%)    298 (100.0%)
      No               99 (14.1%)   603 (85.9%)    702 (100.0%)
   Total              224 (22.4%)   776 (77.6%)   1000 (100.0%)
-------- ---------- ------------- ------------- ---------------

The type of table generated by ctable() is unfortunately not (yet) supported by rmarkdown. But we can turn to the render method to circumvent this:

crosstable <- with(tobacco, ctable(smoker, diseased))
print(crosstable, method='render', footnote = NA)

Cross-Tabulation / Row proportions

smoker * diseased

Data Frame: tobacco
diseased
smoker Yes No Total
Yes 125  ( 41.9% ) 173  ( 58.1% )  298  ( 100.0% )
No  99  ( 14.1% ) 603  ( 85.9% )  702  ( 100.0% )
Total 224  ( 22.4% ) 776  ( 77.6% ) 1000  ( 100.0% )

Data frame Summaries

This is the most elaborate function of the package. It incorporates elements of freq() and descr(), but goes beyond with its graphs (not yet supported with rmarkdown) and other attributes.

dfSummary(tobacco, style='grid', plain.ascii = FALSE, graph.col = FALSE)

Data Frame Summary

tobacco
N: 1000

No Variable Stats / Values Freqs (% of Valid) Valid Missing

1

gender [factor]

  1. F
  2. M

489 (50.0%)
489 (50.0%)

978 (97.8%)

22 (2.2%)

2

age [numeric]

mean (sd) : 49.6 (18.29)
min < med < max :
18 < 50 < 80
IQR (CV) : 32 (0.37)

63 distinct val.

975 (97.5%)

25 (2.5%)

3

age.gr [factor]

  1. 18-34
  2. 35-50
  3. 51-70
  4. 71 +

258 (26.5%)
241 (24.7%)
317 (32.5%)
159 (16.3%)

975 (97.5%)

25 (2.5%)

4

BMI [numeric]

mean (sd) : 25.73 (4.49)
min < med < max :
8.83 < 25.62 < 39.44
IQR (CV) : 5.72 (0.17)

974 distinct val.

974 (97.4%)

26 (2.6%)

5

smoker [factor]

  1. Yes
  2. No

298 (29.8%)
702 (70.2%)

1000 (100%)

0 (0%)

6

cigs.per.day [numeric]

mean (sd) : 6.78 (11.88)
min < med < max :
0 < 0 < 40
IQR (CV) : 11 (1.75)

37 distinct val.

965 (96.5%)

35 (3.5%)

7

diseased [factor]

  1. Yes
  2. No

224 (22.4%)
776 (77.6%)

1000 (100%)

0 (0%)

8

disease [character]

  1. Hypertension
  2. Cancer
  3. Cholesterol
  4. Heart
  5. Pulmonary
  6. Musculoskeletal
  7. Diabetes
  8. Hearing
  9. Digestive
  10. Hypotension
    [ 3 others ]

36 (16.2%)
34 (15.3%)
21 ( 9.5%)
20 ( 9.0%)
20 ( 9.0%)
19 ( 8.6%)
14 ( 6.3%)
14 ( 6.3%)
12 ( 5.4%)
11 ( 5.0%)
21 ( 9.4%)

222 (22.2%)

778 (77.8%)

9

samp.wgts [numeric]

mean (sd) : 1 (0.08)
min < med < max :
0.86 < 1.04 < 1.06
IQR (CV) : 0.19 (0.08)

0.86!: 267 (26.7%)
1.04!: 249 (24.9%)
1.05!: 324 (32.4%)
1.06!: 160 (16.0%)
! rounded

1000 (100%)

0 (0%)

For this one, we can use styles “multiline” (default) or “grid”. We must however specify plain.ascii=FALSE when using markdown, otherwise the rendered results will be problematic.

Redirecting Output

Text/Markdown Documents

Using the file= parameter with the view() or print() functions, we can redirect output into text files. And setting append=TRUE will append results to an existing text file:

my_summary <- dfSummary(tobacco)  
print(my_summary, file = "tobacco.txt", style = "grid")  # Creates tobacco.txt
my_stats <- descr(tobacco)
print(my_stats, file="tobacco.txt", append = TRUE) # Appends results to tobacco.txt

As you may have noticed, the style argument was used when calling the print() function. We could also have used it when calling the dfSummary() and descr() functions, in which case the style would have been written in the objects’ properties. Using this argument with print() overrides the style that is stored in the object. It is one of several arguments that can be used that way. See the documentation for print() to know all the details.

HTML Documents

summarytools uses Bootstrap’s stylesheets to generate standalone HTML documents that can be displayed in a Web Browser or in RStudio’s Viewer using the generic print() function:

print(dfSummary(tobacco), method = 'browser')  # Displays results in default Web Browser
print(dfSummary(tobacco), method = 'viewer')   # Displays results in RStudio's Viewer
view(dfSummary(tobacco))                       # Same as line above -- view() is a wrapper function

Using file= argument with an .html extension will simply generate an HTML document (without opening it).

print(dfSummary(tobacco), file = '~/Documents/tobacco_summary.html')

Here is a picture of the output:

dfSummary in HTML format

As with simple text files, you can also append existing HTML reports with additionnal content.

Using by(), with(), and lapply()

Summarytools functions support the use of by(), with(), and lapply(), at least when used in good measure.

Since objects generated by those native functions have their own class (they are special lists containing summarytools objects), they are not sent to the package’s generic print method automatically. In order to have the best results, the following method is recommended: First, store the object generated by one of the native functions. Then, use view() either with method='pander' to show results in console, or omitting the method argument to see (HTML) results in the Viewer or Browser.

stats <- by(data = exams$geography, INDICES = exams$gender, FUN = descr, style = 'rmarkdown')
view(stats, method = 'pander')

Descriptive Statistics

geography
Data Frame: exams
Group: gender = Girl
N: 15

  geography
Mean 67.27
Std.Dev 8.26
Min 50.40
Median 67.30
Max 78.90
MAD 9.34
IQR 10.20
CV 8.14
Skewness -0.34
SE.Skewness 0.58
Kurtosis -0.90
N.Valid 15.00
Pct.Valid 100.00

Group: gender = Boy
N: 15

  geography
Mean 73.00
Std.Dev 12.35
Min 47.20
Median 71.20
Max 96.30
MAD 11.34
IQR 15.48
CV 5.91
Skewness -0.13
SE.Skewness 0.60
Kurtosis -0.48
N.Valid 14.00
Pct.Valid 93.33

Other Tricks To Try Out

There are many things you can do to build elaborate, fine-tuned reports. Let’s mention a few…

Getting Most Properties of an Object With what.is()

When developing, we often use a number functions to obtain an object’s properties. what.is() proposes to lump together the results of such functions (class(), typeof(), attributes() and others).

what.is(iris)
$properties
      property      value
1        class data.frame
2       typeof       list
3         mode       list
4 storage.mode       list
5          dim    150 x 5
6       length          5
7    is.object       TRUE
8  object.type         S3
9  object.size 7088 Bytes

$attributes.lengths
    names row.names     class 
        5       150         1 

$extensive.is
[1] "is.data.frame" "is.list"       "is.object"     "is.recursive" 
[5] "is.unsorted"  

Limitations

Learn more and stay up-to-date

Check the project’s page for more examples; from there you can also submit feature requests or signal problems you might encounter.

To install the package in its development version, use

install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools', ref='dev-current')

Final note

The source of this document is an .Rmd file; knitr’s chunk option results has been set to 'asis', to make sure formatting is not coming from knitr itself.