R offers many tools to analyse the univariate or bivariate distribution of series. This includes table and prop.table on base R, and group_by/summarise and count in dplyr. However, these functions are somehow frustrating as some very common tasks, like:

  • adding a total,
  • computing relative frequencies or percentage instead of counts,
  • merging high values of integer series in a >=N category,
  • computing cumulative distributions,

are tedious. Moreover, to our knowledge, R offers weak support for numerical series for which the numerical value is not known at the individual level, but only the fact that this value belongs to a certain class. descstat is intended to provide user-friendly tools to perform these kind of operations. More specifically, three kind of tables can be constructed with descstat:

  • freq_table for frequency tables, suitable for factors or integer numerical series,
  • bins_table for bins tables, suitable for numerical series, either provided as a numeric or as a factor containing numerical classes,
  • cont_table for contingency tables of two series.

These function are writen in the tidyverse style, which means that the pipe operator can be used and that the series can be selected without quotes.

library("ggplot2")
library("descstat")

Frequency tables

Frequency tables are suitable to summarize the univariate distribution of a categorical or an integer series. In tidyverse, this task can be performed using the dplyr::count function. For example, the rgp data set, which is an extract of the French cencus, contains the number of children in households:

rgp %>% count(children)
## # A tibble: 8 x 2
##   children     n
##      <int> <int>
## 1        0   317
## 2        1   272
## 3        2   260
## 4        3   112
## 5        4    31
## # … with 3 more rows

Computing a frequency table

The descstat::freq_table function performs the same task:

rgp %>% freq_table(children)
## # A tibble: 8 x 2
##   children     n
## *    <int> <int>
## 1        0   317
## 2        1   272
## 3        2   260
## 4        3   112
## 5        4    31
## # … with 3 more rows

but it has several further arguments which can improve the result:

  • cols is a character containing one or several of the following letters:

    • n for the count or absolute frequencies (the default),
    • f for the (relative) frequencies,
    • p for the percentage (ie \(f\times 100\)),
    • N, F and P returns the cumulative values of n, f and p.
  • total a boolean, if TRUE, a total is returned,
  • max, suitable for an integer series only, is an integer which is the maximum value presented in the frequency table, eg max = 3 creates a last line which is >=3.

The following command use all the possible letters.

rgp %>% freq_table(children, cols = "nfpNFP")
## # A tibble: 8 x 7
##   children     n     f     p     N     F     P
## *    <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1        0   317 0.317  31.7   317 0.317  31.7
## 2        1   272 0.272  27.2   589 0.589  58.9
## 3        2   260 0.26   26     849 0.849  84.9
## 4        3   112 0.112  11.2   961 0.961  96.1
## 5        4    31 0.031   3.1   992 0.992  99.2
## # … with 3 more rows

As there are few occurrences of families with more than 3 children, we set max = 3 and add a total by setting total to TRUE.

rgp %>% freq_table(children, max = 3, total = TRUE)
## # A tibble: 5 x 2
##   children     n
##   <fct>    <int>
## 1 0          317
## 2 1          272
## 3 2          260
## 4 >=3        151
## 5 Total     1000

Printing a frequency table

Note that in the printed table, the children series is now a character as the last two values are >=3 and Total. Actually freq_table returns an object of class freq_table which inherits from the tbl_df class. A look at the structure of the object:

rgp %>% freq_table(children, max = 3, total = TRUE) %>% str
## tibble [5 × 2] (S3: freq_table/tbl_df/tbl/data.frame)
##  $ children: num [1:5] 0 1 2 3.5 NA
##  $ n       : int [1:5] 317 272 260 151 1000

indicates that we still have a tibble, with a numeric children series for which the last two values equal to 3.5 (for 3 and more) and NA (for the total).

A pre_print function is provided with a method for freq_table objects. It turns the children series in a character, with 3.5 and NA replaced by >=3 and Total. This pre_print method is included in the format method for freq_table objects, which is then passed to the tbl_df method:

descstat:::format.freq_table
## function (x, ..., n = NULL, width = NULL, n_extra = NULL) 
## {
##     x <- pre_print(x)
##     class(x) <- setdiff(class(x), "freq_table")
##     format(x, ..., n = n, width = width, n_extra = n_extra)
## }
## <bytecode: 0x555af2996238>
## <environment: namespace:descstat>

The pre_print function should be used explicitly while using knitr::kable, as this function doesn't use any format method:

rgp %>% freq_table(children, max = 3, total = TRUE) %>%
    pre_print %>% knitr::kable()
children n
0 317
1 272
2 260
>=3 151
Total 1000

Ploting a frequency table

The most natural way to plot a frequency table with ggplot is to use geom_col (or equivalently geom_bar with stat = 'identity').

cld <- rgp %>% freq_table(children, cols = "nf", max = 3)
cld %>% pre_print %>% ggplot(aes(children, f)) +
    geom_col(fill = "white", color = "black")

Note the use of the pre_print method which turns the 3.5 numerical value in >=3.

To get more enhanced graphics, a pre_plot method is provided, with a plot argument equal to banner or cumulative. With plot = "banner",

cld %>% pre_print %>% pre_plot("f", plot = "banner")
## # A tibble: 4 x 3
##   children     f  ypos
##   <fct>    <dbl> <dbl>
## 1 >=3      0.151 0.924
## 2 2        0.26  0.719
## 3 1        0.272 0.453
## 4 0        0.317 0.158

pre_plot returns an ypos series which indicates the coordinates where to write the label corresponding to the frequencies.

bnp <- cld %>% pre_print %>% pre_plot("f", plot = "banner") %>%
    ggplot(aes(x = 2, y = f, fill = children)) +
    geom_col() +
    geom_text(aes(y = ypos, label = f)) +
    scale_x_continuous(label = NULL) +
    scale_fill_brewer(palette = "Set3")
bnp

using polar coordinates, we get a pie chart:

bnp + coord_polar(theta = "y") + theme_void()

changing the range of the x values, we get a hole in the pie chart which result in the so called donut chart:

bnp + scale_x_continuous(limits = c(1, 2.5)) +
    coord_polar(theta = "y") + theme_void()

The other possible value for the plot argument of pre_plot is cumulative:

cld <- rgp %>% freq_table(children, "F", max = 5, total = TRUE)
cld %>% pre_plot(plot = "cumulative") %>% print(n = 5)
## # A tibble: 12 x 5
##   pos       x  xend     y   yend
##   <chr> <dbl> <dbl> <dbl>  <dbl>
## 1 hor       0    NA 0.317  0.317
## 2 vert     NA    NA 0.317 NA    
## 3 hor       1     0 0.589  0.589
## 4 vert      0     0 0.589  0.317
## 5 hor       2     1 0.849  0.849
## # … with 7 more rows

this returns four series which have the names of the aesthetics that are mandatory for geom_segment; x, xend, y and yend. It also returns a pos series with which one can draw differently the horizontal (hor) and the vertical (vert) segments, using for example the linetype aesthetic.

cld %>% pre_plot(plot = "cumulative") %>% 
    ggplot() +
    geom_segment(aes(x = x, xend = xend, y = y, yend = yend,
                     linetype = pos)) +
    guides(linetype = FALSE) +
    labs(x = "number of children", y = "cumulative frequency")

Bins tables

descstat::bins_table computes a bins table, ie a table that contains the frequencies for different classes of a numerical series. This numerical series can either be numerical or coded as a class in the original (raw) tibble.

For example, the wages data set contains two series called wage and size which respectively indicate the class of wages and the class of firm size.

wages %>% print(n = 3)
## # A tibble: 1,000 x 6
##   sector           age hours sex    wage    size    
##   <fct>          <int> <int> <fct>  <fct>   <fct>   
## 1 industry          37  1712 male   [14,16) [1,10)  
## 2 administration    57   598 female [4,6)   [50,100)
## 3 business          30  1820 male   [40,50) [20,50) 
## # … with 997 more rows

The univariate distribution of, for example, size can be computed using the dplyr::count or the descstat::freq_table functions.

wages %>% count(size)
## # A tibble: 6 x 2
##   size          n
##   <fct>     <int>
## 1 [1,10)      207
## 2 [10,20)      90
## 3 [20,50)     124
## 4 [50,100)    113
## 5 [100,250)   124
## # … with 1 more row
wages %>% freq_table(size)
## # A tibble: 6 x 2
##   size          n
## * <fct>     <int>
## 1 [1,10)      207
## 2 [10,20)      90
## 3 [20,50)     124
## 4 [50,100)    113
## 5 [100,250)   124
## # … with 1 more row

Creating bins tables

The bins_table function provides a richer interface. Firstly, it returns a series called x which is the center of the classes. Secondly, a break argument is provided, which is a numerical vector that can be used to reduce the number of classes:

wages %>% bins_table(size, breaks = c(20, 250))
## # A tibble: 3 x 3
##   size          x     n
## * <fct>     <dbl> <int>
## 1 [1,20)     10.5   297
## 2 [20,250)  135     361
## 3 [250,Inf) 365     342

or to create classes from numerical values:

padova %>% pull(price) %>% range
## [1]  35 950
padova %>% bins_table(price, breaks = c(250, 500, 750))
## # A tibble: 4 x 3
##   price         x     n
## * <fct>     <dbl> <int>
## 1 (0,250]     125   655
## 2 (250,500]   375   328
## 3 (500,750]   625    45
## 4 (750,Inf]   875    14
padova %>% bins_table(price, breaks = c(30, 250, 500, 750, 1000))
## # A tibble: 4 x 3
##   price           x     n
## * <fct>       <dbl> <int>
## 1 (30,250]      140   655
## 2 (250,500]     375   328
## 3 (500,750]     625    45
## 4 (750,1e+03]   875    14

Note that in this latter case, the first (last) values of breaks can be either:

  • outside the range of the series; in this case the lower bound of the first class and the upper bound of the last class are these values,
  • inside the range of the series; in this case the lower bound of the first class is 0 and the upper bound of the last class is Inf.

Moreover, as for freq_table, the cols argument enables the computation of counts, frequencies and percentage and the cumulative values are obtained using upper caps.

wages %>% bins_table(size, cols = "nFp", breaks = c(20, 100, 250))
## # A tibble: 4 x 5
##   size          x     n     F     p
## * <fct>     <dbl> <int> <dbl> <dbl>
## 1 [1,20)     10.5   297 0.297  29.7
## 2 [20,100)   60     237 0.534  23.7
## 3 [100,250) 175     124 0.658  12.4
## 4 [250,Inf) 325     342 1      34.2

cols can also contain a d for density, which is the frequency divided by the class width, m which is the mass for the class (the product of the frequency and the value of the variable) and M which is the cumulated mass.

wages %>% bins_table(size, cols = "dmM", breaks = c(20, 100, 250))
## # A tibble: 4 x 5
##   size          x        d      m      M
## * <fct>     <dbl>    <dbl>  <dbl>  <dbl>
## 1 [1,20)     10.5 0.0156   0.0208 0.0208
## 2 [20,100)   60   0.00296  0.0947 0.115 
## 3 [100,250) 175   0.000827 0.144  0.260 
## 4 [250,Inf) 325   0.00228  0.740  1

A total can be computed for f, n and p, as for freq_table objects.

wages %>% bins_table(size, cols = "dMfnFp", breaks = c(20, 100, 250),
                     total = TRUE)
## # A tibble: 5 x 8
##   size      x     f     n      F     p        d       M
## * <fct> <dbl> <dbl> <int>  <dbl> <dbl>    <dbl>   <dbl>
## 1 [1,2…  10.5 0.297   297  0.297  29.7  1.56e-2  0.0208
## 2 [20,…  60   0.237   237  0.534  23.7  2.96e-3  0.115 
## 3 [100… 175   0.124   124  0.658  12.4  8.27e-4  0.260 
## 4 [250… 325   0.342   342  1      34.2  2.28e-3  1     
## 5 Total  NA   1      1000 NA     100   NA       NA

Classes and values

For the computation of descriptive statistics, classes should be replaced by values. By the default, the center of the class is used, which means that the computation is done as if all the observations of the class had a value equal to its center. This value is returned as x while using bins_table. Three arguments control what happens for the first and the last class:

  • a specific value for the first class can be indicated using the xfirst argument,
  • a specific value for the last class can be indicated using the xlast argument,
  • the width of the last class (if it is opened on infinity on the right) can be set as a multiple of the before last class using the wlast argument.

To set the center of the first class to 10 (and not to 10.5), we can use:

wages %>% bins_table(size, breaks = c(20, 100, 250), xfirst = 10)
## # A tibble: 4 x 3
##   size          x     n
## * <fct>     <dbl> <int>
## 1 [1,20)       10   297
## 2 [20,100)     60   237
## 3 [100,250)   175   124
## 4 [250,Inf)   325   342

To set the center of the last class to 400 we can either set xlast to this value or to indicate that the width of the last class should be twice the one of the before last:

wages %>% bins_table(size, breaks = c(20, 100, 250), xlast = 400)
wages %>% bins_table(size, breaks = c(20, 100, 250), wlast = 2)
## # A tibble: 4 x 3
##   size          x     n
## * <fct>     <dbl> <int>
## 1 [1,20)     10.5   297
## 2 [20,100)   60     237
## 3 [100,250) 175     124
## 4 [250,Inf) 400     342

other values of the variable can be included in the table using the vals argument, which is a character including some of the following letters:

  • l for the lower bound of the class,
  • u for the upper bound of the class,
  • a for the width of the class.
wages %>% bins_table(size, vals = "xlua", cols = "p", breaks = c(20, 100, 250), wlast = 2)
## # A tibble: 4 x 6
##   size          x     l     u     a     p
## * <fct>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 [1,20)     10.5     1    20    19  29.7
## 2 [20,100)   60      20   100    80  23.7
## 3 [100,250) 175     100   250   150  12.4
## 4 [250,Inf) 400     250   550   300  34.2

Under the hood: converting classes to values

bins_table internally calls the function cls2val which turns the classes into values of the variable. Its main argument is a factor (or a character) containing classes of numerical values, ie a string of the form [a,b) or (a,b] where a and b are respectively the lower and the upper values of the classes. In the first case, the class is closed on the left (a is included) and opened on the right (b is not included) as for the second case, the class is opened on the left and closed on the right. This is the notation used when the base::cut function is used to transform a numerical value to a class, using a vector of breaks.

padova %>% pull(price) %>% head
## [1]  95 225 225 148 138 460
padova %>% pull(price) %>%
    cut(breaks = c(0, 250, 500, 1000), right = FALSE) %>% head
## [1] [0,250)   [0,250)   [0,250)   [0,250)   [0,250)  
## [6] [250,500)
## Levels: [0,250) [250,500) [500,1e+03)

Note the use of the right argument so that the classes are closed on the left and opened on the right.

descstat::recut can be used to reduce the number of classes, by providing a numerical vector which should be a subset of the initial class limits:

wages %>% pull(wage) %>% levels %>% head
## [1] "[0.2,0.5)" "[0.5,1)"   "[1,1.5)"   "[1.5,2)"  
## [5] "[2,3)"     "[3,4)"
wages2 <- wages %>% mutate(wage = recut(wage, c(10, 20, 50)))
wages2 %>% pull(wage) %>% levels
## [1] "[0.2,10)" "[10,20)"  "[20,50)"  "[50,Inf)"

The cls2val function takes as arguments a series which contains numerical classes, a pos argument and the three arguments xfirst, xlast and wlast previously described. pos is a numerical value which can take any value between 0 and 1:

  • pos = 0 returns the lower bound of the class,
  • pos = 1 returns the upper bound of the class,
  • pos = 0.5 returns the center of the class.
wages2 %>% select(wage) %>%
    mutate(lb = cls2val(wage, 0),
           ub = cls2val(wage, 1),
           ct = cls2val(wage, 0.5, xfirst = 5, xlast = 100))
## # A tibble: 1,000 x 4
##   wage        lb    ub    ct
##   <fct>    <dbl> <dbl> <dbl>
## 1 [10,20)   10      20    15
## 2 [0.2,10)   0.2    10     5
## 3 [20,50)   20      50    35
## 4 [0.2,10)   0.2    10     5
## 5 [50,Inf)  50      80   100
## # … with 995 more rows

Ploting bins tables

ggplot provides three geoms to plot the distribution of a numerical series: geom_histogram, geom_density and geom_freqpoly. These three geoms use the bin stat, which means that they consider a raw vector of numerical values, create classes, count the number of observations in each class and then plot the result:

padova %>% ggplot(aes(price)) +
    geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
    geom_freqpoly(aes(y = ..density..), color = "red") + geom_density(color = "blue")

These geoms can be used when individual numerical data are available, but not when only classes are observed. A pre_plot method is provided for bins_table objects, with the plot argument either equal to histogram (the default) or freqpoly. The resulting table contains columns called x and y and can be ploted using geom_polygon for an histogram and geom_line for a frequency polygon.

wages %>% bins_table(wage, "d", breaks = c(10, 20, 30, 40, 50)) %>%
    pre_plot(plot = "histogram") %>%
    ggplot(aes(x, y)) + geom_polygon(fill = "white", color = "black")

wages %>% bins_table(wage, "d", breaks = c(10, 20, 30, 40, 50)) %>%
    pre_plot(plot = "freqpoly") %>%
    ggplot(aes(x, y)) + geom_line()

Lorenz curve can be plotted using plot = "lorenz". Not that in this case, the table should contain F and M.

lzc <- wages %>% bins_table(wage, "MF", breaks = c(10, 20, 30, 40, 50)) %>%
    pre_plot(plot = "lorenz")
lzc
## # A tibble: 24 x 5
##   cls      pos   pts       F      M
## * <fct>    <fct> <lgl> <dbl>  <dbl>
## 1 [0.2,10) sw    FALSE 0     0     
## 2 [0.2,10) nw    TRUE  0     0     
## 3 [0.2,10) ne    TRUE  0.184 0.0391
## 4 [0.2,10) se    FALSE 0.184 0     
## 5 [10,20)  sw    FALSE 0.184 0     
## # … with 19 more rows

Each line in the resulting tibble indicate the coordinates (F for x and M for y) of the points that are necessary to plot the polygons under the Lorenz curve. pts is a logical which indicates which lines correspond to points that are on the Lorenz

lzc %>% ggplot(aes(F, M)) +
    geom_polygon(fill = "lightyellow", color = "black") +
    geom_point(data = filter(lzc, pts)) +
    geom_line(data = tibble(F = c(0, 1), M = c(0, 1)), color = "blue") +
    geom_line(data = tibble(F = c(0, 1, 1), M = c(0, 0, 1)), color = "red")

Computing descriptive statistics

Descriptive statistics can easily be computed applying functions to bins_table objects. The problem is that only a few statistical functions of the base and the stats packages are generic. For these, methods where written for bins_table objects, for the other ones, we had to create new functions.

base R descstat
mean mean
median median
quantile quantile
var variance
sd stdev
mad madev
modval
medial
gini

To compute the central values statistics of the distribution of wages we use:

z <- wages %>% bins_table(wage)
z %>% mean
## [1] 24.12
z %>% median
## [1] 21.94
z %>% modval
## # A tibble: 1 x 3
##   wage        x     n
##   <fct>   <dbl> <int>
## 1 [24,26)    25    69

median returns a value computed using the Thales theorem. modval returns the mode, which is a one line tibble containing the class, the center of the class and the modal value.

For the dispersion statistics:

z %>% stdev
## [1] 14.68
z %>% variance
## [1] 215.4

For the quantiles, the argument y can be used to compute the quantiles using the values of the variable (y = "value", the default) or the masses (y = "mass"):

z %>% quantile(probs = c(0.25, 0.5, 0.75))
## [1] 13.45 21.94 32.31
z %>% quantile(y = "mass", probs = c(0.25, 0.5, 0.75))
## [1] 21.58 30.39 46.00

The quantile of level 0.5 is the median if the first case, the medial in the second case

z %>% median
## [1] 21.94
z %>% medial
## [1] 30.39

gini computes the Gini coefficient of the series:

z %>% gini
## [1] 0.3403

Contingency tables

With dplyr, a contingency table can be computed using count with two categorical variables.

Let's first reduce the number of classes of size and wage in the wages table.

wages2 <- wages %>%
    mutate(size = recut(size, c(20, 50, 100)),
           wage = recut(wage, c(10, 30, 50)))
wages2 %>% count(size, wage)
## # A tibble: 16 x 3
##   size    wage         n
##   <fct>   <fct>    <int>
## 1 [1,20)  [0.2,10)    74
## 2 [1,20)  [10,30)    169
## 3 [1,20)  [30,50)     41
## 4 [1,20)  [50,Inf)    13
## 5 [20,50) [0.2,10)    17
## # … with 11 more rows

To get a "wide" table, with the values of one of the two variables being the columns, we can use tidyr::pivot_wider:

wages2 %>% count(size, wage) %>%
    tidyr::pivot_wider(values_from = n, names_from = size)
## # A tibble: 4 x 5
##   wage     `[1,20)` `[20,50)` `[50,100)` `[100,Inf)`
##   <fct>       <int>     <int>      <int>       <int>
## 1 [0.2,10)       74        17         30          63
## 2 [10,30)       169        75         55         236
## 3 [30,50)        41        26         21         103
## 4 [50,Inf)       13         6          7          64

Creating contigency tables using cont_table

The same contingency table can be obtained using descstat::cont_table:

wages2 %>% cont_table(wage, size)
## # A tibble: 4 x 5
##   `wage|size` `[1,20)` `[20,50)` `[50,100)` `[100,Inf)`
##   <chr>          <int>     <int>      <int>       <int>
## 1 [0.2,10)          74        17         30          63
## 2 [10,30)          169        75         55         236
## 3 [30,50)           41        26         21         103
## 4 [50,Inf)          13         6          7          64

The result is a cont_table object, which is a tibble in "long" format, as the result of the dplyr::count function. The printing of the table in "wide" format is performed by the pre_print method, which is included in the format method, but should be used explicitely while using knitr::kable.

wages2 %>% cont_table(wage, size) %>%
    pre_print %>% knitr::kable()
wage|size [1,20) [20,50) [50,100) [100,Inf)
[0.2,10) 74 17 30 63
[10,30) 169 75 55 236
[30,50) 41 26 21 103
[50,Inf) 13 6 7 64

The total argument can be set to TRUE to get a row and a column of totals for the two series.

wages2 %>% cont_table(wage, size, total = TRUE)
## # A tibble: 5 x 6
##   `wage|size` `[1,20)` `[20,50)` `[50,100)` `[100,Inf)`
##   <chr>          <int>     <int>      <int>       <int>
## 1 [0.2,10)          74        17         30          63
## 2 [10,30)          169        75         55         236
## 3 [30,50)           41        26         21         103
## 4 [50,Inf)          13         6          7          64
## 5 Total            297       124        113         466
## # … with 1 more variable: Total <int>

A weights argument is used to mimic the population. For example, the employment table contains a series of weights called weights:

employment %>% cont_table(age, sex, weights = weights, total = TRUE)
## # A tibble: 6 x 4
##   `age|sex`   male female  Total
##   <chr>      <dbl>  <dbl>  <dbl>
## 1 [15,30)   12676. 14050. 26726.
## 2 [30,40)    8482. 12563. 21045.
## 3 [40,50)    9254. 12355. 21610.
## 4 [50,60)   10090. 10456. 20546.
## 5 [60,Inf)  15246. 21409. 36654.
## # … with 1 more row

as for bins_table, central values for the first and the last class can be set using arguments xfirst#, xlast# and wlast#, where # is equal to 1 or 2 for the first and the second variable indicated in the cont_table function.

Plotting a contingency table

A contingency table can be ploted using geom_point, with the size of the points being proportional to the count of the cells. The pre_plot method replaces classes by values.

wages2 %>% cont_table(size, wage) %>% pre_plot %>%
    ggplot() + geom_point(aes(size, wage, size = n))

Computing the distributions from a contingency table

\(n_{ij}\) is the count of the cell corresponding to the \(i\)th modality of the first variable and the \(j\)th modality of the second one.

  • the joint distribution is obtained by dividing \(n_{ij}\) by the sample size,
  • the marginal distribution is obtained by summing the counts column-wise \(n_{i.}=\sum_{j}n_{ij}\) for the first variable and row-wise \(n_{.j} = \sum_{i}n_{ij}\) for the second one,
  • the conditional distribution is obtained by dividing the joint by the marginal frequencies.

The joint, marginal and conditional functions return these three distributions. The last two require an argument y which is one of the two variables of the bins_table object.

wht <- wages2 %>% cont_table(size, wage)
wht %>% joint
## # A tibble: 4 x 5
##   `size|wage` `[0.2,10)` `[10,30)` `[30,50)` `[50,Inf)`
##   <chr>            <dbl>     <dbl>     <dbl>      <dbl>
## 1 [1,20)           0.074     0.169     0.041      0.013
## 2 [20,50)          0.017     0.075     0.026      0.006
## 3 [50,100)         0.03      0.055     0.021      0.007
## 4 [100,Inf)        0.063     0.236     0.103      0.064
wht %>% marginal(size)
## # A tibble: 4 x 2
##   size          f
## * <chr>     <dbl>
## 1 [1,20)    0.297
## 2 [20,50)   0.124
## 3 [50,100)  0.113
## 4 [100,Inf) 0.466
wht %>% conditional(size)
## # A tibble: 4 x 5
##   `size|wage` `[0.2,10)` `[10,30)` `[30,50)` `[50,Inf)`
##   <chr>            <dbl>     <dbl>     <dbl>      <dbl>
## 1 [1,20)          0.402      0.316     0.215     0.144 
## 2 [20,50)         0.0924     0.140     0.136     0.0667
## 3 [50,100)        0.163      0.103     0.110     0.0778
## 4 [100,Inf)       0.342      0.441     0.539     0.711

Computing descriptive statistics

Descriptive statistics can be computed using any of the three distributions. Using the joint distribution, we get a tibble containing two columns for the two variables

wht %>% joint %>% mean
## # A tibble: 1 x 2
##    size  wage
##   <dbl> <dbl>
## 1  74.2  24.7
wht %>% joint %>% stdev
## # A tibble: 1 x 2
##    size  wage
##   <dbl> <dbl>
## 1  51.0  15.5
wht %>% joint %>% variance
## # A tibble: 1 x 2
##    size  wage
##   <dbl> <dbl>
## 1 2598.  239.

The same (univariate) statistics can be obtained using the marginal distribution:

wht %>% marginal(size) %>% mean
## [1] 74.18
wht %>% marginal(size) %>% stdev
## [1] 50.97

or even more simply considering the univariate distribution computed by bins_table:

wages2 %>% bins_table(size) %>% mean
## [1] 74.18

The mean, stdev and variance methods are actually only usefull when applied to a conditional distribution; in this case, considering for example the conditional distribution of the first variable, there are as many values returned that the number of modalities of the second (conditioning) variable.

wht %>% conditional(wage) %>% mean
## # A tibble: 4 x 2
##   size       wage
##   <chr>     <dbl>
## 1 [1,20)     20.8
## 2 [20,50)    24.1
## 3 [50,100)   22.2
## 4 [100,Inf)  27.9
wht %>% conditional(wage) %>% variance
## # A tibble: 4 x 2
##   size       wage
##   <chr>     <dbl>
## 1 [1,20)     180.
## 2 [20,50)    175.
## 3 [50,100)   227.
## 4 [100,Inf)  276.

Regression curve

The total variance of \(X\) can be writen as the sum of:

  • the explained variance, ie the variance of the conditional means,
  • the residual variance, ie the mean of the conditional variances.

\[ s_{x}^2 = \sum_j f_{.j} s^2_{x_j} + \sum_j f_{.j} (\bar{x}_j - \bar{\bar{x}}) ^ 2 \]

The decomposition of the variance can be computed by joining tables containing the conditional moments and the marginal distribution of the conditioning variable and then applying the formula:

cm <- wht %>% conditional(wage) %>% mean %>% rename(mean = wage)
cv <- wht %>% conditional(wage) %>% variance %>% rename(variance = wage)
md <- wht %>% marginal(size)
md %>% left_join(cm) %>% left_join(cv) %>%
    summarise(om = mean(mean),
              ev = sum(f * (mean - om) ^ 2),
              rv = sum(f * variance),
              tv = ev + rv)                       
## # A tibble: 1 x 4
##      om    ev    rv    tv
##   <dbl> <dbl> <dbl> <dbl>
## 1  23.8  10.9  229.  240.

Or more simply using the var_decomp function:

wht_wage <- wht %>% var_decomp("wage")
wht_wage
## # A tibble: 4 x 5
##   size      size_val     f  mean   var
## * <chr>        <dbl> <dbl> <dbl> <dbl>
## 1 [1,20)        10.5 0.297  20.8  180.
## 2 [20,50)       35   0.124  24.1  175.
## 3 [50,100)      75   0.113  22.2  227.
## 4 [100,Inf)    125   0.466  27.9  276.

which has a summary method which computes the different elements of the decomposition:

wht_wage %>% summary
## # A tibble: 1 x 4
##   inter intra total  ratio
##   <dbl> <dbl> <dbl>  <dbl>
## 1  10.0  229.  239. 0.0419

and especially the correlation ratio, which is obtained by dividing the explained variance by the total variance.

The regression curve of wage on size can be plotted using wht_wage, together with error bars:

wht_wage %>% ggplot(aes(size_val, mean)) + geom_point() +
    geom_line(lty = "dotted") +
    geom_errorbar(aes(ymin = mean - sqrt(var), ymax = mean + sqrt(var))) +
    labs(x = "size", y = "wage")    

Linear regression

For the joint distribution, two other functions are provided, covariance and correlation (which are the equivalent of the non-generic stats::cov and stats::cor functions) to compute the covariance and the linear coefficient of correlation.

wht %>% joint %>% covariance
## [1] 92.26
wht %>% joint %>% correlation
## [1] 0.117

The regression line can be computed using the regline function:

rl <- regline(wage ~ size, wht)
rl
## [1] -67.586   1.244

which returns the intercept and the slope of the regression of wage on size. We can then draw the points and the regression line:

wht %>% pre_plot %>% ggplot() + geom_point(aes(size, wage, size = n)) +
    geom_abline(intercept = rl[1], slope = rl[2])