Many statistical methods, particularly Bayesian methods, produce random draws from a distribution. A useful feature of these draws is that they can be used to make inferences about derived quantities. The procedure is:
If, for instance, we have randoms draws of age-specific mortality rates, and we want make inferences about life expectancy (a summary indicator for mortality rates), then we proceed as follows:
For more on the theory behind manipulating random draws, and for an argument that R needs high-level tools to help with this manipulation, see Kerman and Gelman (2007).
Package rvec provides tools for working with random draws. The draws are held in a structure called an rvec, which can, for many purposes, be treated like an ordinary R vector, and manipulated using ordinary base R and tidyverse code. rvec also contains functions for summarizing across random draws.
We begin with a toy example, to illustrate basic functionality.
library(rvec)
#>
#> Attaching package: 'rvec'
#> The following objects are masked from 'package:stats':
#>
#> sd, var
#> The following object is masked from 'package:base':
#>
#> rank
<- list(c(3, 1, 0))
l <- rvec(l)
theta
theta#> <rvec_dbl<3>[1]>
#> [1] 3,1,0
The header <rvec_dbl<3>[1]>
describe the structure of theta
:
_dbl
indicates that theta
is composed of doubles<3>
indicates that theta
holds three random draws, and[1]
indicates that each draw has length 1.We can perform standard mathematical operations:
^2 + 1
theta#> <rvec_dbl<3>[1]>
#> [1] 10,2,1
theta
recycles to match the length of other vectors,
<- theta + c(1, -1)
beta
beta#> <rvec_dbl<3>[2]>
#> [1] 4,2,1 2,0,-1
beta
consists of three draws:
c(4, 2)
, obtained from adding 3
and c(1, -1)
,c(2, 0)
, obtained from adding 1
and c(1, -1)
, andc(1, -1)
, obtained from adding 0
and c(1, -1)
.To summarize across random draws, we use draws_*
functions, e.g.
draws_mean(beta)
#> [1] 2.3333333 0.3333333
Our next example is more involved, and includes the use of some standard tidyverse packages.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(ggplot2)
We analyse a posterior sample from a Bayesian model of divorce rates in New Zealand. The rates are divorces per thousand people per year, disaggregated by age and sex. The rates are not stored as an rvec, but instead in a ‘data base’ format, where each row describes a single draw.
divorce#> # A tibble: 22,000 × 4
#> age sex draw rate
#> <fct> <chr> <int> <dbl>
#> 1 15-19 Female 1 0.0462
#> 2 15-19 Female 2 0.0369
#> 3 15-19 Female 3 0.0448
#> 4 15-19 Female 4 0.0411
#> 5 15-19 Female 5 0.0333
#> 6 15-19 Female 6 0.0511
#> 7 15-19 Female 7 0.0249
#> 8 15-19 Female 8 0.0280
#> 9 15-19 Female 9 0.0339
#> 10 15-19 Female 10 0.0425
#> # ℹ 21,990 more rows
First we convert from database format to rvec format.
<- divorce |>
divorce_rv collapse_to_rvec(value = rate)
divorce_rv#> # A tibble: 22 × 3
#> age sex rate
#> <fct> <chr> <rdbl<1000>>
#> 1 15-19 Female 0.036 (0.019, 0.068)
#> 2 20-24 Female 0.67 (0.58, 0.78)
#> 3 25-29 Female 3.2 (3, 3.4)
#> 4 30-34 Female 5.8 (5.5, 6.1)
#> 5 35-39 Female 6.5 (6.2, 6.9)
#> 6 40-44 Female 7.1 (6.8, 7.4)
#> 7 45-49 Female 7.2 (6.9, 7.6)
#> 8 50-54 Female 6 (5.8, 6.3)
#> 9 55-59 Female 4.4 (4.2, 4.7)
#> 10 60-64 Female 2.7 (2.5, 3)
#> # ℹ 12 more rows
When the number of draws is large, the print method displays <median> (<2.5% quantile>, <97.5% quantile>)
for the distribution, rather than the individual draws.
We define the ‘total divorce rate’ to be the number of divorces that a person would expect to experience over their lifetime under prevailing divorce rates. The total divorce rate can be calculated as
|>
divorce_rv group_by(sex) |>
summarise(TDR = sum(rate) * 5 / 1000)
#> # A tibble: 2 × 2
#> sex TDR
#> <chr> <rdbl<1000>>
#> 1 Female 0.22 (0.22, 0.23)
#> 2 Male 0.23 (0.22, 0.23)
We summarize across draws using draws_ci()
, which, by default, calculates medians and 95% credible intervals. Function draws_ci()
returns a tibble rather than a vector, so, following standard mutate
rules, we do not explicitly create new columns.
|>
divorce_rv group_by(sex) |>
summarise(tdr = sum(rate) * 5 / 1000) |>
mutate(draws_ci(tdr))
#> # A tibble: 2 × 5
#> sex tdr tdr.lower tdr.mid tdr.upper
#> <chr> <rdbl<1000>> <dbl> <dbl> <dbl>
#> 1 Female 0.22 (0.22, 0.23) 0.218 0.223 0.228
#> 2 Male 0.23 (0.22, 0.23) 0.221 0.226 0.231
Next we calculate the ratio between female and male divorce rates,
<- divorce_rv |>
divorce_ratio pivot_wider(names_from = sex, values_from = rate) |>
mutate(ratio = Female / Male) |>
mutate(draws_ci(ratio))
and graph the result
ggplot(divorce_ratio,
aes(x = age,
ymin = ratio.lower,
y = ratio.mid,
ymax = ratio.upper)) +
geom_pointrange()
The class "rvec"
has four subclasses:
"rvec_dbl"
, which holds doubles, e.g. 3.142
, -1.01
."rvec_int"
, which holds integers, e.g. 42
, -1
."rvec_lgl"
, which holds TRUE
, FALSE
, and NA
."rvec_chr"
, which hold characters, e.g. "a"
, "Thomas Bayes"
.Internally, an rvec is a matrix, with each row representing one unknown quantity, and each column representing one draw from the joint distribution of the unknown quantities,
Draw 1 | Draw 2 | \(\dots\) | Draw \(n\) | |
---|---|---|---|---|
Quantity 1 | \(\theta_{11}\) | \(\theta_{12}\) | \(\dots\) | \(\theta_{1n}\) |
Quantity 2 | \(\theta_{21}\) | \(\theta_{22}\) | \(\dots\) | \(\theta_{2n}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
Quantity \(m\) | \(\theta_{m1}\) | \(\theta_{m2}\) | \(\dots\) | \(\theta_{mn}\) |
Ordinary functions are applied independently to each column. For instance, calling sum()
on an rvec creates a new rvec with structure
Draw 1 | Draw 2 | \(\dots\) | Draw \(n\) | |
---|---|---|---|---|
Quantity 1 | \(\sum_{i=1}^m\theta_{i1}\) | \(\sum_{i=1}^m\theta_{i2}\) | \(\dots\) | \(\sum_{i=1}^m\theta_{in}\) |
Functions with a draws_
prefix are applied independently to each row. For instance, calling draws_mean()
on an rvec creates a new numeric vector with structure
Value | |
---|---|
Quantity 1 | \(\frac{1}{n}\sum_{j=1}^n\theta_{1j}\) |
Quantity 2 | \(\frac{1}{n}\sum_{j=1}^n\theta_{2j}\) |
\(\vdots\) | \(\vdots\) |
Quantity \(m\) | \(\frac{1}{n}\sum_{j=1}^n\theta_{mj}\) |
Each rvec holds a fixed number of draws. Two rvecs can only be used together in a function if
An individual rvec can be created from a list of vectors,
<- list(LETTERS, letters)
x rvec(x)
#> <rvec_chr<26>[2]>
#> [1] ..A.. ..a..
a matrix,
<- matrix(rnorm(2000), nrow = 2)
x rvec(x)
#> <rvec_dbl<1000>[2]>
#> [1] 0.012 (-1.9, 1.8) -0.015 (-1.9, 1.9)
or an atomic vector
<- c(TRUE, FALSE)
x rvec(x)
#> <rvec_lgl<1>[2]>
#> [1] T F
Function rvec()
chooses from classes "rvec_dbl"
, "rvec_int"
, "rvec_lgl"
, and "rvec_chr"
, based on the input. To enforce a particular class, use function rvec_dbl()
, rvec_int()
, rvec_lgl()
, or rvec_chr()
,
<- list(1:3)
x rvec(x)
#> <rvec_int<3>[1]>
#> [1] 1,2,3
rvec_dbl(x)
#> <rvec_dbl<3>[1]>
#> [1] 1,2,3
rvec_chr(x)
#> <rvec_chr<3>[1]>
#> [1] "1","2","3"
When the raw data take the form of a database with one draw per row, the most efficient way to create rvecs is to use collapse_to_rvec()
. See Section 2.2 for an example.
Section 6 shows how to create an rvec consisting of draws from a standard probability distribution.
Mathematical and logical operations are applied independently to each draw.
<- rvec(list(c(TRUE, FALSE),
x c(TRUE, TRUE)))
all(x)
#> <rvec_lgl<2>[1]>
#> [1] T,F
any(x)
#> <rvec_lgl<2>[1]>
#> [1] T,T
User-defined functions that consist entirely of standard mathematical and logical operations should work with no modifications.
<- function(p) log(p / (1-p))
logit tibble(
x = rvec(list(c(0.2, 0.4),
c(0.6, 0.9))),
y = logit(x)
)#> # A tibble: 2 × 2
#> x y
#> <rdbl<2>> <rdbl<2>>
#> 1 0.2,0.4 -1.386,-0.4055
#> 2 0.6,0.9 0.4055,2.197
Multiplying an rvec by a matrix produces an rvec,
<- rbind(c(1, 1),
m c(0, 1))
<- rvec(list(1:2,
x 3:4))
%*% x
m #> <rvec_dbl<2>[2]>
#> [1] 4,6 3,4
rvec contains a suite of functions for summarising weighted data:
weighted_mad()
weighted_mean()
weighted_median()
weighted_sd()
weighted_var()
All of these are built on functions from package matrixStats.
The elements of an rvec do not have a well-defined order when there is more than one draw. Functions sort()
and order()
fail when called on an rvec, unless n_draw
is 1.
Ranking does, however, have a useful interpretation. We apply the ranking operation independently to each draw, and return the results as an integer rvec.
|>
divorce_ratio select(age, ratio) |>
mutate(rank = rank(ratio))
#> # A tibble: 11 × 3
#> age ratio rank
#> <fct> <rdbl<1000>> <rint<1000>>
#> 1 15-19 1.7 (0.66, 4.3) 10 (2, 11)
#> 2 20-24 2.1 (1.6, 2.6) 11 (10, 11)
#> 3 25-29 1.6 (1.4, 1.7) 9 (9, 10)
#> 4 30-34 1.2 (1.1, 1.3) 8 (8, 9)
#> 5 35-39 1.1 (1, 1.2) 7 (5, 8)
#> 6 40-44 1 (0.98, 1.1) 6 (5, 7)
#> 7 45-49 1 (0.93, 1.1) 5 (5, 7)
#> 8 50-54 0.89 (0.83, 0.94) 4 (4, 5)
#> 9 55-59 0.79 (0.74, 0.85) 3 (2, 4)
#> 10 60-64 0.73 (0.65, 0.82) 2 (2, 3)
#> 11 65+ 0.51 (0.45, 0.57) 1 (1, 1)
The rank of the 15-19 age group is uncertain, while the rank of the 65+ age group is estimated precisely.
Standard R probability functions such as dnorm()
or rbinom()
do not allow rvec arguments. Package rvec provides modified functions that do. For instance,
<- rvec(list(c(-1, 0.2),
y c(3, -7)))
<- rvec(list(c(0, 1),
mu c(0, -1)))
dnorm_rvec(y, mean = mu, sd = 3)
#> <rvec_dbl<2>[2]>
#> [1] 0.1258,0.1283 0.08066,0.018
rbinom_rvec(n = 2, size = round(y+10), prob = 0.8)
#> <rvec_int<2>[2]>
#> [1] 8,9 12,3
The return value from an rvec probability function is an rvec if and only if at least one argument to the function is an rvec – with one exception. The exception is random variate functions, where a value can be supplied for a special argument called n_draw
. When a value for n_draw
is supplied, the return value is an rvec with n_draw
draws,
rnorm_rvec(n = 3, mean = 100, sd = 10, n_draw = 2)
#> <rvec_dbl<2>[3]>
#> [1] 110.4,93.06 106,96.98 115.3,100.7
This is a convenient way to create inputs to a simulation.
Standard R ways of selecting elements from vectors work with rvecs.
<- rvec(list(a = 1:2,
x b = 3:4,
c = 5:6))
1] ## element number
x[#> <rvec_int<2>[1]>
#> a
#> 1,2
c("a", "c")] ## element name
x[#> <rvec_int<2>[2]>
#> a c
#> 1,2 5,6
c(TRUE, FALSE, TRUE)] ## logical flag
x[#> <rvec_int<2>[2]>
#> a c
#> 1,2 5,6
The standard R function ifelse()
does not work at all with rvecs.
The tidyverse function if_else()
works when the true
, false
, or missing
arguments are rvecs,
<- rvec(list(1:2,
x 3:4))
if_else(condition = c(TRUE, FALSE),
true = x,
false = -x)
#> <rvec_int<2>[2]>
#> [1] 1,2 -3,-4
if_else()
does not work, however, when the condition
argument is an rvec.
When the condition
argument is an rvec, we need the rvec function if_else_rvec()
,
if_else_rvec(x <= 2, x, 2)
#> <rvec_dbl<2>[2]>
#> [1] 1,2 2,2
Function if_else_rvec()
can be used to independently transform or recode values across different draws,
<- rvec(list(c(1, 3.3),
x c(NA, -2)))
x#> <rvec_dbl<2>[2]>
#> [1] 1,3.3 NA,-2
<- if_else_rvec(is.na(x), 99, x)
x_recode
x_recode#> <rvec_dbl<2>[2]>
#> [1] 1,3.3 99,-2
The standard R concatenation function c()
works with rvecs,
<- rvec(list(c(0.1, 0.2),
x1 c(0.3, 0.4)))
<- rvec(list(c(0.5, 0.6),
x2 c(0.7, 0.8)))
c(x1, x2)
#> <rvec_dbl<2>[4]>
#> [1] 0.1,0.2 0.3,0.4 0.5,0.6 0.7,0.8
Unfortunately, cbind()
and rbind()
cannot be made to work properly on raw rvecs,
rbind(x1, x2)
#> data
#> x1 numeric,4
#> x2 numeric,4
cbind(x1, x2)
#> x1 x2
#> data numeric,4 numeric,4
though cbind()
does work if the rvecs are contained in data frames
<- data.frame(x1)
df1 <- data.frame(x2)
df2 cbind(df1, df2)
#> x1 x2
#> 1 0.1,0.2 0.5,0.6
#> 2 0.3,0.4 0.7,0.8
Tidyverse equivalents such as dplyr::bind_rows()
, dbplyr::bind_cols()
, vctrs::vec_cbind()
, and vctrs::vec_rbind()
do work with rvecs,
library(vctrs, warn.conflicts = FALSE)
vec_cbind(a = x1, b = x2)
#> a b
#> 1 0.1,0.2 0.5,0.6
#> 2 0.3,0.4 0.7,0.8
Base R function sapply()
does not work with rvecs (unless simplify
is set to FALSE
). rvec supplies a function called map_rvec()
(based on map functions in package purrr) that does the same job:
<- list(a = rvec(list(c(1, 4))),
l b = rvec(list(c(9, 16))))
l#> $a
#> <rvec_dbl<2>[1]>
#> [1] 1,4
#>
#> $b
#> <rvec_dbl<2>[1]>
#> [1] 9,16
map_rvec(l, sqrt)
#> <rvec_dbl<2>[2]>
#> a b
#> 1,2 3,4
Function as.matrix()
returns the data underlying an rvec.
<- matrix(1:6, nr = 2)
m
m#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
<- rvec(m)
x
x#> <rvec_int<3>[2]>
#> [1] 1,3,5 2,4,6
as.matrix(x)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
Function as_list_col()
returns a list of vectors
as_list_col(x)
#> [[1]]
#> [1] 1 3 5
#>
#> [[2]]
#> [1] 2 4 6
Functions such as point_interval in package ggdist accept lists of vector. A good way to access the sophisticated plotting facilities in ggdist, or in packages such as tidybayes and bayesplot, is to use as_list_col()
to convert an rvec to a list column.
Function expand_from_rvec()
is the inverse of function collapse_to_rvec()
, introduced in Section 2.2.
|>
divorce head(2)
#> # A tibble: 2 × 4
#> age sex draw rate
#> <fct> <chr> <int> <dbl>
#> 1 15-19 Female 1 0.0462
#> 2 15-19 Female 2 0.0369
|>
divorce collapse_to_rvec(values = rate) |>
head(2)
#> # A tibble: 2 × 3
#> age sex rate
#> <fct> <chr> <rdbl<1000>>
#> 1 15-19 Female 0.036 (0.019, 0.068)
#> 2 20-24 Female 0.67 (0.58, 0.78)
|>
divorce collapse_to_rvec(values = rate) |>
expand_from_rvec() |>
head(2)
#> # A tibble: 2 × 4
#> age sex draw rate
#> <fct> <chr> <int> <dbl>
#> 1 15-19 Female 1 0.0462
#> 2 15-19 Female 2 0.0369
Most functions in rvec are concerned with deriving random vectors from other random vectors: that is, with the column-wise calculations described in Section 3. But once we have derived the random vectors, we typically want to summarise them, using statistics such as means or quantiles: that is, we want to carry out row-wise calculations.
The functions for carrying out row-wise calculations on rvecs are:
draws_all()
draws_any()
draws_median()
draws_mean()
draws_mode()
draws_ci()
draws_quantile()
draws_fun()
Internally, most of these functions call functions from matrixStats, which means they are fast.
draws_ci()
, which calculates credible intervals, is the draws function that is used most often,
<- divorce |>
divorce_rv collapse_to_rvec(value = rate)
divorce_rv#> # A tibble: 22 × 3
#> age sex rate
#> <fct> <chr> <rdbl<1000>>
#> 1 15-19 Female 0.036 (0.019, 0.068)
#> 2 20-24 Female 0.67 (0.58, 0.78)
#> 3 25-29 Female 3.2 (3, 3.4)
#> 4 30-34 Female 5.8 (5.5, 6.1)
#> 5 35-39 Female 6.5 (6.2, 6.9)
#> 6 40-44 Female 7.1 (6.8, 7.4)
#> 7 45-49 Female 7.2 (6.9, 7.6)
#> 8 50-54 Female 6 (5.8, 6.3)
#> 9 55-59 Female 4.4 (4.2, 4.7)
#> 10 60-64 Female 2.7 (2.5, 3)
#> # ℹ 12 more rows
|>
divorce_rv mutate(draws_ci(rate))
#> # A tibble: 22 × 6
#> age sex rate rate.lower rate.mid rate.upper
#> <fct> <chr> <rdbl<1000>> <dbl> <dbl> <dbl>
#> 1 15-19 Female 0.036 (0.019, 0.068) 0.0193 0.0360 0.0678
#> 2 20-24 Female 0.67 (0.58, 0.78) 0.579 0.673 0.782
#> 3 25-29 Female 3.2 (3, 3.4) 3.02 3.22 3.45
#> 4 30-34 Female 5.8 (5.5, 6.1) 5.54 5.80 6.06
#> 5 35-39 Female 6.5 (6.2, 6.9) 6.21 6.54 6.86
#> 6 40-44 Female 7.1 (6.8, 7.4) 6.85 7.13 7.42
#> 7 45-49 Female 7.2 (6.9, 7.6) 6.87 7.22 7.57
#> 8 50-54 Female 6 (5.8, 6.3) 5.77 6.03 6.33
#> 9 55-59 Female 4.4 (4.2, 4.7) 4.17 4.40 4.66
#> 10 60-64 Female 2.7 (2.5, 3) 2.48 2.72 2.99
#> # ℹ 12 more rows
The first R package to provide a specialized object for handling multiple draws was rv. The specialized object, called an rv, can be manipulated and summarized much like an rvec. However, in software terms, an rv is not strictly a vector (calling is.vector()
on one returns FALSE
) and an rv does not always behave as expected inside a data frame. It is therefore not well suited to tidyverse-style work flows.
R package posterior provides several data structures for handling multiple draws, including one, called a rvar, that is similar to an rvec. An rvar is, however, is not limited to a single dimension, and has special facilities for dealing with multiple chains (as produced by Markov chain Monte Carlo methods.) These features are essential for some analyses, but they can make rvars harder to master, and they are not needed for most tidyverse-style work flows.
Whereas rvecs interpret summary functions such as mean()
and sum()
as operations to be applied independently on each draw, rvars interpret them as operations to be applied across draws. The result is that code written for ordinary R vectors will often work on rvecs, but need modification to work on rvars. The tidyverse function count()
, for instance, works with rvecs but not rvars.