Sebastian Funk on December 23, 2016
RBi is an R
interface to LibBi, a library for Bayesian inference with state-space models using high-performance computer hardware.
The package has been tested on macOS/OSX and Linux. It requires a working installation of LibBI, which is easiest done using the homebrew-science tap: Install Homebrew (on OS X) or Linuxbrew (on linux), then issue the following commands (using a command shell, i.e. Terminal or similar):
brew tap homebrew/science
brew install libbi
If you have any trouble installing LibBi you can get help on the LibBi Users mailing list.
The path to the libbi
script can be passed as an argument to RBi, otherwise the package tries to find it automatically using the which
linux/unix command.
If you just want to process the output from LibBi, then you do not need to have LibBi installed.
The RBi package requires R
(>= 3.2.0) as well as the packages:
reshape2
ncdf4
data.table
The easiest way to install the latest stable version of RBi is via CRAN. The package name is rbi
(all lowercase):
install.packages('rbi')
Alternatively, the current development version can be installed using the devtools
package
# install.packages("devtools")
library('devtools')
install_github("libbi/rbi")
Use
library('rbi')
## Loading required package: data.table
to load the package.
The main computational engine and model grammar behind RBi is provided by LibBi. The LibBi manual is a good place to start for finding out everything there is to know about LibBi models and inference methods.
The RBi package mainly provides two classes: bi_model
and libbi
. The bi_model
class is used to load, view and manipulate LibBi model files. The libbi
class is used to run LibBi and perform inference.
The package also provides two methods for interacting with the NetCDF files used by LibBi, bi_read
and bi_write
. Lastly, it provides a get_traces
function to analyse Markov-chain Monte Carlo (MCMC) traces using the coda package.
bi_model
classAs an example, we consider a simplified version of the SIR model discussed in Del Moral et al. (2014). This is included with the RBi package and can be loaded with
model_file <- system.file(package="rbi", "SIR.bi") # get full file name from package
SIRmodel <- bi_model(model_file) # load model
The SIRmodel
object now contains the model, which can be displayed with
SIRmodel
## bi model: SIR
## =============
## [1] "model SIR {"
## [2] " const h = 7"
## [3] " const N = 1000"
## [4] " const d_infection = 14"
## [5] " noise n_transmission"
## [6] " noise n_recovery"
## [7] " state S, I, R, Z"
## [8] " obs Incidence"
## [9] " param p_rep"
## [10] " param p_R0"
## [11] " sub parameter {"
## [12] " p_rep ~ uniform(0,1)"
## [13] " p_R0 ~ uniform(1,3)"
## [14] " }"
## [15] " sub initial {"
## [16] " S <- N - 1"
## [17] " I <- 1"
## [18] " R <- 0"
## [19] " Z <- 1"
## [20] " }"
## [21] " sub transition {"
## [22] " n_transmission ~ wiener()"
## [23] " n_recovery ~ wiener()"
## [24] " Z <- (t_now % 7 == 0 ? 0 : Z)"
## [25] " inline i_beta = p_R0 / d_infection * exp(n_transmission)"
## [26] " inline i_gamma = 1 / d_infection * exp(n_recovery)"
## [27] " ode (alg='RK4(3)', h=1e-1, atoler=1e-2, rtoler=1e-5) {"
## [28] " dS/dt = - i_beta * S * I / N"
## [29] " dI/dt = i_beta * S * I / N - i_gamma * I"
## [30] " dR/dt = i_gamma * I"
## [31] " dZ/dt = i_beta * S * I / N"
## [32] " }"
## [33] " }"
## [34] " sub observation {"
## [35] " Incidence ~ truncated_gaussian(mean = p_rep * Z, std = sqrt(p_rep * (1 - p_rep) * Z + 1), lower = 0)"
## [36] " }"
## [37] " sub proposal_initial {"
## [38] " S <- N - 1"
## [39] " I <- 1"
## [40] " R <- 0"
## [41] " Z <- 1"
## [42] " }"
## [43] " sub proposal_parameter {"
## [44] " inline _old_mean_ = p_rep"
## [45] " p_rep ~ truncated_gaussian(mean = p_rep, std = 1 * 0.023805117148865, lower = 0, upper = 1)"
## [46] " inline _old_mean_diff_ = p_rep - _old_mean_"
## [47] " p_R0 ~ truncated_gaussian(mean = p_R0 + (5.47399428332657) * _old_mean_diff_, std = 1 * 0.150151309758444, lower = 1, upper = 3)"
## [48] " }"
## [49] "}"
A part of the model can be shown with, for example,
SIRmodel[35:38]
## [1] " Incidence ~ truncated_gaussian(mean = p_rep * Z, std = sqrt(p_rep * (1 - p_rep) * Z + 1), lower = 0)"
## [2] " }"
## [3] " sub proposal_initial {"
## [4] " S <- N - 1"
or
SIRmodel$get_block("parameter")
## [1] "p_rep ~ uniform(0,1)" "p_R0 ~ uniform(1,3)"
To get a list of states, you can use
SIRmodel$get_vars("state")
## [1] "S" "I" "R" "Z"
Moreover, there are various methods for manipulating a model, such as remove_block
, add_block
, insert_lines
, update_lines
, remove_lines
, set_name
.
The fix
method fixes a variable to one value. This can be useful, for example, to run the deterministic equivalent of a stochastic model for testing purposes.
Lastly, write
writes a model to a file, and clone
creates a new bi_model
object which is an exact copy of the original one.
To get documentation for any of these methods, use the links in the documentation for bi_model
, or directly using ?bi_model_
followed by the function name, for example ?bi_model_write
.
First, let's create a data set from the SIR model.
SIRdata <- bi_generate_dataset(SIRmodel, end_time=16*7, noutputs=16, seed=12345678)
This simulates the model a single time from time 0 until time 16*7 (e.g., 16 weeks with a daily time step), producing 16 output (one a week). Note that we have specified a seed to make this document reproducible. Remove the seed
argument or specify a different seed to simulate your own data set.
The bi_generate_dataset
function returns a libbi
object:
SIRdata
## Wrapper around LibBi
## * client: sample
## * path to working folder: /var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T//RtmpQPeexp
## * path to model file: /private/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T/RtmpQPeexp/SIR_modela100bd251c1.bi
## * path to output_file: /private/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T/RtmpQPeexp/SIR_outputa1001cf55bf7.nc
Note that, if no working_folder
is specified, the model and output files will be stored in a temporary folder.
The generated dataset can be viewed and/or stored in a variable using bi_read
:
dataset <- bi_read(SIRdata)
The bi_read
function takes the name of a NetCDF file or a libbi
object (in which case it locates the output file) and stores the contents in a list of data frames or vectors, depending on the dimensionality of the contents:
names(dataset)
## [1] "n_transmission" "n_recovery" "S" "I"
## [5] "R" "Z" "Incidence" "p_rep"
## [9] "p_R0" "clock"
dataset$p_R0
## [1] 2.593518
dataset$Incidence
## time value
## 1 0 1.0627469
## 2 7 0.9579713
## 3 14 2.7424396
## 4 21 5.8324430
## 5 28 19.4283761
## 6 35 26.4151382
## 7 42 51.3084677
## 8 49 65.8505966
## 9 56 15.3140031
## 10 63 14.3198948
## 11 70 9.7333426
## 12 77 3.7584477
## 13 84 0.6501478
## 14 91 2.1190116
## 15 98 2.0573166
## 16 105 0.3234377
## 17 112 0.4878791
We can visualise the generated incidence data with
plot(dataset$Incidence$time, dataset$Incidence$value)
libbi
classThe libbi
class manages the interaction with LibBi such as sampling from the prior or posterior distribution.
bi <- libbi(SIRmodel)
This initialises a libbi
object called bi
with the model created earlier.
Let's sample from the prior of the SIR model:
bi$run(client="sample", target="prior", nsamples=1000, end_time=16*7, noutputs=16)
This step calls LibBi to sample from the prior distribution of the previously specified model, generating 1,000 samples and each time running the model for 16 * 7 = 112 time steps and writing 16 outputs (i.e., every 7 time steps). LibBi parses the model, creates C++ code, compiles it and run the model. If the model is run again, it should do so much quicker because it will will use the already compiled C++ code to run the model:
bi$run()
Any call of run
preserves options passed to the previous call of run
and libbi
, unless they are overwritten by arguments passed to run
(e.g., passing a new nsamples
argument). Let's have a closer look at the bi
object:
str(bi)
## Reference class 'libbi' [package "rbi"] with 12 fields
## $ client : chr "sample"
## $ config : chr ""
## $ options :List of 4
## ..$ target : chr "prior"
## ..$ nsamples: num 1000
## ..$ end-time: num 112
## ..$ noutputs: num 16
## $ path_to_libbi : Named chr "/usr/local/bin/libbi"
## ..- attr(*, "names")= chr "libbi"
## $ model :Reference class 'bi_model' [package "rbi"] with 2 fields
## ..$ model: chr [1:49] "model SIR {" "const h = 7" "const N = 1000" "const d_infection = 14" ...
## ..$ name : chr "SIR"
## ..and 34 methods, of which 20 are possibly relevant:
## .. add_block, clean_model, clone, find_block, fix, get_block,
## .. get_lines, get_vars, initialize, insert_lines, obs_to_noise,
## .. propose_prior, remove_block, remove_lines, replace_all, set_name,
## .. show#envRefClass, update_lines, write, write_model_file
## $ model_file_name : chr "/private/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T/RtmpQPeexp/SIR_modela1005d70e7b3.bi"
## $ working_folder : chr "/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T//RtmpQPeexp"
## $ dims : list()
## $ command : chr "cd /var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T//RtmpQPeexp;/usr/local/bin/libbi sample --target prior --nsamples 1000 --en"| __truncated__
## $ output_file_name: chr "/private/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T/RtmpQPeexp/SIR_outputa1001f7bc8b2.nc"
## $ log_file_name : chr "/private/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T/RtmpQPeexp/outputa10077b52e43.txt"
## $ run_flag : logi TRUE
## and 18 methods, of which 4 are possibly relevant:
## clone, initialize, run, show#envRefClass
We can see it contains 12 fields: client
is the client we passed (see LibBi manual for other clients, e.g. filter
); config
allows to specify an external configuration file; the options
field contains all the options we passed to bi$run
bi$options
## $target
## [1] "prior"
##
## $nsamples
## [1] 1000
##
## $`end-time`
## [1] 112
##
## $noutputs
## [1] 16
The other fields contain various bits of information about the object, including the model used, the command used to run LibBi (bi$command
) and the output file name:
bi$output_file_name
## [1] "/private/var/folders/69/p5_4w84n3yg8hj0xjnjs4mv00000gq/T/RtmpQPeexp/SIR_outputa1001f7bc8b2.nc"
We can get the results of the sampling run using bi_read
prior <- bi_read(bi$output_file_name)
or with the shorthand
prior <- bi_read(bi)
which looks at the output_file_name
field to read in the data. Let's look at the returned object
str(prior)
## List of 9
## $ n_transmission:'data.frame': 17000 obs. of 3 variables:
## ..$ np : num [1:17000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:17000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:17000] 0 -0.88 -0.32 0.563 0.318 ...
## $ n_recovery :'data.frame': 17000 obs. of 3 variables:
## ..$ np : num [1:17000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:17000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:17000] 0 -0.388 2.283 0.566 0.408 ...
## $ S :'data.frame': 17000 obs. of 3 variables:
## ..$ np : num [1:17000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:17000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:17000] 999 998 998 997 994 ...
## $ I :'data.frame': 17000 obs. of 3 variables:
## ..$ np : num [1:17000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:17000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:17000] 1 1.252 0.703 1.032 2.605 ...
## $ R :'data.frame': 17000 obs. of 3 variables:
## ..$ np : num [1:17000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:17000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:17000] 0 0.392 1.709 2.1 2.971 ...
## $ Z :'data.frame': 17000 obs. of 3 variables:
## ..$ np : num [1:17000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:17000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:17000] 1 0.644 0.768 0.721 2.443 ...
## $ p_rep :'data.frame': 1000 obs. of 2 variables:
## ..$ np : num [1:1000] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ value: num [1:1000] 0.523 0.28 0.99 0.329 0.893 ...
## $ p_R0 :'data.frame': 1000 obs. of 2 variables:
## ..$ np : num [1:1000] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ value: num [1:1000] 1.13 2.61 1.42 2.23 1.26 ...
## $ clock : num 453941
This is a list of 11 objects, 10 representing each of the (noise/state) variables and parameters in the file, and one number clock
, representing the time spent running the model in milliseconds.
We can see that the time-varying variables are represented as data frames with three columns: np
(enumerating individual simulations), time
and value
. Parameters don't vary in time and just have np
and value
columns.
Let's perform inference using Particle Markov-chain Metropolis Hastings (PMMH). The following command will generate 16 * 10,000 = 160,000 simulations and therefore may take a little while to run (if you want to see the samples progress, use verbose=TRUE
in the bi$run
call).
bi$options$seed <- generate_seed()
bi$run(client="sample", target="posterior", nparticles=16, obs=SIRdata, sample_obs=TRUE, nsamples=10000)
bi$options$seed
## [1] 69888192
This samples from the posterior distribution. Remember that options are preserved from previous runs, so we don't need to specify nsamples
, end_time
and noutputs
again, unless we want to change them. The same holds for seed
, which is why we set it to a (pseudo-)random value as returned by the random generator of R using generate_seed
. An alternative would have been to set bi$option$seed
to NULL and let LibBi set the random seed, but then we could not have inspected it afterwards. The nparticles
option specifies the number of particles, and the sample_obs
option tells the run
command that we also want observation samples from the posterior distribution of trajectories.
Input, init and observation files (see the LibBi manual for details) can be specified using the init
, input
, obs
options, respectively. They can each be specified either as the name of a NetCDF file containing the data (i.e., a character vector of length one, a libbi
object (in which case the output file will be taken), or a list of data frames or numeric vectors). In the case of the command above, init
is specified as a list, and obs
as a libbi
object. The Incidence
variable of the SIRdata
object will be taken as observations.
The time dimension (or column, if a data frame) in the passed init
, input
and/or obs
files can be specified using the time_dim
option. If this is not given, it will be assumed to be time
, if such a dimension exists or, if not, any numeric column not called value
(or the contents of the value_column
option). If this does not produce a unique column name, an error will be thrown. All other dimensions/columns in the passed options will be interpreted as additional dimensions in the data, and stored in the dims
field of the libbi
object.
Any other options (apart from log_file_name
, see the Debugging section) will be passed on to the command libbi
– for a complete list, see the LibBi manual. Hyphens can be replaced by underscores so as not to confuse R (see end_time
). Any arguments starting with enable
/disable
can be specified as boolean (e.g., assert=TRUE
). Any dry-
options can be specified with a "dry"
argument, e.g., parse="dry"
.
Let's get the results of the preceding run
command:
bi_contents(bi)
## [1] "time" "n_transmission" "n_recovery" "S"
## [5] "I" "R" "Z" "p_rep"
## [9] "p_R0" "clock" "loglikelihood" "logprior"
## [13] "Incidence"
posterior <- bi_read(bi)
str(posterior)
## List of 12
## $ n_transmission:'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 0 -0.681 -1.568 0.29 2.322 ...
## $ n_recovery :'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 0 0.06 0.507 0.303 1.063 ...
## $ S :'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 999 998 997 990 968 ...
## $ I :'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 1 1.44 1.1 6.16 21.14 ...
## $ R :'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 0 0.772 1.802 4.269 10.995 ...
## $ Z :'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 1 1.214 0.691 7.521 21.71 ...
## $ p_rep :'data.frame': 10000 obs. of 2 variables:
## ..$ np : num [1:10000] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ value: num [1:10000] 0.695 0.68 0.68 0.68 0.68 ...
## $ p_R0 :'data.frame': 10000 obs. of 2 variables:
## ..$ np : num [1:10000] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ value: num [1:10000] 1.46 1.43 1.43 1.43 1.43 ...
## $ clock : num 73731955
## $ loglikelihood :'data.frame': 10000 obs. of 2 variables:
## ..$ np : num [1:10000] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ value: num [1:10000] -62.2 -51.1 -51.1 -51.1 -51.1 ...
## $ logprior :'data.frame': 10000 obs. of 2 variables:
## ..$ np : num [1:10000] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ value: num [1:10000] -0.693 -0.693 -0.693 -0.693 -0.693 ...
## $ Incidence :'data.frame': 170000 obs. of 3 variables:
## ..$ np : num [1:170000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ time : num [1:170000] 0 7 14 21 28 35 42 49 56 63 ...
## ..$ value: num [1:170000] 0 0.671 0.693 4.722 17.702 ...
We can see that this has three more objects than previously when we specified target="prior"
: loglikelihood
(the estimated log-likelihood of the parameters at each MCMC step), logprior
(the estimated log-prior density of the parameters at each MCMC step), and Incidence
(sampled observations, as we specified sample_obs=TRUE
).
To analyse MCMC outputs, we can use the coda package and the get_traces
function of RBi. Note that, to get exactly the same traces, you would have to use the seed randomly generated above.
library(coda)
traces <- mcmc(get_traces(bi))
We can study, for example, the acceptance rate
accRate <- 1 - rejectionRate(traces)
accRate
## p_rep p_R0
## 0.2377238 0.2377238
and visualise parameter traces and densities with
plot(traces)
Compare this to the marginal posterior distributions to the “correct” parameters used to generate the data set:
dataset[bi$model$get_vars("param")]
## $p_rep
## [1] 0.2458042
##
## $p_R0
## [1] 2.593518
For more details on using coda to further analyse the chains, see the website of the coda package. For more plotting functionality, see the plot_libbi
function in RBi.helpers.
If libbi
throws an error, it is best to investigate with verbose = TRUE
, and setting working_folder
to a folder that one can then use for debugging. Output of the libbi
call can be saved in a file using the log_file_name
option (by default a temporary file).
RBi.helpers contains higher-level methods to interact with LibBi, including methods for plotting the results of libbi runs and for adapting the proposal distribution and number of particles.