An Explanation of the Underlying Data Structures in rubias

Eric C. Anderson

2024-01-23

In order to be computationally efficient and allow for multiallelic markers, with rubias we boil most of the data down to a bunch of integer vectors in a data structure that we operate on with some compiled code.

This document is intended to document that data structure (mostly for Eric’s benefit, at this point. We should have had a document like this a long time ago).

The param_list

The basic data structure is what we call a param_list. it has the following named elements, which are briefly described here. We will describe each in detail in separate sections below.

Finally, we have some entries that we should have had from day one, but didn’t, so they aren’t consistently used throughout the code to access the names of entities ordered as they ended up ordered: - indiv_names - collection_names - repunit_names - locus_names

How/Where do all these get set?

This is a trickier question than it seems, because things are done slightly differently in the different top-level functions.

assess_reference_loo() and assess_reference_mc()

In both of these functions, the original data sets gets read in, collection and repunit get converted to factors, and then the param_list is made inside a single function: tcf2param_list().

assess_pb_bias_correction()

Same as above, this uses tcf2param_list() after doing a few other steps on the original data frame.

self_assign()

Uses tcf2param_list() unless it is using preCompiledParams so that it can run through stuff during infer_mixture to compute the locus-specific means and variances of the log-likelihoods.

infer_mixture()

This is the tough one. Because we end up doing multiple mixture collections, we couldn’t simply use tcf2param_list() in the function. Rather, we create a summary for the reference sample (keeping track of alleles found in both the reference and the mixture), and then we split the mixture samples up by mixture collection and use

Dealing with 012 matrices

One problem with the current approach is that it is terribly slow when you start to get 10K+ SNPs. It would be much faster to read and store those data in an 012 matrix. Here is how I am thinking I could deal with that:

Cool, in order to do all this I should make two new functions: reference_allele_counts_012 and allelic_list_012. That might give me enough insight that I could easily do it for infer_mixture, too.