An Explanation of the Underlying Data Structures in rubias
Eric C. Anderson
2026-03-06
Source:vignettes/rubias-underlying-data-structures.Rmd
rubias-underlying-data-structures.RmdIn order to be computationally efficient and allow for multiallelic
markers, with rubias we boil most of the data down to a
bunch of integer vectors in a data structure that we operate on with
some compiled code.
This document is intended to document that data structure (mostly for Eric’s benefit, at this point. We should have had a document like this a long time ago).
The param_list
The basic data structure is what we call a param_list.
it has the following named elements, which are briefly described here.
We will describe each in detail in separate sections below.
L: the number of loci, an integerN: the number of individuals (integer)C: the number of collections in the reference data set (integer)A: the number of alleles at each locus (integer vector of length L)CA: the “cumulative number of alleles” at each locus. For each locus this gives the base-0 index of the first allele at the locus (if you were to line all the alleles at each locus up one after another.)coll: an integer vector of length N that gives the index of the collection that each fish is in.coll_N: an integer vector of length C that gives the number of fish in each of the collectionsRU_vec: This is the hardest one to figure out / remember. Imagine that each collection has an index from 1 up to C, and imagine that each collection belongs to a single reporting unit. Each reporting unit is assigned an integer. Now, sort everything first by reporting unit index and then by collection index. The order that you get is the order of the collections inRU_vec. This vector is a named integer vector. The collections are in the order as described above. The names are the collection names and the values are the base-1 index of each collection.-
RU_starts: The base-0 index of the starting position of each reporting unit in theRU_vecvector. This is a named integer vector. For example, the first few entries of the chinook data set are:ploidies <- check_refmix(chinook, 5) cpar <- tcf2param_list(chinook, 5, summ = FALSE, ploidies = ploidies) cpar$RU_starts[1:5] #> CentralValleyfa CentralValleysp CentralValleywi CaliforniaCoast KlamathR #> 0 8 12 13 15and if we look at the first 15 elements of
RU_vecit gives us the names and the indices of the collections in those first 4 listed reporting units:cpar$RU_vec[1:15] #> Feather_H_sp Feather_H_fa Butte_Cr_fa #> 1 6 7 #> Mill_Cr_fa Deer_Cr_fa Mokelumne_R_fa #> 8 9 10 #> Battle_Cr Sacramento_R_lf Butte_Cr_Sp #> 11 12 2 #> Mill_Cr_sp Deer_Cr_sp UpperSacramento_R_sp #> 3 4 5 #> Sacramento_H Eel_R Russian_R #> 13 14 15 I: an integer vector giving the allelic type of each gene copy carried by each individual. For ploidy = 2 (the only case implemented so far) this vector is of length (N * L * 2). An entry of 0 denotes missing data, and the observed alleles are named 1, 2, …AC: this is a flat integer vector of the counts of alleles of different types in the different populations. It has length C * sum(A) (i.e. the number of collections in the reference times that total number of alleles at all the loci.). This is created by a somewhat lengthy process: first the functionreference_allele_counts()makes a long data frame that hascollection,locus,allele, andcounts. This then gets turned into a list of matrices ina_freq_list(). One matrix for each collection. The rows are the different alleles and the columns are the different populations. Then inlist_diploid_params()that list of matrices gets flattened into one long integer vector.
One of the weaknesses as I see that now, is that the loci are arranged alphabetically, rather than by input order. We should at least include the names of the loci in the order in which they appear so that we can get back to the loci, if necessary. The order of the loci coming out of this process is used to make sure that it corresponds to the order of the loci inI, which is good, but not super intuitive. At any rate, from the foregoing, it can be deduced that we can index into this vector thus (all indexes are base-0): if we want the count of the a-th allele at the l-th locus in the c-th collection then we get that by base-0 subsciptingACby[C * CA[l] + c * A[l] + a]. WhereCis the number of collections,CAis the cumulative number of alleles, andAis the number of alleles at each locus. Now it should be clear why we storeCA`—this is where we use it!sum_AC: the sum of the allele counts at each locus for each collection in the reference data set. (Basically the number of observed gene copies at the locus in the reference data set). This gets computed inlist_diploid_params()from the list of matrices returned bya_freq_list(). It is of length L * C. It is a named vector with the names takingLocus.Collection, but I don’t think those names get used at all. It gets indexed as[l * C + c]DP: this is a vector completely parallel toACbut in which the prior weights have been added to each allele in each collection.sum_DP: this is the sum of Dirichlet ParametersDPfor each locus and each collection. It is parallel tosum_AC.
Finally, we have some entries that we should have had from day one,
but didn’t, so they aren’t consistently used throughout the code to
access the names of entities ordered as they ended up ordered: -
indiv_names - collection_names -
repunit_names - locus_names
How/Where do all these get set?
This is a trickier question than it seems, because things are done slightly differently in the different top-level functions.
assess_reference_loo() and assess_reference_mc()
In both of these functions, the original data sets gets read in,
collection and repunit get converted to factors, and then the
param_list is made inside a single function:
tcf2param_list().
assess_pb_bias_correction()
Same as above, this uses tcf2param_list() after doing a
few other steps on the original data frame.
self_assign()
Uses tcf2param_list() unless it is using
preCompiledParams so that it can run through stuff during infer_mixture
to compute the locus-specific means and variances of the
log-likelihoods.
infer_mixture()
This is the tough one. Because we end up doing multiple mixture
collections, we couldn’t simply use tcf2param_list() in the
function. Rather, we create a summary for the reference sample (keeping
track of alleles found in both the reference and the mixture), and then
we split the mixture samples up by mixture collection and use
Dealing with 012 matrices
One problem with the current approach is that it is terribly slow when you start to get 10K+ SNPs. It would be much faster to read and store those data in an 012 matrix. Here is how I am thinking I could deal with that:
-
For the functions that use
tcf2param_list()I could just write another function,tcf2param_list_012(), that tookDas just a data frame withsample_type,collection, andrepunitandindiv, and then had an 012 matrix with the genetic data in it, with indiv names in the rownames and locus names in the colnames. Then we just have to deal with seeing AC_list and I_list correctly. Actually, looking at it now, I can just do it in the sametcf2param_list()function. If thed012parameter is not NULL we would:- make
cleaned$longNULL, and setcleaned$clean_shorttoD. - make the
AC_listdirectly from the 012 matrix. This should be super straightforward. I would probably want to drop monomorphic loci first. - make the I_list from the 012 matrix
- make
Cool, in order to do all this I should make two new functions:
reference_allele_counts_012 and
allelic_list_012. That might give me enough insight that I
could easily do it for infer_mixture, too.