Partition a reference dataset and estimate reporting group and collection proportions
Source:R/assess_reference_mc.R
assess_reference_mc.RdFrom a reference dataset, this draws (without replacement) a simulated mixture dataset with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.
Usage
assess_reference_mc(
reference,
gen_start_col,
reps = 50,
mixsize = 100,
seed = 5,
alpha_repunit = 1.5,
alpha_collection = 1.5,
min_remaining = 5,
alle_freq_prior = list(const_scaled = 1)
)Arguments
- reference
a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries.
- gen_start_col
the first column of genetic data in reference
- reps
number of reps to do
- mixsize
the number of individuals in each simulated mixture.
- seed
a random seed for simulations
- alpha_repunit
The dirichlet parameter for simulating the proportions of reporting units. Default = 1.5
- alpha_collection
The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5
- min_remaining
the minimum number of individuals which should be conserved in each reference collection during sampling without replacement to form the simulated mixture
- alle_freq_prior
a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
"const","scaled_const", and"empirical". See?list_diploid_paramsfor method details.
Details
This method is referred to as "Monte Carlo cross-validation".
The input parameters for assess_reference_mc are more restrictive than those of
assess_reference_loo. Rather than allowing a data.frame to specify Dirichlet
parameters, proportions, or counts for specific reporting units and collections,
assess_reference_mc only allows vector input (default = 1.5) for alpha_repunit
and alpha_collection. These inputs specify the uniform Dirichlet parameters for
all reporting units and collections, respectively.
For mixture proportion generation, the rho values are first drawn using a stick-breaking
model of the Dirichlet distribution, but with proportions capped by min_remaining.
Stick-breaking is then used to subdivide each reporting unit into collections. In
addition to the constraint that mixture sampling without replacement cannot deplete
the number of individuals in each collection below min_remaining, a similar
constraint is placed upon the number of individuals left in reporting units,
determined as min_remaining * (# collections in reporting unit).
Note that this implies that the data are only truly Dirichlet distributed when no
rejections based on min_remaining occur. This is a reasonable certainty with
sufficient reference collection sizes relative to the desired mixture size.