Partition a reference dataset and estimate reporting group and collection proportions

From a reference dataset, this draws (without replacement) a simulated mixture dataset with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.

Usage

assess_reference_mc(
  reference,
  gen_start_col,
  reps = 50,
  mixsize = 100,
  seed = 5,
  alpha_repunit = 1.5,
  alpha_collection = 1.5,
  min_remaining = 5,
  alle_freq_prior = list(const_scaled = 1)
)

Arguments

reference: a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries.
gen_start_col: the first column of genetic data in reference
reps: number of reps to do
mixsize: the number of individuals in each simulated mixture.
seed: a random seed for simulations
alpha_repunit: The dirichlet parameter for simulating the proportions of reporting units. Default = 1.5
alpha_collection: The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5
min_remaining: the minimum number of individuals which should be conserved in each reference collection during sampling without replacement to form the simulated mixture
alle_freq_prior: a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

Details

This method is referred to as "Monte Carlo cross-validation". The input parameters for assess_reference_mc are more restrictive than those of assess_reference_loo. Rather than allowing a data.frame to specify Dirichlet parameters, proportions, or counts for specific reporting units and collections, assess_reference_mc only allows vector input (default = 1.5) for alpha_repunit and alpha_collection. These inputs specify the uniform Dirichlet parameters for all reporting units and collections, respectively.

For mixture proportion generation, the rho values are first drawn using a stick-breaking model of the Dirichlet distribution, but with proportions capped by min_remaining. Stick-breaking is then used to subdivide each reporting unit into collections. In addition to the constraint that mixture sampling without replacement cannot deplete the number of individuals in each collection below min_remaining, a similar constraint is placed upon the number of individuals left in reporting units, determined as min_remaining * (# collections in reporting unit).

Note that this implies that the data are only truly Dirichlet distributed when no rejections based on min_remaining occur. This is a reasonable certainty with sufficient reference collection sizes relative to the desired mixture size.

Examples

# only 5 reps, so it doesn't take too long.  Typically you would
# do many more
ale_dev <- assess_reference_mc(alewife, 17, 5)