Estimate mixing proportions and origin probabilities from one or several mixtures

Takes a mixture and reference dataframe of two-column genetic data, and a desired method of estimation for the population mixture proportions (MCMC, PB, BR). Returns the output of the chosen estimation method

Usage

infer_mixture(
  reference,
  mixture,
  gen_start_col,
  method = "MCMC",
  alle_freq_prior = list(const_scaled = 1),
  pi_prior = NULL,
  pi_init = NULL,
  reps = 2000,
  burn_in = 100,
  pb_iter = 100,
  prelim_reps = NULL,
  prelim_burn_in = NULL,
  sample_int_Pi = 1,
  sample_theta = TRUE,
  pi_prior_sum = 1,
  total_catch_tib = NULL,
  variable_prob_is_catch = FALSE
)

Arguments

reference: a dataframe of two-column genetic format data, proceeded by "repunit", "collection", and "indiv" columns. Does not need "sample_type" column, and will be overwritten if provided
mixture: a dataframe of two-column genetic format data. Must have the same structure as reference dataframe, but "collection" and "repunit" columns are ignored. Does not need "sample_type" column, and will be overwritten if provided
gen_start_col: the first column of genetic data in both data frames
method: a choice between "MCMC", "PB", "BR" methods for estimating mixture proportions
alle_freq_prior: a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.
pi_prior: The prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of new pi vectors in MCMC. Default value of NA leads to the calculation of a symmetrical prior based on pi_prior_sum. To provide other values to certain collections, you can pass in a data frame with two columns, "collection" listing the relevant collection, and "pi_param" listing the desired prior for that collection. Specific priors may be listed for as few as one collection. The special collection name "DEFAULT_PI" is used to set the prior for all collections not explicitly listed; if no "DEFAULT_PI" is given, it is taken to be 1/(# collections).
pi_init: The initial value to use for the mixing proportion of collections. This lets the user start the chain from a specific value of the mixing proportion vector. If pi_init is NULL (the default) then the mixing proportions are all initialized to be equal. Otherwise, you pass in a data frame with one column named "collection" and the other named "pi_init". Every value in the pi_init column must be strictly positive (> 0), and a value must be given for every collection. If they sum to more than one the values will be normalized to sum to one.
reps: the number of iterations to be performed in MCMC
burn_in: how many reps to discard in the beginning of MCMC when doing the mean calculation. They will still be returned in the traces if desired.
pb_iter: how many bootstrapped data sets to do for bootstrap correction using method PB. Default is 100.
prelim_reps: for method "BR", the number of reps of conditional MCMC (as in method "MCMC") to perform prior to MCMC with baseline resampling. The posterior mean of mixing proportions from this conditional MCMC is then used as pi_init in the baseline resampling MCMC.
prelim_burn_in: for method "BR", this sets the number of sweeps out of prelim_reps that should be discarded as burn in when preparing the posterior means of the mixing proportions to be set as pi_init in the baseline resampling MCMC.
sample_int_Pi: how many iterations between storing the mixing proportions trace. Default is 1. Can't be 0. Can't be so large that fewer than 10 samples are taken from the burn in and the sweeps.
sample_theta: for method "BR", whether or not the function should store the posterior mean of the updated allele frequences. Default is TRUE
pi_prior_sum: For pi_prior = NA, the prior on the mixing proportions is set as a Dirichlet vector of length C, with each element being W/C, where W is the pi_prior_sum and C is the number of collections. By default this is 1. If it is made much smaller than 1, things could start to mix more poorly.
total_catch_tib: A tibble with two columns: `collection` and `tot_catch`. The entries to `collection` must correspond to different fisheries or mixture collections as listed in the `collections` column in `mixture`. The `tot_catch` must be a list column which contains an estimate of the total catch from the fishery. That could be a single number, or it could be a sample from a posterior distribution of the total catch. Providing a value for this parameter will make the function return a sample from the posterior for stock-specific total catch. Note that if you use this option then every single mixture collection in `mixture` must be represented in `total_catch_tib`.
variable_prob_is_catch: a logical indicating whether to use the `catch_is_prob` column to simulate for each fish whether or not it should be considered part of the catch.

Value

Tidy data frames in a list with the following components: mixing_proportions: the estimated mixing proportions of the different collections. indiv_posteriors: the posterior probs of fish being from each of the collections. mix_prop_traces: the traces of the mixing proportions. Useful for computing credible intervals. bootstrapped_proportions: If using method "PB" this returns the bootstrap corrected reporting unit proportions. If `total_catch_tib` is not NULL, then you will also found three other components in the output list: `stock_specific_total_catch_traces`, `posterior_predictive_remaining_catch_traces` and `allocation_count_traces`. These give samples from the posterior distribution of total catch. Note that `stock_specific_total_catch_traces` is the sum of `posterior_predictive_remaining_catch_traces` and `allocation_count_traces`. All three are returned because it can be instructive to be able to decompose the `stock_specific_total_catch_traces` into its two constituents.

Details

"MCMC" estimates mixing proportions and individual posterior probabilities of assignment through Markov-chain Monte Carlo conditional on the reference allele frequencies, while "PB" does the same with a parametric bootstrapping correction, and "BR" runs MCMC sweeps while simulating reference allele frequencies using the genotypes of mixture individuals and allocations from the previous sweep. All methods default to a uniform 1/(# collections or RUs) prior for the mixing proportions.

Examples

mcmc <- infer_mixture(reference = small_chinook_ref,
                      mixture = small_chinook_mix,
                      gen_start_col = 5,
                      method = "MCMC",
                      reps  = 200)
#> Collating data; compiling reference allele frequencies, etc.
#>    time: 0.10 seconds
#> Computing reference locus specific means and variances for computing mixture z-scores
#>    time: 0.01 seconds
#> Working on mixture collection: rec3 with 29 individuals
#>   calculating log-likelihoods of the mixture individuals.
#>    time: 0.00 seconds
#>   performing 200 total sweeps, 100 of which are burn-in and will not be used in computing averages in method "MCMC"
#>    time: 0.00 seconds
#>   tidying output into a tibble.
#>    time: 0.01 seconds
#> Working on mixture collection: rec1 with 36 individuals
#>   calculating log-likelihoods of the mixture individuals.
#>    time: 0.00 seconds
#>   performing 200 total sweeps, 100 of which are burn-in and will not be used in computing averages in method "MCMC"
#>    time: 0.00 seconds
#>   tidying output into a tibble.
#>    time: 0.01 seconds
#> Working on mixture collection: rec2 with 35 individuals
#>   calculating log-likelihoods of the mixture individuals.
#>    time: 0.00 seconds
#>   performing 200 total sweeps, 100 of which are burn-in and will not be used in computing averages in method "MCMC"
#>    time: 0.00 seconds
#>   tidying output into a tibble.
#>    time: 0.01 seconds