Collect essential data values before mixture proportion estimation

Takes all relevant information created in previous steps of data conversion pipeline, and combines into a single list which serves as input for further calculations

Usage

list_diploid_params(
  AC_list,
  I_list,
  PO,
  coll_N,
  RU_vec,
  RU_starts,
  alle_freq_prior = list(const_scaled = 1)
)

Arguments

AC_list: a list of allele count matrices; output from a_freq_list
I_list: a list of genotype vectors; output from allelic_list
PO: a vector of collection (population of origin) indices for every individual in the sample, in order identical to I_list
coll_N: a vector of the total number of individuals in each collection, in order of appearance in the dataset
RU_vec: a vector of collection indices, sorted by reporting unit
RU_starts: a vector of indices, designating the first collection for each reporting unit in RU_vec
alle_freq_prior: a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. The name of the list item determines the type of prior used, with options "const", "scaled_const", and "empirical". If "const", the listed number will be taken as a constant added to the count for each allele, locus, and collection. If "scaled_const", the listed number will be divided by the number of alleles at a locus, then added to the allele counts. If "empirical", the listed number will be multiplied by the relative frequency of each allele across all populations, then added to the allele counts.

Value

list_diploid_params returns a list of the information necessary for the calculation of genotype likelihoods in MCMC:

L, N, and C represent the number of loci, individual genotypes, and collections, respectively. A is a vector of the number of alleles at each locus, and CA is the cumulative sum of A. coll, coll_N, RU_vec, and RU_starts are copied directly from input.

I, AC, sum_AC, DP, and sum_DP are vectorized versions of data previously represented as lists and matrices; indexing macros use L, N, C, A, and CA to access these vectors in later Rcpp-based calculations.

Details

Genotypes represented in I_list are converted into a single long vector, ordered by locus, individual, and gene copy, with NA values represented as 0s. Similarly, AC_list is unlisted to AC, ordered by locus, collection, and allele. DP is a list of Dirichlet priors for likelihood calculations, created by adding the values calculated from alle_freq_prior to each allele sum_AC and sum_DP are the summed allele values for each locus of their parent vectors, ordered by locus and collection.

Examples

example(allelic_list)
#> 
#> alllc_> example(a_freq_list)
#> 
#> a_frq_>  # Generate a list of individual genotypes by allele from
#> a_frq_>  # the alewife data's reference allele count tables
#> a_frq_>  example(reference_allele_counts)
#> 
#> rfrn__> ## count alleles in alewife reference populations
#> rfrn__> example(tcf2long)  # gets variable ale_long
#> 
#> tcf2ln> ## Convert the alewife dataset for further processing
#> tcf2ln> # the data frame passed into this function must have had
#> tcf2ln> # character collections and repunits converted to factors
#> tcf2ln> reference <- alewife
#> 
#> tcf2ln> reference$repunit <- factor(reference$repunit, levels = unique(reference$repunit))
#> 
#> tcf2ln> reference$collection <- factor(reference$collection, levels = unique(reference$collection))
#> 
#> tcf2ln> ale_long <- tcf2long(reference, 17)
#> 
#> rfrn__> ale_rac <- reference_allele_counts(ale_long$long)
#> 
#> a_frq_>  ale_ac <- a_freq_list(ale_rac)
#> 
#> alllc_> ale_cs <- ale_long$clean_short
#> 
#> alllc_> # Get the vectors of gene copies a and b for all loci in integer index form
#> alllc_> ale_alle_list <- allelic_list(ale_cs, ale_ac)$int
PO <- as.integer(factor(ale_long$clean_short$collection))
coll_N <- as.vector(table(PO))

Colls_by_RU <- dplyr::count(ale_long$clean_short, repunit, collection) %>%
   dplyr::filter(n > 0) %>%
   dplyr::select(-n)
 PC <- rep(0, length(unique((Colls_by_RU$repunit))))
 for(i in 1:nrow(Colls_by_RU)) {
   PC[Colls_by_RU$repunit[i]] <- PC[Colls_by_RU$repunit[i]] + 1
 }
RU_starts <- c(0, cumsum(PC))
RU_vec <- as.integer(Colls_by_RU$collection)
param_list <- list_diploid_params(ale_ac, ale_alle_list, PO, coll_N, RU_vec, RU_starts)