This notebook documents the development of the strategy for putting together all the previous functions so that I can start from a data set of individual genotypes and physical positions and then I can do:
First we ensure that we load each of the functions:
library(tidyverse)
library(stringr)
library(SalarHybPower)
We also want to have some example data. We will just use the Nova Scotia data. But we will call this dat:
dat <- readRDS("intermediates/01/tidy-west.rds") %>%
mutate(pop = str_sub(id, 1, 3),
group = ifelse(pop == "AQU" | pop == "WLN", "farmed", "wild"))
We are going to have a single function that splits the same into training and test and also ranks the loci. It will return a list with components split_dat
and ranked_markers
.
#' split data set into training and test and rank markers based on training set
#'
#'
#' We are going to have a single function that splits the same into training and test
#' and also ranks the loci.
#' @param dat a data frame that has the columns: id, variant, gene_copy, allele,
#' chrom, coord, pop, group.
#' @param train_fract fraction of individuals from each group (and pop, too) to
#' assign to the training set.
#' @param min_dist the number of base pairs below which you will not dplyr::select a marker if
#' another one within min_dist on the same chromosome has already been dplyr::selected.
#' @return a list with components `split_dat` and `ranked_markers`.
#' @export
split_and_rank <- function(dat, train_fract = 0.5, min_dist = 6e04) {
A <- assign_train_test(dat, train_fract = train_fract)
R <- compute_epp(A)
M <- marker_selection(R, min_dist = min_dist)
list(split_dat = A, ranked_markers = M)
}
So, it works like this:
SAR <- split_and_rank(dat)
Joining, by = c("id", "pop", "group")
We will do this so that all the training individuals will end up with z0s
or z1s
in the data set, and then we will make as many of the others as we can. Since we are going to be just computing the scaled likelihoods for all these guys, it should be fine to just create everyone to be of one type (i.e. Pure, F1, F2, Bx, etc.) And I need to do something about whether I want to only have matings occur between individuals from the same population. (for F2s and BXs)
It looks like this:
#' Create a simulated data set for newhybrids using dplyr::selected/ranked markers
#'
#' We do this so that all the training individuals will end up with `z0s` or `z1s` in the data
#' set, and then we make as many of the others as we can. Since we are going to be just computing
#' the scaled likelihoods for all these guys, it is fine to just create everyone to be of one type
#' (i.e. Pure, F1, F2, Bx, etc.) This let's you make sure that backcrosses are all within the same
#' population.
#' @param SAR the output of split_and_rank()
#' @param wild_pop a vector of the names of the populations of wild fish to simulate from. It treats them
#' as all being from the same population if they are included together. This is not a
#' problem for F1's, but it is likely unrealistic for F2s and backcrosses. Unless the
#' F1 hybrids have no site fidelity (i.e. they stray freely...)
#' @param hyb_cat the hybrid category desired
#' @param L the number of loci to choose
#' @param dir the path of the directory to create to put the result into. It will
#' write the result into a file called \code{file_name}.
#' @param file_name the name of the output file. Defaults to "nh_data.txt".
#' @export
create_hybrid_dataset <- function(SAR, wild_pop, hyb_cat, L, dir = "hyb_dir", file_name = "nh_data.txt") {
# Get the L markers. store their names in a vector called V (for variants)
V <- SAR$ranked_markers %>%
dplyr::filter(selectable == TRUE & cumsum(selectable) <= L) %>%
.$variant
# break the data into training and test individuals
Train <- SAR$split_dat %>%
dplyr::filter(test_or_train == "train")
Test <- SAR$split_dat %>%
dplyr::filter(test_or_train == "test")
# make the rows that the Train indivs will be in the newhybrids data set.
# note that "farmed" will always be species 0.
trainNH <- list(
Train %>%
dplyr::filter(group == "farmed") %>%
nh_tablify(., V, opt_str = "z0s", id_prepend = "train_"),
Train %>%
dplyr::filter(group == "wild") %>%
nh_tablify(., V, opt_str = "z1s", id_prepend = "train_")
) %>%
dplyr::bind_rows()
# now make the data frames that represent the wild and farmed founders and
# scramble their haplotypes up
Wf <- Test %>%
dplyr::filter(group == "wild") %>%
dplyr::filter(pop %in% wild_pop) %>%
scramble_founder_haplotypes(., V)
Ff <- Test %>%
dplyr::filter(group == "farmed") %>%
scramble_founder_haplotypes(., V)
# now, simulate whatever type of hybrid was requested
Hybs <- switch(hyb_cat,
PureW = make_pures(Wf),
PureF = make_pures(Ff),
F1 = make_f1s(Wf, Ff),
F2 = make_f2s(Wf, Ff),
BX = make_bxs(Wf, Ff),
stop("unknown hyb_cat: ", hyb_cat)
)
# and prepare each of those simulated individuals as a row in new-hybs file
hybsNH <- Hybs %>%
dplyr::rename(`1` = hap1, `2` = hap2) %>%
tidyr::gather(data = ., key = "gene_copy", value = "allele", `1`, `2`) %>%
dplyr::mutate(gene_copy = as.integer(gene_copy)) %>%
nh_tablify(., V, opt_str = "", id_prepend = "")
# now, make the directory to write the result into
dir.create(dir)
dplyr::bind_rows(trainNH, hybsNH) %>%
write_nh(., path = file.path(dir, file_name))
}
See that that looks like:
AllPops <- c("BDN", "CNR", "GRR", "LHR", "LPR")
create_hybrid_dataset(SAR, wild_pop = AllPops, hyb_cat = "F1", L = 5000, dir = "hyb_dir")
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
'hyb_dir' already exists
Then we can do:
~/Documents/git-repos/newhybrids/newhybs -d nh_data.txt -g P0 1 0 0 -g p1 0 0 1 -g F1 0 1 0 -g F2 .25 .5 .25 -g BX0 .5 .5 0 -g BX1 0 .5 .5 --pi-prior fixed 1 1 1 1 1 1
OK! That is working pretty darn well. NewHybrids converges very quickly. I can easily get away with running it for 50 sweeps of burn in and 200 of data collection, which should go pretty darn fast, and I could do that over many data sets, each one from a different population. But it looks little funky. There are some individuals looking pure in there that should not be.
But, with 5K markers, newhybrids starts to throw this error:
WARNING: i=6 >= Num=6 in IntegerFromProbsRV(). Probably only a problem while initializing the chain. i reset to 0
Turn out that this is a problem during more than just initialization—you end up with salmon that are clearly F1s being called pure.
Let’s try using fewer markers
create_hybrid_dataset(SAR, wild_pop = "GRR", hyb_cat = "PureW", L = 1500, dir = "hyb_dir")
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
'hyb_dir' already exists
That seems to work correctly.
So, I only need to set up these simulations, write out all the data sets to directories with informative names, and then rip through them with Unix parallel. Then we need a function to read in the output into one massive tidy data frame for analysis.