Intro/Overview

This notebook documents the strategy for segregating linked markers from individuals and forming hybrids.

Here we ensure that we load each of the functions:

rfiles <- dir("R", full.names = TRUE)
dump <- lapply(rfiles, source)

Randomize Haplotypes in Founders

We are going to assume no LD (cuz our selected SNPs are quite far apart) so, to initialize our simulations we are going to need to randomly assign alleles to one haplotype or another in the founder individuals.

This is going to work a little like tablify_nh() in that we will pass in a tidy data frame and a list of variants, and then go from there. So, let’s get some data to work on:

library(tidyverse)
TT <- readRDS("intermediates/02/test_and_train.rds")
MM <- readRDS("intermediates/02/marker_rankings.rds")

So, the function looks like:


#' Randomly assign (no-LD) markers to one chromosome or another in a founder individual
#' 
#' blah blah
#' @param D a tidy data frame with at least the columns: id, variant, gene_copy, allele, chrom, coord, pop
#' @param V a vector of variant names giving which variants to use and the order they are
#' to be in.
#' @return A data frame with all the identifiers as before, but now the alleles are in two 
#' columns: hap1 and hap2, which are the alleles as if they occurred together on different
#' homologous chromosomes.  This gets sorted by id, chrom, and coord.
scramble_founder_haplotypes <- function(D, V) {
  # make one column for each gene copy
  D1 <- D %>%
    ungroup() %>%
    filter(variant %in% V) %>%
    tidyr::spread(data = ., key = gene_copy, value = allele) %>%
    mutate(segind = sample(x = c(T,F), size = n(), replace = TRUE),
           hap1 = ifelse(segind, `1`, `2`),
           hap2 = ifelse(segind, `2`, `1`)) %>%
    select(-`1`, -`2`, -segind) %>%
    arrange(id, chrom, coord)
  
}

Let’s see how it works:

Test <- TT %>%
  filter(test_or_train == "test")
v1000 <- MM %>%
  filter(selectable == TRUE & cumsum(selectable) <= 1000) %>%
  .$variant
SFG <- scramble_founder_haplotypes(Test, v1000)

Now, SFG is ready to have some gametes segregated.

Segregate gametes

In order to maintain maximal sample sizes whilst not incurring the sort of inflation of perceived accuracy that comes from sampling with replacement from the genes in the Test group we are going to segregate two gametes from each individual. These will be the opposites of one another—so, effectively all of the genetic material in an individual is getting segregated into the two gametes. (In effect there is no genetic drift in this type of sampling exercise…). The variance in what gets segregated around might be a little greater because the two gametes are not independent, but that is not going to bias the accuracy simulations.

This function is going to work on a data frame like SFG that has a hap1 and a hap2 column. It will create new columns gam1 and gam2 (gam is short for gamete there). We just give a chance of recombination between each pair of markers.

Here is what the function looks like

And this is what it looks like when we use it:

SG <- segregate_gametes(H = SFG)

Reproduction

Now all we need is a function that will combine the gametes of randomly chosen individuals into new individuals. THis is an interesting problem. We would like to make use of all the genetic material that we have from all the test individuals. We will be constrained by what types of hybrid categories we are making and the sizes of the test samples from the different populations.

arrange_matings

The way I am going to approach this is to create a function arrange_matings() that returns a data frame that describes which gametes get passed on to which individuals.

Let’s first get some data sets that we will be passing into arrange_matings().

tmp <- TT %>%
  filter(test_or_train == "test") %>%
  group_by(id, pop, group) %>%
  tally() %>%
  ungroup() %>%
  select(-n)
A <- tmp %>% filter(group == "wild")
B <- tmp %>% filter(group == "farmed")

Let’s write it:

#' return list of  who mates with whom to create hybrid offspring.  This will create F1's between 
#' the A and the B group.
#' @param A a data frame with id, pop, and group.  Should be exclusively from one of the groups (i.e. wild)
#' @param B a data frame with id and pop, and group.  Should be from the other group (i.e. farmed)
#' @return a data frame with columns id, popA, popB, idA gamA idB gamB that tells us which gametes from
#' which individuals each new individual is to be made of.
arrange_matings <- function(A, B) {
  
  n <-  min(nrow(A), nrow(B)) 
  Atib <- tibble(id = rep(sample(A$id, size = n, replace = FALSE), each = 2))  %>%
    mutate(gam = rep(c(1,2), length.out = n())) %>%
    sample_n(size = nrow(.), replace = FALSE)
  Btib <- tibble(id = rep(sample(B$id, size = n, replace = FALSE), each = 2))  %>%
    mutate(gam = rep(c(1,2), length.out = n())) %>%
    sample_n(size = nrow(.), replace = FALSE)
  
  
  Atib2 <- left_join(Atib, A)
  names(Atib2) <- paste(names(Atib2), "A", sep = "_")
  Btib2 <- left_join(Btib, B)
  names(Btib2) <- paste(names(Btib2), "B", sep = "_")
  
  F1s <- bind_cols(Atib2, Btib2) %>%
    mutate(id = paste("F1", 1:n(), sep = "_")) %>%
    select(id, everything())
  
  return(F1s)
}

Now that we have that function, we can create a mating list for F1s like this:

arrange_matings(A, B)
Joining, by = "id"
Joining, by = "id"

Some data to play with

We need some scrambled founder data for the following functions for testing and demo. So, here it is:

Wf <- SFG %>%
  filter(group == "wild")
Ff <- SFG %>%
  filter(group == "farmed")

make_f1s

And so, we can make a function to make F1’s like this:

#' @param Wf a data frame of founders with scrambled haplotypes in hap1 and hap2
#' @param Ff a data frame of founders (from the other species/population) 
#' with scrambled haplotypes in hap1 and hap2
make_f1s <- function(Wf, Ff) {
  A <- Wf %>%
    count(id) %>%
    select(-n)
  B <- Ff %>%
    count(id) %>%
    select(-n)
  
  ML <- arrange_matings(A, B) # get the list of matings
  
  Wft <- Wf %>%
    select(id, variant, chrom, coord, hap1, hap2) %>%
    rename(id_A = id, `1` = hap1, `2` = hap2) %>%
    tidyr::gather(data = ., key = "gam_A", value = "hap1", `1`, `2`) %>%
    mutate(gam_A = as.numeric(gam_A))
  
  Fft <- Ff %>%
    select(id, variant, chrom, coord, hap1, hap2) %>%
    rename(id_B = id, `1` = hap1, `2` = hap2) %>%
    tidyr::gather(data = ., key = "gam_B", value = "hap2", `1`, `2`) %>%
    mutate(gam_B = as.numeric(gam_B))
  
  left_join(ML, Wft) %>%
    left_join(., Fft)
  
}

Then we can use that to make a bunch of F1’s and immediately segrate some haplotypes from them.

newF1s <- make_f1s(Wf, Ff) %>%
  segregate_gametes()
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
newF1s[1:100,]

That is cool. The harder part is the later generations, but not so bad, it turns out.

make_f2s

This function will take the same input as make_f1s() but then it will split the result into two data frames and make F2s from it.

make_f2s <- function(Wf, Ff) {
  F1 <- make_f1s(Wf, Ff) %>%
    segregate_gametes()
  
  # now, make two new data frames that we will combine using make_F1s into F2s
  ids <- sample(unique(F1$id))  # permute all the ids around
  if (length(ids) %% 2 == 1) {  # if there is an odd number of F1s, then drop the first one
    ids <- ids[-1]
  }
  idA <- ids[1:(length(ids) / 2)]
  idB <- ids[(1 + length(ids) / 2):length(ids)]
  
  # now make new Wf and new Ff and run them back through make_F1s
  F1_clean <- F1 %>%
    select(-ends_with("_A"), -ends_with("_B"))
  
  newWf <- F1_clean %>% filter(id %in% idA)
  newFf <- F1_clean %>% filter(id %in% idB)
  
  F2 <- make_f1s(newWf, newFf) %>%
    mutate(id = stringr::str_replace(id, "^F1", "F2"))
  
}

And we can use that thing like this:

newF2s <- make_f2s(Wf, Ff)
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
newF2s[1:100, ]

make backcrosses

This is a single function that will make backcrosses to the first population (the A population). If you want to make backcrosses in the other direction, you will just need to call it with the populations in a different order. This is also going to be a wasteful function: it is not going to work hard to make pures out of any extra individuals, etc. It is going to make as many backcrosses as it can, and no more. It is also not going to be population aware—that all has to be done on the front end if you want to make sure that backcrosses are all within the same population, for example.

Before we launch into this, let’s talk about how many backcrosses you can make. Imagine that you are going to be backcrossing to the \(A\) population which has \(n_A\) test indivs. And the other population is the \(B\) population which has \(n_B\) individuals in it. To figure out how many A-backcross indivs you can make without reusing any of your gene copies, you start by computing \(M = \min(\lfloor n_a/3 \rfloor, n_b)\). Then the way you do this is

  1. Hold out \(2M\) of the \(n_a\) individuals for backcrossing later
  2. Make \(2M\) F1s out of \(M\) \(A\)’s and \(M\) \(B\)’s
  3. Mate the \(2M\) \(A\)’s to the \(2M\) F1’s to get \(4M\) backcrosses.

So, you can get \(4M\) backcrosses out of it.

Let’s write a function that does this:

#' @param idstr string that gets prepended to the number of each individual
make_bxs <- function(Wf, Ff, idstr = "BX") {
  A <- Wf %>%
    count(id) %>%
    select(-n)
  B <- Ff %>%
    count(id) %>%
    select(-n)
  
  M <- min(floor(nrow(A)/3), nrow(B))
  
  Aids <- sample(A$id) # permute these
  Bids <- sample(B$id) # permute these too
  
  Ai1 <- Aids[1:M]  # ids of A's for F1 creation
  Bi1 <- Bids[1:M]  # ids of B's for F1 creation
  
  Ai2 <- Aids[(M + 1):(3 * M)]  # ids of A's for making backcrosses
  
  # make the F1s and segregate em 
  F1 <- make_f1s(Wf %>% filter(id %in% Ai1), 
                 Ff %>% filter(id %in% Bi1)) %>%
    segregate_gametes()
  
  # now, make F1's between the F1s and the Ai2's
  BX <- make_f1s(Wf %>% filter(id %in% Ai2), F1) %>%
    mutate(id = stringr::str_replace(id, "^F1", idstr))
  
}

And we can use that like this:

newBXs <- make_bxs(Wf, Ff)
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
Joining, by = "id"
Joining, by = "id"
Joining, by = c("id_A", "gam_A")
Joining, by = c("id_B", "gam_B", "variant", "chrom", "coord")
newBXs[1:100,]
---
title: "Segregation and Reproduction"
output: 
  html_notebook:
    toc: true
---

```{r setup, include=FALSE}
# set the working directory always to the project directory (one level up)
knitr::opts_knit$set(root.dir = normalizePath(rprojroot::find_rstudio_root_file())) 
```


## Intro/Overview

This notebook documents the strategy for segregating linked markers from
individuals and forming hybrids.

Here we ensure that we load each of the functions:
```{r}
rfiles <- dir("R", full.names = TRUE)
dump <- lapply(rfiles, source)
```

## Randomize Haplotypes in Founders

We are going to assume no LD (cuz our selected SNPs are quite far apart)
so, to initialize our simulations we are going to need to randomly assign
alleles to one haplotype or another in the founder individuals. 

This is going to work a little like `tablify_nh()` in that we will pass in
a tidy data frame and a list of variants, and then go from there.
So, let's get some data to work on:
```{r get-data}
library(tidyverse)
TT <- readRDS("intermediates/02/test_and_train.rds")
MM <- readRDS("intermediates/02/marker_rankings.rds")
```

So, the function looks like:
```r
`r paste(readLines("R/scramble_founder_haplotypes.R"), collapse = "\n")`
```

Let's see how it works:
```{r demo-scramble}
Test <- TT %>%
  filter(test_or_train == "test")
v1000 <- MM %>%
  filter(selectable == TRUE & cumsum(selectable) <= 1000) %>%
  .$variant

SFG <- scramble_founder_haplotypes(Test, v1000)
```
Now, `SFG` is ready to have some gametes segregated.

## Segregate gametes

In order to maintain maximal sample sizes whilst not incurring the sort of 
inflation of perceived accuracy that comes from sampling with replacement
from the genes in the Test group we are going to segregate two gametes from 
each individual.  These will be the opposites of one another---so, effectively all
of the genetic material in an individual is getting segregated into the two gametes.
(In effect there is no genetic drift in this type of sampling exercise...).  The variance
in what gets segregated around might be a little greater because the two gametes are
not independent, but that is not going to bias the accuracy simulations.  

This function is going to work on a data frame like SFG that has a `hap1` and a `hap2`
column.  It will create new columns `gam1` and `gam2` (gam is short for gamete there).
We just give a chance of recombination between each pair of markers.

Here is what the function looks like

And this is what it looks like when we use it:
```{r use-seg-gam}
SG <- segregate_gametes(H = SFG)
```

## Reproduction

Now all we need is a function that will combine the gametes of randomly chosen individuals 
into new individuals. THis is an interesting problem.  We would like to make use of
all the genetic material that we have from all the test individuals.  We will be
constrained by what types of hybrid categories we are making and the sizes of the
test samples from the different populations.  

### arrange_matings

The way I am going to approach this is to create a function `arrange_matings()` that
returns a data frame that describes which gametes get passed on to which individuals.

Let's first get some data sets that we will be passing into `arrange_matings()`.
```{r trial-data}
tmp <- TT %>%
  filter(test_or_train == "test") %>%
  group_by(id, pop, group) %>%
  tally() %>%
  ungroup() %>%
  select(-n)

A <- tmp %>% filter(group == "wild")
B <- tmp %>% filter(group == "farmed")
```
Let's write it:
```r
`r paste(readLines("R/arrange_matings.R"), collapse = "\n")`
```

Now that we have that function, we can create a mating list for F1s like this:
```{r demo-arrange}
arrange_matings(A, B)
```

### Some data to play with

We need some scrambled founder data for the following functions for testing 
and demo.  So, here it is:
```{r make-demo-data}
Wf <- SFG %>%
  filter(group == "wild")
Ff <- SFG %>%
  filter(group == "farmed")
```
### make_f1s
And so, we can make a function to make F1's like this:
```r
`r paste(readLines("R/make_f1s.R"), collapse = "\n")`
```

Then we can use that to make a bunch of F1's and immediately segrate some haplotypes from them.
```{r demo_f1s}
newF1s <- make_f1s(Wf, Ff) %>%
  segregate_gametes()

newF1s[1:100,]
```

That is cool.  The harder part is the later generations, but not so bad,
it turns out.

### make_f2s

This function will take the same input as `make_f1s()` but then it will 
split the result into two data frames and make F2s from it.
```r
`r paste(readLines("R/make_f2s.R"), collapse = "\n")`
```

And we can use that thing like this:
```{r demo-f2}
newF2s <- make_f2s(Wf, Ff)
newF2s[1:100, ]
```

### make backcrosses

This is a single function that will make backcrosses to the first population (the A population).
If you want to make backcrosses in the other direction, you will just need to 
call it with the populations in a different order.  This is also going to be a wasteful function:
it is not going to work hard to make pures out of any extra individuals, etc.  It is going to make
as many backcrosses as it can, and no more. It is also not going to be population aware---that all
has to be done on the front end if you want to make sure that backcrosses are all within the same
population, for example.

Before we launch into this, let's talk about how many backcrosses you can make.  Imagine that you are
going to be backcrossing to the $A$ population which has $n_A$ test indivs.  And the other population
is the $B$ population which has $n_B$ individuals in it.  To figure out how many A-backcross 
indivs you can make without reusing any of your gene copies, you start by computing
$M = \min(\lfloor n_a/3 \rfloor, n_b)$.  Then the way you do this is

1. Hold out $2M$ of the $n_a$ individuals for backcrossing later
2. Make $2M$ F1s out of $M$ $A$'s and $M$ $B$'s
3. Mate the $2M$ $A$'s to the $2M$ F1's to get $4M$ backcrosses.

So, you can get $4M$ backcrosses out of it.

Let's write a function that does this:
```r
`r paste(readLines("R/make_bxs.R"), collapse = "\n")`
```

And we can use that like this:
```{r demo-bx}
newBXs <- make_bxs(Wf, Ff)
newBXs[1:100,]
```