Calculate the frequency of haplotypes that are recombinant or non-recombinant between the RoSA and the distal assays from Prince et al. (2017).
library(tidyverse)
library(parallel)
dir.create("outputs/011", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/011", recursive = TRUE, showWarnings = FALSE)
These data should stored as strings of genotypes encoded as letters:
M
= 2 copies of the L alleleP
= 2 copies of the E alleleH
= 1 copy of the L and one copy of the E allele.?
= genotype missingThe distal assays are given by 5-letter words and the RoSA assays by 8-letter words. The letters in each word are given in genome coordinate order.
We will continue to call the alleles M
or P
(rather than E
or L
). This terminology is vestigial, reflecting the language of “Mature” and “Premature” migrators that we have since shown to be inappropriate.
PR_tib <- read_rds("data/RoSA_prince_genotype_data.rds")
# This data set has the rosa column coded as E/L/H rather than P/M/H.
# But the code is set up for P/M/H. So, let's just switch it.
RPGD <- PR_tib %>%
mutate(rosa = str_replace_all(rosa, "L", "M") %>% str_replace_all("E", "P"))
Looking at this it seems the perfect case for phasing all these variants using PHASE (Stephens, Smith, and Donnelly 2001). This way we can get haplotype frequencies and, by making some very reasonable assumptions about the ancestral state of haplotypes, this will provide a good way to estimate the fraction of recombined haplotypes.
We will explode the words into vectors and then apply a function. Note, we will call the “M” allele “M” and the “P” allele “P” as it looks like almost any single-letter code for the variants can be interpreted correctly by PHASE.
#' @param side for heterozygous positions side = 1 give one result and side = 2 gives the other
halflotype <- function(x, side = 1) {
repl_1 <- c(M = "M", P = "P", H = "M", `?` = "?")
repl_2 <- c(M = "M", P = "P", H = "P", `?` = "?")
if (side == 1) {
repl <- repl_1
} else {
repl <- repl_2
}
unname(repl[x])
}
halflos <- RPGD %>%
mutate(
prince_vec = strsplit(prince, ""),
rosa_vec = str_split(rosa, ""),
prince1 = map(prince_vec, ~ halflotype(., 1)),
prince2 = map(prince_vec, ~ halflotype(., 2)),
rosa1 = map(rosa_vec, ~ halflotype(., 1)),
rosa2 = map(rosa_vec, ~ halflotype(., 2)),
)
h2 <- halflos %>%
gather("reg_hap", "halflo", rosa1, rosa2, prince1, prince2) %>%
mutate(
region = str_replace(reg_hap, "[12]", ""),
num_miss = map_int(halflo, ~ sum(. == "?"))
)
ggplot(h2, aes(x = num_miss)) +
geom_histogram(breaks = (0:11) - 0.5) +
facet_wrap(~region) +
scale_x_continuous(breaks = 0:10)
This is telling us that it make sense to ditch any genotypes with more than 1 missing data point, because if they are missing more than 1, they are likely missing all of them! So, we will do that. This tosses out entire individuals if they are missing genotypes at more than 1 SNP in either the distal or rosa regions.
halflo_filt <- h2 %>%
group_by(Indiv) %>%
filter(all(num_miss <= 1)) %>%
ungroup()
When we have spring/fall ecotype pairs in a particular sub-basin, I want to analyze them together in PHASE, and just refer to them by their shared sub-basin.
for_phase <- halflo_filt %>%
rename(PopEco = Population) %>%
mutate(Pop = str_replace(PopEco, "[fF]all$|[sS]p$", "")) %>%
select(Indiv, Pop, PopEco, reg_hap, halflo) %>%
spread(reg_hap, halflo) %>%
group_by(Pop) %>%
nest() %>%
mutate(num_fish = map_int(data, nrow)) %>%
filter(!is.na(Pop) & num_fish > 5) # filter out any unknown pops and any groups without many fish fish
for_phase
We have the positions in the data
directory (and in a supplemental table in the paper):
var_pos_tibble <- read_tsv("data/assay-positions.tsv")
var_poses <- var_pos_tibble %>%
pull(Position_on_Chr_28_in_Otsh_v1.0)
We will walk over the for_phase
tibble and write phase files out to different directories (named for the population) in /tmp, and we will write out the command to run it, and then run that with mclapply.
# this is a function that operates on a single tibble like those in "data" in for_phase.
# note: it is hardwired for 13 biallelic snps at the moment
#' @param R_seed The random number to use to set random numbers to use as seeds for PHASE
# for reproducibility.
write_phase_input <- function(dirname, D) {
dir.create(file.path("/tmp", dirname), showWarnings = FALSE)
outf <- file.path("/tmp", dirname, "input.txt")
cat(nrow(D), "\n", 13, "\n", sep = "", file = outf)
cat("P ", var_poses, "\n", sep = " ", file = outf, append = TRUE)
cat(rep("S", 13), "\n", sep = "", file = outf, append = TRUE)
write_ind <- function(Indiv, prince1, prince2, rosa1, rosa2, ...) {
cat(Indiv, "\n", file = outf, append = TRUE)
cat(c(prince1, rosa1), "\n", sep = " ", file = outf, append = TRUE)
cat(c(prince2, rosa2), "\n", sep = " ", file = outf, append = TRUE)
}
pwalk(D, write_ind)
# and now return the command one should use to run it
str_c("cd ", dirname(outf), "; PHASE -S", floor(runif(1, 2, 10000)), " input.txt output 500 1 500 > stdout 2> stderr;")
}
set.seed(110) # for reproducibility (sets seeds for PHASE...)
fp_wcall <- for_phase %>%
mutate(call = map2_chr(.x = Pop, .y = data, ~ write_phase_input(.x, .y)))
That set everything up, now let’s run it:
dump <- mclapply(fp_wcall$call, function(x) system(x), mc.cores = 8)
The last step takes only a few minutes, after which we can go and retrieve the estimated haplotype frequencies in each population. We read those in as tibbles:
phase_results <- fp_wcall %>%
mutate(freqs = map(Pop, ~ read_table2(file.path("/tmp", .x, "output_freqs")) %>% arrange(desc(`E(freq)`))))
And now we just need to organize those results a little bit:
summarized <- phase_results %>%
select(Pop, num_fish, freqs) %>%
unnest(cols = c(freqs)) %>%
select(-index) %>%
mutate(
distal_region = str_sub(haplotype, 1, 5),
rosa_region = str_sub(haplotype, 6, 14)
) %>%
select(Pop, num_fish, distal_region, rosa_region, `E(freq)`, S.E) %>%
filter(`E(freq)` >= 0.01) # filter out low frequency haplotypes, but let's try not doing it...
write_csv(summarized, "intermediates/011/haplo-freqs-unfiltered.csv")
# hey! Let's format that for a table with Es and Ls
hap_freqs <- read_csv("intermediates/011/haplo-freqs-unfiltered.csv") %>%
mutate(
distal_region = str_replace_all(distal_region, "P", "E") %>% str_replace_all(., "M", "L"),
rosa_region = str_replace_all(rosa_region, "P", "E") %>% str_replace_all(., "M", "L")
) %>%
rename(
Location = Pop,
`Distal Region` = distal_region,
`RoSA Region` = rosa_region,
Freq = `E(freq)`,
S.E. = S.E
) %>%
mutate(
Freq = sprintf("%.3f", Freq),
S.E. = sprintf("%.6f", S.E.)
) %>%
left_join(read_tsv("data/collection-name-modify.tsv"), by = "Location") %>%
select(`Collection location`, ESU, everything()) %>%
select(-Location) %>%
rename(`Collection N` = num_fish)
write_csv(hap_freqs, "outputs/011/data-table-distal-rosa-hap-freqs-unfiltered-raw.csv")
The late lineage haplotypes at the distal region are either “MMMMM”, “MPPMM”, or “MPMMM”. The early lineage haplotypes in the distal region are merely “PPPPP”. Everything else we call “A” for “ambiguous.”
Meanwhile, we summarise the RoSA region with “MMMMMMMM” as late lineage, while “PPPPPPPP” is early-lineage, and anything else is A for Ambiguous. Note that PHASE has imputed any missing genotypes (of which there will be only one for each region) so it is easy to deal with these….no ?
’s.
sf_freqs <- summarized %>%
mutate(distal_SF_type = case_when(
distal_region %in% c("MMMMM", "MPPMM", "MPMMM") ~ "L",
distal_region %in% c("PPPPP") ~ "E",
TRUE ~ "A"
)) %>%
mutate(rosa_SF_type = case_when(
rosa_region == "MMMMMMMM" ~ "L",
rosa_region == "PPPPPPPP" ~ "E",
TRUE ~ "A"
)) %>%
mutate(compo = str_c(distal_SF_type, rosa_SF_type, sep = "/")) %>%
group_by(Pop, compo) %>%
summarize(freq = sum(`E(freq)`)) %>%
spread(compo, freq)
sf_freqs
# let's format that nicely for output
tmp <- sf_freqs %>%
mutate_at(vars(-Pop), function(x) {
y <- ifelse(is.na(x), 0, x)
sprintf("%.2f", y)
}) %>%
left_join(., summarized %>% distinct(Pop, num_fish)) %>%
select(Pop, num_fish, everything()) %>%
rename(N = num_fish)
# and we can sort the columns and rows into the right order to
# just copy and paste this stuff into the Google Doc Table...
tmp_sorted <- tmp %>%
select(Pop, N, `A/L`, `A/E`, `L/L`, `L/E`, `E/L`, `E/E`) %>%
ungroup() %>%
mutate(Pop = factor(Pop, levels = c("Siletz", "KlamathEntry", "IGH", "Salmon", "Trinity", "EelVAfl", "RussianWSH", "Butte", "ColemanLF", "FRH", "Keswick_WR"))) %>%
mutate(Pop = recode_factor(Pop, IGH = "Klamath R. - Iron Gate H.", ColemanLF = "Battle Creek L.F.", .default = levels(Pop))) %>%
arrange(Pop)
# now, write that out to paste into the google doc
write_csv(tmp_sorted, "outputs/011/haplo-freqs-to-paste-into-google-doc.csv")
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2020-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.0)
## dbplyr 1.4.3 2020-04-19 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.0)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.0)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
## glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.0)
## hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
## lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.0)
## rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.0)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.0)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.0)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.2)
## xfun 0.13 2020-04-13 [1] CRAN (R 3.6.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
##
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
## Time difference of 1.41681 mins
Prince, Daniel J., Sean M. O’Rourke, Tasha Q. Thompson, Omar A. Ali, Hannah S. Lyman, Ismail K. Saglam, Thomas J. Hotaling, Adrian P. Spidle, and Michael R. Miller. 2017. “The Evolutionary Basis of Premature Migration in Pacific Salmon Highlights the Utility of Genomics for Informing Conservation.” Science Advances 3 (8): e1603198. doi:10.1126/sciadv.1603198.
Stephens, Matthew, Nicholas J. Smith, and Peter Donnelly. 2001. “A New Statistical Method for Haplotype Reconstruction from Population Data.” The American Journal of Human Genetics 68 (4): 978–89. doi:10.1086/319501.