Anaylsis of the duplications between GREB1L and ROCK1 that are associated with run timing.

Be warned: ggtree take a very long time to layout unrooted trees in the “daylight” mode used here. It takes about 2.3 hours to produce the two unrooted daylight trees in this notebook when running on a mac laptop. In fact, it takes so long that I define a variable here called RECOMPUTE_DAYLIGHT_LAYOUT. If this is TRUE, then it will crunch through the whole process, and it will save the result into stored_results/009. If it is FALSE, it will just read the results from stored_results/009 and use them. That is much better than waiting over two hours to finish running the code in this RMarkdown document.

RECOMPUTE_DAYLIGHT_LAYOUT <- FALSE

Packages and paths

library(tidyverse)
library(broom)
library(ecaRbioinf)
library(ape)
library(ggtree)

## Warning: package 'ggtree' was built under R version 3.6.3

library(phangorn)
library(cowplot)


dir.create("outputs/009", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/009", recursive = TRUE, showWarnings = FALSE)

Getting SNP Data and inserting it into sequences

We are going to do the RoSA region from 12,260,034 to 12,289,484.

Grab the big_haps2.rds from the 004 notebook, and also the Johnson Creek fish. Choose here whether to use the stored big_haps (that we used) or others generated by BEAGLE that might be subtly different.

We recommend using the ones we provide, since there was some fiddling with scale bar positioning in the tree plotting that had to go on, and it is more likely to end up well-positioned with the identical input.

USE_STORED_BIG_HAPS <- TRUE

if (USE_STORED_BIG_HAPS != TRUE) {
  haps <- read_rds(path = "outputs/004/big_haps2.rds")
  fish_jc <- read_rds(path = "outputs/004/jc_fish_haps.rds")

  # Make a data frame with everyone, Johnson Creek included:
  haps_jc <- bind_rows(haps, fish_jc)
} else {
  haps <- read_rds("stored_results/004/big_haps2.rds")
  haps_jc <- read_rds("stored_results/004/big_haps2_with_jc.rds")
}

Determining which indivs are hets in this region

Now, we are going to want some of the meta data from that to color things on the tree. In particular, I want to know which individuals appear heterozygous. We can ascertain that by investigating how many S alleles there are in each haplotype. So, first, let’s look at the distribution of # of S alleles per haplotype:

# group on lots of things so they stick around
Ssums <- haps %>%
  filter(POS > 12.26e6, POS < 12.29e6) %>%
  group_by(haplo_name, Indiv, pop, ecotype, lineage) %>%
  summarise(NumSs = sum(alle2 == "S"))

ggplot(Ssums, aes(x = NumSs, fill = ecotype)) +
  geom_histogram(bins = 40)

From that it is pretty darn easy to see that we can identify heterozygous individuals as those that have one haplo with NumSs > 125 and NumSs < 125. Fun…

seq_meta <- Ssums %>%
  group_by(Indiv) %>%
  mutate(inHetFish = any(NumSs < 125) & any(NumSs >= 125)) %>%
  ungroup() %>%
  mutate(label = str_replace(haplo_name, "-([ab])", ".\\1")) %>% # make names the same as in tree
  select(label, everything())

# write this out for later playing
write_rds(seq_meta, path = "outputs/009/seq_meta.rds", compress = "xz")

Filter it down to biallelic SNPs

We are going to retain only those sites that have an A, C, G, or T in the allele.

bisnps <- haps %>%
  mutate(isSNP = allele %in% c("A", "C", "G", "T")) %>%
  group_by(POS) %>%
  summarise(allSNPs = all(isSNP == TRUE)) %>%
  filter(allSNPs == TRUE)

haps_bi <- haps %>%
  semi_join(bisnps, by = "POS")

# save these for exploring the effects of imputation error later (203-04).
# We will only save the portion in the RoSA that is used for making the trees
haps_bi %>%
  filter(POS > 12.26e6 & POS < 12.29e6) %>%
  write_rds(path = "outputs/009/haps_bi_rosa.rds", compress = "xz")

And we will do the same thing for the Johnson Creek fish, but we will also toss out any sites that are missing from that fish (or in anyone, but the other data is imputed, so has no missing data.)

bisnps_jc <- haps_jc %>%
  group_by(POS) %>%
  mutate(toss_it = any(is.na(allele))) %>%
  ungroup() %>%
  filter(toss_it == FALSE) %>%
  mutate(isSNP = allele %in% c("A", "C", "G", "T")) %>%
  group_by(POS) %>%
  summarise(allSNPs = all(isSNP == TRUE)) %>%
  filter(allSNPs == TRUE)

haps_bi_jc <- haps_jc %>%
  semi_join(bisnps_jc, by = "POS")

Now, we insert that variation into the chinook sequences from about 12.26 to 12.29 Mb. We use a function I wrote for that in ‘ecaRbioinf’. However, before we do that we make some small fastas that have small segments that we want:

samtools faidx genome/Otsh_v1.0_genomic.fna NC_037124.1:9661479-14824896 | gzip -c > intermediates/009/NC_037124.1_fragment_chinook.fna.gz

dump <- ecaRbioinf::insert_variants_into_sequences(
  V = haps_bi,
  fasta = "intermediates/009/NC_037124.1_fragment_chinook.fna.gz",
  lo = 12260034,
  hi = 12289484,
  file = "intermediates/009/chinook_rosa_seqs.fa"
)

dump_jc <- ecaRbioinf::insert_variants_into_sequences(
  V = haps_bi_jc,
  fasta = "intermediates/009/NC_037124.1_fragment_chinook.fna.gz",
  lo = 12260034,
  hi = 12289484,
  file = "intermediates/009/Prelim-chinook_rosa_seqs_jc.fa"
)

A minor adjustment for the Johnson Creek fish: It turns out that since we only have one haplotype for the JC fish, that causes some problems with insert_variants_into_sequences()—basically it duplicates the JC haplotype into an a and b version. So, we will just pull out the .b version here. We use awk to do that.

cd intermediates/009

awk 'BEGIN {skippy = 100;} /^>Narum_et_al_Chinook_Assembly.b/ {skippy = 0; next} {skippy++; if(skippy>1) print}' Prelim-chinook_rosa_seqs_jc.fa > chinook_rosa_seqs_jc.fa

Now, we will also want to get the coho sequence that has been aligned to this region:

samtools faidx stored_results/001/NC_037124.1_coho.fna.gz NC_037124.1:12260034-12289484 > intermediates/009/coho-rosa-seq.fa

# make everything uppercase in Coho and add it to the existing fastas with Chinook
(awk 'NR==1 {print; next} {printf("%s", toupper($1))} END {printf("\n");}' intermediates/009/coho-rosa-seq.fa;
cat intermediates/009/chinook_rosa_seqs.fa) > intermediates/009/chin-co-seqs.fa

(awk 'NR==1 {print; next} {printf("%s", toupper($1))} END {printf("\n");}' intermediates/009/coho-rosa-seq.fa;
cat intermediates/009/chinook_rosa_seqs_jc.fa) > intermediates/009/chin-co-seqs_jc.fa

Reading sequences into APE, computing distances, making a neighbor-joining tree

chinco <- ape::read.FASTA("intermediates/009/chin-co-seqs.fa")

names(chinco)[names(chinco) == "NC_037124.1:12260034-12289484"] <- "Coho_ref_sequence"

chdist <- ape::dist.dna(chinco)

ccnj <- ape::nj(chdist)

write.tree(ccnj, "outputs/009/rosa-newick.txt")

# now, do the same thing, but drop Coho:
chin_only <- chinco[!(names(chinco) == "Coho_ref_sequence")]
chin_only_dist <- ape::dist.dna(chin_only)
chin_only_nj <- ape::nj(chin_only_dist)
write.tree(chin_only_nj, "outputs/009/rosa-chinook-only-newick.txt")

And then we will do the same thing with JC fish in there, too:

chinco_jc <- ape::read.FASTA("intermediates/009/chin-co-seqs_jc.fa")

names(chinco_jc)[names(chinco_jc) == "NC_037124.1:12260034-12289484"] <- "Coho_ref_sequence"

chdist_jc <- ape::dist.dna(chinco_jc)

ccnj_jc <- ape::nj(chdist_jc)

write.tree(ccnj_jc, "outputs/009/rosa-newick_jc.txt")

# now, do the same thing, but drop Coho:
chin_only_jc <- chinco_jc[!(names(chinco_jc) == "Coho_ref_sequence")]
chin_only_dist_jc <- ape::dist.dna(chin_only_jc)
chin_only_nj_jc <- ape::nj(chin_only_dist_jc)
write.tree(chin_only_nj_jc, "outputs/009/rosa-chinook-only-newick_jc.txt")

Let’s report the number of segregating sites (phylogenetically informatve sites) for both of those sequence data sets

length(seg.sites(chin_only))

## [1] 154

length(seg.sites(chinco))

## [1] 875

length(seg.sites(chin_only_jc))

## [1] 149

length(seg.sites(chinco_jc))

## [1] 873

As a final hurrah, let’s make a seq_meta that includes the JC fish:

seq_meta_jc <- bind_rows(
  seq_meta,
  tibble(
    label = "Narum_et_al_Chinook_Assembly.a",
    haplo_name = "Narum_et_al_Chinook_Assembly-a",
    Indiv = "Narum_et_al_Chinook_Assembly",
    pop = "Johnson Creek",
    ecotype = "Spring/Summer Run",
    lineage = "Upper Columbia",
    NumSs = NA,
    inHetFish = FALSE # we don't know this since we just get one haplo, essentially.
  )
)

Colors and a function to add things to plots

source("R/define_fcolors_all_sf.R")

final_bits <- function(g) {
  g +
    geom_tippoint(aes(fill = Ecotype, shape = `Fish Genotype`, colour = Basin), size = 5, stroke = 1.1) +
    theme(legend.position = "right") +
    scale_shape_manual(values = c(25, 21, 22)) +
    scale_fill_manual(values = fcolors_all_sf) +
    scale_colour_manual(values = fcolors_all_sf) +
    guides(
      fill = guide_legend(override.aes = list(shape = 21, stroke = 0.1)),
      colour = guide_legend(override.aes = list(shape = 21, stroke = 1.1, fill = NA)),
      shape = guide_legend(title = "Fish Genotype")
    )
}

Plotting the Chinook-only tree with Johnson Creek in there

Read it in and attach meta data

But we will include the Johnson Creek fish.

tmptree_jc <- treeio::read.newick("outputs/009/rosa-chinook-only-newick_jc.txt") %>%
  treeio::as.treedata()

# now, one cool thing about the ggtree family of packages by
# Guangchuang Yu is that you can attach all sorts of info
# to these data.
sqm2_jc <- seq_meta_jc %>%
  mutate(Basin = recode(lineage, Sacto = "Sacramento", Klamath = "Klamath")) %>%
  mutate(Ecotype = recode(ecotype,
    Fall = "Fall Run",
    Spring = "Spring Run",
    Winter = "Winter Run",
    Late_Fall = "Late Fall Run"
  )) %>%
  mutate(`Fish Genotype` = ifelse(inHetFish, "Heterozygous", "Homozygous")) %>%
  mutate(`Fish Genotype` = ifelse(str_detect(label, "Narum"), "Unknown", `Fish Genotype`))


# write sqm2_jc out for future use in 204-03
write_rds(sqm2_jc, path = "outputs/009/sqm2_jc.rds", compress = "xz")

tree_jc <- treeio::full_join(tmptree_jc, sqm2_jc, by = "label")

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Make a simple rectangular tree

This is fast and we can plot it in the notebook.

rectangular_final_jc <- final_bits(ggtree(tree_jc)) +
  geom_treescale(width = 0.001, offset = 2)

print(rectangular_final_jc)

Make an unrooted “daylight” tree

Be warned. This takes about an hour of computation.

if (RECOMPUTE_DAYLIGHT_LAYOUT == TRUE) {
  # note, the following takes almost an hour!
  unrooted_jc <- ggtree(tree_jc, layout = "unrooted")

  # save that in outputs so that we don't have to regenerate it
  write_rds(unrooted_jc, path = "stored_results/009/unrooted_jc.rds", compress = "xz")
} else {
  unrooted_jc <- read_rds(path = "stored_results/009/unrooted_jc.rds")
}


# add a time scale.
# I don't know why the default timescale is not working out there...maybe cuz it is an unrooted tree.

unrooted_final_jc <- final_bits(unrooted_jc) +
  geom_treescale(width = 0.001, offset = 0.00005, y = -0.07889505, x = -0.1300)

# save the figure as a PDF file
ggsave(unrooted_final_jc, filename = "outputs/009/chinook-only-distance-unrooted-tree_jc.pdf", width = 10, height = 8)

Make an unrooted tree with coho (but drop Johnson Creek)

Read the tree.

cctmptree <- treeio::read.newick("outputs/009/rosa-newick.txt") %>%
  treeio::as.treedata()

# now, one cool thing about the ggtree family of packages by
# Guangchuang Yu is that you can attach all sorts of info
# to these data.
sqm2 <- seq_meta %>%
  mutate(Basin = recode(lineage, Sacto = "Sacramento", Klamath = "Klamath")) %>%
  mutate(Ecotype = recode(ecotype,
    Fall = "Fall Run",
    Spring = "Spring Run",
    Winter = "Winter Run",
    Late_Fall = "Late Fall Run"
  )) %>%
  mutate(`Fish Genotype` = ifelse(inHetFish, "Heterozygous", "Homozygous"))


# add coho to seq_meta
sm2 <- sqm2 %>%
  bind_rows(., tibble(
    label = "Coho_ref_sequence",
    `Fish Genotype` = "Homozygous",
    Ecotype = "Coho",
    Basin = NA
  )) %>%
  rename(`ecotype/species` = ecotype)


cctree <- treeio::full_join(cctmptree, sm2, by = "label")

Plot an unrooted tree from that.

if (RECOMPUTE_DAYLIGHT_LAYOUT == TRUE) {
  ccunrooted <- ggtree(cctree, layout = "unrooted")

  # save that in outputs so that we don't have to regenerate it if we want to play with it
  write_rds(ccunrooted, path = "stored_results/009/ccunrooted.rds", compress = "xz")
} else {
  ccunrooted <- read_rds("stored_results/009/ccunrooted.rds")
}
ccfinal <- ccunrooted +
  geom_tippoint(aes(fill = Ecotype, shape = `Fish Genotype`), colour = "black", size = 3, stroke = 0.1) +
  theme(legend.position = "right") +
  scale_shape_manual(values = c(25, 21)) +
  scale_fill_manual(values = fcolors_all_sf) +
  scale_colour_manual(values = fcolors_all_sf) +
  guides(
    fill = guide_legend(title = "Ecotype/Species", order = 1, override.aes = list(shape = 21, stroke = 0.1)),
    shape = guide_legend(title = "Fish Genotype")
  ) +
  geom_treescale(width = 0.01, offset = 0.0005, y = -0.0701, x = -0.130)

ggsave(ccfinal, filename = "outputs/009/with-coho-unrooted-tree.pdf", width = 10, height = 8)

The layout procedure for that takes over an hour, and might have a random element, because it comes out looking different on different runs. The version that went into the paper has been placed in stored/results/009/with-coho-unrooted-tree.pdf.

Compute tree fractions

Last thing in this notebook is calcuating the fraction of the distance from Coho to the Fall-run group at which the spring run subtree branches off.

gt <- ggtree(cctree) +
  geom_tippoint(aes(fill = Ecotype, shape = `Fish Genotype`), colour = "black", size = 3, stroke = 0.1) +
  geom_text2(aes(label = node), hjust = -0.5) +
  theme(legend.position = "right") +
  scale_shape_manual(values = c(25, 21)) +
  scale_fill_manual(values = fcolors_all_sf) +
  scale_colour_manual(values = fcolors_all_sf) +
  guides(
    fill = guide_legend(title = "Ecotype/Species", order = 1, override.aes = list(shape = 21, stroke = 0.1)),
    shape = guide_legend(title = "Fish Genotype")
  )

ggsave(gt, filename = "outputs/009/help-find-distances-1.pdf", width = 30, height = 30)

From that, it is quite clear that Coho is node #206, and the node where the split between fall-run and spring-run occurs is node #431. So, we will measure the mean branch length between Coho and every late-returning haplotype. This is going to be B1, which is the mean over late-returning tips of the the cumulative branch length from 431 to the tip, plus B2 which is the distance from coho to node #431. Then B1 + B2 is a measure of the total “evolution” between coho and chinook. We will assume that half of that is due to evolution on the chinook branch and half on the coho branch from some common ancestor. So, Ch = (B1 + B2)/2 is the amount of evolution from the common ancestor of coho and chinook to modern day chinook. And so B1/Ch is our estimate of the fraction of time since the split with the coho common ancestor that spring and fall run split off.

So, now, let us look at those branch lengths:

desc_from_431 <- tidytree::offspring(as_tibble(cctree), 431) %>%
  filter(node != 206)

# now, there must be an easy way to do this in R, but I am just
# going to roll my own.  I want to get the cumulative branch length from
# each (late migrating) tip to node 431, and I can then take the average
# or the max of that, whichever I see fit.
MN <- max(c(desc_from_431$parent, desc_from_431$node))

# make a list with an element for each node.  The element will
# either be NULL, or it will be a list saying whether is a tip
# and who its parent is, and the edge length above it.  We will
# then write a simple recursion that starts at a tip and returns
# cumulative branch length when it hits 431.

LL <- vector(mode = "list", length = MN)
for (i in seq_along(desc_from_431$parent)) {
  LL[[desc_from_431$node[i]]]$parent <- desc_from_431$parent[i]
  LL[[desc_from_431$node[i]]]$blength <- desc_from_431$branch.length[i]
}
LL[[431]]$blength <- 0
# here is a function that traces through the parents and returns
# a vector of the branch lengths.  Then, all we have to do is
# sum those up.
recurs <- function(i) {
  if (is.null(LL[[i]])) {
    return(NA)
  }
  if (i != 431) {
    d <- LL[[i]]$blength
    names(d) <- i
    return(c(d, recurs(LL[[i]]$parent)))
  } else {
    d <- 0
    names(d) <- 431
    return(d)
  }
}

With that, we can now make get a vector of the cumulative branch lengths between every fall run haplotype and node 431:

fall_lengths <- lapply(1:nrow(desc_from_431), function(i) {
  if (is.na(desc_from_431$label[i])) {
    return(NA)
  } else {
    return(sum(recurs(i)))
  }
}) %>%
  unlist() %>%
  enframe() %>%
  filter(!is.na(value))

Here is the mean length:

mean_length <- mean(fall_lengths$value)
mean_length

## [1] 0.001509969

max_length <- max(fall_lengths$value)
max_length

## [1] 0.002009329

And we set B1 equal to that mean_length:

B1 <- mean_length

Now, what is the length from Coho to node 431:

as_tibble(cctree) %>%
  filter(label == "Coho_ref_sequence")

And the branch length there is 0.02704302:

B2 <- 0.02704302

So,

Ch <- (B1 + B2) / 2

# fraction of distance along chinook lineage since split with
# common ancestor with coho before we
# get to spring run lineage breaking off is:
FractBeforeSpring <- B1 / Ch

FractBeforeSpring

## [1] 0.1057661

OK, that is 10.6%. It is worthwile knowing that. However, it would be interesting to do something a little more formal to try to estimate the time of divergence of the \(E\) and the \(L\) lineages. There are a number of reasons that such an estimate must be taken as a rough estimate, chief among them that the assumption of a molecular clock is likely violated. Nonetheless, we will make a good faith effort at coming up with an estimate of the time of the split between the \(L\) and \(E\) linages.

Estimating the split time between \(E\) and \(L\) haplotype lineages

Why a molecular clock assumption is likely violated

The idea of a molecular clock relies quite heavily on the neutral theory’s conclusion that the rate of neutral substitutions should equal the rate at of neutral mutation. This has the nifty feature of being independent of the population size. So, that a molecular clock assumption might hold if:

you are dealing with a neutrally evolving segment of DNA
you are dealing with substitutions, as opposed to segregating variation within a species/lineage.

Neutrally evolving?

Number 1 above seems like it could be violated if the RoSA has been under selection. It certainly seems possible that mutations along the RoSA might have been subject to selective sweeps, etc. Number 2, on the other hand, is clearly violated, since, within the RoSA there are sites that segregate within the \(E\) or the \(L\) lineage. It is possible, too, that there is some gene conversion (or recombination) in this region, which means some sites might have fixed between \(E\) and \(L\) lineages, and then been transferred between the two by gene conversion and/or recombination within RoSA heterozygotes. So, I think that it would be hard to rigorously and formally justify a molecular clock assumption.

That said, let’s explore the data further with respect to the two points above. First, let us see if we can detect any evidence for different rates of mutation accumulation between the \(E\) and \(L\) lineages by comparing the distances between each haplotype of those two lineages and Coho (since, after the lineages split, differences in the rate of mutation accumulation should show up as differences in the pairwise differences).

# get the distances
dists_to_coho_tmp <- broom::tidy(chdist) %>%
  filter(item2 == "Coho_ref_sequence") %>%
  mutate(item1 = as.character(item1))

# now, get the haplotype-lineage each belongs to using the
# number of "early" alleles (from Ssums above).
hlineages <- Ssums %>%
  ungroup() %>%
  mutate(
    item1 = ifelse(str_detect(haplo_name, "-a$"), str_c(Indiv, ".a"), str_c(Indiv, ".b")),
    haplo_lineage = ifelse(NumSs < 125, "L", "E")
  ) %>%
  select(item1, haplo_lineage)

# join those on there
dists_to_coho <- dists_to_coho_tmp %>%
  left_join(hlineages, by = "item1") %>%
  select(-item2)

Now that we have those, let’s just make a plot:

dtc_plot <- ggplot(dists_to_coho, aes(x = distance, fill = haplo_lineage)) +
  geom_histogram() +
  facet_wrap(~haplo_lineage, ncol = 1) +
  scale_fill_manual(values = c(E = "gold", L = "blue"))

ggsave(dtc_plot, filename = "outputs/009/dtc_plot.pdf", width = 10, height = 7)

dtc_plot

That is a slightly higher distance to the E haplotype lineage from coho.

Let’s also just look at the average of those:

dists_to_coho %>%
  group_by(haplo_lineage) %>%
  summarise(ave_dist_to_coho = mean(distance))

Another way to go about evaluating the support for positive selection on the \(E\) lineage would be to look for evidence of selective sweeps in the form of a higher frequency of derived alleles (rare variants that have been swept to higher frequency) amongst the variable sites in the RoSA region on the \(E\) haplotypes.

Sliding window of derived allele frequency variants

Here is a quick way that we will investigate that. We will just look at the frequency of derived alleles in sites variable in chinook flanking (and within) the RoSA region upon haplotypes that carry the \(E\) vs the \(L\) RoSA lineage. We will do this in a sliding window from 10.2 Mb to 14.2 Mb.

# get the haplotype lineages again, with the dash in their name
hlineages_dash <- Ssums %>%
  ungroup() %>%
  mutate(haplo_lineage = ifelse(NumSs < 125, "L", "E")) %>%
  select(haplo_name, haplo_lineage)

# now, join that onto haps_bi and keep the Positions we want
haps_bi_lin <- haps_bi %>%
  left_join(hlineages_dash, by = "haplo_name") %>%
  filter(POS > 10.2e6, POS < 14.2e6)

# now, we compute the frequency of derived vs ancestral alleles
# at each position that have that specified (i.e. where there is
# a coho alignment). Note that we are going to do this
# only for spring and fall, and we will break it into two different
# lineages Sacto and Klamath
dfreqs <- haps_bi_lin %>%
  filter(!is.na(anc_vs_derived)) %>%
  filter(ecotype %in% c("Spring", "Fall")) %>%
  rename(basin = lineage) %>%
  mutate(basin = ifelse(basin == "Sacto", "Sacramento", "Klamath")) %>%
  group_by(POS, haplo_lineage, basin, anc_vs_derived) %>%
  tally() %>%
  spread(key = anc_vs_derived, value = n, fill = 0) %>%
  mutate(dfreq = D / (A + D))


# also, get all the positions so we can put a rug in:
posies <- tibble(x = unique(dfreqs$POS))


# now, run a smooth through that in a 50 Kb sliding window
dsmooth <- dfreqs %>%
  group_by(basin, haplo_lineage) %>%
  do(as_tibble(ksmooth(.$POS, .$dfreq, bandwidth = 50e3, n.points = 10000)))

# now, plot that
dsmooth_plot <- ggplot(dsmooth, aes(x = x / 1e6)) +
  annotate("rect", xmin = 12.26, xmax = 12.29, ymin = -Inf, ymax = Inf, fill = "pink", alpha = 0.6) +
  geom_line(aes(y = y, colour = haplo_lineage)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 25)) +
  scale_colour_manual(values = c(E = "gold", L = "blue")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  xlab("Genome position on Chromosome 28 (Mb)") +
  ylab("Average frequency of derived alleles within 50 Kb windows") +
  theme_bw() +
  facet_wrap(~basin, ncol = 1) +
  geom_rug(data = posies, size = 0.01) +
  theme(
    strip.background = element_blank(),
    panel.border = element_rect(colour = "black")
  ) +
  guides(colour = guide_legend(title = "RoSA\nHaplotype\nLineage"))


ggsave(dsmooth_plot, filename = "outputs/009/ad_plot_faceted.pdf", width = 10, height = 6)
dsmooth_plot

This is consistent with sliding window analyses (not shown) showing a higher fraction of sites have derived alleles in the spring run lineages than the fall run lineages in this region. That could be consistent with the early haplotype lineage participating in selective sweeps.

Let’also for fun, look at the fractional difference between L and E sliding windows at different positions. Let’s just add those on as other lines.

dcomps <- dsmooth %>%
  spread(key = haplo_lineage, value = y) %>%
  mutate(
    `E/L` = E / L,
    `L/E` = L / E
  )

dsmooth_plot +
  geom_line(data = dcomps, aes(y = `E/L`), colour = "red") +
  geom_line(data = dcomps, aes(y = `L/E`), colour = "orange")

Polymorphisms vs substitutions

Now, how about the issue of polymorphisms vs substitutions. For that, I want to simply count up the number of different alleles segregating amongst the E and L lineages, and see what the distribution of those allele freqs are amongst the different haplotype lineages.

We will do this by counting up whether the alleles are reference or alt.

# first filter down to variants within the RoSA region
rosa_snps_tmp <- haps_bi %>%
  filter(
    POS >= 12260034, POS <= 12289484,
    !is.na(anc_vs_derived)
  )

# then join on whether they are E or L lineage haplotypes
rosa_snps <- hlineages %>%
  mutate(
    Indiv = str_replace(item1, "\\.[ab]", ""),
    haplo = ifelse(str_detect(item1, "\\.a"), "a", "b")
  ) %>%
  select(-item1) %>%
  left_join(rosa_snps_tmp, ., by = c("Indiv", "haplo")) %>%
  mutate(ref_or_alt = ifelse(allele == REF, "REF", "ALT"))

# now, tally things up in the E and L lineages
rosa_ad_freqs <- rosa_snps %>%
  count(POS, haplo_lineage, ref_or_alt) %>%
  group_by(POS) %>%
  mutate(freq = n / sum(n)) %>%
  mutate(category = case_when(
    haplo_lineage == "E" & ref_or_alt == "REF" ~ "E_reference",
    haplo_lineage == "E" & ref_or_alt == "ALT" ~ "E_alternate",
    haplo_lineage == "L" & ref_or_alt == "REF" ~ "L_reference",
    haplo_lineage == "L" & ref_or_alt == "ALT" ~ "L_alternate",
    TRUE ~ NA_character_
  ))

Now, let’s just plot a them out in coordinate order first. We are going to abuse our colors a little bit here. E and REF will be gold (spring run color) but E and ALT will be yellow (winter run color) and we will do the same with blue (L and REF) and late-fall color (L and ALT).

freq_bars <- ggplot(rosa_ad_freqs, aes(x = factor(POS), y = freq, fill = category)) +
  geom_col() +
  scale_fill_manual(values = c(
    E_reference = "gold",
    E_alternate = "#ffff99",
    L_reference = "blue",
    L_alternate = "#a6cee3"
  ))

ggsave(freq_bars, filename = "outputs/009/freq_bars.pdf", width = 12, height = 8)

freq_bars

And report the number of these filtered variants:

length(unique(rosa_ad_freqs$POS))

## [1] 144

That makes it pretty clear that you wouldn’t want to consider all of these to be substitutions. So, that makes it tough. We could try filtering it down to only those that are very close to being fixed in either lineage. We could do that quickly and dirtily by taking the sum of the two largest frequencies at each position and only accepting it if it is greater than 0.98 (that gives us a little slop for genotyping error, etc.) Let’s see what that looks like:

rosa_ad_masked <- rosa_ad_freqs %>%
  group_by(POS) %>%
  mutate(call_it_a_sub = sum(sort(freq, decreasing = TRUE)[1:2]) > 0.99) %>%
  mutate(freq2 = ifelse(call_it_a_sub, freq, NA))

freq_bars2 <- ggplot(rosa_ad_masked, aes(x = factor(POS), y = freq2, fill = category)) +
  geom_col() +
  scale_fill_manual(values = c(
    E_reference = "gold",
    E_alternate = "#ffff99",
    L_reference = "blue",
    L_alternate = "#a6cee3"
  ))

ggsave(freq_bars2, filename = "outputs/009/freq_bars_substi.pdf", width = 12, height = 8)

## Warning: Removed 305 rows containing missing values (position_stack).

freq_bars2

## Warning: Removed 305 rows containing missing values (position_stack).

And report the number of these that were considered subtitutions:

rosa_ad_masked %>%
  filter(!is.na(freq2)) %>%
  count(POS) %>%
  nrow()

## [1] 50

So, we could consider assuming that all of these sites shown represent substitutions, and then try to time the split between them with those. Some of the problems with this is that it may be that, at one time, other sites became fixed for alternate alleles within each lineage (i.e. were actually substitutions within one of the lineages) but then the other allele was introduced back to the lineage by recombination or gene conversion. This would tend to make the estimated time of the split smaller than it should be. On the other hand, maybe these are not proper substitutions even though they appear fixed in this sample.

OK, let’s do that all more formally.

Phylogenetic inference from the apparent substitutions

Just a reminder here: we are dealing with only the SNPs (not dealing with indels and rearrangements, etc) to do a phylogenetic reconstruction. For this exercise, we will have only 3 taxa: E-haplotype lineage, L-haplotype lineage, and coho salmon (the outgroup).

Make sequence for the \(E\) and \(L\) lineage

For these guys, we want to make a tibble we can feed into insert_variants_into_sequences(). That is something that has Indiv, haplo, POS, and allele. We will do this by summarising and figuring out what the alleles are in the different lineages.

vars_for_seqs <- rosa_ad_masked %>%
  filter(call_it_a_sub == TRUE) %>%
  semi_join(rosa_snps, ., by = "POS") %>%
  count(POS, haplo_lineage, allele) %>%
  arrange(POS, haplo_lineage, desc(n)) %>%
  group_by(POS, haplo_lineage) %>%
  slice(1) %>%
  mutate(
    Indiv = ifelse(haplo_lineage == "E", "E-haplo-lineage", "L-haplo-lineage"),
    haplo = "a"
  )


# now insert into sequences
dump <- ecaRbioinf::insert_variants_into_sequences(
  V = vars_for_seqs,
  fasta = "intermediates/009/NC_037124.1_fragment_chinook.fna.gz",
  lo = 12260034,
  hi = 12289484,
  file = "intermediates/009/haplo-lineage-fixed-snps.fa"
)

Finally, put the coho in there and we should be ready to go to the next step.

cat intermediates/009/haplo-lineage-fixed-snps.fa intermediates/009/coho-rosa-seq.fa > intermediates/009/three-taxa.fa

ML Phylogenetic Inference

Read the seqs:

seq3taxa <- ape::read.FASTA("intermediates/009/three-taxa.fa")
names(seq3taxa) <- c("e_haplo", "l_haplo", "coho")

phydat3taxa <- phangorn::as.phyDat(seq3taxa)

Make a neighbor-joining tree and plot it.

nj3taxa <- ape::nj(dist.dna(seq3taxa))
plot(nj3taxa)

Also compute the likelihood of such a tree under a super simple model:

nj3t_logl <- phangorn::pml(nj3taxa, phydat3taxa)
nj3t_logl

## 
##  loglikelihood: -45570.46 
## 
## unconstrained loglikelihood: -50727.69 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
## 0.25 0.25 0.25 0.25

Now, use that as a starting tree, and compute a maximum likelihood phylogeny, whilst optimizing base frequencies and substitution rates, and using a gamma distribution for variation in substitution rates.

mlp3taxa <- phangorn::optim.pml(nj3t_logl, optNni = TRUE, optBf = TRUE, optQ = TRUE, optGamma = TRUE, optEdge = TRUE)

## optimize edge weights:  -45570.46 --> -45570.4 
## optimize base frequencies:  -45570.4 --> -45264.12 
## optimize rate matrix:  -45264.12 --> -45192.17 
## optimize edge weights:  -45192.17 --> -45192.17 
## optimize base frequencies:  -45192.17 --> -45192.17 
## optimize rate matrix:  -45192.17 --> -45192.17 
## optimize edge weights:  -45192.17 --> -45192.17 
## optimize base frequencies:  -45192.17 --> -45192.17 
## optimize rate matrix:  -45192.17 --> -45192.17 
## optimize edge weights:  -45192.17 --> -45192.17

mlp3taxa

## 
##  loglikelihood: -45192.17 
## 
## unconstrained loglikelihood: -50727.69 
## 
## Rate matrix:
##           a         c         g         t
## a 0.0000000 0.9920401 1.7852439 0.7283919
## c 0.9920401 0.0000000 0.4520061 1.7555913
## g 1.7852439 0.4520061 0.0000000 1.0000000
## t 0.7283919 1.7555913 1.0000000 0.0000000
## 
## Base frequencies:  
## 0.2993366 0.2155577 0.2163758 0.2687299

Does that improve things? Yes.

AIC(nj3t_logl)

## [1] 91146.92

AIC(mlp3taxa)

## [1] 90406.34

Plot the resulting tree:

plot(mlp3taxa$tree)

And write it out in Newick format

ape::write.tree(root(mlp3taxa$tree, outgroup = "coho", resolve.root = TRUE), file = "intermediates/009/3-taxa.nwk")

We can use that tree to make similar calculations of the times. We will compare the average branch length from the L-E split to each of L and E to 1/2 the distance from the L-E split to Coho:

LE2L <- mlp3taxa$tree$edge.length[2]
LE2E <- mlp3taxa$tree$edge.length[3]
Coho2LE <- mlp3taxa$tree$edge.length[1]

# here is notation that follows the supplement
B <- mean(c(LE2E, LE2L))
C <- Coho2LE



# this is a rough estimate of the fraction of half the branch from Coho from
# to "average" chinook that involves the divergence between L and E.
bFract <- B / (0.5 * (B + C))

# So, if the split time is estimated to be (min, est, max) of
cc_split_times <- c(5.43, 7.46, 9.51) # MYA from timetree.org
# Kumar S, Stecher G, Suleski M, Hedges SB (2017) TimeTree: A Resource for Timelines, Timetrees, and Divergence Times. Mol Biol Evol doi:10.1093/molbev/msx116

# then our estimate, surrounded by CIs would be:
cc_split_times * bFract

## [1] 0.3279036 0.4504901 0.5742843

In millions of years.

Make a figure with (a) and (b) for the supplement

tree_and_smooth <- cowplot::plot_grid(ccfinal,
  dsmooth_plot,
  nrow = 2,
  labels = c("A", "B"), rel_heights = c(.35, .65)
)

ggsave(tree_and_smooth, filename = "outputs/009/tree-and-smooth.pdf", width = 10, height = 10)

Session Info

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2020-05-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  ape         * 5.3     2019-03-17 [1] CRAN (R 3.6.0)
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.6   2020-04-05 [1] CRAN (R 3.6.2)
##  BiocManager   1.30.10 2019-11-16 [1] CRAN (R 3.6.0)
##  broom       * 0.5.6   2020-04-20 [1] CRAN (R 3.6.2)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.0)
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.0)
##  cowplot     * 1.0.0   2019-07-11 [1] CRAN (R 3.6.0)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
##  DBI           1.1.0   2019-12-15 [1] CRAN (R 3.6.0)
##  dbplyr        1.4.3   2020-04-19 [1] CRAN (R 3.6.2)
##  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.5   2020-03-07 [1] CRAN (R 3.6.0)
##  ecaRbioinf  * 0.1.0   2020-02-13 [1] local         
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.0)
##  farver        2.0.3   2020-01-16 [1] CRAN (R 3.6.0)
##  fastmatch     1.1-0   2017-01-28 [1] CRAN (R 3.6.0)
##  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 3.6.0)
##  fs            1.4.1   2020-04-04 [1] CRAN (R 3.6.2)
##  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.0   2020-03-05 [1] CRAN (R 3.6.0)
##  ggtree      * 2.0.4   2020-04-13 [1] Bioconductor  
##  glue          1.4.0   2020-04-03 [1] CRAN (R 3.6.2)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.2.0   2019-11-08 [1] CRAN (R 3.6.0)
##  hms           0.5.3   2020-01-08 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.0)
##  igraph        1.2.5   2020-03-19 [1] CRAN (R 3.6.0)
##  jsonlite      1.6.1   2020-02-02 [1] CRAN (R 3.6.0)
##  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.0)
##  labeling      0.3     2014-08-23 [1] CRAN (R 3.6.0)
##  lattice       0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.6.0)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.0)
##  lubridate     1.7.8   2020-04-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
##  Matrix        1.2-18  2019-11-27 [2] CRAN (R 3.6.2)
##  modelr        0.1.6   2020-02-22 [1] CRAN (R 3.6.0)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
##  phangorn    * 2.5.5   2019-06-19 [1] CRAN (R 3.6.0)
##  pillar        1.4.3   2019-12-20 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
##  plyr          1.8.6   2020-03-03 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 3.6.2)
##  quadprog      1.5-8   2019-11-20 [1] CRAN (R 3.6.0)
##  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.4   2020-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0   2019-05-16 [1] CRAN (R 3.6.0)
##  reshape2      1.4.4   2020-04-09 [1] CRAN (R 3.6.2)
##  rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.0)
##  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.0)
##  rstudioapi    0.11    2020-02-07 [1] CRAN (R 3.6.0)
##  rvcheck       0.1.8   2020-03-01 [1] CRAN (R 3.6.0)
##  rvest         0.3.5   2019-11-08 [1] CRAN (R 3.6.0)
##  scales        1.1.0   2019-11-18 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1   2020-04-20 [1] CRAN (R 3.6.2)
##  tidyr       * 1.0.2   2020-01-24 [1] CRAN (R 3.6.0)
##  tidyselect    1.0.0   2020-01-27 [1] CRAN (R 3.6.0)
##  tidytree      0.3.3   2020-04-02 [1] CRAN (R 3.6.2)
##  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 3.6.0)
##  treeio        1.10.0  2019-10-29 [1] Bioconductor  
##  vctrs         0.2.4   2020-03-10 [1] CRAN (R 3.6.0)
##  withr         2.2.0   2020-04-20 [1] CRAN (R 3.6.2)
##  xfun          0.13    2020-04-13 [1] CRAN (R 3.6.2)
##  xml2          1.3.2   2020-04-23 [1] CRAN (R 3.6.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.0)
## 
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:

Sys.time() - start_time

## Time difference of 2.868241 mins

RoSA DNA Distance Trees

Eric C. Anderson

Last Updated: 2020-05-14