Here we produce the haplo-raster plots. One with the alleles and the other with the read depths in each fish.

Packages and paths

library(tidyverse)
library(ecaRbioinf) # will put on github eventually


dir.create("outputs/006", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/006", recursive = TRUE, showWarnings = FALSE)

Load up needed data from previous steps

Grab the big_haps2.rds from the 004 notebook, and also the Johnson Creek fish. Choose here whether to use the stored big_haps (that we used) or others generated by BEAGLE that might be subtly different. In this works there is little difference, but, because BEAGLE randomly imputes and phases things, if you want to get a figure that looks just like the one we produced, you will want to use the stored results from 004.

USE_STORED_BIG_HAPS <- FALSE

if (USE_STORED_BIG_HAPS != TRUE) {
  big_haps2 <- read_rds(path = "outputs/004/big_haps2.rds")
  jc_fish <- read_rds(path = "outputs/004/jc_fish_haps.rds")

  # Make a data frame with everyone, Johnson Creek included:
  big_haps2_jc <- bind_rows(big_haps2, jc_fish)
} else {
  big_haps2 <- read_rds("stored_results/004/big_haps2.rds")
  big_haps2_jc <- read_rds("stored_results/004/big_haps2_with_jc.rds")
}

haps_fix <- read_rds(path = "outputs/004/big_haps2_fixed_info.rds")

First compile auxiliary genome features

Here we get data that is used for the annotation rows below the rasters.

Whether snps are in genes, exons, CDSs…

Intersect the gff file with the variants, after whittling it down to one individual

bcftools view -s DPCh_plate1_A01_S1  data/greb1l-ish-region.vcf.gz -Oz > intermediates/006/one-sample-vcf-greb1l-ish.vcf.gz

bedtools intersect -a intermediates/006/one-sample-vcf-greb1l-ish.vcf.gz  -b genome/GCF_002872995.1_Otsh_v1.0_genomic.gff.gz -wb > intermediates/006/variant-annotations.tsv

Then read that into a tibble

snp_anno <- read_tsv("intermediates/006/variant-annotations.tsv",
  col_names = c(
    "CHROM",
    "POS",
    "dot1",
    "REF",
    "ALT",
    "VSCORE",
    "dot2",
    "dot3",
    "format",
    "a_sample",
    "CHROM2",
    "method",
    "what",
    "start",
    "stop",
    "dot4",
    "strand",
    "phase",
    "biggie"
  )
)

# look at what we have here
snp_anno %>%
  count(what)

So, for a quick visualization, I am first going to classify sites as belonging to genes, pseudo-genes, or non-genes.

Subsequently, I will classify things within genes as exon, CDS, or other.

gene_stuff_func <- function(what) {
  if (any(what == "gene")) {
    ret <- "gene"
  } else if (any(what == "pseudogene")) {
    ret <- "pseudogene"
  } else {
    ret <- "non-gene"
  }
  ret
}
gene_stuff <- snp_anno %>%
  group_by(CHROM, POS) %>%
  summarise(ginc = gene_stuff_func(what))

exon_stuff_func <- function(what) {
  if (any(what == "CDS")) {
    ret <- "CDS"
  } else if (any(what == "exon")) {
    ret <- "exon"
  } else {
    ret <- "non-exon"
  }
  ret
}

# define exon_stuff only for things found in genes
exon_stuff <- snp_anno %>%
  semi_join(gene_stuff %>% filter(ginc == "gene"), by = c("CHROM", "POS")) %>%
  group_by(CHROM, POS) %>%
  summarise(einc = exon_stuff_func(what))

Codon position of the SNPs that are in coding seqeunces

We use snpEff’s designation of synonymous or missense. I will just define things that are missense as codon_pos1 and synonymous ones as codon_pos3. I am doing this because I am pretty sure that the haplo raster function is hard-wired to use those designations, rather than synonymous/missense.

First, get the SNPs in CDS

cds_snps <- snp_anno %>%
  filter(what == "CDS")

# read it in
snpeff <- read_rds("outputs/005/variants-and-annos-5.16.rds")

# only retain positions that are MODERATE or HIGH (this kicks out the synonymous ones)
HiMod <- snpeff %>%
  filter(snp_eff_effect %in% c("MODERATE", "HIGH")) %>%
  select(CHROM, POS, snp_eff_effect, snp_eff_what)

cds_snps <- cds_snps %>%
  mutate(syn_v_nonsyn_codon_pos = ifelse(POS %in% HiMod$POS, "codon_pos1", "codon_pos3"))

Repeat content around each SNP

I want to include at each SNP in the raster, some information about how much of the sequence nearby is considered repeat sequence. This might also be helpful in the future for other species when we go to design amplicons for sequencing.

I am going to do this just by designating each base with a lowercase letter part of a repeat-masked segment. The strategy will be to pull out the sequence at every position, then make a tibble and rollapply a window over it. The we can join these onto the positions used in each raster.

First, get chinook sequence in the region we are working with, which is from about 9.66 Mb to 14.83 Mb.

samtools faidx genome/Otsh_v1.0_genomic.fna NC_037124.1:9660000-14830000 > intermediates/006/greb-ish-seq-chinook.fna

gzip -f intermediates/006/greb-ish-seq-chinook.fna

Now, read it into a tibble:

dna_tibble <- read_lines("intermediates/006/greb-ish-seq-chinook.fna.gz")[-1] %>%
  paste(., collapse = "") %>%
  strsplit(., "") %>%
  .[[1]] %>%
  tibble(
    POS = seq(9660000, 14830000, by = 1),
    dna = .
  ) %>%
  mutate(inRep = dna %in% c("a", "c", "g", "t"))

# now rollapply that dude in a 200 bp window centered on the current base
dna_repetitive_200 <- dna_tibble %>%
  mutate(rep_mean_200 = zoo::rollmean(inRep, 200, fill = NA, na.rm = TRUE))

Positions of the RoSA markers and other SNPs

We will add dashed lines around the columns indicating the RoSA SNPs and also the distal SNPs identified in Prince et al. (2017). We have all those positions in the data directory.

h_pos_for_paper <- read_tsv("data/assay-positions.tsv") %>%
  rename(
    name = Marker_name,
    POS = Position_on_Chr_28_in_Otsh_v1.0
  ) %>%
  select(name, POS) %>%
  mutate(POS = as.integer(POS))

Calculations to order haplotypes by number of spring-alleles in the RoSA

Order haplos according to the number of spring alleles

# focus on a smaller region
rosa2 <- big_haps2 %>%
  filter(POS > 12.05e6, POS < 12.5e06)

While we are here quickly deduce the RoSA genotype of the reference fish, by focusing only on the SNPs from 12.26 to 12.29

rosa2 %>%
  filter(POS > 12.26e6, POS < 12.29e6) %>%
  group_by(POS, REF, ALT, spring_allele) %>%
  tally() %>%
  select(-n) %>%
  mutate(spring_is_ref = REF == spring_allele) %>%
  group_by(spring_is_ref) %>%
  tally()

So, that is saying that the reference fish here has 68 spring alleles and 134 fall alleles in the region from 12.26 to 12.29 Mb. We can compare that to the distribution we see in our fish later on when we are determining heterozygotes. 68 spring alleles puts it solidly within what we expect for a fall run fish.

Order everyone by number of springer alleles in there:

rosa2_sorted <- rosa2 %>%
  group_by(haplo_name) %>%
  mutate(sumS = sum(alle2 == "S")) %>%
  ungroup() %>%
  arrange(desc(sumS), haplo_name, POS)

The above data frame has haplotypes ordered from “most springy in the RoSA” to “least springy in the RoSA.”

Define different orders for the haplotypes

We want to order haplotypes by ecotype, then lineage, then population then by springyness. We make one order that includes the Johnson Creek fish, and then another one that does not (for the read depth raster).

# first, order by ecotype, and then within that, spring-run-assoc alleles
htmp <- rosa2_sorted %>%
  mutate(hnames = haplo_name) %>%
  mutate(efact = factor(ecotype, levels = c("Spring", "Winter", "Late_Fall", "Fall"))) %>%
  arrange(efact, desc(sumS))

# then, order by ecotype, and within that, springy-ness of the full individual.
# this would put the two haplotypes within an individual next to one another
ht2 <- htmp %>%
  group_by(Indiv) %>%
  mutate(sumS_indiv = sum(sumS)) %>%
  ungroup() %>%
  arrange(efact, desc(sumS_indiv), Indiv, sumS)


# ht2 is not what we want, though.  So,
# here let us order by ecotype, and then lineage, then pop, and only then on springyness
ht3 <- ht2 %>%
  arrange(efact, desc(lineage), pop, desc(sumS), Indiv)

h_ord_eco_line_pop_springy <- unique(ht3$hnames)

# here is a hacky thing to make an ordering in which the Johnson Creek fish
# is in there between the spring and the winters
last_trin_spring <- max(which(str_detect(h_ord_eco_line_pop_springy, "Trinity_River_Hatchery_Spring")))

h_ord_eco_line_pop_springy_jc <- c(
  h_ord_eco_line_pop_springy[1:last_trin_spring],
  "Narum_et_al_Chinook_Assembly-a",
  h_ord_eco_line_pop_springy[-(1:last_trin_spring)]
)

Prepare the tibbles for the row and column annotations

Make the annotation columns tibble

This is just to get the lineage and the ecotype in a decent form:

tmp <- c(1, 2, 3)
names(tmp) <- c("pop", "lineage", "ecotype")
annotation_columns <- rosa2_sorted %>%
  mutate(hnames = haplo_name) %>%
  select(hnames, lineage, ecotype, pop) %>%
  count(hnames, lineage, ecotype, pop) %>%
  select(-n) %>%
  gather(key = "column", value = "value", -hnames) %>%
  mutate(
    width = 0.015,
    order = tmp[column]
  )

# now, I am going to add a label column which is what we want printed on each of these

Make the row annotations tibble

I want to add to that a row at the bottom which says whether the “spring” allele is ancestral or derived or unknown (because there is no Coho sequence at that point)

ancy_row <- big_haps2 %>%
  count(POS, spring_allele) %>%
  select(-n) %>%
  left_join(haps_fix %>% select(POS, AA), by = "POS") %>%
  mutate(
    value = ifelse(spring_allele == AA, "ancestral", "derived"),
    row = "spring_anc_or_derived",
    order = 3,
    height = 0.02
  )

I also want to have a row below the exon info that shows whether the snp is synonymous or missense.

codon_row <- cds_snps %>%
  select(POS, syn_v_nonsyn_codon_pos) %>%
  rename(value = syn_v_nonsyn_codon_pos) %>%
  mutate(
    row = "codon_stuff",
    order = 2,
    height = 0.02
  )

So make a tibble with everything. We can semi-join this thing for different data sets.

annotation_rows_full <- bind_rows(
  exon_stuff %>%
    ungroup() %>%
    select(POS, einc) %>%
    rename(value = einc) %>%
    mutate(
      row = "exon_stuff",
      order = 1,
      height = 0.02
    ),
  codon_row,
  ancy_row
) %>%
  mutate(POS = as.integer(POS))

Read in our color schemes and a few helper functions

source("R/define_fcolors_all_sf.R")
source("R/function_prep_elements.R")
source("R/function_extra_labels.R")

Now we are ready to make our plots

Moderate zoom allelic-type haplo raster with Johnson Creek fish, hets not outlined in orange

plist <- prep_elements(12.1e6, 12.6e6, 0.05e6, format = "%.2f", DSET = big_haps2_jc)

## Warning: Column `order` has different attributes on LHS and RHS of join

jca <- haplo_raster_plot(
  D = plist$D,
  h_ord = h_ord_eco_line_pop_springy_jc,
  pos_annot = plist$pos_annot,
  pos_bar_text_size = 8.0,
  highlight_pos = h_pos_for_paper,
  annotation_columns = annotation_columns,
  annotation_rows = plist$ann_rows,
  fcolors = fcolors_all_sf,
  anno_row_start = 10,
  snp_quant_tibble = plist$repcontent,
  no_legend = TRUE,
  no_y_labels = TRUE,
  no_x_labels = TRUE
)

## Warning: Column `order` has different attributes on LHS and RHS of join

jca2 <- extra_labels(
  g = jca, D = plist$D, h_ord = h_ord_eco_line_pop_springy_jc,
  xmids = plist$ac_x_midpoints, ymids = plist$ar_y_midpoints,
  row_just = 0.5
) +
  expand_limits(x = -130, y = 337)

# now, fill in the white space in the annotation columns with the gray color.
# this is totally a hard-wired hack!
jcy <- which(rev(h_ord_eco_line_pop_springy_jc) == "Narum_et_al_Chinook_Assembly-a")
jca3 <- jca2 +
  annotate("rect", xmin = -120, xmax = 0, ymin = jcy - 0.5, ymax = jcy + 0.5, colour = NA, fill = "darkgray")


ggsave(jca3, filename = "outputs/006/haplo-raster-alleles.pdf", width = 49, height = 30)

## Warning: Removed 13 rows containing missing values (geom_segment).

## Warning: Removed 1 rows containing missing values (geom_text).

## Warning: Removed 1 rows containing missing values (geom_text).

Moderate Zoom read-depths, sorted by ecotype-lineage-pop-springy

We need to figure out the positions for the orange rectangle. It should be from 12.311 to about 12.3215. So, find the variant numbers associated with that. That means we have to prep the elements now…

plist <- prep_elements(12.1e6, 12.6e6, 0.05e6, format = "%.2f")

## Warning: Column `order` has different attributes on LHS and RHS of join

inner_variants <- tibble(POS = unique(plist$D$POS)) %>%
  ungroup() %>%
  mutate(idx = 1:n()) %>%
  filter(POS > 12.311e6, POS < 12.3215e6)

niv <- nrow(inner_variants)

# left and right points
left_x <- inner_variants$idx[1] - 2
right_x <- inner_variants$idx[niv] + 2

# top point is from about the 6th winter run haplotype
# bottom point is first fish
fish_ords <- tibble(fish = h_ord_eco_line_pop_springy) %>%
  mutate(idx = 1:n()) %>%
  mutate(winter_num = cumsum(str_detect(fish, "Winter")))

top_y <- nrow(fish_ords) - fish_ords$idx[fish_ords$winter_num == 5] + 1
bot_y <- nrow(fish_ords) - fish_ords$idx[nrow(fish_ords)] + 1

g <- haplo_raster_plot(
  D = plist$D,
  h_ord = h_ord_eco_line_pop_springy,
  pos_annot = plist$pos_annot,
  pos_bar_text_size = 8.0,
  highlight_pos = h_pos_for_paper,
  annotation_columns = annotation_columns,
  annotation_rows = plist$ann_rows,
  fcolors = fcolors_all_sf,
  anno_row_start = 10,
  snp_quant_tibble = plist$repcontent,
  plot_read_depths = TRUE,
  no_y_labels = TRUE,
  no_x_labels = TRUE
)

## Warning: Column `order` has different attributes on LHS and RHS of join

g2 <- extra_labels(
  g = g, D = plist$D, h_ord = h_ord_eco_line_pop_springy,
  xmids = plist$ac_x_midpoints, ymids = plist$ar_y_midpoints,
  row_just = 0.5
) +
  expand_limits(x = -130, y = 337) +
  theme(
    legend.key.width = unit(1.3, "in"),
    legend.key.height = unit(2, "in"),
    legend.text = element_text(size = 48.0),
    legend.title = element_text(size = 48.0)
  ) +
  annotate("rect", xmin = left_x, xmax = right_x, ymin = bot_y, ymax = top_y, fill = NA, colour = "orange", alpha = 0.5, size = 4)

ggsave(g2, filename = "outputs/006/haplo-raster-read-depths.pdf", width = 49, height = 30)

## Warning: Transformation introduced infinite values in discrete y-axis

## Warning: Transformation introduced infinite values in discrete y-axis

## Warning: Removed 13 rows containing missing values (geom_segment).

Session Info

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2020-05-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.6   2020-04-05 [1] CRAN (R 3.6.2)
##  broom         0.5.6   2020-04-20 [1] CRAN (R 3.6.2)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.0)
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
##  DBI           1.1.0   2019-12-15 [1] CRAN (R 3.6.0)
##  dbplyr        1.4.3   2020-04-19 [1] CRAN (R 3.6.2)
##  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.5   2020-03-07 [1] CRAN (R 3.6.0)
##  ecaRbioinf  * 0.1.0   2020-02-13 [1] local         
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.0)
##  farver        2.0.3   2020-01-16 [1] CRAN (R 3.6.0)
##  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 3.6.0)
##  fs            1.4.1   2020-04-04 [1] CRAN (R 3.6.2)
##  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.0   2020-03-05 [1] CRAN (R 3.6.0)
##  glue          1.4.0   2020-04-03 [1] CRAN (R 3.6.2)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.2.0   2019-11-08 [1] CRAN (R 3.6.0)
##  hms           0.5.3   2020-01-08 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.6.1   2020-02-02 [1] CRAN (R 3.6.0)
##  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.0)
##  lattice       0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.0)
##  lubridate     1.7.8   2020-04-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
##  modelr        0.1.6   2020-02-22 [1] CRAN (R 3.6.0)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
##  pillar        1.4.3   2019-12-20 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 3.6.2)
##  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.4   2020-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0   2019-05-16 [1] CRAN (R 3.6.0)
##  rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.0)
##  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.0)
##  rstudioapi    0.11    2020-02-07 [1] CRAN (R 3.6.0)
##  rvest         0.3.5   2019-11-08 [1] CRAN (R 3.6.0)
##  scales        1.1.0   2019-11-18 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1   2020-04-20 [1] CRAN (R 3.6.2)
##  tidyr       * 1.0.2   2020-01-24 [1] CRAN (R 3.6.0)
##  tidyselect    1.0.0   2020-01-27 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 3.6.0)
##  vctrs         0.2.4   2020-03-10 [1] CRAN (R 3.6.0)
##  viridisLite   0.3.0   2018-02-01 [1] CRAN (R 3.6.0)
##  withr         2.2.0   2020-04-20 [1] CRAN (R 3.6.2)
##  xfun          0.13    2020-04-13 [1] CRAN (R 3.6.2)
##  xml2          1.3.2   2020-04-23 [1] CRAN (R 3.6.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.0)
##  zoo           1.8-7   2020-01-10 [1] CRAN (R 3.6.0)
## 
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:

Sys.time() - start_time

## Time difference of 1.29479 mins

Citations

Prince, Daniel J., Sean M. O’Rourke, Tasha Q. Thompson, Omar A. Ali, Hannah S. Lyman, Ismail K. Saglam, Thomas J. Hotaling, Adrian P. Spidle, and Michael R. Miller. 2017. “The Evolutionary Basis of Premature Migration in Pacific Salmon Highlights the Utility of Genomics for Informing Conservation.” Science Advances 3 (8): e1603198. doi:10.1126/sciadv.1603198.

Haplo-raster Plots

Eric C. Anderson

Last Updated: 2020-05-14