The goal here is to infer haplotypes in the GREB1L region to be used in later analyses. We use a small portion of the VCF that is saved in this repository in data/greb1l-ish-region.vcf.gz

Packages and paths

library(tidyverse)
library(vcfR)

dir.create("outputs/004", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/004", recursive = TRUE, showWarnings = FALSE)

Manipulating the VCF file

Get sample names and prepare a file to reheader the vcf

samples <- read_csv("data/wgs-chinook-samples.csv") %>%
  mutate(newname = paste0(Population, "-", NMFS_DNA_ID)) %>%
  mutate(newname = str_replace_all(newname, " +", "_"))

new_names <- samples %>%
  select(vcf_name, newname)

write_tsv(new_names, "intermediates/004/new_names_for_vcf.txt", col_names = FALSE)

Use bcftools to reheader the file with new sample names

This VCF we are reheadering is about 5 Mb around the GREB1L RoSA.

bcftools reheader -s intermediates/004/new_names_for_vcf.txt data/greb1l-ish-region.vcf.gz > intermediates/004/fish_for_tree.vcf.gz

Filter sites

We will filter sites with vcftools. Criteria:

  1. biallelic (this just drops 1000 or so out of 49,000. Seems reasonable).
  2. Genotype call in the vcf for more then 40% of the individuals.
  3. Minor allele freq > 0.05
vcftools --gzvcf intermediates/004/fish_for_tree.vcf.gz --min-alleles 2 --max-alleles 2 --max-missing 0.4 --maf 0.05 --out intermediates/004/greb1l-ish-filtered --recode 
## 
## VCFtools - v0.1.12b
## (C) Adam Auton and Anthony Marcketta 2009
## 
## Parameters as interpreted:
##  --gzvcf intermediates/004/fish_for_tree.vcf.gz
##  --maf 0.05
##  --max-alleles 2
##  --min-alleles 2
##  --max-missing 0.4
##  --out intermediates/004/greb1l-ish-filtered
##  --recode
## 
## Using zlib version: 1.2.5
## After filtering, kept 160 out of 160 Individuals
## Outputting VCF file...
## After filtering, kept 25539 out of a possible 49264 Sites
## Run Time = 9.00 seconds

Attach ancestral alleles from coho

See 200-ancestral-states-from-coho.Rmd for details of the alignment. I put the resulting fasta that is congruent with the chinook genome into data/NC_037124.1_coho.fna and it is faidx indexed.

Now we attach the ancestral alleles to the vcf and filter out those that don’t have known ancestral states:

echo '##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele">' > aa.hdr

bcftools +fill-from-fasta intermediates/004/greb1l-ish-filtered.recode.vcf -- -c AA -f stored_results/001/NC_037124.1_coho.fna.gz -h aa.hdr  > intermediates/004/almost-ready-for-imputation.vcf 
rm aa.hdr

Extract Read Depth Information for those Genotypes

We want to get the read depths so we can look at that information later in the plots. So let’s get it:

rfi <- read.vcfR("intermediates/004/almost-ready-for-imputation.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 14060
##   header_line: 14061
##   variant count: 25539
##   column count: 169
## 
Meta line 1000 read in.
Meta line 2000 read in.
Meta line 3000 read in.
Meta line 4000 read in.
Meta line 5000 read in.
Meta line 6000 read in.
Meta line 7000 read in.
Meta line 8000 read in.
Meta line 9000 read in.
Meta line 10000 read in.
Meta line 11000 read in.
Meta line 12000 read in.
Meta line 13000 read in.
Meta line 14000 read in.
Meta line 14060 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 25539
##   Character matrix gt cols: 169
##   skip: 0
##   nrows: 25539
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant: 25539
## All variants processed
rfi_depths <- vcfR2tidy(rfi)$gt %>%
  select(Indiv, POS, gt_DP) %>%
  mutate(DP = ifelse(is.na(gt_DP), 0, gt_DP)) %>%
  select(-gt_DP)

Now, while we are at it, let’s look at the average per-site depth for different fish in there:

dp_means <- rfi_depths %>%
  group_by(Indiv) %>%
  summarise(mean_depth = mean(DP)) %>%
  arrange(mean_depth)
dp_means

And now look at that:

ggplot(dp_means, aes(x = mean_depth)) +
  geom_histogram(binwidth = 0.1)

We are going to toss anyone with average read depth less than 0.45 as that is a good cutoff that gets all the really poor samples. Sadly, these are mostly fish that were sampled as carcasses in the Salmon River, but we want to keep our data clean for haplotyping, so we will drop those individuals at this stage.

dp_means %>%
  filter(mean_depth > 0.45) %>%
  .$Indiv %>%
  cat(sep = "\n", file = "intermediates/004/suff-read-depth-inds.txt")

Now, we grab them

vcftools --vcf intermediates/004/almost-ready-for-imputation.vcf --keep intermediates/004/suff-read-depth-inds.txt --out intermediates/004/ready-for-imputation-suff-rd --recode 

cp intermediates/004/ready-for-imputation-suff-rd.recode.vcf intermediates/004/ready-for-imputation.vcf
## 
## VCFtools - v0.1.12b
## (C) Adam Auton and Anthony Marcketta 2009
## 
## Parameters as interpreted:
##  --vcf intermediates/004/almost-ready-for-imputation.vcf
##  --keep intermediates/004/suff-read-depth-inds.txt
##  --out intermediates/004/ready-for-imputation-suff-rd
##  --recode
## 
## Keeping individuals in 'keep' list
## After filtering, kept 146 out of 160 Individuals
## Outputting VCF file...
## After filtering, kept 25539 out of a possible 25539 Sites
## Run Time = 8.00 seconds

Impute and phase with BEAGLE

Others will have to adjust their paths to BEAGLE. First the imputation

source script/java-jar-paths.sh

echo "Using Beagle jar at: $BEAGLE"
java -jar $BEAGLE gl=intermediates/004/ready-for-imputation.vcf  out=intermediates/004/greb1l-imputed > intermediates/004/beagle-impute-stdout.txt 
## Using Beagle jar at: /Users/eriq/Documents/others_code/BEAGLE_4.1/beagle.27Jan18.7e1.jar

And then the haplotype phasing.

source script/java-jar-paths.sh

java -jar $BEAGLE gt=intermediates/004/greb1l-imputed.vcf.gz out=intermediates/004/greb1l-imp-phased > intermediates/004/beagle-phase-stdout.txt 

And after that, we reattach the ancestral states:

# gotta do this cuz BEAGLE doesn't put the contig name in there
tabix -f intermediates/004/greb1l-imp-phased.vcf.gz

echo '##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele">' > aa.hdr

bcftools +fill-from-fasta intermediates/004/greb1l-imp-phased.vcf.gz -- -c AA -f stored_results/001/NC_037124.1_coho.fna.gz -h aa.hdr  > intermediates/004/greb1l-imp-phased-with-anc.vcf
rm aa.hdr

OK, now we have intermediates/004/greb1l-imp-phased-with-anc.vcf which is a VCF file of phased haplotypes along with an ancestral allele (AA) tag in the INFO.

Further Processing

From here, forward, we will be using those variants in R, so we will output them in rds format for future use, along with the read-depth information.

Read in the haplotype data:

V <- "intermediates/004/greb1l-imp-phased-with-anc.vcf"
haps <- ecaRbioinf::vcf_haplos2tidy(V, Anc = "Coho")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 15
##   header_line: 16
##   variant count: 25539
##   column count: 155
## 
Meta line 15 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 25539
##   Character matrix gt cols: 155
##   skip: 0
##   nrows: 25539
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant: 25539
## All variants processed
# if we kept sites that don't have an ancestral allele from Coho, mark those as NA so the derived allelic states will be NA too.
haps$fix <- haps$fix %>%
  mutate(AA = ifelse(AA == "N", NA, AA))

First, straighten out the haplotype names, etc.

big_haps <- haps$tidy %>%
  filter(Indiv != "Coho") %>%
  left_join(haps$avd, by = c("ChromKey", "POS", "Indiv", "haplo")) %>%
  mutate(haplo_name = paste0(Indiv, "-", haplo)) %>%
  left_join(rfi_depths, by = c("POS", "Indiv")) %>% # add the read depth information in there
  mutate(
    pop = str_replace_all(Indiv, "(_Spring|_Late_Fall|_Fall|_Winter).*$", ""), # add useful columns of meta data
    ecotype = str_match(Indiv, "(Spring|Late_Fall|Fall|Winter)")[, 1],
    lineage = ifelse(pop %in% c("Salmon_River", "Trinity_River_Hatchery"), "Klamath", "Sacto")
  )

Now, to color everything by the allele that is most common amongst springers, we recode a column of alleles as S (for spring) and F (for fall)

spring_ones <- big_haps %>%
  filter(ecotype == "Spring") %>%
  group_by(POS, allele) %>%
  summarise(freq = n()) %>%
  filter(rank(-freq, ties = "first") == 1) %>%
  rename(spring_allele = allele) %>%
  ungroup()

big_haps <- big_haps %>%
  left_join(spring_ones, by = "POS") %>%
  mutate(alle2 = ifelse(allele == spring_allele, "S", "F"))

And now we need to Join REF and ALT onto that. Down the road, it will be important to know which allele was REF and which was ALT. I’m going to stick that information into big_haps2 so that we have it all, easily:

big_haps2 <- haps$fix %>%
  select(POS, REF, ALT) %>%
  left_join(big_haps, ., by = "POS") %>%
  select(ChromKey, POS, REF, ALT, everything())

Insert Variation from the Johnson Creek fish (Narum assembly)

For some of the upcoming analyses, it will be informative to include the variation seen in the fish from Johnson Creek that Narum et al. used in their assembly. In 200.2 I aligned their genome to Otsh_v1.0 and made a tibble of variants.

# first, get the distinct positions in big_haps2
bh2d <- big_haps2 %>%
  select(ChromKey, POS, REF, ALT, spring_allele) %>%
  distinct()

# get narvars and nar_refms (Johnson Creek Variants) from 003
load("outputs/003/greb_region_narum_variants.rda")

# now, we are going to only focus on the variants called in the broader
# data set, so we will left_join stuff onto those.
all_together <- bh2d %>%
  left_join(., narvars) %>%
  left_join(., nar_refms)

# and now we create a data frame that includes all the information for
# the interior columbia individual, as if it were part of our data set.
# We do this by choosing its alleles, and naming it, etc.
jc_fish <- all_together %>%
  mutate(
    Indiv = "Narum_et_al_Chinook_Assembly",
    haplo = "a",
    allele = case_when(
      (!is.na(Narum_ALT) & Narum_ALT == ALT) ~ Narum_ALT,
      nchar(REF) > 1 & Narum == ALT ~ REF, # REF has no deletion, but neither did Narum.  So Narum is REF here.  A tricky one.
      (REF == nar_refmsREF & REF == Narum) ~ Narum,
      TRUE ~ NA_character_
    ),
    haplo_name = "Narum_et_al_Chinook_Assembly-a",
    pop = "Johnson Ck.",
    ecotype = "Spring",
    lineage = "Upper Columbia",
    alle2 = ifelse(allele == spring_allele, "S", "F")
  ) %>%
  select(-narvar_REF, -Narum_ALT, -nar_refmsREF, -Narum)

Note that bind_rows(big_haps2, jc_fish) will add the Johnson Creek fish into the data set.

Output final data sets as R objects

We now save the haplotypes in some tidy tibbles for future use in creating trees and haplo-rasters.

write_rds(big_haps2, path = "outputs/004/big_haps2.rds", compress = "xz")
write_rds(jc_fish, path = "outputs/004/jc_fish_haps.rds", compress = "xz")
write_rds(haps$fix, path = "outputs/004/big_haps2_fixed_info.rds", compress = "xz")

Note The process of imputing and phasing with BEAGLE can produce different results with different runs of the algorithm. Downstream analyses can be subtly different, though the important patterns remain the same. To reproduce exactly the figures presented in the paper, we provide the output from the BEAGLE run used for downstream analyses in:

./stored_results/004/big_haps2.rds
./stored_results/004/big_haps2_with_jc.rds

Downstream analyses include options (selected with a variable named like USE_STORED_*) to use the stored results.

Session Info

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2020-05-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version  date       lib source        
##  ape           5.3      2019-03-17 [1] CRAN (R 3.6.0)
##  assertthat    0.2.1    2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.6    2020-04-05 [1] CRAN (R 3.6.2)
##  broom         0.5.6    2020-04-20 [1] CRAN (R 3.6.2)
##  cellranger    1.1.0    2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2    2020-02-28 [1] CRAN (R 3.6.0)
##  cluster       2.1.0    2019-06-19 [2] CRAN (R 3.6.2)
##  colorspace    1.4-1    2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4    2017-09-16 [1] CRAN (R 3.6.0)
##  DBI           1.1.0    2019-12-15 [1] CRAN (R 3.6.0)
##  dbplyr        1.4.3    2020-04-19 [1] CRAN (R 3.6.2)
##  digest        0.6.25   2020-02-23 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.5    2020-03-07 [1] CRAN (R 3.6.0)
##  ecaRbioinf    0.1.0    2020-02-13 [1] local         
##  ellipsis      0.3.0    2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14     2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1    2020-01-08 [1] CRAN (R 3.6.0)
##  farver        2.0.3    2020-01-16 [1] CRAN (R 3.6.0)
##  forcats     * 0.5.0    2020-03-01 [1] CRAN (R 3.6.0)
##  fs            1.4.1    2020-04-04 [1] CRAN (R 3.6.2)
##  generics      0.0.2    2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.0    2020-03-05 [1] CRAN (R 3.6.0)
##  glue          1.4.0    2020-04-03 [1] CRAN (R 3.6.2)
##  gtable        0.3.0    2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.2.0    2019-11-08 [1] CRAN (R 3.6.0)
##  hms           0.5.3    2020-01-08 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0    2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1    2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.6.1    2020-02-02 [1] CRAN (R 3.6.0)
##  knitr         1.28     2020-02-06 [1] CRAN (R 3.6.0)
##  labeling      0.3      2014-08-23 [1] CRAN (R 3.6.0)
##  lattice       0.20-38  2018-11-04 [2] CRAN (R 3.6.2)
##  lifecycle     0.2.0    2020-03-06 [1] CRAN (R 3.6.0)
##  lubridate     1.7.8    2020-04-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5      2014-11-22 [1] CRAN (R 3.6.0)
##  MASS          7.3-51.6 2020-04-26 [1] CRAN (R 3.6.2)
##  Matrix        1.2-18   2019-11-27 [2] CRAN (R 3.6.2)
##  memuse        4.1-0    2020-02-17 [1] CRAN (R 3.6.0)
##  mgcv          1.8-31   2019-11-09 [2] CRAN (R 3.6.2)
##  modelr        0.1.6    2020-02-22 [1] CRAN (R 3.6.0)
##  munsell       0.5.0    2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-142  2019-11-07 [2] CRAN (R 3.6.2)
##  permute       0.9-5    2019-03-12 [1] CRAN (R 3.6.0)
##  pillar        1.4.3    2019-12-20 [1] CRAN (R 3.6.0)
##  pinfsc50      1.1.0    2016-12-02 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4    2020-04-17 [1] CRAN (R 3.6.2)
##  R6            2.4.1    2019-11-12 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.4    2020-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1    2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1    2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0    2019-05-16 [1] CRAN (R 3.6.0)
##  rlang         0.4.5    2020-03-01 [1] CRAN (R 3.6.0)
##  rmarkdown     2.1      2020-01-20 [1] CRAN (R 3.6.0)
##  rstudioapi    0.11     2020-02-07 [1] CRAN (R 3.6.0)
##  rvest         0.3.5    2019-11-08 [1] CRAN (R 3.6.0)
##  scales        1.1.0    2019-11-18 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1    2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6    2020-02-17 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0    2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1    2020-04-20 [1] CRAN (R 3.6.2)
##  tidyr       * 1.0.2    2020-01-24 [1] CRAN (R 3.6.0)
##  tidyselect    1.0.0    2020-01-27 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.3.0    2019-11-21 [1] CRAN (R 3.6.0)
##  vcfR        * 1.10.0   2020-02-06 [1] CRAN (R 3.6.0)
##  vctrs         0.2.4    2020-03-10 [1] CRAN (R 3.6.0)
##  vegan         2.5-6    2019-09-01 [1] CRAN (R 3.6.0)
##  viridisLite   0.3.0    2018-02-01 [1] CRAN (R 3.6.0)
##  withr         2.2.0    2020-04-20 [1] CRAN (R 3.6.2)
##  xfun          0.13     2020-04-13 [1] CRAN (R 3.6.2)
##  xml2          1.3.2    2020-04-23 [1] CRAN (R 3.6.2)
##  yaml          2.2.1    2020-02-01 [1] CRAN (R 3.6.0)
## 
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:

Sys.time() - start_time
## Time difference of 12.02159 mins

Citations