One of the referees was curious about how accurate (or inaccurate) the process of imputation and phasing might have been on our relatively low coverage whole genome sequencing data. He was also curious about what effect it might have on the trees that we inferred from the phased haplotype data. In this notebook we begin to address that issue by first estimating the heterozygote miscall rate of our data amongst fall-run Chinook salmon from the Sacramento-San Joaquin Basin. Heterozygote miscall, as a common type of genotyping error in genotyping-by-sequencing data, was discussed in Hendricks et al. (2018), where there was also mention of software for estimating heterozygote miscall rates, as a function of read depth, via distortions from Hardy-Weinberg equilibrium. A method for doing so is now distributed as the R package ‘whoa’ by Eric C. Anderson, available from CRAN.

Here, our goal is to assess whether the imputed genotypes found amongst a large group of fish from several largely undifferentiated populations (the fall and late-fall collections from the Central Valley of California) are in proportions that indicate that there has not been a gross under-identification of heterozygous individuals.

Sacramento-San Joaquin Fall run Chinook salmon are largely undifferentiated between different tributaries (and from the Late-Fall run) which makes them suitable for consideration for use in ‘whoa.’

This notebook will draw upon the intermediate files produced in 004.

library(tidyverse)
library(vcfR)

# You must use a post-CRAN-release version of 'whoa'
# that deals with read depths of 0 (from fully imputed sites) appropriately.
# devtools::install_github("eriqande/whoa")
library(whoa)

dir.create("intermediates/204", showWarnings = FALSE, recursive = TRUE)
dir.create("outputs/204", showWarnings = FALSE, recursive = TRUE)

First explore the raw gentoypes called by GATK

Since GATK seems to not include population frequency information to calculate genotype posteriors (and, rather, appears to call genotypes based on the maximum of the genotype likelihoods), we expect that our low-read depth sequencing will yield raw genotypes from GATK that show evidence of extreme deficits of heterozygotes.

# get names of the fall and late fall central valley fish:
bcftools query -l intermediates/004/almost-ready-for-imputation.vcf | \
  awk '/San_Joaquin_River_Fall/ || /Coleman_Hatchery_Late_Fall/ || Feather_River_Hatchery_Fall' \
  > intermediates/204/cv-fall-run-names.txt 

# pick out just the fall/late-fall from the CV
bcftools view -S intermediates/204/cv-fall-run-names.txt -Oz \
  intermediates/004/almost-ready-for-imputation.vcf > intermediates/204/fall-late-fall-raw-from-gatk-5.16Mb.vcf.gz

We investigate that with scatterplots of genotype frequencies observed and expected under HWE in the 48 CV fall/late-fall fish:

raw_gatk <- read.vcfR("intermediates/204/fall-late-fall-raw-from-gatk-5.16Mb.vcf.gz")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 14062
##   header_line: 14063
##   variant count: 25539
##   column count: 41
## 
Meta line 1000 read in.
Meta line 2000 read in.
Meta line 3000 read in.
Meta line 4000 read in.
Meta line 5000 read in.
Meta line 6000 read in.
Meta line 7000 read in.
Meta line 8000 read in.
Meta line 9000 read in.
Meta line 10000 read in.
Meta line 11000 read in.
Meta line 12000 read in.
Meta line 13000 read in.
Meta line 14000 read in.
Meta line 14062 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 25539
##   Character matrix gt cols: 41
##   skip: 0
##   nrows: 25539
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant: 25539
## All variants processed
raw_freqs <- exp_and_obs_geno_freqs(raw_gatk)

geno_freqs_scatter(raw_freqs, alpha = 0.04, max_plot_loci = 1e15)

Indeed. As expected, the raw genotype calls from GATK show substantial departures from HWE in a collection of from fish which it is to be expected.

Get the VCF of imputed/phased haplotypes from 004 and add depths to it

Here we test for an improvement in genotype proportions after imputation of genotypes from genotype likelihoods, and using haplotype information and linkage disequilibrium

cp intermediates/004/almost-ready-for-imputation.vcf intermediates/204
bgzip -f intermediates/204/almost-ready-for-imputation.vcf
bcftools index intermediates/204/almost-ready-for-imputation.vcf.gz


# reattach the DP information
bcftools annotate -a  intermediates/204/almost-ready-for-imputation.vcf.gz -c FMT/DP \
  -Oz -o intermediates/204/imputed-all-fish-with-DP.vcf.gz intermediates/004/greb1l-imp-phased.vcf.gz

# pick out just the fall/late-fall from the CV
bcftools view -S intermediates/204/cv-fall-run-names.txt -Oz \
  intermediates/204/imputed-all-fish-with-DP.vcf.gz > intermediates/204/fall-late-fall-imp-and-phased-5.16Mb.vcf.gz

Now, run it through whoa

First read it in and make a geno_freq scatter of it:

v <- read.vcfR("intermediates/204/fall-late-fall-imp-and-phased-5.16Mb.vcf.gz")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 19
##   header_line: 20
##   variant count: 25539
##   column count: 41
## 
Meta line 19 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 25539
##   Character matrix gt cols: 41
##   skip: 0
##   nrows: 25539
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant: 25539
## All variants processed
gfreqs <- exp_and_obs_geno_freqs(v)

geno_freqs_scatter(gfreqs, alpha = 0.04, max_plot_loci = 1e15)

That is a considerable improvement. We can conclude from this that, at a gross level, the imputation from BEAGLE is working reasonably well, despite the fact that fish from multiple populations were included together in the BEAGLE imputation analysis.

Now, try to infer an overall het miscall rate:

overall <- infer_m(v, minBin = 1e15)

Look at the posterior mean estimate:

overall$m_posteriors

So, a het miscall rate of a little greater than 5%. That is quite low. Note that a more suitable name for this quantity might be the “heterozygote distortion rate,” since it indicates that there is a distortion of the heterozygote genotype frequencies that is consistent with heterozygotes being incorrectly called as homozygotes at a rate of 5%; however, the actual rate at which heterozygotes are called as homozygotes may be higher than this (as will be seen in the next analysis, 204-02).

Let’s look at the MCMC trace:

ggplot(overall$m_traces, aes(x = sweep, y = value)) +
  geom_line()

That shows good mixing.

Now, let’s see if we can break things down by different read depth bins:

binned <- infer_m(v, minBin = 2e04)

Check out the posterior means:

binned$m_posteriors

Let’s plot the traces:

binned$m_traces %>%
  left_join(binned$m_posteriors %>% select(bin, mean_dp)) %>%
  ggplot(., aes(x = sweep, y = value)) +
  geom_line() +
  facet_wrap(~ as.factor(mean_dp))

So, that shows what we expect: that heterozygotes appear to be miscalled as homozygotes at a higher rate when the read depth is low. This is exactly what we expect.

Note that these are error estimates for sites across the whole 5.16 Mb. We might expect the imputation and the phasing to be more accurate in areas with high LD, such as what we find in the RoSA.

Effective rate of genotyping errors

Also, those esimated heterozygote miscall rates are the rates at which heterozygotes are miscalled as homozygotes (the most common type of genotyping error with these sorts of data). However, only a fraction of the sites are heterozygous. So, what does that translate to for these fish in terms of average rate of genotyping error? Well it depends on the allele frequencies. Let’s do a quick calculation:

vt <- vcfR2tidy(v)$gt %>%
  mutate(DP = ifelse(is.na(gt_DP), 0, gt_DP)) %>%
  select(-gt_DS, -gt_GP, -gt_DP, -gt_GT_alleles)

dpjoins <- binned$m_posteriors %>%
  mutate(DP = as.integer(mean_dp)) %>%
  select(DP, mean) %>%
  rename(est_hmr = mean)

vt_with_hmr <- vt %>%
  left_join(dpjoins, by = "DP") %>%
  mutate(est_hmr = ifelse(is.na(est_hmr), 0.00138, est_hmr)) # if DP > 5 just say het miscall rate is 0.00138

vt_with_hmr %>%
  separate(gt_GT, into = c("a", "b"), convert = TRUE, sep = "\\|") %>%
  gather(key = "gene_copy", value = "allele", a, b) %>%
  group_by(POS) %>%
  mutate(
    n1 = sum(allele == 1),
    n0 = sum(allele == 0)
  ) %>%
  mutate(het_freq = 2 * n1 / (n0 + n1) * n0 / (n0 + n1)) %>%
  mutate(exp_geno_rate = het_freq * est_hmr) %>%
  ungroup() %>%
  summarise(mean_est_geno_err_rate = mean(exp_geno_rate))

So, if we use the rate of heterozygote miscalls estimated here from distortions from HWE, and we only count heterozygote miscalls, we get a rate of about 1.3% of all genotypes being incorrect. That is not a high fraction of all genotypes that are expected to be incorrect.

However, a more definitive comparison would involve comparing the imputed genotypes of high-read depth individuals to the imputed genotypes from those same individuals after subsampling the reads from them to a lower read depth, like 1.5. This is done in the next notebook (204-02)

Session Info

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2020-05-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version   date       lib source        
##  ape           5.3       2019-03-17 [1] CRAN (R 3.6.0)
##  assertthat    0.2.1     2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.6     2020-04-05 [1] CRAN (R 3.6.2)
##  broom         0.5.6     2020-04-20 [1] CRAN (R 3.6.2)
##  cellranger    1.1.0     2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2     2020-02-28 [1] CRAN (R 3.6.0)
##  cluster       2.1.0     2019-06-19 [2] CRAN (R 3.6.2)
##  colorspace    1.4-1     2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4     2017-09-16 [1] CRAN (R 3.6.0)
##  DBI           1.1.0     2019-12-15 [1] CRAN (R 3.6.0)
##  dbplyr        1.4.3     2020-04-19 [1] CRAN (R 3.6.2)
##  digest        0.6.25    2020-02-23 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.5     2020-03-07 [1] CRAN (R 3.6.0)
##  ellipsis      0.3.0     2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14      2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1     2020-01-08 [1] CRAN (R 3.6.0)
##  farver        2.0.3     2020-01-16 [1] CRAN (R 3.6.0)
##  forcats     * 0.5.0     2020-03-01 [1] CRAN (R 3.6.0)
##  fs            1.4.1     2020-04-04 [1] CRAN (R 3.6.2)
##  generics      0.0.2     2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.0     2020-03-05 [1] CRAN (R 3.6.0)
##  glue          1.4.0     2020-04-03 [1] CRAN (R 3.6.2)
##  gtable        0.3.0     2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.2.0     2019-11-08 [1] CRAN (R 3.6.0)
##  hms           0.5.3     2020-01-08 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0     2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1     2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.6.1     2020-02-02 [1] CRAN (R 3.6.0)
##  knitr         1.28      2020-02-06 [1] CRAN (R 3.6.0)
##  labeling      0.3       2014-08-23 [1] CRAN (R 3.6.0)
##  lattice       0.20-38   2018-11-04 [2] CRAN (R 3.6.2)
##  lifecycle     0.2.0     2020-03-06 [1] CRAN (R 3.6.0)
##  lubridate     1.7.8     2020-04-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5       2014-11-22 [1] CRAN (R 3.6.0)
##  MASS          7.3-51.6  2020-04-26 [1] CRAN (R 3.6.2)
##  Matrix        1.2-18    2019-11-27 [2] CRAN (R 3.6.2)
##  memuse        4.1-0     2020-02-17 [1] CRAN (R 3.6.0)
##  mgcv          1.8-31    2019-11-09 [2] CRAN (R 3.6.2)
##  modelr        0.1.6     2020-02-22 [1] CRAN (R 3.6.0)
##  munsell       0.5.0     2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-142   2019-11-07 [2] CRAN (R 3.6.2)
##  permute       0.9-5     2019-03-12 [1] CRAN (R 3.6.0)
##  pillar        1.4.3     2019-12-20 [1] CRAN (R 3.6.0)
##  pinfsc50      1.1.0     2016-12-02 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3     2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4     2020-04-17 [1] CRAN (R 3.6.2)
##  R6            2.4.1     2019-11-12 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.4     2020-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1     2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1     2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0     2019-05-16 [1] CRAN (R 3.6.0)
##  rlang         0.4.5     2020-03-01 [1] CRAN (R 3.6.0)
##  rmarkdown     2.1       2020-01-20 [1] CRAN (R 3.6.0)
##  rstudioapi    0.11      2020-02-07 [1] CRAN (R 3.6.0)
##  rvest         0.3.5     2019-11-08 [1] CRAN (R 3.6.0)
##  scales        1.1.0     2019-11-18 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1     2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6     2020-02-17 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0     2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1     2020-04-20 [1] CRAN (R 3.6.2)
##  tidyr       * 1.0.2     2020-01-24 [1] CRAN (R 3.6.0)
##  tidyselect    1.0.0     2020-01-27 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.3.0     2019-11-21 [1] CRAN (R 3.6.0)
##  vcfR        * 1.10.0    2020-02-06 [1] CRAN (R 3.6.0)
##  vctrs         0.2.4     2020-03-10 [1] CRAN (R 3.6.0)
##  vegan         2.5-6     2019-09-01 [1] CRAN (R 3.6.0)
##  viridisLite   0.3.0     2018-02-01 [1] CRAN (R 3.6.0)
##  whoa        * 0.0.2.999 2020-04-23 [1] local         
##  withr         2.2.0     2020-04-20 [1] CRAN (R 3.6.2)
##  xfun          0.13      2020-04-13 [1] CRAN (R 3.6.2)
##  xml2          1.3.2     2020-04-23 [1] CRAN (R 3.6.2)
##  yaml          2.2.1     2020-02-01 [1] CRAN (R 3.6.0)
## 
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:

Sys.time() - start_time
## Time difference of 1.16481 mins

Citations

Hendricks, Sarah, Eric C. Anderson, Tiago Antao, Louis Bernatchez, Brenna R. Forester, Brittany Garner, Brian K. Hand, et al. 2018. “Recent Advances in Conservation and Population Genomics Data Analysis.” Evolutionary Applications 11 (8): 1197–1211. doi:10.1111/eva.12659.