Packages and paths

library(tidyverse)
library(vcfR)

Build a SNPeff database

As instructed in the README file, be sure you have downloaded Chinook Otsh_v1.0 genome and the gff files and have put the (or symlinked them) in the genome directory.

Also, ensure that you have installed snpEff as described in the README.

Building the data base involves setting some symlinks within the snpEff directory and running it:

cd snpEff_v4_3t_core/snpEff/
mkdir -p data/Otsh_v1.0
cd data/Otsh_v1.0/

# get rid of these if the symlinks already exist from a previous run
rm -f genes.gff.gz
rm -f sequences.fa
# here we symlink sequences.fa to the genome and genes.gff.gz to the gff file.
ln -s ../../../../genome/GCF_002872995.1_Otsh_v1.0_genomic.gff.gz genes.gff.gz
ln -s ../../../../genome/Otsh_v1.0_genomic.fna sequences.fa

And then adding some lines to the config file:

# only to this if the lines have not already been added
# (i.e., if this has alread been run)
if grep -q Otsh_v1.0 snpEff_v4_3t_core/snpEff/snpEff.config; then
    echo "Otsh_v1.0 already added to config file"
else
    echo "Adding Otsh_v1.0 to config file"
    
    echo "
# Chinook salmon genome, Otsh_v1.0
Otsh_v1.0.genome : Oncorhynchus_tshawytscha_Otsh_V1.0
" >> snpEff_v4_3t_core/snpEff/snpEff.config
fi

## Otsh_v1.0 already added to config file

Now, run it. This takes about 5 minutes.

cd snpEff_v4_3t_core/snpEff
java -jar snpEff.jar build -gff3 -v Otsh_v1.0 > snpEff_stdout.log 2> snpEff_stderr.log

Running SNPeff

Having created a SNPeff data base from the Otshv1.0 genome and gff files using the standard procedures we should be able to annotate our little 5.16 Mb segment of DNA pretty quickly.

DIR=$(pwd)

mkdir -p intermediates/005

cd snpEff_v4_3t_core/snpEff

rm -f $DIR/intermediates/005/greb1l-5Mb-snpEff.vcf 
rm -f $DIR/intermediates/005/greb1l-5Mb-snpEff.vcf.gz 

java -jar snpEff.jar ann Otsh_v1.0 $DIR/data/greb1l-ish-region.vcf.gz > $DIR/intermediates/005/greb1l-5Mb-snpEff.vcf

# after that, bgzip it and index it
cd $DIR/intermediates/005
bgzip greb1l-5Mb-snpEff.vcf
bcftools index -t greb1l-5Mb-snpEff.vcf.gz

Now that those are there, I will be able to use them later, attaching them to the imputed variant data and computing some things. To facilitate that, let’s read it into R and clean it up and resave.

Read results into R and save in outputs

annoV <- read.vcfR("intermediates/005/greb1l-5Mb-snpEff.vcf.gz")

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 14061
##   header_line: 14062
##   variant count: 49264
##   column count: 169
## 
Meta line 1000 read in.
Meta line 2000 read in.
Meta line 3000 read in.
Meta line 4000 read in.
Meta line 5000 read in.
Meta line 6000 read in.
Meta line 7000 read in.
Meta line 8000 read in.
Meta line 9000 read in.
Meta line 10000 read in.
Meta line 11000 read in.
Meta line 12000 read in.
Meta line 13000 read in.
Meta line 14000 read in.
Meta line 14061 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 49264
##   Character matrix gt cols: 169
##   skip: 0
##   nrows: 49264
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant 26000
Processed variant 27000
Processed variant 28000
Processed variant 29000
Processed variant 30000
Processed variant 31000
Processed variant 32000
Processed variant 33000
Processed variant 34000
Processed variant 35000
Processed variant 36000
Processed variant 37000
Processed variant 38000
Processed variant 39000
Processed variant 40000
Processed variant 41000
Processed variant 42000
Processed variant 43000
Processed variant 44000
Processed variant 45000
Processed variant 46000
Processed variant 47000
Processed variant 48000
Processed variant 49000
Processed variant: 49264
## All variants processed

TT <- vcfR2tidy(annoV, info_only = TRUE)

Have a look at some possible effect variants

Now, let’s break out the first three columns of the snpEff annotation.

# there are warnings here because we are just grabbing the first three
# columns of many with separate()
TTS <- TT$fix %>%
  separate(ANN, into = c("snp_eff_var", "snp_eff_what", "snp_eff_effect"), remove = FALSE, sep = "\\|")

Count up what we see for effects:

TTS %>%
  count(snp_eff_effect, snp_eff_what)

Let’s go ahead and save that.

dir.create("outputs/005", recursive = TRUE, showWarnings = FALSE)
write_rds(TTS, path = "outputs/005/variants-and-annos-5.16.rds", compress = "xz")

We will focus on what we find from 12.1 to 12.6 here, as that encompasses GREB1L and ROCK1 and is what we will show for the “moderate zoom” raster

FocReg <- TTS %>%
  filter(POS > 12.1e6 & POS < 12.6e6)

# summarize:
FocReg %>%
  count(snp_eff_effect)

Let’s have a quick look at the interesting ones:

FocReg %>%
  filter(snp_eff_effect != "MODIFIER") %>%
  select(CHROM, POS, snp_eff_effect, snp_eff_what, REF, ALT, ANN)

So, now, let’s just look at the HIGH and MODERATE ones there:

FocReg %>%
  filter(snp_eff_effect %in% c("MODERATE", "HIGH")) %>%
  select(CHROM, POS, snp_eff_effect, snp_eff_what, REF, ALT, ANN)

Let’s write this out:

candidates <- FocReg %>%
  filter(snp_eff_effect %in% c("MODERATE", "HIGH")) %>%
  select(CHROM, POS, snp_eff_effect, snp_eff_what, REF, ALT, ANN, everything())

write_rds(candidates, path = "outputs/005/candidate-mutations-12.1_to_12.6.rds")

Look further at the candidate variants

We grab the imputed SNPs and inferred haplotypes from 305 here to make comparisons on the allele frequencies at the candidate SNPs here.

big_haps2 <- read_rds(path = "outputs/004/big_haps2.rds")

Check if any of the candidate variants are not in there:

candidates %>%
  anti_join(big_haps2, by = "POS")

Check up on positions not included in big_haps2

OK, that is interesting. Those must have gotten filtered out. The filtering rules we used in 004 were:

biallelic (this just drops 1000 or so out of 49,000. Seems reasonable).
Genotype call in the vcf for more then 40% of the individuals.
Minor allele freq > 0.05

I am going to look at these markers:

bcftools index -t  intermediates/004/fish_for_tree.vcf.gz

# this one has only 2 copies of the alt allele
bcftools view -H  intermediates/004/fish_for_tree.vcf.gz NC_037124.1:12202720

# this one is also a low frequency variant (9 copies detected in data set, out of 202)
bcftools view -H  intermediates/004/fish_for_tree.vcf.gz NC_037124.1:12237855

# this one also is just two copies of the alt allele
bcftools view -H  intermediates/004/fish_for_tree.vcf.gz NC_037124.1:12239678


# This HIGH-effect candidate is clearly not associated with anything:
bcftools view -H  intermediates/004/fish_for_tree.vcf.gz NC_037124.1:12246066 | awk '{for(i=10; i<=NF;i++) print $i}' | awk -F":" 'BEGIN {SUBSEP = "   "} {n[$1]++} END {for(i in n) print i, n[i]}' 
./. 57
0/0 95
0/1 6
1/1 2
# and note that those "homozygotes" only have 1 read each, so are likely hets.

# and finally we check the last one, which shows us 10 out 204 allele calls.  Nope! not associated in any way.
bcftools view -H  intermediates/004/fish_for_tree.vcf.gz NC_037124.1:12471068

So, it is not like we are missing any of these in the list of markers we dealt with for any reason other than there is no way they are associated with anything.

Look at imputed genotypes at different candidate sites

So, now let us look at the imputed genotypes at all these sites. It might be useful to identify haplotypes as “spring” or “fall” lineage. So, let’s do that by enumerating the number of spring alleles in the RoSA, too.

big_haps3 <- big_haps2 %>%
  group_by(Indiv, haplo) %>%
  mutate(num_S_in_RoSA = sum(alle2[POS > 12.26e6 & POS < 12.29e6] == "S")) %>%
  semi_join(candidates, by = "POS") %>%
  ungroup()

Look at the distribution of number of S alleles in the RoSA:

big_haps3 %>%
  group_by(Indiv, haplo) %>%
  summarise(num_S_in_RoSA = num_S_in_RoSA[1]) %>%
  ggplot(aes(x = num_S_in_RoSA)) +
  geom_histogram()

So, we can easily call anything with more than 100 spring alleles part of the spring-run haplotype group.

big_haps4 <- big_haps3 %>%
  mutate(haplo_group = ifelse(num_S_in_RoSA > 100, "early", "late"))

Now we can compute the frequencies of these different alleles when we break things up by different groups.

Here, just using the stated ecotype of each individual are the results we get:

big_haps4 %>%
  group_by(POS, ecotype) %>%
  summarise(freq_of_S_allele = sum(alle2 == "S")) %>%
  spread(key = ecotype, value = freq_of_S_allele)

And, we can see where that is going! Clearly the 7 copies in the Fall populations are due to heterozygotes, and that is also probably the case a couple springers.

Let’s do this again with the relative frequencies.

big_haps4 %>%
  group_by(POS, ecotype) %>%
  summarise(freq_of_S_allele = mean(alle2 == "S")) %>%
  spread(key = ecotype, value = freq_of_S_allele)

So, let’s break it out by haplotype group:

big_haps4 %>%
  group_by(POS, haplo_group) %>%
  summarise(freq_of_S_allele = sum(alle2 == "S")) %>%
  spread(key = haplo_group, value = freq_of_S_allele)

OK, we need to know what the total number of haplotypes is in there too:

big_haps4 %>%
  group_by(POS, haplo_group) %>%
  summarise(
    freq_of_S_allele = sum(alle2 == "S"),
    freq_of_F_allele = sum(alle2 != "S")
  )

That shows us what we expect, but it is difficult to read. But it shows us that only two of the candidates have any diffs, and that one of them is perfect, whilst the other is missing by 1. Let’s do it in terms of fractions:

big_haps4 %>%
  group_by(POS, haplo_group) %>%
  summarise(freq_of_S_allele = mean(alle2 == "S")) %>%
  spread(key = haplo_group, value = freq_of_S_allele)

Codons in coho salmon

So, what remains now is to verify that the spring-associated alleles are the same as in coho. We use our aligned coho genome for that.

# pull out the bits from chinook:
samtools faidx genome/Otsh_v1.0_genomic.fna NC_037124.1:12265344-12265346
#>NC_037124.1:12265344-12265346
#CTT


# That corresponds to the codon AAG.  If the spring variant were in the
# reference genome then we'd get CTC.

# pull out the same bit from our coho alignment
samtools faidx stored_results/001/NC_037124.1_coho.fna.gz  NC_037124.1:12265344-12265346
#>NC_037124.1:12265344-12265346
#CTC


# there you have it!

# and now we do the same for the other non-synonymous position
samtools faidx genome/Otsh_v1.0_genomic.fna NC_037124.1:12265889-12265891
#>NC_037124.1:12265889-12265891
#GGA

# that, corresponds to the codon TCC.  So, the spring-associated
# form will be GGG.
samtools faidx stored_results/001/NC_037124.1_coho.fna.gz  NC_037124.1:12265889-12265891
#>NC_037124.1:12265889-12265891
#GGG

# Boom!

## >NC_037124.1:12265344-12265346
## CTT
## >NC_037124.1:12265344-12265346
## CTC
## >NC_037124.1:12265889-12265891
## GGA
## >NC_037124.1:12265889-12265891
## GGG

Now, let’s also check a final time on REF and ALT in the original VCF, to make sure that it all accords as it is supposed to here.

# Check the first position
bcftools view data/greb1l-ish-region.vcf.gz NC_037124.1:12265346 | awk '/^##/ {next} {print $1, $2, $3, $4, $5}' 
#CHROM POS ID REF ALT
#NC_037124.1 12265346 . T C

# Yep, that is correct

# and the second
bcftools view data/greb1l-ish-region.vcf.gz NC_037124.1:12265891 | awk '/^##/ {next} {print $1, $2, $3, $4, $5}' 
#CHROM POS ID REF ALT
#NC_037124.1 12265891 . A G

# also correct

## #CHROM POS ID REF ALT
## NC_037124.1 12265346 . T C
## #CHROM POS ID REF ALT
## NC_037124.1 12265891 . A G

OK! This all forms the basis for Table S1: “Two near-perfectly associated, non-synonymous variants in the GREB1L gene.”

Session Info

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2020-05-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version  date       lib source        
##  ape           5.3      2019-03-17 [1] CRAN (R 3.6.0)
##  assertthat    0.2.1    2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.6    2020-04-05 [1] CRAN (R 3.6.2)
##  broom         0.5.6    2020-04-20 [1] CRAN (R 3.6.2)
##  cellranger    1.1.0    2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2    2020-02-28 [1] CRAN (R 3.6.0)
##  cluster       2.1.0    2019-06-19 [2] CRAN (R 3.6.2)
##  colorspace    1.4-1    2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4    2017-09-16 [1] CRAN (R 3.6.0)
##  DBI           1.1.0    2019-12-15 [1] CRAN (R 3.6.0)
##  dbplyr        1.4.3    2020-04-19 [1] CRAN (R 3.6.2)
##  digest        0.6.25   2020-02-23 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.5    2020-03-07 [1] CRAN (R 3.6.0)
##  ellipsis      0.3.0    2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14     2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1    2020-01-08 [1] CRAN (R 3.6.0)
##  farver        2.0.3    2020-01-16 [1] CRAN (R 3.6.0)
##  forcats     * 0.5.0    2020-03-01 [1] CRAN (R 3.6.0)
##  fs            1.4.1    2020-04-04 [1] CRAN (R 3.6.2)
##  generics      0.0.2    2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.0    2020-03-05 [1] CRAN (R 3.6.0)
##  glue          1.4.0    2020-04-03 [1] CRAN (R 3.6.2)
##  gtable        0.3.0    2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.2.0    2019-11-08 [1] CRAN (R 3.6.0)
##  hms           0.5.3    2020-01-08 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0    2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1    2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.6.1    2020-02-02 [1] CRAN (R 3.6.0)
##  knitr         1.28     2020-02-06 [1] CRAN (R 3.6.0)
##  labeling      0.3      2014-08-23 [1] CRAN (R 3.6.0)
##  lattice       0.20-38  2018-11-04 [2] CRAN (R 3.6.2)
##  lifecycle     0.2.0    2020-03-06 [1] CRAN (R 3.6.0)
##  lubridate     1.7.8    2020-04-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5      2014-11-22 [1] CRAN (R 3.6.0)
##  MASS          7.3-51.6 2020-04-26 [1] CRAN (R 3.6.2)
##  Matrix        1.2-18   2019-11-27 [2] CRAN (R 3.6.2)
##  memuse        4.1-0    2020-02-17 [1] CRAN (R 3.6.0)
##  mgcv          1.8-31   2019-11-09 [2] CRAN (R 3.6.2)
##  modelr        0.1.6    2020-02-22 [1] CRAN (R 3.6.0)
##  munsell       0.5.0    2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-142  2019-11-07 [2] CRAN (R 3.6.2)
##  permute       0.9-5    2019-03-12 [1] CRAN (R 3.6.0)
##  pillar        1.4.3    2019-12-20 [1] CRAN (R 3.6.0)
##  pinfsc50      1.1.0    2016-12-02 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4    2020-04-17 [1] CRAN (R 3.6.2)
##  R6            2.4.1    2019-11-12 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.4    2020-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1    2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1    2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0    2019-05-16 [1] CRAN (R 3.6.0)
##  rlang         0.4.5    2020-03-01 [1] CRAN (R 3.6.0)
##  rmarkdown     2.1      2020-01-20 [1] CRAN (R 3.6.0)
##  rstudioapi    0.11     2020-02-07 [1] CRAN (R 3.6.0)
##  rvest         0.3.5    2019-11-08 [1] CRAN (R 3.6.0)
##  scales        1.1.0    2019-11-18 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1    2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6    2020-02-17 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0    2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1    2020-04-20 [1] CRAN (R 3.6.2)
##  tidyr       * 1.0.2    2020-01-24 [1] CRAN (R 3.6.0)
##  tidyselect    1.0.0    2020-01-27 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.3.0    2019-11-21 [1] CRAN (R 3.6.0)
##  vcfR        * 1.10.0   2020-02-06 [1] CRAN (R 3.6.0)
##  vctrs         0.2.4    2020-03-10 [1] CRAN (R 3.6.0)
##  vegan         2.5-6    2019-09-01 [1] CRAN (R 3.6.0)
##  viridisLite   0.3.0    2018-02-01 [1] CRAN (R 3.6.0)
##  withr         2.2.0    2020-04-20 [1] CRAN (R 3.6.2)
##  xfun          0.13     2020-04-13 [1] CRAN (R 3.6.2)
##  xml2          1.3.2    2020-04-23 [1] CRAN (R 3.6.2)
##  yaml          2.2.1    2020-02-01 [1] CRAN (R 3.6.0)
## 
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:

Sys.time() - start_time

## Time difference of 6.37954 mins

Annotating Variants Near GREB1L

Eric C. Anderson

Last Updated: 2020-05-14