The goal here is to infer haplotypes in the GREB1L region to be used in later analyses. We use a small portion of the VCF that is saved in this repository in data/greb1l-ish-region.vcf.gz
library(tidyverse)
library(vcfR)
dir.create("outputs/004", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/004", recursive = TRUE, showWarnings = FALSE)
samples <- read_csv("data/wgs-chinook-samples.csv") %>%
mutate(newname = paste0(Population, "-", NMFS_DNA_ID)) %>%
mutate(newname = str_replace_all(newname, " +", "_"))
new_names <- samples %>%
select(vcf_name, newname)
write_tsv(new_names, "intermediates/004/new_names_for_vcf.txt", col_names = FALSE)
This VCF we are reheadering is about 5 Mb around the GREB1L RoSA.
bcftools reheader -s intermediates/004/new_names_for_vcf.txt data/greb1l-ish-region.vcf.gz > intermediates/004/fish_for_tree.vcf.gz
We will filter sites with vcftools. Criteria:
vcftools --gzvcf intermediates/004/fish_for_tree.vcf.gz --min-alleles 2 --max-alleles 2 --max-missing 0.4 --maf 0.05 --out intermediates/004/greb1l-ish-filtered --recode
##
## VCFtools - v0.1.12b
## (C) Adam Auton and Anthony Marcketta 2009
##
## Parameters as interpreted:
## --gzvcf intermediates/004/fish_for_tree.vcf.gz
## --maf 0.05
## --max-alleles 2
## --min-alleles 2
## --max-missing 0.4
## --out intermediates/004/greb1l-ish-filtered
## --recode
##
## Using zlib version: 1.2.5
## After filtering, kept 160 out of 160 Individuals
## Outputting VCF file...
## After filtering, kept 25539 out of a possible 49264 Sites
## Run Time = 9.00 seconds
See 200-ancestral-states-from-coho.Rmd
for details of the alignment. I put the resulting fasta that is congruent with the chinook genome into data/NC_037124.1_coho.fna
and it is faidx indexed.
Now we attach the ancestral alleles to the vcf and filter out those that don’t have known ancestral states:
echo '##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele">' > aa.hdr
bcftools +fill-from-fasta intermediates/004/greb1l-ish-filtered.recode.vcf -- -c AA -f stored_results/001/NC_037124.1_coho.fna.gz -h aa.hdr > intermediates/004/almost-ready-for-imputation.vcf
rm aa.hdr
We want to get the read depths so we can look at that information later in the plots. So let’s get it:
rfi <- read.vcfR("intermediates/004/almost-ready-for-imputation.vcf")
## Scanning file to determine attributes.
## File attributes:
## meta lines: 14060
## header_line: 14061
## variant count: 25539
## column count: 169
##
Meta line 1000 read in.
Meta line 2000 read in.
Meta line 3000 read in.
Meta line 4000 read in.
Meta line 5000 read in.
Meta line 6000 read in.
Meta line 7000 read in.
Meta line 8000 read in.
Meta line 9000 read in.
Meta line 10000 read in.
Meta line 11000 read in.
Meta line 12000 read in.
Meta line 13000 read in.
Meta line 14000 read in.
Meta line 14060 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 25539
## Character matrix gt cols: 169
## skip: 0
## nrows: 25539
## row_num: 0
##
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant: 25539
## All variants processed
rfi_depths <- vcfR2tidy(rfi)$gt %>%
select(Indiv, POS, gt_DP) %>%
mutate(DP = ifelse(is.na(gt_DP), 0, gt_DP)) %>%
select(-gt_DP)
Now, while we are at it, let’s look at the average per-site depth for different fish in there:
dp_means <- rfi_depths %>%
group_by(Indiv) %>%
summarise(mean_depth = mean(DP)) %>%
arrange(mean_depth)
dp_means
And now look at that:
ggplot(dp_means, aes(x = mean_depth)) +
geom_histogram(binwidth = 0.1)
We are going to toss anyone with average read depth less than 0.45 as that is a good cutoff that gets all the really poor samples. Sadly, these are mostly fish that were sampled as carcasses in the Salmon River, but we want to keep our data clean for haplotyping, so we will drop those individuals at this stage.
dp_means %>%
filter(mean_depth > 0.45) %>%
.$Indiv %>%
cat(sep = "\n", file = "intermediates/004/suff-read-depth-inds.txt")
Now, we grab them
vcftools --vcf intermediates/004/almost-ready-for-imputation.vcf --keep intermediates/004/suff-read-depth-inds.txt --out intermediates/004/ready-for-imputation-suff-rd --recode
cp intermediates/004/ready-for-imputation-suff-rd.recode.vcf intermediates/004/ready-for-imputation.vcf
##
## VCFtools - v0.1.12b
## (C) Adam Auton and Anthony Marcketta 2009
##
## Parameters as interpreted:
## --vcf intermediates/004/almost-ready-for-imputation.vcf
## --keep intermediates/004/suff-read-depth-inds.txt
## --out intermediates/004/ready-for-imputation-suff-rd
## --recode
##
## Keeping individuals in 'keep' list
## After filtering, kept 146 out of 160 Individuals
## Outputting VCF file...
## After filtering, kept 25539 out of a possible 25539 Sites
## Run Time = 8.00 seconds
Others will have to adjust their paths to BEAGLE. First the imputation
source script/java-jar-paths.sh
echo "Using Beagle jar at: $BEAGLE"
java -jar $BEAGLE gl=intermediates/004/ready-for-imputation.vcf out=intermediates/004/greb1l-imputed > intermediates/004/beagle-impute-stdout.txt
## Using Beagle jar at: /Users/eriq/Documents/others_code/BEAGLE_4.1/beagle.27Jan18.7e1.jar
And then the haplotype phasing.
source script/java-jar-paths.sh
java -jar $BEAGLE gt=intermediates/004/greb1l-imputed.vcf.gz out=intermediates/004/greb1l-imp-phased > intermediates/004/beagle-phase-stdout.txt
And after that, we reattach the ancestral states:
# gotta do this cuz BEAGLE doesn't put the contig name in there
tabix -f intermediates/004/greb1l-imp-phased.vcf.gz
echo '##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele">' > aa.hdr
bcftools +fill-from-fasta intermediates/004/greb1l-imp-phased.vcf.gz -- -c AA -f stored_results/001/NC_037124.1_coho.fna.gz -h aa.hdr > intermediates/004/greb1l-imp-phased-with-anc.vcf
rm aa.hdr
OK, now we have intermediates/004/greb1l-imp-phased-with-anc.vcf
which is a VCF file of phased haplotypes along with an ancestral allele (AA) tag in the INFO.
From here, forward, we will be using those variants in R, so we will output them in rds format for future use, along with the read-depth information.
Read in the haplotype data:
V <- "intermediates/004/greb1l-imp-phased-with-anc.vcf"
haps <- ecaRbioinf::vcf_haplos2tidy(V, Anc = "Coho")
## Scanning file to determine attributes.
## File attributes:
## meta lines: 15
## header_line: 16
## variant count: 25539
## column count: 155
##
Meta line 15 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 25539
## Character matrix gt cols: 155
## skip: 0
## nrows: 25539
## row_num: 0
##
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant: 25539
## All variants processed
# if we kept sites that don't have an ancestral allele from Coho, mark those as NA so the derived allelic states will be NA too.
haps$fix <- haps$fix %>%
mutate(AA = ifelse(AA == "N", NA, AA))
First, straighten out the haplotype names, etc.
big_haps <- haps$tidy %>%
filter(Indiv != "Coho") %>%
left_join(haps$avd, by = c("ChromKey", "POS", "Indiv", "haplo")) %>%
mutate(haplo_name = paste0(Indiv, "-", haplo)) %>%
left_join(rfi_depths, by = c("POS", "Indiv")) %>% # add the read depth information in there
mutate(
pop = str_replace_all(Indiv, "(_Spring|_Late_Fall|_Fall|_Winter).*$", ""), # add useful columns of meta data
ecotype = str_match(Indiv, "(Spring|Late_Fall|Fall|Winter)")[, 1],
lineage = ifelse(pop %in% c("Salmon_River", "Trinity_River_Hatchery"), "Klamath", "Sacto")
)
Now, to color everything by the allele that is most common amongst springers, we recode a column of alleles as S (for spring) and F (for fall)
spring_ones <- big_haps %>%
filter(ecotype == "Spring") %>%
group_by(POS, allele) %>%
summarise(freq = n()) %>%
filter(rank(-freq, ties = "first") == 1) %>%
rename(spring_allele = allele) %>%
ungroup()
big_haps <- big_haps %>%
left_join(spring_ones, by = "POS") %>%
mutate(alle2 = ifelse(allele == spring_allele, "S", "F"))
And now we need to Join REF and ALT onto that. Down the road, it will be important to know which allele was REF and which was ALT. I’m going to stick that information into big_haps2
so that we have it all, easily:
big_haps2 <- haps$fix %>%
select(POS, REF, ALT) %>%
left_join(big_haps, ., by = "POS") %>%
select(ChromKey, POS, REF, ALT, everything())
For some of the upcoming analyses, it will be informative to include the variation seen in the fish from Johnson Creek that Narum et al. used in their assembly. In 200.2 I aligned their genome to Otsh_v1.0 and made a tibble of variants.
# first, get the distinct positions in big_haps2
bh2d <- big_haps2 %>%
select(ChromKey, POS, REF, ALT, spring_allele) %>%
distinct()
# get narvars and nar_refms (Johnson Creek Variants) from 003
load("outputs/003/greb_region_narum_variants.rda")
# now, we are going to only focus on the variants called in the broader
# data set, so we will left_join stuff onto those.
all_together <- bh2d %>%
left_join(., narvars) %>%
left_join(., nar_refms)
# and now we create a data frame that includes all the information for
# the interior columbia individual, as if it were part of our data set.
# We do this by choosing its alleles, and naming it, etc.
jc_fish <- all_together %>%
mutate(
Indiv = "Narum_et_al_Chinook_Assembly",
haplo = "a",
allele = case_when(
(!is.na(Narum_ALT) & Narum_ALT == ALT) ~ Narum_ALT,
nchar(REF) > 1 & Narum == ALT ~ REF, # REF has no deletion, but neither did Narum. So Narum is REF here. A tricky one.
(REF == nar_refmsREF & REF == Narum) ~ Narum,
TRUE ~ NA_character_
),
haplo_name = "Narum_et_al_Chinook_Assembly-a",
pop = "Johnson Ck.",
ecotype = "Spring",
lineage = "Upper Columbia",
alle2 = ifelse(allele == spring_allele, "S", "F")
) %>%
select(-narvar_REF, -Narum_ALT, -nar_refmsREF, -Narum)
Note that bind_rows(big_haps2, jc_fish)
will add the Johnson Creek fish into the data set.
We now save the haplotypes in some tidy tibbles for future use in creating trees and haplo-rasters.
write_rds(big_haps2, path = "outputs/004/big_haps2.rds", compress = "xz")
write_rds(jc_fish, path = "outputs/004/jc_fish_haps.rds", compress = "xz")
write_rds(haps$fix, path = "outputs/004/big_haps2_fixed_info.rds", compress = "xz")
Note The process of imputing and phasing with BEAGLE can produce different results with different runs of the algorithm. Downstream analyses can be subtly different, though the important patterns remain the same. To reproduce exactly the figures presented in the paper, we provide the output from the BEAGLE run used for downstream analyses in:
./stored_results/004/big_haps2.rds
./stored_results/004/big_haps2_with_jc.rds
Downstream analyses include options (selected with a variable named like USE_STORED_*
) to use the stored results.
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2020-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## ape 5.3 2019-03-17 [1] CRAN (R 3.6.0)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.2)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.0)
## dbplyr 1.4.3 2020-04-19 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
## ecaRbioinf 0.1.0 2020-02-13 [1] local
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.0)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.0)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
## glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.0)
## hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
## lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## MASS 7.3-51.6 2020-04-26 [1] CRAN (R 3.6.2)
## Matrix 1.2-18 2019-11-27 [2] CRAN (R 3.6.2)
## memuse 4.1-0 2020-02-17 [1] CRAN (R 3.6.0)
## mgcv 1.8-31 2019-11-09 [2] CRAN (R 3.6.2)
## modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
## permute 0.9-5 2019-03-12 [1] CRAN (R 3.6.0)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pinfsc50 1.1.0 2016-12-02 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.0)
## rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.0)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.0)
## vcfR * 1.10.0 2020-02-06 [1] CRAN (R 3.6.0)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.0)
## vegan 2.5-6 2019-09-01 [1] CRAN (R 3.6.0)
## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.6.0)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.2)
## xfun 0.13 2020-04-13 [1] CRAN (R 3.6.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
##
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
## Time difference of 12.02159 mins