library(tidyverse)
library(ecaRbioinf)
dir.create("outputs/003", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/003", recursive = TRUE, showWarnings = FALSE)
Narum et al. (2018) mapped all their pool-seq to a chromosomal scale Chinook assembly that came from a Johnson Creek fish. As the paper says:
This fish was collected at a weir on Johnson Creek, Idaho, USA and was euthanized post-spawning. Chinook salmon in this population represent an ecotype known to be part of the recognized Snake River spring/summer-run evolutionarily significant unit (ESU) and are considered interior Columbia River stream-type phylogenetic lineage.
It would be very much worth getting a sense for what this fish looks like in the RoSA region.
Narum et al ordered all their chromosomes against the mykiss genome, it appears. So, what they call Chromosome 28 is the same as Otsh_V1.0’s chromosome NC_037124.1. So, we will use my lastz pipeline to map the two together and then whittle it down to the Otsh_V1.0’s coordinate system.
The query sequence from the Narum assembly shall be CM009021.1
, and the target sequence will be NC_037124.1.
We will set the identity to be quite high, since this is the same species. Let’s give her a whirl:
mkdir -p intermediates/003/lastz_workdir
./script/lastz-seed-and-chain-narum-chinook.sh intermediates/003/lastz_workdir NC_037124.1 CM009021.1
That takes a few minutes and produces half a gig or so of output (most of which is not needed) which is not stored in this repository, so, if you want to continue in this notebook, you do have to run that last command.
Now, let’s have a look at the dotplot:
dots <- readr::read_tsv("intermediates/003/lastz_workdir/step=20_notransition_inner=1000_identity=97.rdp")
plot(dots, type = "l")
OK. Their assemply is a little different, but it looks like it is right on track.
First thing we might want to do is assess how many ambibuous (heterzygous) IUPAC codes there are in the Narum et al reference sequence:
# just focus on Chrom 28 for now:
samtools faidx genome/GCA_002831465.1_CHI06_genomic.fna CM009021.1 > intermediates/003/CM009021.1.fna
Table it in R:
dna <- read_lines("intermediates/003/CM009021.1.fna")[-1]
dvec <- strsplit(paste(dna, collapse = ""), "")[[1]]
table(dvec)
## dvec
## a A c C g G N t T
## 4815360 5169680 3534816 3864124 3509655 3854533 3119080 4832640 5162476
Great. That means that we don’t have to deal with heterozygous positions in the Narum reference.
Now we compare the two aligned sequences, focusing on isolated SNPs and simple insertions and deletions that do not abut against SNPs, to find variants in that sequence.
It will be easier to assess how well that is doing, too.
# ts = target_seq. qs = query seq.
tq_seqs <- tibble(
ts = toupper(read_lines("./intermediates/003/lastz_workdir/tmp_step=20_notransition_inner=1000_identity=97.001.column")),
qs = toupper(read_lines("./intermediates/003/lastz_workdir/tmp_step=20_notransition_inner=1000_identity=97.002.column"))
)
# here is a little R function that I can pass that into which will create a report of variants that
# includes simple insertions and deletions and simple SNPs
extract_variants_between_two_seqs <- function(S) {
# first, add another column which says whether the sequence region is a multinucleotide area.
# This is a region that is an unbroken series of records in which either the target or the
# query or both have a "-" or in which the target and query mismatch. The way we will go about this
# is first making a base_mismatch column (bm) which is TRUE if and only if the qs and ts are neither "-"
# and also if the the bases mismatch. We also want to record whether the row is
# in a dash-block (db), i.e. if one or the other or both of the columns is "-". While we are at it
# we will number the target sequence bases that are not "-"'s in the Otsh_v1.0 reference, because those, in fact, are the positions
# relative to the target reference sequence.
# Ultimately, we will call an MNP-block-candidate as any collections of rows in which there is
# an unbroken path through bm and db rows. We will filter this later to toss out those
# MNP-candidates that are just adjacent SNPs (basically those are MNP-block-candidates
# that have no dashes in them).
S2 <- S %>%
mutate(
bm = (qs != "-" & ts != "-") & qs != ts,
db = (qs == "-" | ts == "-"),
pos = {
tp <- rep(NA, length.out = n())
tp[ts != "-"] <- 1:sum(ts != "-")
tp
}, # this is a little ugly but gets the job done
mnp_bc_tmp = bm | db
)
# now we want to keep a record of those positions in which both
# target and sequence are the same, and are not N's (and not "-"s)
# to know which SNPs we can say the Narum fish had the reference base on
ref_matches <- S2 %>%
filter(ts %in% c("A", "C", "G", "T") & ts == qs) %>%
rename(POS = pos, REF = ts, Narum = qs) %>%
select(POS, REF, Narum)
# now, we number these mnp-block-candidates so we can group on them. We use rle to give the
# candidates even indexes
S2rle <- rle(S2$mnp_bc_tmp)
# we will define a block as being simple SNPs if that is the only type of variant found in the block
# (and note that many of these will be of length 1). But, we will toss from consideration in that case
# any simple SNP block that includes any N's in the target or the query.
# we will only call something a simple indel block if there are no more than 200 positions in it. (It appears
# that GATK has returned indels up to 193 in length in our WGS data, so I will stick with something around that size.)
S3 <- S2 %>%
mutate(mnp_bc_flag = rep(1:length(S2rle$values), S2rle$lengths)) %>%
group_by(mnp_bc_flag) %>%
mutate(are_simple_snps = mnp_bc_tmp & !any(db) & !any(ts == "N" | qs == "N")) %>%
mutate(are_simple_indels = mnp_bc_tmp & (all(qs == "-") | all(ts == "-")) & n() < 200) %>%
ungroup()
# OK, now, in the simple indels we have to include the position immediately before each true_mnp_block. (This position
# should (if I have done everything right) match between target and query).
S3rle <- rle(S3$are_simple_indels)
# to group the row before each of these true mnp_blocks with the block
# I just make indices in which the odd ones are 1 less in length and the even ones
# are 1 more in length. In order to get things to the right length we need to test whether
# it is ending in a true mnp block or not.
s3reps <- rep(1:length(S3rle$values), S3rle$lengths + rep(c(-1, 1), length.out = length(S3rle$values)))
if (S3$are_simple_indels[length(S3$are_simple_indels)] == FALSE) {
s3reps[length(s3reps) + 1] <- s3reps[length(s3reps)]
}
S4 <- S3 %>%
mutate(simple_indels_grp_idxs = s3reps)
# now all that remains is picking out the REF and ALT alleles at each position
# (simple SNPs first, then simple indels), and then sorting them all back into order.
# We will carry out those operations sepaerately for snps and indes
S4_snps <- S4 %>%
filter(are_simple_snps == TRUE) %>%
mutate(POS = pos, REF = ts, ALT = qs) %>%
select(POS, REF, ALT)
S4_indels <- S4 %>%
filter(simple_indels_grp_idxs %% 2 == 0) %>%
group_by(simple_indels_grp_idxs) %>%
summarise(
POS = pos[1],
REF = str_replace_all(paste(ts, collapse = ""), "-", ""),
ALT = str_replace_all(paste(qs, collapse = ""), "-", "")
) %>%
ungroup() %>%
select(-simple_indels_grp_idxs)
# now we return it
ret <- bind_rows(S4_snps, S4_indels) %>%
arrange(POS) %>%
rename(Narum_ALT = ALT)
list(
variants = ret,
ref_matches = ref_matches
)
}
Let’s give it a whirl, and extract those variants:
chrom28_narum_variants <- extract_variants_between_two_seqs(tq_seqs)
Note the number of variants as a fraction of the chromosome length:
nrow(chrom28_narum_variants$variants) / sum(tq_seqs$ts != "-")
## [1] 0.003906725
So, roughly 4 in every 1000 bp. That seems pretty reasonable.
There is a lot of info in there that we won’t need down the road, so just grab the variants from 9.6 to 14.8 Mb or so:
# filter the variants down to NC_037124.1:9660000–14825000
narvars <- chrom28_narum_variants$variants %>%
filter(POS >= 9.66e6 & POS <= 14.825e6) %>%
rename(narvar_REF = REF)
# filter the ref-matches down to those coordinates as well
nar_refms <- chrom28_narum_variants$ref_matches %>%
filter(POS >= 9.66e6 & POS <= 14.825e6) %>%
rename(nar_refmsREF = REF)
Finally, write this to outputs, as we will want to join it to the phased SNPs in 004 to include in haplotype rasters and trees. We will save these into an rda file.
save(narvars, nar_refms, file = "outputs/003/greb_region_narum_variants.rda", compress = "xz")
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2020-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.0)
## dbplyr 1.4.3 2020-04-19 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
## ecaRbioinf * 0.1.0 2020-02-13 [1] local
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.0)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
## glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.0)
## hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
## lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.0)
## rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.0)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.0)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.0)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.2)
## xfun 0.13 2020-04-13 [1] CRAN (R 3.6.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
##
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
## Time difference of 8.812589 mins
Narum, Shawn R., Alex Di Genova, Steven J. Micheletti, and Alejandro Maass. 2018. “Genomic Variation Underlying Complex Life-History Traits Revealed by Genome Sequencing in Chinook Salmon.” Proceedings of the Royal Society B: Biological Sciences 285 (1883): 20180935. doi:10.1098/rspb.2018.0935.