Here we produce the haplo-raster plots. One with the alleles and the other with the read depths in each fish.
library(tidyverse)
library(ecaRbioinf) # will put on github eventually
dir.create("outputs/006", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/006", recursive = TRUE, showWarnings = FALSE)
Grab the big_haps2.rds from the 004 notebook, and also the Johnson Creek fish. Choose here whether to use the stored big_haps (that we used) or others generated by BEAGLE that might be subtly different. In this works there is little difference, but, because BEAGLE randomly imputes and phases things, if you want to get a figure that looks just like the one we produced, you will want to use the stored results from 004.
USE_STORED_BIG_HAPS <- FALSE
if (USE_STORED_BIG_HAPS != TRUE) {
big_haps2 <- read_rds(path = "outputs/004/big_haps2.rds")
jc_fish <- read_rds(path = "outputs/004/jc_fish_haps.rds")
# Make a data frame with everyone, Johnson Creek included:
big_haps2_jc <- bind_rows(big_haps2, jc_fish)
} else {
big_haps2 <- read_rds("stored_results/004/big_haps2.rds")
big_haps2_jc <- read_rds("stored_results/004/big_haps2_with_jc.rds")
}
haps_fix <- read_rds(path = "outputs/004/big_haps2_fixed_info.rds")
Here we get data that is used for the annotation rows below the rasters.
Intersect the gff file with the variants, after whittling it down to one individual
bcftools view -s DPCh_plate1_A01_S1 data/greb1l-ish-region.vcf.gz -Oz > intermediates/006/one-sample-vcf-greb1l-ish.vcf.gz
bedtools intersect -a intermediates/006/one-sample-vcf-greb1l-ish.vcf.gz -b genome/GCF_002872995.1_Otsh_v1.0_genomic.gff.gz -wb > intermediates/006/variant-annotations.tsv
Then read that into a tibble
snp_anno <- read_tsv("intermediates/006/variant-annotations.tsv",
col_names = c(
"CHROM",
"POS",
"dot1",
"REF",
"ALT",
"VSCORE",
"dot2",
"dot3",
"format",
"a_sample",
"CHROM2",
"method",
"what",
"start",
"stop",
"dot4",
"strand",
"phase",
"biggie"
)
)
# look at what we have here
snp_anno %>%
count(what)
So, for a quick visualization, I am first going to classify sites as belonging to genes, pseudo-genes, or non-genes.
Subsequently, I will classify things within genes as exon, CDS, or other.
gene_stuff_func <- function(what) {
if (any(what == "gene")) {
ret <- "gene"
} else if (any(what == "pseudogene")) {
ret <- "pseudogene"
} else {
ret <- "non-gene"
}
ret
}
gene_stuff <- snp_anno %>%
group_by(CHROM, POS) %>%
summarise(ginc = gene_stuff_func(what))
exon_stuff_func <- function(what) {
if (any(what == "CDS")) {
ret <- "CDS"
} else if (any(what == "exon")) {
ret <- "exon"
} else {
ret <- "non-exon"
}
ret
}
# define exon_stuff only for things found in genes
exon_stuff <- snp_anno %>%
semi_join(gene_stuff %>% filter(ginc == "gene"), by = c("CHROM", "POS")) %>%
group_by(CHROM, POS) %>%
summarise(einc = exon_stuff_func(what))
We use snpEff’s designation of synonymous or missense. I will just define things that are missense as codon_pos1
and synonymous ones as codon_pos3
. I am doing this because I am pretty sure that the haplo raster function is hard-wired to use those designations, rather than synonymous/missense.
First, get the SNPs in CDS
cds_snps <- snp_anno %>%
filter(what == "CDS")
# read it in
snpeff <- read_rds("outputs/005/variants-and-annos-5.16.rds")
# only retain positions that are MODERATE or HIGH (this kicks out the synonymous ones)
HiMod <- snpeff %>%
filter(snp_eff_effect %in% c("MODERATE", "HIGH")) %>%
select(CHROM, POS, snp_eff_effect, snp_eff_what)
cds_snps <- cds_snps %>%
mutate(syn_v_nonsyn_codon_pos = ifelse(POS %in% HiMod$POS, "codon_pos1", "codon_pos3"))
I want to include at each SNP in the raster, some information about how much of the sequence nearby is considered repeat sequence. This might also be helpful in the future for other species when we go to design amplicons for sequencing.
I am going to do this just by designating each base with a lowercase letter part of a repeat-masked segment. The strategy will be to pull out the sequence at every position, then make a tibble and rollapply a window over it. The we can join these onto the positions used in each raster.
First, get chinook sequence in the region we are working with, which is from about 9.66 Mb to 14.83 Mb.
samtools faidx genome/Otsh_v1.0_genomic.fna NC_037124.1:9660000-14830000 > intermediates/006/greb-ish-seq-chinook.fna
gzip -f intermediates/006/greb-ish-seq-chinook.fna
Now, read it into a tibble:
dna_tibble <- read_lines("intermediates/006/greb-ish-seq-chinook.fna.gz")[-1] %>%
paste(., collapse = "") %>%
strsplit(., "") %>%
.[[1]] %>%
tibble(
POS = seq(9660000, 14830000, by = 1),
dna = .
) %>%
mutate(inRep = dna %in% c("a", "c", "g", "t"))
# now rollapply that dude in a 200 bp window centered on the current base
dna_repetitive_200 <- dna_tibble %>%
mutate(rep_mean_200 = zoo::rollmean(inRep, 200, fill = NA, na.rm = TRUE))
We will add dashed lines around the columns indicating the RoSA SNPs and also the distal SNPs identified in Prince et al. (2017). We have all those positions in the data
directory.
h_pos_for_paper <- read_tsv("data/assay-positions.tsv") %>%
rename(
name = Marker_name,
POS = Position_on_Chr_28_in_Otsh_v1.0
) %>%
select(name, POS) %>%
mutate(POS = as.integer(POS))
# focus on a smaller region
rosa2 <- big_haps2 %>%
filter(POS > 12.05e6, POS < 12.5e06)
While we are here quickly deduce the RoSA genotype of the reference fish, by focusing only on the SNPs from 12.26 to 12.29
rosa2 %>%
filter(POS > 12.26e6, POS < 12.29e6) %>%
group_by(POS, REF, ALT, spring_allele) %>%
tally() %>%
select(-n) %>%
mutate(spring_is_ref = REF == spring_allele) %>%
group_by(spring_is_ref) %>%
tally()
So, that is saying that the reference fish here has 68 spring alleles and 134 fall alleles in the region from 12.26 to 12.29 Mb. We can compare that to the distribution we see in our fish later on when we are determining heterozygotes. 68 spring alleles puts it solidly within what we expect for a fall run fish.
Order everyone by number of springer alleles in there:
rosa2_sorted <- rosa2 %>%
group_by(haplo_name) %>%
mutate(sumS = sum(alle2 == "S")) %>%
ungroup() %>%
arrange(desc(sumS), haplo_name, POS)
The above data frame has haplotypes ordered from “most springy in the RoSA” to “least springy in the RoSA.”
We want to order haplotypes by ecotype, then lineage, then population then by springyness. We make one order that includes the Johnson Creek fish, and then another one that does not (for the read depth raster).
# first, order by ecotype, and then within that, spring-run-assoc alleles
htmp <- rosa2_sorted %>%
mutate(hnames = haplo_name) %>%
mutate(efact = factor(ecotype, levels = c("Spring", "Winter", "Late_Fall", "Fall"))) %>%
arrange(efact, desc(sumS))
# then, order by ecotype, and within that, springy-ness of the full individual.
# this would put the two haplotypes within an individual next to one another
ht2 <- htmp %>%
group_by(Indiv) %>%
mutate(sumS_indiv = sum(sumS)) %>%
ungroup() %>%
arrange(efact, desc(sumS_indiv), Indiv, sumS)
# ht2 is not what we want, though. So,
# here let us order by ecotype, and then lineage, then pop, and only then on springyness
ht3 <- ht2 %>%
arrange(efact, desc(lineage), pop, desc(sumS), Indiv)
h_ord_eco_line_pop_springy <- unique(ht3$hnames)
# here is a hacky thing to make an ordering in which the Johnson Creek fish
# is in there between the spring and the winters
last_trin_spring <- max(which(str_detect(h_ord_eco_line_pop_springy, "Trinity_River_Hatchery_Spring")))
h_ord_eco_line_pop_springy_jc <- c(
h_ord_eco_line_pop_springy[1:last_trin_spring],
"Narum_et_al_Chinook_Assembly-a",
h_ord_eco_line_pop_springy[-(1:last_trin_spring)]
)
This is just to get the lineage and the ecotype in a decent form:
tmp <- c(1, 2, 3)
names(tmp) <- c("pop", "lineage", "ecotype")
annotation_columns <- rosa2_sorted %>%
mutate(hnames = haplo_name) %>%
select(hnames, lineage, ecotype, pop) %>%
count(hnames, lineage, ecotype, pop) %>%
select(-n) %>%
gather(key = "column", value = "value", -hnames) %>%
mutate(
width = 0.015,
order = tmp[column]
)
# now, I am going to add a label column which is what we want printed on each of these
I want to add to that a row at the bottom which says whether the “spring” allele is ancestral or derived or unknown (because there is no Coho sequence at that point)
ancy_row <- big_haps2 %>%
count(POS, spring_allele) %>%
select(-n) %>%
left_join(haps_fix %>% select(POS, AA), by = "POS") %>%
mutate(
value = ifelse(spring_allele == AA, "ancestral", "derived"),
row = "spring_anc_or_derived",
order = 3,
height = 0.02
)
I also want to have a row below the exon info that shows whether the snp is synonymous or missense.
codon_row <- cds_snps %>%
select(POS, syn_v_nonsyn_codon_pos) %>%
rename(value = syn_v_nonsyn_codon_pos) %>%
mutate(
row = "codon_stuff",
order = 2,
height = 0.02
)
So make a tibble with everything. We can semi-join this thing for different data sets.
annotation_rows_full <- bind_rows(
exon_stuff %>%
ungroup() %>%
select(POS, einc) %>%
rename(value = einc) %>%
mutate(
row = "exon_stuff",
order = 1,
height = 0.02
),
codon_row,
ancy_row
) %>%
mutate(POS = as.integer(POS))
source("R/define_fcolors_all_sf.R")
source("R/function_prep_elements.R")
source("R/function_extra_labels.R")
plist <- prep_elements(12.1e6, 12.6e6, 0.05e6, format = "%.2f", DSET = big_haps2_jc)
## Warning: Column `order` has different attributes on LHS and RHS of join
jca <- haplo_raster_plot(
D = plist$D,
h_ord = h_ord_eco_line_pop_springy_jc,
pos_annot = plist$pos_annot,
pos_bar_text_size = 8.0,
highlight_pos = h_pos_for_paper,
annotation_columns = annotation_columns,
annotation_rows = plist$ann_rows,
fcolors = fcolors_all_sf,
anno_row_start = 10,
snp_quant_tibble = plist$repcontent,
no_legend = TRUE,
no_y_labels = TRUE,
no_x_labels = TRUE
)
## Warning: Column `order` has different attributes on LHS and RHS of join
jca2 <- extra_labels(
g = jca, D = plist$D, h_ord = h_ord_eco_line_pop_springy_jc,
xmids = plist$ac_x_midpoints, ymids = plist$ar_y_midpoints,
row_just = 0.5
) +
expand_limits(x = -130, y = 337)
# now, fill in the white space in the annotation columns with the gray color.
# this is totally a hard-wired hack!
jcy <- which(rev(h_ord_eco_line_pop_springy_jc) == "Narum_et_al_Chinook_Assembly-a")
jca3 <- jca2 +
annotate("rect", xmin = -120, xmax = 0, ymin = jcy - 0.5, ymax = jcy + 0.5, colour = NA, fill = "darkgray")
ggsave(jca3, filename = "outputs/006/haplo-raster-alleles.pdf", width = 49, height = 30)
## Warning: Removed 13 rows containing missing values (geom_segment).
## Warning: Removed 1 rows containing missing values (geom_text).
## Warning: Removed 1 rows containing missing values (geom_text).
We need to figure out the positions for the orange rectangle. It should be from 12.311 to about 12.3215. So, find the variant numbers associated with that. That means we have to prep the elements now…
plist <- prep_elements(12.1e6, 12.6e6, 0.05e6, format = "%.2f")
## Warning: Column `order` has different attributes on LHS and RHS of join
inner_variants <- tibble(POS = unique(plist$D$POS)) %>%
ungroup() %>%
mutate(idx = 1:n()) %>%
filter(POS > 12.311e6, POS < 12.3215e6)
niv <- nrow(inner_variants)
# left and right points
left_x <- inner_variants$idx[1] - 2
right_x <- inner_variants$idx[niv] + 2
# top point is from about the 6th winter run haplotype
# bottom point is first fish
fish_ords <- tibble(fish = h_ord_eco_line_pop_springy) %>%
mutate(idx = 1:n()) %>%
mutate(winter_num = cumsum(str_detect(fish, "Winter")))
top_y <- nrow(fish_ords) - fish_ords$idx[fish_ords$winter_num == 5] + 1
bot_y <- nrow(fish_ords) - fish_ords$idx[nrow(fish_ords)] + 1
g <- haplo_raster_plot(
D = plist$D,
h_ord = h_ord_eco_line_pop_springy,
pos_annot = plist$pos_annot,
pos_bar_text_size = 8.0,
highlight_pos = h_pos_for_paper,
annotation_columns = annotation_columns,
annotation_rows = plist$ann_rows,
fcolors = fcolors_all_sf,
anno_row_start = 10,
snp_quant_tibble = plist$repcontent,
plot_read_depths = TRUE,
no_y_labels = TRUE,
no_x_labels = TRUE
)
## Warning: Column `order` has different attributes on LHS and RHS of join
g2 <- extra_labels(
g = g, D = plist$D, h_ord = h_ord_eco_line_pop_springy,
xmids = plist$ac_x_midpoints, ymids = plist$ar_y_midpoints,
row_just = 0.5
) +
expand_limits(x = -130, y = 337) +
theme(
legend.key.width = unit(1.3, "in"),
legend.key.height = unit(2, "in"),
legend.text = element_text(size = 48.0),
legend.title = element_text(size = 48.0)
) +
annotate("rect", xmin = left_x, xmax = right_x, ymin = bot_y, ymax = top_y, fill = NA, colour = "orange", alpha = 0.5, size = 4)
ggsave(g2, filename = "outputs/006/haplo-raster-read-depths.pdf", width = 49, height = 30)
## Warning: Transformation introduced infinite values in discrete y-axis
## Warning: Transformation introduced infinite values in discrete y-axis
## Warning: Removed 13 rows containing missing values (geom_segment).
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2020-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.0)
## dbplyr 1.4.3 2020-04-19 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
## ecaRbioinf * 0.1.0 2020-02-13 [1] local
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.0)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.0)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
## glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.0)
## hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
## lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.0)
## rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.0)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.0)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.0)
## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.6.0)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.2)
## xfun 0.13 2020-04-13 [1] CRAN (R 3.6.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
## zoo 1.8-7 2020-01-10 [1] CRAN (R 3.6.0)
##
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
## Time difference of 1.29479 mins
Prince, Daniel J., Sean M. O’Rourke, Tasha Q. Thompson, Omar A. Ali, Hannah S. Lyman, Ismail K. Saglam, Thomas J. Hotaling, Adrian P. Spidle, and Michael R. Miller. 2017. “The Evolutionary Basis of Premature Migration in Pacific Salmon Highlights the Utility of Genomics for Informing Conservation.” Science Advances 3 (8): e1603198. doi:10.1126/sciadv.1603198.