Here we compute the absolute value of the allele freq difference between spring and fall run fish over all the assembled chromosomes (and preliminarily in all the scaffolds > 5 Mb, though these were not retained for the final figure). In this notebook we also do the same over the 5.16 Mb around GREB1L and create plots.
library(tidyverse)
library(viridis)
dir.create("outputs/002", recursive = TRUE, showWarnings = FALSE)
dir.create("intermediates/002", recursive = TRUE, showWarnings = FALSE)
We do this in R.
pmeta <- read_csv("data/wgs-chinook-samples.csv")
# get file with fall-run names in it
pmeta %>%
filter(run_type == "Fall") %>%
.$vcf_name %>%
cat(., file = "intermediates/002/fall-run.txt", sep = "\n")
# get file with spring-run names in it
pmeta %>%
filter(run_type == "Spring") %>%
.$vcf_name %>%
cat(., file = "intermediates/002/spring-run.txt", sep = "\n")
I copied the results from the previous section to the cluster’s scratch:
/u/flashscratch/e/eriq/abs-diff-alle-freq/inputs/fall-run.txt
/u/flashscratch/e/eriq/abs-diff-alle-freq/inputs/spring-run.txt
Now we just need to use those to subset the vcf files. We will use vcftools. This assumes a directory vcf
that includes all the vcf files.
# in: /u/flashscratch/e/eriq/abs-diff-alle-freq
mkdir subsetted-vcfs
ls ~/nobackup-kruegg/chinook-wgs/vcf/*.gz > vcflist.txt
#here is what that looks like:
head vcflist.txt
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037097.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037098.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037099.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037100.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037101.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037102.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037103.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037104.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037105.1.vcf.gz
/u/home/e/eriq/nobackup-kruegg/chinook-wgs/vcf/NC_037106.1.vcf.gz
# now we make a job list for these:
awk '{printf("%d\tfall-run\t%s\n", ++n, $1); printf("%d\tspring-run\t%s\n", ++n, $1);}' vcflist.txt > job_list.txt
And now a quick shell script to run this as a job array:
#!/bin/bash
#$ -cwd
#$ -V
#$ -N bvcf
#$ -o bvcf-$TASK_ID.log
#$ -e bvcf-$TASK_ID.error
#$ -l h_data=4G,time=1:00:00
#$ -M eric.anderson@noaa.gov
#$ -t 1-3:1
#$ -m a
source /u/local/Modules/default/init/modules.sh
module load vcftools
# get the relevant line from the id file and store as a bash array,
# then get the necessary parts of it
IDFILE=job_list.txt
# get it as an array of three things
LINE=($(awk -v N=$SGE_TASK_ID '$1==N {print}' $IDFILE))
# then break out what we need
ecotype=${LINE[1]}
vcf=${LINE[2]}
b=$(basename $vcf)
f=${b/.vcf.gz/}
vcftools --gzvcf $vcf --keep inputs/$ecotype.txt --out subsetted-vcfs/${ecotype}-$f --recode
sleep 180
And we launch that like this:
qsub split-vcfs-job-array.sh
That got through everything pretty quickly.
Let’s make a file list for jobs and an output directory
ls -l subsetted-vcfs/*.vcf | awk '{printf("%d\t%s\n", ++n, $NF)}' > angsd_jobs.txt
mkdir angsd_outputs
Then a quick job array:
#!/bin/bash
#$ -cwd
#$ -V
#$ -N aaf
#$ -o aaf-$TASK_ID.log
#$ -e aaf-$TASK_ID.error
#$ -l h_data=4G,time=1:00:00
#$ -M eric.anderson@noaa.gov
#$ -t 1-3:1
#$ -m a
source /u/local/Modules/default/init/modules.sh
# get the relevant line from the id file and store as a bash array,
# then get the necessary parts of it
IDFILE=angsd_jobs.txt
# get it as an array of three things
LINE=($(awk -v N=$SGE_TASK_ID '$1==N {print}' $IDFILE))
# then break out what we need
vcf=${LINE[1]}
b=$(basename $vcf)
f=${b/.recode.vcf/}
FAI=/u/home/e/eriq/nobackup-kruegg/chinook-wgs/genome/Otsh_v1.0_genomic.fna.fai
angsd -vcf-gl $vcf -fai $FAI -nind 64 -domaf 3 -out angsd_outputs/$f > angsd_outputs/$f.stdout 2> angsd_outputs/$f.stderr
sleep 180
Launch that:
qsub angsd_array.sh
That finished pretty quickly.
First, get all the mafs.gz files over to my laptop. I put them all into:
/Users/eriq/Documents/UnsyncedData/chinook-wgs-alle-freqs-fall-v-spring
Anyone reproducing this will have to adjust their paths accordingly.
Now, our strategy is to read the spring run and the fall run separately, and filter each on the read depth requirement, separately. Then we will do an inner join of all those.
Get the spring run:
spring_files <- dir(path = "~/Documents/UnsyncedData/chinook-wgs-alle-freqs-fall-v-spring",
pattern = "spring-run",
full.names = TRUE)
spring_freqs <- lapply(spring_files, function(x) {
read_tsv(x) %>%
filter(nInd >= 30)
}) %>%
bind_rows()
Then get the fall run:
fall_files <- dir(path = "~/Documents/UnsyncedData/chinook-wgs-alle-freqs-fall-v-spring",
pattern = "fall-run",
full.names = TRUE)
fall_freqs <- lapply(fall_files, function(x) {
read_tsv(x) %>%
filter(nInd >= 30)
}) %>%
bind_rows()
Now we do the inner join:
spring_fall <- inner_join(
fall_freqs,
spring_freqs,
by = c("chromo", "position"),
suffix = c("_fall", "_spring")) %>%
mutate(
ave_freq = (unknownEM_fall + unknownEM_spring) / 2,
abs_diff = abs(unknownEM_fall - unknownEM_spring)
)
And it is quite straightforward from here. Note that we have 9,170,403 SNPs here.
I considered compressing spring_fall
into a stored result, but that was going to be somewhere >200 Mb, and that did not seem worth it.
First, we are going to make a 2-D histogram of average allele freq vs absolute difference.
g <- ggplot(spring_fall, aes(x = ave_freq, y = abs_diff)) +
geom_hex(binwidth = 0.001) +
scale_fill_viridis_c()
g
So, that gives us some good guidance for choosing an abs_diff cutoff to ease the plotting. Like 0.3 or 0.375 might be good.
But, let’s see how many SNPs have abs_diff > 0.2, 0.25, 0.3?
pts <- c(0.2, 0.25, 0.3, 0.35, 0.375)
names(pts) <- pts
lapply(pts, function(x) sum(spring_fall$abs_diff > x))
## $`0.2`
## [1] 139719
##
## $`0.25`
## [1] 45901
##
## $`0.3`
## [1] 13821
##
## $`0.35`
## [1] 4201
##
## $`0.375`
## [1] 2483
So, 0.25 is probably a decent cutoff to keep the figure file size down, but also retain most of the information. Let’s make a manhattan plot of that.
I have some code for this, but need the chromosome lengths: We need to get the chromosome lengths for all of these. We can pull those out of the VCF file.
(echo "Chromosome chrom_length"; gunzip -c data/greb1l-ish-region.vcf.gz 2> /dev/null | awk -F"=" '/^##contig/ {print $3, $4} !/^#/ {exit}' | sed 's/,length//g; s/>//g;') > intermediates/002/chrom_lengths.txt
Now, filter down to abs_diff > 0.25, and get the chrom_lengths
sf_lite <- spring_fall %>%
filter(abs_diff > 0.25) %>%
rename(Chromosome = chromo)
chrom_lengths <- read_table2("intermediates/002/chrom_lengths.txt")
Now, my function to make the plot. This website https://www.r-graph-gallery.com/wp-content/uploads/2018/02/Manhattan_plot_in_R.html had a nice discussion of it. We’ve store two functions in R/
: my_mh_prep
and plot_mh
and source them here:
source("R/manhattan_plot_funcs.R")
Now, use those funcs:
sf_prepped <- my_mh_prep(sf_lite, chrom_lengths)
mh_plot <- plot_mh(sf_prepped$snps, sf_prepped$axisdf) +
xlab("Position along chromosome") +
ylab("Absolute value of allele frequency difference, spring-run vs. fall-run")
#ggsave(mh_plot, filename = "outputs/002/abs-diff-alle-freq-manhattan.pdf", width = 12, height = 8)
mh_plot
Looking at that plot, it is clear that there is a big chunk missing out of NC_037105.1 in the middle of the chromosome, and also at the right hand end of it. There is also a big chunk missing from NC_037106.1. Interestingly, the gap in the middle of NC_037105.1 shows up in a previous association study I did directly from the bams. But not the gap at the end of NC_037105.1, nor on NC_037106.1. Something must have gone awry with GATK in those regions. So, we will go back to the bams to fill those spots.
Let’s look at which areas we are missing here
spring_fall %>%
group_by(chromo) %>%
summarise(min = min(position),
max = max(position)) %>%
left_join(chrom_lengths, by = c("chromo" = "Chromosome"))
Yeah, 106.1 is missing something like 35 Mb of stuff in it. I don’t see clear evidence that everything is missing at the end of 105.1, though.
Let’s have a look at those chromosomes and snp density:
sf_lite %>%
filter(Chromosome %in% c("NC_037105.1", "NC_037106.1") ) %>%
ggplot(., aes(x = position, y = abs_diff)) +
geom_point(alpha = 0.2) +
facet_wrap(~ Chromosome, scales = "free", ncol = 1) +
scale_x_continuous(breaks = (1:10) * 1e7)
However the bams all seem to have plenty of alignments in places where there are no SNPs:
2019-04-01 22:59 /chinook_WGS_processed_large_contigs/--% pwd
/Volumes/KanaloaEXTRA/chinook_WGS_processed_large_contigs
2019-04-01 23:04 /chinook_WGS_processed_large_contigs/--% for i in *.bam; do echo -n "$i "; samtools view $i NC_037105.1:40000000-60000000 | wc; done
DPCh_plate1_A01_S1.rmdup.bam 64331 1058220 30255186
DPCh_plate1_A02_S2.rmdup.bam 147407 2426280 69154811
DPCh_plate1_A03_S3.rmdup.bam 24809 407627 11701193
DPCh_plate1_A04_S4.rmdup.bam 28937 475422 13613302
DPCh_plate1_A05_S5.rmdup.bam 102850 1692399 48234066
DPCh_plate1_A06_S6.rmdup.bam 57725 949480 27093163
DPCh_plate1_A07_S7.rmdup.bam 126506 2081428 59374228
DPCh_plate1_A08_S8.rmdup.bam 135176 2223432 63340986
So, there are very clearly alignments all along there in the BAMS. I shall call SNPs on those locations from those BAMS, and I will estimate alle freqs there using ANGSD, to fill in the holes.
In fact I will be a little more general, and see if there were any other “holes” in the VCF that we can fill by re-running those sections with just the BAMS in Angsd.
We are going to focus only on the chromosomes (none of the NW_ segments). I am going to find any 100 Kb sections that lack any SNPs at all in spring_fall, and then I will merge those all into contiguous pieces, and then extract those from the bams, and then just do allele frequency estimation with ANGSD on those piecemeal BAMS.
We will chop each chromosome up into 1 Mb sections with cut
. We cut them all with the same intervals that will accommodate the longest chromosome. Then we will filter that stuff later. We aren’t going to worry much about the ends.
longest <- max(chrom_lengths$chrom_length)
breaks <- c(seq(1, longest, by = 1000000), longest)
labs <- sprintf("%d-%d", breaks, breaks + 999999) #, sprintf("%09d-%09d", max(breaks) + 99999, longest))
labs <- labs[-length(labs)]
intvls <- spring_fall %>%
rename(Chromosome = chromo) %>%
select(Chromosome, position) %>%
mutate(interval = cut(position, breaks = breaks, labels = labs)) %>%
count(Chromosome, interval, .drop = FALSE) %>%
mutate(interval = as.character(interval)) %>%
separate(interval, into = c("lo", "hi"), sep = "-", remove = FALSE, convert = TRUE) %>%
left_join(., chrom_lengths, by = "Chromosome") %>%
filter(lo < chrom_length) %>%
filter(str_detect(Chromosome, "^NC_"))
At this juncture we write a little function that will give integer indexes to contiguous runs using rle (run-length-encoding).
#' @param v a logical vector
contig_groups <- function(v) {
rl <- rle(v)
idx <- 1:length(rl$values)
rep(idx, rl$lengths)
}
redo_from_bams <- intvls %>%
arrange(Chromosome, lo) %>%
mutate(cgroup = contig_groups(n)) %>%
filter(n == 0) %>%
group_by(Chromosome, cgroup) %>%
summarise(lo = min(lo),
hi = max(hi),
length = hi - lo + 1)
# and now all we need to do is print a string of regions to pick out of the bams:
bamstr <- redo_from_bams %>%
mutate(region = sprintf("%s:%d-%d", Chromosome, lo, hi))
# here is what it looks like
cat(bamstr$region, sep = " ")
## NC_037097.1:33000001-34000000 NC_037097.1:71000001-74000000 NC_037098.1:5000001-6000000 NC_037098.1:21000001-22000000 NC_037099.1:26000001-27000000 NC_037099.1:28000001-29000000 NC_037100.1:13000001-14000000 NC_037100.1:35000001-36000000 NC_037100.1:37000001-38000000 NC_037100.1:39000001-40000000 NC_037101.1:24000001-25000000 NC_037101.1:28000001-29000000 NC_037102.1:13000001-14000000 NC_037102.1:27000001-29000000 NC_037102.1:33000001-34000000 NC_037103.1:15000001-16000000 NC_037103.1:81000001-82000000 NC_037105.1:47000001-48000000 NC_037105.1:52000001-54000000 NC_037105.1:78000001-90000000 NC_037106.1:25000001-61000000 NC_037107.1:48000001-49000000 NC_037108.1:39000001-40000000 NC_037108.1:52000001-54000000 NC_037109.1:55000001-56000000 NC_037109.1:66000001-69000000 NC_037110.1:10000001-11000000 NC_037110.1:12000001-13000000 NC_037112.1:40000001-41000000 NC_037112.1:42000001-43000000 NC_037112.1:50000001-51000000 NC_037114.1:22000001-23000000 NC_037114.1:24000001-25000000 NC_037114.1:32000001-41000000 NC_037115.1:48000001-51000000 NC_037116.1:12000001-13000000 NC_037116.1:36000001-38000000 NC_037117.1:33000001-38000000 NC_037121.1:18000001-19000000 NC_037125.1:19000001-20000000 NC_037125.1:21000001-22000000 NC_037126.1:1-1000000 NC_037126.1:29000001-30000000 NC_037127.1:30000001-37000000 NC_037130.1:11000001-12000000
# here is where we put it into a file:
cat(bamstr$region, sep = " ", file = "intermediates/002/reg_str_file.txt")
So, that will be picking about 100 Mb out of the bams.
Doing this on my laptop, pulling them off KanaloaExtra drive, putting them into /tmp, and then I will stuff them up to Hoffman.
/chinook-wgs/--% (master)
for i in /Volumes/KanaloaEXTRA/chinook_WGS_processed_large_contigs/*.bam; do
j=$(basename $i);
echo $i;
samtools view -b -1 $i $(cat intermediates/002/reg_str_file.txt) > /tmp/bam-holes/$j;
done
I copied that all to flashscratch on hoffman.
I will do this in: /u/flashscratch/e/eriq/bam-holes
# first, get the spring and fall run lists
/bam-holes/--% cat ../abs-diff-alle-freq/inputs/fall-run.txt | awk '{print "../full-wgs-chinook-data/bams/" $1 ".rmdup.bam"}' > inputs/fall-run-bamlist.txt
/bam-holes/--% cat ../abs-diff-alle-freq/inputs/spring-run.txt | awk '{print "../full-wgs-chinook-data/bams/" $1 ".rmdup.bam"}' > inputs/spring-run-bamlist.txt
# note that I put the bams into full-chinook-data/bams, but they are just the 100 Mb or so
# of the gaps
# and then we just need an ANGSD command line for each:
angsd -out out -doMajorMinor 1 -doMaf 3 -bam inputs/fall-run-bamlist.txt -GL 1 -SNP_pval 1e-6 -minInd 30
# that appears to be working just fine, but is pretty darn slow.
# I shall job-array it over 1 Mb chunks...
# that means i need to index the bams...done
Now, get a file with 1 Mb regions to process in a job array
intvls %>%
filter(n == 0) %>%
mutate(angsd_job_str = str_c(1:n(), "\t", Chromosome, ":", lo, "-", hi)) %>%
.$angsd_job_str %>%
cat(., sep = "\n", file = "intermediates/002/bam-hole-regions.txt")
Then I put spring and fall on there:
Then I put that on hoffman.
And now we can make a job array script to run those:
#!/bin/bash
#$ -cwd
#$ -V
#$ -N bamhole
#$ -o bamhole-$TASK_ID.log
#$ -e bamhole-$TASK_ID.error
#$ -l h_data=8G,time=24:00:00
#$ -M eric.anderson@noaa.gov
#$ -t 1-3:1
#$ -m a
# get the relevant line from the id file and store as a bash array,
# then get the necessary parts of it
IDFILE=inputs/bh-spring-fall-comms.txt
# get it as an array of three things
LINE=($(awk -v N=$SGE_TASK_ID '$1==N {print}' $IDFILE))
# then break out what we need
region=${LINE[1]}
run=${LINE[2]}
jobnum=$(printf "%04d" $SGE_TASK_ID)
f=${run}_job_$jobnum.${region/:/_}
mkdir -p angsd_outputs
angsd -out angsd_outputs/$f -doMajorMinor 1 -doMaf 3 -bam inputs/${run}-run-bamlist.txt \
-r $region -GL 1 -SNP_pval 1e-6 \
-minInd 30 > angsd_outputs/$f.stdout 2> angsd_outputs/$f.stderr
sleep 180
Then launch those:
/bam-holes/--% qsub bam-hole-job-array.sh
Your job-array 679949.1-4:1 ("bamhole") has been submitted
/bam-holes/--% qsub bam-hole-job-array.sh
Your job-array 679953.5-238:1 ("bamhole") has been submitted
Those all ran to completion overnight, and now I will put the mafs.gz files into outputs. They are locally in ~/Documents/git-repos/chinook-wgs/outputs/317/angsd_outputs
and take up 80 Mb. Other users will need to modify the paths as appropriate.
files <- dir(path = "~/Documents/git-repos/chinook-wgs/outputs/317/angsd_outputs", full.names = TRUE)
ecotype <- ifelse(str_detect(files, "spring"), "spring", "fall")
angsd_bam_hole_freqs <- lapply(seq_along(files), function(i) {
fin <- files[i]
read_tsv(fin) %>%
mutate(ecotype = ecotype[i])
}) %>% bind_rows()
Note that this already has nInd >= 30.
bf_spring <- angsd_bam_hole_freqs %>%
filter(ecotype == "spring")
bf_fall <- angsd_bam_hole_freqs %>%
filter(ecotype == "fall")
full_comp <- bf_fall %>%
full_join(bf_spring, by = c("chromo", "position"), suffix = c("_fall", "_spring"))
tmp <- full_comp %>%
filter(xor(is.na(unknownEM_fall), is.na(unknownEM_spring)))
# plot histogram of spring freqs for variants absent in fall
ggplot(tmp %>% filter(is.na(unknownEM_fall)), aes(x = unknownEM_spring)) +
geom_histogram(binwidth = 0.01)
# and confirm that
tmp %>%
filter(is.na(unknownEM_fall)) %>%
arrange(desc(unknownEM_spring)) %>%
head()
# nothing near fixation
# plot histogram of fall freqs for variants absent in spring
ggplot(tmp %>% filter(is.na(unknownEM_spring)), aes(x = unknownEM_fall)) +
geom_histogram(binwidth = 0.01)
# check top of that list
tmp %>%
filter(is.na(unknownEM_spring)) %>%
arrange(desc(unknownEM_fall)) %>%
head()
ANGSD remains a little mysterious about how it identifies minor alleles, but in previous tests it did REF and ALT consistently.
Let’s inner join these and only compare things that were found in both.
# join them and then
# get just the columns that are common to sf_lite
inner_comp <- bf_fall %>%
inner_join(bf_spring, by = c("chromo", "position"), suffix = c("_fall", "_spring")) %>%
mutate(
ave_freq = (unknownEM_fall + unknownEM_spring) / 2,
abs_diff = abs(unknownEM_fall - unknownEM_spring)
) %>%
rename(Chromosome = chromo) %>%
.[names(sf_lite)] %>%
mutate(source = "bam")
# If we have a value from angsd that we already have from the VCF, then use the VCF one,
# but, of course we shouldn't have any of those...
sf_av_lite <- inner_comp %>%
anti_join(., sf_lite, by = c("Chromosome", "position")) %>%
filter(abs_diff >= 0.2) %>% # keep only these
bind_rows(., sf_lite %>% mutate(source = "vcf")) %>%
arrange(Chromosome, position)
Now we should be able to make the Manhattan plot:
sf_av_prepped <- my_mh_prep(sf_av_lite, chrom_lengths)
mh_av_plot <- plot_mh(sf_av_prepped$snps, sf_av_prepped$axisdf) +
xlab("Position along chromosome") +
ylab("Absolute value of allele frequency difference, spring-run vs. fall-run")
mh_av_plot
## Warning: Removed 2943 rows containing missing values (geom_point).
So, there is still a big gap on 105. Let’s explore this:
sf_av_lite %>%
filter(Chromosome == "NC_037105.1") %>%
ggplot(aes(x = position)) +
geom_histogram(binwidth = 1000000)
What about the raw unfiltered stuff from spring and fall?
bf_spring %>%
filter(chromo == "NC_037105.1") %>%
ggplot(aes(x = position)) +
geom_histogram(binwidth = 1000000) +
ggtitle("spring")
bf_fall %>%
filter(chromo == "NC_037105.1") %>%
ggplot(aes(x = position)) +
geom_histogram(binwidth = 1000000) +
ggtitle("fall")
So, we didn’t pick up any new stuff int the 43 Mb and 53 Mb area. So, I think there just is something difficult going on there. Perhaps it is a centromere or something. Even on human GWAS manhattan plots there are some comparable gaps, so I am going to leave it.
I have to rename the chromosomes appropriately. Gonna just pull the table off the NCBI page:
chrom_name_tib <- read_tsv("data/chromosome-name-demangle.txt") %>%
mutate(chrom_num = as.integer(str_replace_all(molecule_name, "Chromosome ", ""))) %>%
rename(cluster = RefSeq_sequence) %>%
select(cluster, chrom_num)
sf_av_renamed <- sf_av_prepped
sf_av_renamed$axisdf <- sf_av_renamed$axisdf %>%
left_join(chrom_name_tib, by = c("cluster")) %>%
mutate(cluster = chrom_num) %>%
select(-chrom_num) %>%
filter(!is.na(cluster))
sf_av_renamed$snps <- sf_av_renamed$snps %>%
filter(str_detect(Chromosome, "^NC"))
mh_avr_plot <- plot_mh(sf_av_renamed$snps, sf_av_renamed$axisdf) +
xlab("Position on numbered chromosomes") +
ylab("Absolute value of allele frequency difference, spring-run vs. fall-run") +
scale_y_continuous(limits = c(0.25, 1.01), expand = c(0, 0))
mh_avr_plot
## Warning: Removed 2943 rows containing missing values (geom_point).
Now, I am going to save the object so I can reproduce the plot again more easily if need be.
write_rds(mh_avr_plot, path = "outputs/002/mh_avr_plot.rds", compress = "xz")
We do this separately from a small part of the VCF file (included in the repo, here) to do it between fall run and spring and also between fall run and spring + winter run.
We grabbed about 5.165 Mb worth of genome out of the Chromosome 28 VCF:
vcftools --gzvcf NC_037124.1.vcf.gz --out grebby.vcf --chr NC_037124.1 --from-bp 9660000 --to-bp 14825000 --recode
To keep repository size down, we have bgzipped and indexed the results and saved them into: ./data/greb1l-ish-region.vcf.gz
Now, just make text files with all the spring or all the falls in there
meta <- read_csv("data/wgs-chinook-samples.csv")
meta %>%
filter(run_type == "Spring") %>%
.$vcf_name %>%
cat(., sep = "\n", file = "intermediates/002/springers.txt")
meta %>%
filter(run_type == "Fall") %>%
.$vcf_name %>%
cat(., sep = "\n", file = "intermediates/002/falls.txt")
# let's do it for winter run too
meta %>%
filter(run_type == "Winter" | run_type == "Spring") %>%
.$vcf_name %>%
cat(., sep = "\n", file = "intermediates/002/spring-winters.txt")
Done quickly with bcftools
bcftools view -S intermediates/002/springers.txt data/greb1l-ish-region.vcf.gz -Oz > intermediates/002/spring-fish.vcf.gz
bcftools view -S intermediates/002/falls.txt data/greb1l-ish-region.vcf.gz -Oz > intermediates/002/fall-fish.vcf.gz
bcftools view -S intermediates/002/spring-winters.txt data/greb1l-ish-region.vcf.gz -Oz > intermediates/002/spring-winter-fish.vcf.gz
Run as configured for your own system on the VCF files created above.
cd intermediates/002
angsd -vcf-gl spring-fish.vcf.gz -fai ../../genome/Otsh_v1.0_genomic.fna.fai -nind 64 -domaf 3 -out springs 2> springs.redirect.stderr
angsd -vcf-gl fall-fish.vcf.gz -fai ../../genome/Otsh_v1.0_genomic.fna.fai -nind 64 -domaf 3 -out falls 2> falls.redirect.stderr
angsd -vcf-gl spring-winter-fish.vcf.gz -fai ../../genome/Otsh_v1.0_genomic.fna.fai -nind 80 -domaf 3 -out spring-winters 2> spring-winters.redirect.stderr
Then copy the mafs.gz outfiles back to laptop and we have a look into outputs/002
fallmaf <- read_tsv("intermediates/002/falls.mafs.gz")
springmaf <- read_tsv("intermediates/002/springs.mafs.gz")
spring_wintmaf <- read_tsv("intermediates/002/spring-winters.mafs.gz")
# merge the spring and the fall here, and filter so that we have 30 individuals in each
everyone_moiged <- left_join(fallmaf, springmaf, by = c("chromo", "position")) %>%
mutate(abs_diff = abs(knownEM.y - knownEM.x),
diff_ge_98 = abs_diff > 0.98) %>%
mutate(`diff > 0.98` = diff_ge_98)
# then filter it down so that we only keep ones that have enough
# individuals that were successfully genotyped
moiged <- everyone_moiged %>%
filter(nInd.x >= 30 & nInd.y >= 30)
# now .x = fall and .y = spring
# it turns out that major and minor line up with both data sets
# merge the spring-winter and the fall here, and filter so that we have 30 individuals in each
moiged2 <- left_join(fallmaf, spring_wintmaf, by = c("chromo", "position")) %>%
mutate(abs_diff = abs(knownEM.y - knownEM.x),
diff_ge_98 = abs_diff > 0.98) %>%
filter(nInd.x >= 30 & nInd.y >= 38) %>%
mutate(`diff > 0.98` = diff_ge_98)
Now, let’s add the exons in GREB1L and ROCK1 from the GFF to this region.
# make a bed file that will pick out the span from GREB1L and ROCK1
echo "NC_037124.1 12230325 12424394" | awk '{printf("%s\t%s\t%s\n", $1, $2, $3);}' > intermediates/002/greb-rock.bed
# extract that region from the GFF
bedtools intersect -a ~/Documents/UnsyncedData/Otsh_v1.0/GCF_002872995.1_Otsh_v1.0_genomic.gff.gz -b intermediates/002/greb-rock.bed > intermediates/002/greb-rock.gff
Now, read that in and play with it:
# read em in
gff <- read_tsv("intermediates/002/greb-rock.gff", col_names = FALSE) %>%
setNames(c("chrom",
"source",
"what",
"start",
"stop",
"X1", "X2", "X3",
"keyvals"
))
# expand the keyvals
gff_exp <- ecaRbioinf::separate_key_value_pairs(gff, "keyvals")
## Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Now, we will filter that down to only exons and CDS and then we will categorize each as GREB1L or ROCK1, and we will toss columns that we won’t need.
exocds <- gff_exp %>%
filter(what %in% c("exon", "CDS")) %>%
mutate(gene = case_when(
str_detect(product, "^GREB1") ~ "GREB1L",
str_detect(product, "^rho") ~ "ROCK1",
TRUE ~ NA_character_
)) %>%
select(what, start, stop, gene) %>%
mutate(y = ifelse(what == "exon", 1, 2))
Keep the exons.
exons <- exocds %>%
filter(what == "exon") %>%
mutate(start_mb = start / 1e06,
stop_mb = stop / 1e06)
greb <- exons %>%
filter(gene == "GREB1L") %>%
summarise(min = min(start_mb),
max = max(stop_mb))
rock <- exons %>%
filter(gene == "ROCK1") %>%
summarise(min = min(start_mb),
max = max(stop_mb))
# I am going to spooge these out in case they can be used later
save(exons, greb, rock, file = "outputs/002/exons-greb-rock.rda")
fdiffs <- bind_rows(
`Spring-run vs Fall-run` = moiged,
`Winter-run + Spring-run vs Fall-run` = moiged2,
.id = "comparison"
)
# get the nearly fixed ones so that we can plot them on top to make sure
# they show up well.
fdiffs_fixed <- fdiffs %>% filter(`diff > 0.98` == TRUE)
We also plot the locations of GREB1L and ROCK1 on there too.
# do this to get it on just the bottom facet
exons$comparison <- "Winter-run + Spring-run vs Fall-run"
# now prepare a data frame to put the gene extents on the bottom facet only
genes <- tibble(
text = c("GREB1L", "ROCK1"),
min = c(greb$min, rock$min),
max = c(greb$max, rock$max),
comparison = "Winter-run + Spring-run vs Fall-run",
hjust = c(1, 0),
text_x = c(greb$min - 0.005, rock$max + 0.005)
)
svf_ad_for_paper <- ggplot() +
annotate("rect", xmin = 12.26, xmax = 12.29, ymin = -Inf, ymax = Inf, fill = "pink", alpha = 0.6) +
facet_wrap(~ comparison, ncol = 1, scales = "free_y") +
geom_point(data = fdiffs,
mapping = aes(x = position / 1e06, y = abs_diff, colour = `diff > 0.98`,
alpha = `diff > 0.98`),
size = 0.9) +
geom_rect(data = exons,
mapping = aes(xmin = start_mb, xmax = stop_mb, ymin = -0.20, ymax = -0.04),
fill = "darkviolet") +
ylab("Absolute allele frequency difference,\nspring-run vs. fall-run") +
xlab("Position on Chromosome 28 (Mb)") +
theme_bw() +
scale_color_manual(values = c("gray", "black")) +
scale_alpha_manual(values = c(0.3, 1.0)) +
geom_point(data = fdiffs_fixed,
mapping = aes(x = position / 1e06, y = abs_diff, colour = `diff > 0.98`,
alpha = `diff > 0.98`),
colour = "black", alpha = 1.0, size = 0.9) + # plot the black ones on top
guides(color = guide_legend(title = "Frequency difference\ngreater than 0.98"),
alpha = FALSE) +
scale_x_continuous(breaks = seq(11.6, 12.9, by = 0.1), limits = c(11.6, 12.9)) +
geom_hline(yintercept = 0) +
scale_y_continuous(labels = c("0.00", "0.25", "0.50", "0.75", "1.00"), breaks = c(0.00, 0.25, 0.50, 0.75, 1.00)) +
theme(legend.position = "none") +
theme(strip.background = element_blank(),
panel.border = element_rect(colour = "black")) +
geom_segment(data = genes, aes(x = min, xend = max), y = -0.12, yend = -0.12, colour = "darkviolet", size = 1.5) +
geom_text(data = genes, aes(label = text, x = text_x, hjust = hjust), y = -0.12, vjust = 0.5, size = 3.2)
# now, save the plot for later
write_rds(svf_ad_for_paper, path = "outputs/002/ggplot_svf_ad_for_paper.rds", compress = "xz")
svf_ad_for_paper
## Warning: Removed 51563 rows containing missing values (geom_point).
mh_avr_plot <- read_rds("outputs/002/mh_avr_plot.rds")
mh_avr_plot_squat <- mh_avr_plot +
ylab("Absolute allele\nfrequency difference,\nspring- vs. fall-run") +
theme(axis.text.x = element_text(size = 7))
# get the zoomed in one, too
svf_ad_for_paper <- read_rds(path = "outputs/002/ggplot_svf_ad_for_paper.rds") +
ylab("Absolute allele\nfrequency difference") +
guides(color = guide_legend(title = "Difference\n > 0.98"),
alpha = FALSE)
library(cowplot)
figure_2 <- plot_grid(mh_avr_plot_squat,
svf_ad_for_paper,
nrow = 2,
labels = c("A", "B"), rel_heights = c(.4, .6))
## Warning: Removed 2943 rows containing missing values (geom_point).
## Warning: Removed 51563 rows containing missing values (geom_point).
ggsave(figure_2, filename = "outputs/002/allele-frequencies.pdf", height = 6, width = 7)
ggsave(figure_2, filename = "outputs/002/allele-frequencies.png", height = 6, width = 7)
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2020-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## cowplot * 1.0.0 2019-07-11 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.0)
## dbplyr 1.4.3 2020-04-19 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
## ecaRbioinf 0.1.0 2020-02-13 [1] local
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.0)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.0)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
## glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.0)
## hexbin 1.28.1 2020-02-03 [1] CRAN (R 3.6.0)
## hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
## lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.0)
## rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.0)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.0)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.0)
## viridis * 0.5.1 2018-03-29 [1] CRAN (R 3.6.0)
## viridisLite * 0.3.0 2018-02-01 [1] CRAN (R 3.6.0)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.2)
## xfun 0.13 2020-04-13 [1] CRAN (R 3.6.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
##
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
## Time difference of 3.496044 mins