In the previous notebook we performed Principal Components Analysis on 114 fish and will use that now.
We need to set up the files to use the -doAsso option. The fish in the analysis on are the rownames of eigenvectors.tsv
.
# make the bamlist
awk 'NR>1 {print "chinook_WGS_processed/" $1 ".rmdup.bam"}' stored_results/203/eigenvectors.tsv > intermediates/203/asso-bamlist.txt
# now make a binary file. We will let fall = 0 and spring = 1
(
awk 'NR>1' data/wgs-chinook-samples.csv;
awk 'NR>1 {print $1}' stored_results/203/eigenvectors.tsv
) | awk -F"," '
NF>1 && $NF=="Fall" {type[$1]=0;}
NF>1 && $NF=="Spring" {type[$1]=1;}
NF>1 {next}
{print type[$1]}
' > intermediates/203/asso-yBin.txt
# now make a file of the covariates. We will make a series of them
# with different numbers of PC's used.
for C in 0 3 10 30 60; do
C0=$(printf "%03d" $C)
echo $C0
awk 'NR>1' stored_results/203/eigenvectors.tsv | \
cut -d" " -f 2-$((C + 1)) > intermediates/203/asso-covariates-$C0.txt
done
With that we are ready to do our runs. We will use the segmentation of the genome that was developed in 202-01.
# start the first 20 jobs to test
sbatch --array=1-20 ./script/203-02-angsd-doAsso-array.sh
# Submitted batch job 51087
# that seems to be doing well. So start up the rest:
sbatch --array=21-2492%240 ./script/203-02-angsd-doAsso-array.sh
# Submitted batch job 51142
# after those are done, check for failures:
for i in intermediates/203/asso/*.arg; do gotit=$(tail -n 1 $i | awk 'BEGIN {msg="No"} /ALL done/ {msg="Yes"} END {print msg}'); if [ $gotit = No ]; then echo $i; fi; done
# results:
intermediates/203/asso/2110.000.arg
intermediates/203/asso/2110.003.arg
intermediates/203/asso/2110.010.arg
intermediates/203/asso/2110.030.arg
intermediates/203/asso/2110.060.arg
# so, we just have to crank up the memory on that one segment
sbatch --array=2110 --mem-per-cpu=30G ./script/203-02-angsd-doAsso-array.sh
# Submitted batch job 54770
# that fixed those.
I also inspected the .stderr
files and found that up to 30 PCs as covariates, things looked stable, but with 60 PCs there were reports of the test becoming unreliable and a suggestion to set minHigh higher. So, I will just not process the results with 60 principal components as covariates. 30 is plenty.
Here we catenate all those results so we can read them into R and compress them.
mkdir intermediates/203/catenated
(echo "n" | awk '{printf("nCov\tCHROM\tPOS\tLRT\n")}'
for C in 000 003 010 030; do
cat intermediates/203/asso/*.$C.lrt0.gz | gunzip -c | \
awk -v nC=$C '$1 != "Chromosome" && $7 > 0 {printf("%d\t%s\t%s\t%s\n", nC, $1, $2, $7)}'
done) > intermediates/203/catenated/asso-results.txt
Now, read that into R and inspect:
library(tidyverse)
lrt <- read_tsv("intermediates/203/catenated/asso-results.txt")
# now filter it down to only those sites with LRT > 2
# (that still leaves us with about 1.6M of them over the 4 numbers
# of covariates)
lrt_f <- lrt %>%
filter(LRT > 2)
# and save that into stored_results
write_rds(lrt_f, path = "stored_results/203/pca-corrected-assoc-lrts-filtered.rds", compress = "xz")
We also want to have the raw results that include the RoSA region. By raw we mean that we haven’t filtered out sites that did not yield a value (i.e. were not filtered out by ANGSD). The 1 Mb block that includes the RoSA has index 1583.
mkdir -p stored_results/203/assoc_raw
cp intermediates/203/asso/1583.* stored_results/203/assoc_raw/
We are now done with the calcs done on the cluster. Back to the laptop…
We now will do some analyses of these GWAS LRT values.
library(tidyverse)
dir.create("outputs/203/", showWarnings = FALSE, recursive = TRUE)
lrt <- read_rds("stored_results/203/pca-corrected-assoc-lrts-filtered.rds")
First, see how the LRT values changed with increasing numbers of PCs used as covariates.
tmp <- lrt %>%
spread(key = nCov, value = LRT) %>%
rename(lrt_no_pc_covariates = `0`) %>%
gather(key = "num_pc_covariates", value = "lrt", `3`:`30`)
# now, we want to be able to see how many points there are with no covariate
# correction (because some end up becoming non-computable by ANGSD when
# the covariates are added).
lrt_against_0 <- tmp %>%
filter(num_pc_covariates == 3) %>%
mutate(num_pc_covariates = "0", lrt = lrt_no_pc_covariates) %>%
bind_rows(tmp) %>%
mutate(num_pc_covariates = factor(as.integer(num_pc_covariates)))
ggplot(lrt_against_0, aes(x = lrt_no_pc_covariates, y = lrt, colour = num_pc_covariates)) +
geom_point(shape = 21) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed")
## Warning: Removed 3100011 rows containing missing values (geom_point).
So, we see that adding 3 PCs as covariates tends to reduce the LRT value. We have some where using 10 or 30 PCs reduces it to zero, but also a number of cases where adding 10 PCs ends up causing no LRT to be computed (the calculations likely become unstable.)
We also have a couple of crazy fliers where adding 10 or 30 PCs as covariates inflates the LRT values dramatically. Those should not be trusted, I believe.
Let’s look at this same thing but shade points by whether they are within 1 Mb of the RoSA or not, to see if we have any associated SNPs outside of that region.
lrt_two_class <- lrt_against_0 %>%
mutate(status = case_when(
CHROM == "NC_037124.1" & POS > 11.26e6 & POS < 13.29e6 ~ "1 Mb from RoSA",
TRUE ~ "Not near RoSA"
)) %>%
arrange(desc(status), desc(lrt))
ggplot(lrt_two_class, aes(x = lrt_no_pc_covariates, y = lrt, colour = status)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
scale_colour_manual(values = c(`Not near RoSA` = "gray", `1 Mb from RoSA` = "black")) +
theme_bw()
## Warning: Removed 3100011 rows containing missing values (geom_point).
OK, we have that most of the high values are near the RoSA.
We now want to make Manhattan plots, from the results above it is clear that we should be quite wary of any values where the LRT with covariate correction is much greater than the LRT with no covariate correction (as those likely represent spurious results). So, I will filter out values that are 1.1x the value with no covariate correction, otherwise there are just manifestly too many.
Maybe we will also plot unfiltered results for 3 PCs. Note that Anders does not recommend using a lot of PCs (better to use something around 3, he says: https://github.com/ANGSD/angsd/issues/149)
lrt_pvalues <- lrt_against_0 %>%
mutate(abs_diff = -(log10(exp(1)) * pchisq(lrt, df = 1, lower.tail = FALSE, log.p = TRUE))) %>%
rename(Chromosome = CHROM, position = POS) %>%
mutate(num_pc_covariates = as.integer(as.character(num_pc_covariates)))
# note, we name the column abs_diff to be compatible with our plotting functions
# but will rename the axis to "negative log10 p-value".
# now, make a version where we filter out the wacky high values that occur
# when adding covariates
lrt_p_values_no_fliers <- lrt_pvalues %>%
filter(lrt < 1.1 * lrt_no_pc_covariates)
Now, prep for MH plots:
source("R/manhattan_plot_funcs.R")
chrom_lengths <- read_tsv("stored_results/202/chrom_lengths_from_bams.txt.gz")
Now, simply make a series of plots. Here is a function to facilitate that:
asso_mh_plots <- function(D) {
MM <- my_mh_prep(D, chrom_lengths)
# correct the position of the NWs label.
idx <- which(MM$axisdf$cluster == "NWs")
MM$axisdf$clust_start[idx] <- MM$axisdf$clust_start[idx - 1] + MM$axisdf$clust_length[idx - 1] + 1
MM$axisdf$clust_center[idx] <- (2 * MM$axisdf$clust_start[idx] + MM$axisdf$clust_length[idx]) / 2
MP <- plot_mh(MM$snps, MM$axisdf) +
xlab("Position Along Chromosome") +
ylab("Negative Log-10 P-value") +
ylim(0.8, NA)
MP
}
And then go for it:
These are the association p-values doing no correction:
asso_mh_plots(D = lrt_p_values_no_fliers %>% filter(num_pc_covariates == 0))
And here we look at the top 40 SNPs:
lrt_pvalues %>%
filter(num_pc_covariates == 0) %>%
arrange(desc(abs_diff)) %>%
slice(1:40) %>%
rename(Log10_Pvalue = abs_diff)
One thing that we note is that we don’t find only one value within the RoSA (the one at 12263509), and its Pvalue is far from the smallest. Here is what I suspect is going on: the association method in doAsso 2
, which allows covariates to be added to it, does not manage well when individuals are mostly homozygotes. As the web page states:
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of of homozygous major, heterozygous and homozygous minor genotypes. At least two of the three genotypes categories needs at least [int] individuals with a genotype probability above 0.9. This filter avoids the scenario where all individuals have genotypes with the same probability e.g. all are heterozygous with a high probability or all have 0.33333333 probability for all three genotypes.
What is happening here is that the RoSA region has a lot of markers that are nearly fixed for alternate alleles between the spring and the fall run group (the cases and control), and as a consequence, ANGSD does not attempt to calculate a p-value for them. In other words, the association test is not well suited to things that are totally fixed in alternate groups.
We will follow up on this below by looking at the results printed out for the RoSA region in stored_results/203/assoc_raw
.
Here, we don’t remove the LRT values that are > 1.1x what they are with no PCA covariates.
asso_mh_plots(D = lrt_pvalues %>% filter(num_pc_covariates == 3))
## Warning: Removed 482012 rows containing missing values (geom_point).
While we are at it, we should see which 10 of those SNPs are most highly associated.
lrt_pvalues %>%
filter(num_pc_covariates == 3) %>%
arrange(desc(abs_diff)) %>%
slice(1:10) %>%
rename(Log10_Pvalue = abs_diff)
Only two in the RoSA.
Here, we don’t remove the LRT values that are > 1.1x what they are with no PCA covariates.
asso_mh_plots(D = lrt_pvalues %>% filter(num_pc_covariates == 10))
## Warning: Removed 910840 rows containing missing values (geom_point).
We see far fewer points and a lot of spurious ones.
We can remove the ones that are > 1.1x the LRT value with no PCA covariates.
asso_mh_plots(D = lrt_p_values_no_fliers %>% filter(num_pc_covariates == 10))
So, not much there, but that is primarily a consequence of the method filtering out sites.
I think that is enough to see.
We can compare allele frequency absolute differences to the p-values, and also see whether the SNPs were filtered out by ANGSD. We can read in the values that we produced in 202.
abs_diffs <- read_rds("stored_results/202/abs_diffs_from_bams_gt0.25.rds")
raw_lrts_files <- dir(path = "stored_results/203/assoc_raw/", pattern = "*lrt*", full.names = TRUE)
names(raw_lrts_files) <- basename(raw_lrts_files)
all_raw <- lapply(raw_lrts_files, function(x) {
read_tsv(x)
}) %>%
bind_rows(.id = "file") %>%
separate(file, into = c("seg_idx", "num_pcs_as_covars", "a", "b"), sep = "\\.", convert = TRUE) %>%
select(-a, -b) %>%
left_join(abs_diffs, by = c("Chromosome" = "chromo", "Position" = "position"))
That gives us all those data, together. Note that abs_diff is NA for many of them, because we filtered the absolute allele freq diffs to those > 0.25. Let’s have a look at 100 SNPs with the highest absolute frequency differences:
all_raw %>%
arrange(desc(abs_diff)) %>%
slice(1:100)
There, we see that most of these are in the RoSA. Additionally, we see that none of them had an LRT calculated for them because ANGSD does not have great confidence in their genotypes, because with doPost = 1
, as recommended in the ANGSD documentation, the allele frequency based prior for genotype probablities gives twice the weight to heterozygotes (since the pooled allele freq is always near 0.5 for these markers that are fixed between the E and L lineage haplotypes).
So, let’s filter the above to keep only the ones that ANGSD actually estimated an LRT:
all_raw %>%
arrange(desc(abs_diff)) %>%
filter(LRT != -999) %>%
slice(1:500)
So, the ones with a high LRT just happen to be those that have enough heterozygotes for the SNP to not be filtered out.
The irony here is that if someone just naively tried to do this type of association test in ANGSD with 10 PCs for covariates, and they didn’t look closely at their results, they could easily have missed the association entirely.
Since there are obviously a lot of pecularities and filtering issues when doing this type of association test with ANGSD, the results become much harder to interpret. It is much easier to intepret the absolute differences in allele frequency. Furthermore, those absolute diffs are on the appropriate scale to see the decay in LD on the flanks of the RoSA etc. For all these reasons, we will not replace Figure 2 with a Manhattan plot of association p-values.
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2020-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.0)
## dbplyr 1.4.3 2020-04-19 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.0)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.0)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
## glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.0)
## hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
## lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-142 2019-11-07 [2] CRAN (R 3.6.2)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.0)
## rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.0)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.0)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.0)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.2)
## xfun 0.13 2020-04-13 [1] CRAN (R 3.6.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
##
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
## Time difference of 4.482326 mins