In the notebook 003-process-5-plates-of-birds-at-192-fluidigm-snps.Rmd
, we processed the chips into two column format, tossed out some hard-to-score loci, ending up with 179 SNPs. In 004-combine-RAD-and-fludigm-breeders-and-run-STRUCTURE.Rmd
we retrieved the genotypes of the 175 RAD-genotyped individuals and combined them with the 393 Fluidigm-typed individuals and ran all of these 568 breeding birds through structure with the 179 SNPs.
From those results, it looked like there was very little variation in the structure output across multiple runs for the model with K=5, and the clusters found there reflected geography, consistently.
Now it is time to whittle those 179 loci down to a panel of 96 that we will be using to genotype the wintering birds. The way we will do this is by using the SNP allele frequencies estimated in the K=5 run (we use the first K=5 run, but they all came out the same, so it should make no difference) to guide us. I have committed with the repository the output file from that run that includes the allele frequencies at the different loci. That file is at ./stored_results/004/StructOuput_genos_slg_pipe.txt_dat001_k005_Rep001.txt_f
.
We rank each locus in terms of its utility for accurately assigning individuals from each of the 5 different clusters using the expected value of the posterior probability that an individual came from its own cluster. Let p1,…,p5 denote the structure-estimated frequencies of the "0" allele at a SNP in the 5 clusters, respectively. Then, under HW-equilibrium it is easy to compute the expected frequencies of the three different genotypes P00,i=p2iP01,i=2pi(1−pi)P11,i=(1−pi)2 for i∈{1,…,5}.
If we assumed that an individual with genotype 00 was a-priori equally likely to be from any of the 5 clusters, then we could define the posterior probability that an 00 genotype (at a single locus) came from cluster j as Q00,j=P00,j/∑5i=1P00,i. The same goes for the heterozygote genotype (01) and the other homozygote (11). So we have: Q00,j=P00,j∑5i=1P00,iQ01,j=P01,j∑5i=1P01,iQ11,j=P11,j∑5i=1P11,i for j∈{1,…,5}.
Thus, for an individual that is actually from cluster j the expected posterior probability that it is from cluster j based on data from this one locus is: Ej=P00,jQ00,j+P01,jQ01,j+P11,jQ11,j. In the sequel, we use E(ℓ)j to refer to the Ej value for locus ℓ.
So, our approach for selecting loci here is going to be:
In order to do this easily, we write a function that takes a vector of allele frequencies from biallelic loci in K different clusters/populations, and returns a vector of Ej values.
This function is called Ej_from_freqs
in ./R/wifl-popgen-funcs.R
. So, we need to source that in here:
library(tidyverse)
source("R/Ej_from_freqs.R")
dir.create("outputs/005", showWarnings = FALSE, recursive = TRUE)
We need to parse the structure output file. This can be done with R, but it is way easier with awk
.
First We put them into an easily readable form in a temp file called xxx_tmp_freqs.txt
.
rm -f xxx_tmp_freqs.txt
awk '
/^Estimated Allele Frequencies in each cluster/ {go = 1}
/^Locus/ && go==1 {locus = $4; n = 0; next}
go==1 {++n}
n==3 && go==1 {printf("%s", locus); for(i=3;i<=NF;i++) printf("\t%s", $i); printf("\n")}
' stored_results/004/StructOuput_genos_slg_pipe.txt_dat001_k005_Rep001.txt_f > outputs/005/tmp_struct_freqs.txt
Then we read those into a data frame and put some names on the columns
sfreqs <- read_delim("outputs/005/tmp_struct_freqs.txt", delim = "\t", col_names = FALSE)
names(sfreqs) <- c("struct_name", paste("clust_", 1:5, sep = ""))
There is one silly wrinkle here: structure truncates the locus names. Jeez!! Fortunately, we have them in the slg_pipe output. The respository has a copy of them in a text file at inputs/004/slg-pipe-locus-order.rds
. We read that in and then fill out the locus names.
Anyhow, it is just a quick slap of the locus names on there.
fullloc <- read_rds(file = "inputs/004/slg-pipe-locus-order.rds")
sfreqs2 <- sfreqs %>%
mutate(locus = fullloc) %>%
select(locus, clust_1:clust_5)
# here is what that looks like:
sfreqs2
locus <chr> | clust_1 <dbl> | clust_2 <dbl> | clust_3 <dbl> | clust_4 <dbl> | clust_5 <dbl> |
---|---|---|---|---|---|
ada_brew_1 | 0.605 | 0.943 | 0.811 | 0.762 | 0.361 |
ada_brew_17 | 0.820 | 0.735 | 0.856 | 0.784 | 0.805 |
ada_ext_11 | 0.467 | 0.374 | 0.702 | 0.359 | 0.217 |
ada_ext_4 | 0.968 | 0.980 | 0.573 | 0.963 | 0.996 |
ada_Kern_3 | 0.961 | 0.834 | 0.931 | 0.933 | 0.997 |
ada_tra_6 | 0.962 | 0.705 | 0.965 | 0.776 | 0.452 |
brew_ext_10 | 0.831 | 0.633 | 0.995 | 0.767 | 0.941 |
brew_ext_3 | 0.834 | 0.908 | 0.602 | 0.838 | 0.707 |
brew_Kern_3 | 0.261 | 0.509 | 0.511 | 0.440 | 0.359 |
brew_NS_7 | 0.705 | 0.714 | 0.400 | 0.656 | 0.568 |
This is a matter of sorting by rank and cluster and then taking unduplicated ones until we have 96
stidy <- sfreqs2 %>%
tidyr::gather(data = ., key = "cluster", value = "freq", clust_1:clust_5)
picks <- stidy %>%
group_by(locus) %>%
mutate(Ej = Ej_from_freqs(freq)) %>%
group_by(cluster) %>%
mutate(rank = rank(desc(Ej))) %>%
ungroup() %>%
arrange(rank) %>%
mutate(first_occurrence = !duplicated(locus),
cml = cumsum(first_occurrence),
include = (first_occurrence == TRUE) & (cml <= 96))
If we want to have a look at the freqs we can page through this:
left_join(picks, sfreqs2) %>%
select(-first_occurrence, -freq)
locus <chr> | cluster <chr> | Ej <dbl> | rank <dbl> | cml <int> | include <lgl> | clust_1 <dbl> | clust_2 <dbl> | clust_3 <dbl> | clust_4 <dbl> | |
---|---|---|---|---|---|---|---|---|---|---|
ada_WM_6 | clust_1 | 0.7491235 | 1 | 1 | TRUE | 0.280 | 0.943 | 0.993 | 0.925 | |
brew_tra_1 | clust_2 | 0.4774170 | 1 | 2 | TRUE | 0.748 | 0.255 | 0.889 | 0.585 | |
ada_ext_1 | clust_3 | 0.7420034 | 1 | 3 | TRUE | 0.991 | 0.941 | 0.319 | 0.946 | |
brew_tra_14 | clust_4 | 0.2735743 | 1 | 4 | TRUE | 0.978 | 0.523 | 0.897 | 0.621 | |
ext_tra_5 | clust_5 | 0.7856068 | 1 | 5 | TRUE | 0.905 | 0.925 | 0.905 | 0.713 | |
ext_pure_Kern_5 | clust_1 | 0.6740522 | 2 | 6 | TRUE | 0.053 | 0.706 | 0.906 | 0.722 | |
brew_tra_9 | clust_2 | 0.3717976 | 2 | 7 | TRUE | 0.190 | 0.646 | 0.304 | 0.414 | |
brew_ext_6 | clust_3 | 0.6146368 | 2 | 8 | TRUE | 0.998 | 0.991 | 0.583 | 0.970 | |
ada_ext_10 | clust_4 | 0.2611643 | 2 | 9 | TRUE | 0.654 | 0.419 | 0.685 | 0.345 | |
brew_tra_3 | clust_5 | 0.7390769 | 2 | 10 | TRUE | 0.985 | 0.975 | 0.947 | 0.870 |
Let us write out the top 96 loci:
top96 <- picks %>%
filter(include == TRUE)
# and write them to the intermediates directory
write_csv(top96, file = "outputs/005/top96_loci.csv")
Just going to make a quick histogram of the Ej values that got specifically chosen for the different clusters.
ggplot(picks, aes(x = Ej, fill = include)) +
geom_histogram(bins = 30) +
facet_wrap(~ cluster, ncol = 3)
## Warning: Removed 5 rows containing non-finite values (stat_bin).
But, of course, if we want to see all the included loci we have to do something a little different
tmp <- picks %>%
mutate(in_panel = locus %in% top96$locus)
ggplot(tmp, aes(x = Ej, fill = in_panel)) +
geom_histogram(bins = 30) +
facet_wrap(~ cluster, ncol = 3)
## Warning: Removed 5 rows containing non-finite values (stat_bin).
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed, listed in HH:MM:SS
.
td <- Sys.time() - start_time
tdf <- hms::as_hms(ceiling(hms::as_hms(td)))
dir.create("stored_run_times", showWarnings = FALSE, recursive = TRUE)
write_rds(tdf, "stored_run_times/005.rds")
tdf
## 00:00:05
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os macOS Mojave 10.14.6
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2021-03-04
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
## backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
## broom 0.7.3 2020-12-16 [1] CRAN (R 4.0.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.2)
## cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.2)
## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2)
## dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
## farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.2)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.2)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
## ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
## haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.2)
## hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.2)
## htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
## knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
## lubridate 1.7.9.2 2020-11-13 [1] CRAN (R 4.0.2)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.2)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
## ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.2)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
## readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.2)
## rlang 0.4.9 2020-11-26 [1] CRAN (R 4.0.2)
## rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.2)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
## scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
## tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.2)
## tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
## vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
## xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
##
## [1] /Users/eriq/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library