Introduction
Finding highly diverged SNPs (the "Divergence SNPs") from the RAD-Seq data
Processing and including the "Climate-Associated SNPs"
Filtering to \(\approx 192\) candidate SNPs from the Climate and Divergence SNPs
- Weighting the different comparisons for divergence SNP selection
- Picking out the SNPs
Fetching Allelic types and sequences, and preparing the assay order
Running Time
Session Info

library(tidyverse)
library(genoscapeRtools)
library(snps2assays)

dir.create("outputs/002", recursive = TRUE, showWarnings = FALSE)

Introduction

In order to select a number of assays that should have high power for resolving different groups of Willow Flycatchers, we divided individuals into subspecies groups (Subspecies_Pure) and also groups within subspecies (Within_Subspecies) that we would like to be able to resolve. For a number of different pairwise comparisons between these groups, we used the allele frequencies estimated from the RAD seq data to compute the expected probability of correct assignment of an individual of each group to its group. These calculations are identical to those that appear on pg. 118 in Clemento et al. (2014); however we summarize it here, as well.

The central ingredient is what we call the expected correct assignment probability (abbreviated cap in the tables of output below). Let the frequency of the alternate allele in population 1 be \(p\), while the reference allele is at frequency \(1 - p\). Under the assumption of Hardy-Weinberg equilibrium, the expected frequencies of the three genotypes are: \[ \begin{aligned} P_{00} &= (1 - p)^2 \\ P_{01} &= 2p(1-p) \\ P_{11} &= p^2, \\ \end{aligned} \] where 00 denotes the genotype with two copies of the reference allele, 11 denotes a homozygote with two copies of the alternate allele, and 01 denotes a heterozygote genotype.

In population 2, let the alternate allele frequency be \(q\), with the corresponding expected genotype frequencies being: \[ \begin{aligned} Q_{00} &= (1 - q)^2 \\ Q_{01} &= 2q(1-q) \\ Q_{11} &= q^2. \\ \end{aligned} \] With these genotype frequencies, the probability that a randomly drawn individual from population 1 would be correctly assigned to population 1, as opposed to population 2, when using just this locus for population assignment is simply: \[ C_1 = P_{00}\delta(P_{00} \geq Q_{00}) + P_{00}\delta(P_{01} \geq Q_{01}) + P_{11}\delta(P_{11} \geq Q_{11}), \] where \(\delta(x)\) equals 1 when statement \(x\) is true and 0 otherwise. Likewise, the probability that a randomly drawn individual from population 2 would be correctly assigned to population 2, as opposed to population 1, when using just this locus for population assignment is: \[ C_2 = Q_{00}\delta(Q_{00} > P_{00}) + Q_{00}\delta(Q{01} > P_{01}) + Q_{11}\delta(Q_{11} > P_{11}). \] We define the correct probability as the mean of those two terms: \(\mathrm{CAP} = \frac{C_1 + C_2}{2}\). In the following, \(\mathrm{CAP}\) is calculated for a number of relevant comparisons between populations/groups; the \(\mathrm{CAP}\) values are used to rank the loci within each of these comparisons; and then, by cycling over the comparisons and taking the remaining top-ranked loci for each, a final list of markers is obtained as explained below.

For each relevant pairwise population/group comparison, we chose the top 300 markers, and then cycled over those comparisons, taking a single such top SNP from each comparison, and moving down the lists until we had a number of candidates. We then further filtered these candidates according to whether Fluidigm assays could be designed for them.

To the final set of candidates we also added the Climate Associated SNPs from Ruegg et al. (2018).

Finding highly diverged SNPs (the "Divergence SNPs") from the RAD-Seq data

These are the SNPs that we are choosing because of allele freq differences between certain groups.

Get the SNPs

We develop SNPs from amongst those that passed our "clean" filters (even though, later, we will include more variation to ensure there is little variation in the flanking regions)

wifl_clean <- read_rds("data/rad_wifl_clean_175_105000.rds")

Get and process the groups

We also want the groups that we are putting the birds into. Here is the file of designations that Kristen has supplied:

meta_full <- read_csv("data/appendices/app1-SITable1_Breed_IndvAssigwgenos.csv")


grps <- meta_full %>%
  filter(geno_method == "RAD") %>%
  select(Field_Number, Subspecies_Pure, Within_Subspecies) %>%
  rename(Individual = Field_Number)

Let's just count up how many birds we have in each group and subgroup:

grps %>%
  group_by(Subspecies_Pure, Within_Subspecies) %>%
  tally()

This makes it clear that these categories are not strictly nested. But we have a healthy number of birds to choose from for most categories, so that is good.

We need to get a list of the IDs of the individuals in each of those groups:

major_grps <- grps %>%
  filter(!is.na(Subspecies_Pure)) %>%
  split(.$Subspecies_Pure) %>%
  lapply(., function(x) x$Individual)

minor_grps <- grps %>%
  filter(!is.na(Within_Subspecies)) %>%
  split(.$Within_Subspecies) %>%
  lapply(., function(x) x$Individual)

# see what that looks like...
minor_grps[1:2]

## $ada_North
##  [1] "1590-97493" "1590-97494" "1590-97497" "1590-97498" "1710-20366"
##  [6] "1710-46312" "1740-91796" "1740-91797" "1740-91798" "1740-91800"
## 
## $ada_south
##  [1] "1740-51823" "1740-51824" "1740-51825" "1740-51826" "1740-51827"
##  [6] "1740-51828" "1590-97390" "1590-97391" "1590-97392" "1590-97393"
## [11] "1740-51646" "1740-51684" "1740-51686" "1740-51687" "1740-91921"
## [16] "1740-91922" "1740-91923"

Computing allele frequencies in the groups

We can do that with a function from genoscapeRtools:

major_grp_freqs <- group_012_cnts(wifl_clean, major_grps)
minor_grp_freqs <- group_012_cnts(wifl_clean, minor_grps)

Doing the comparisons and computing the probability of correct assignment

Here we have some code to just pick out the comparisons that we want to look at, and then we actually make a big data frame of all those comparisons.

The "major" comparisons (between subspecies level groups)

comps <- combn(unique(major_grp_freqs$group), 2)
keep <- str_detect(comps, "ext") %>%
  matrix(nrow = 2) %>%
  colSums() < 2
comps <- comps[, keep]

major_comps <- lapply(1:ncol(comps), function(c) {
  comp_pair(major_grp_freqs, comps[1, c], comps[2, c])
}) %>%
  bind_rows()

# print out which comparisons those are:
major_comps %>% group_by(comparison) %>% tally()

The "minor" comparison (within subspecies)

We will only look within subspecies in these cases

comps2 <- combn(unique(minor_grp_freqs$group), 2)

# now restrict that to just the comparisons within subspp
pref <- str_replace_all(comps2, "_.*", "") %>%
  matrix(nrow = 2)
keep2 <- pref[1,] == pref[2,]

comps2 <- comps2[, keep2]

minor_comps <- lapply(1:ncol(comps2), function(c) {
  comp_pair(minor_grp_freqs, comps2[1, c], comps2[2, c])
}) %>%
  bind_rows()

# print out which comparisons that is
minor_comps %>% group_by(comparison) %>% tally()

So, that is 21 comparisons, total.

Choosing the top 300 SNPS for each comparison

This is relatively easy, it is just filtering. We rank things and then take the top 300. There might be slightly fewer than 300 SNPs if there are ties, but that is OK.

We will bind them all into a single tibble at the end.

top300 <- lapply(list(major = major_comps, minor = minor_comps), function(x) {
  x %>%
    group_by(comparison) %>%
    mutate(rank_full = rank(-cap)) %>%
    filter(rank_full <= 300)
}) %>%
  bind_rows(.id = "type_of_comparison")

# we print the first 10 line out here so it can be seen
top300 %>%
  ungroup() %>%
  slice(1:10)

There are 6353 rows in that data frame, but the total number of unique SNPs is smaller: 4154.

Whittling these candidates down by assayability

For this, we are going to take top300 and just filter it by whether or not it is assayable. We define it as assayable if it doesn't have any SNP with MAF > 0.02 (and taken from our less filtered data set where a SNP is called if it appears in at least 50% of the individuals) within 20 bp.

First, get that "less filtered" data. This is from the 012 file with 350K SNPs:

wifl <- read_rds(file = "data/rad_wifl_less_filtered_219_349014.rds")

And then we assess the assayability of these all. The first thing we are going to do is just filter on MAF > 0.02 to get all the SNPs (all the variation, really, since we are going to ignore the chance of indels) that we might want to worry about as far as whether they are in the flanking regions. While we are at this, we are going to write all those positions to a file so we can pull them out of the VCF file later so that we can get what the REF and ALT alleles are.

# keep track of which of the SNPs we are going to "worry about"
# as far as whether they might be in the flanking region
snps_to_care_about <- compile_assayable(wifl, minMAF = 0.02, flank = 20)

# while we are at it.  Let's write out a tab delimited file of these positions
# so we can easily pull them out of the VCF file later 
snps_to_care_about %>%
  select(snp) %>%
  tidyr::separate(snp, into = c("CHROM", "POS"), sep = "--") %>%
  write.table(., quote = FALSE, row.names = FALSE, col.names = FALSE, file = "outputs/002/snps-to-care-about.txt", sep = "\t")

Then keep track of which ones are assayable.

cassay <- snps_to_care_about %>%
  filter(assayable == TRUE)  # we will keep only the assayable ones (possibly on edges)

# have a look at the first 10 of those
cassay %>%
  slice(1:10)

So, now we can filter top300 by whether they are assayable or not, and then join them with the assayability information:

top_assayable <- top300 %>%
  rename(snp = pos) %>%
  filter(snp %in% cassay$snp) %>%
  left_join(., cassay)

That gives us 1735 unique SNPs to play around with. If we wanted to filter it even harder, we could throw out ones that were on the edges (i.e. no SNP within 1000 bp of it).

top_off_edge <- top_assayable %>%
  filter(on_left_edge == FALSE, on_right_edge == FALSE)

which gives us 1315 unique SNPs.

Keep the top 40 assayable SNPs for each comparison

Let's look at the 40 best from each of the comparisons. And we record how many of these top 40 groups each SNP is in.

tmp <- top_off_edge %>%
  select(type_of_comparison, snp, leftpad, rightpad, comparison, freq.x, freq.y, ntot.x, ntot.y, cap, rank_full) %>%
  group_by(type_of_comparison, comparison) %>%
  top_n(40, -rank_full) %>%
  mutate(idx = 1:n()) %>%  # this and the next two lines are just so that we get exactly
  filter(idx <= 40) %>%    # 20---the ties usually occur when the sample size is small and so
  select(-idx)  %>%           # it is OK to pitch them.
  ungroup()

num_occur <- tmp %>%
  group_by(snp) %>%
  summarise(num_comps = n())

top40 <- left_join(tmp, num_occur)

# have a glimpse at that:
top40 %>%
  slice(1:10)

Join LFMM information to these top40 markers for each comparison

The LFMM scores for all SNPs were computed in Ruegg et al. (2018). We load those values here and join them to our top40 divergence SNPs.

lfmm <- read_rds("data/wifl-lfmm-results.rds")

# and now filter it to the smallest p-value for each snp:
lfmm_mins <- lfmm %>%
  group_by(snp) %>%
  filter(fdr_p == min(fdr_p))

# and now we join that on there
top40_lfmm <- top40 %>%
  left_join(., lfmm_mins)

# print the whole thing as a CSV for inspection
write_csv(top40_lfmm, file = "outputs/002/top40_lfmm.csv")

# print the top 10 rows to this notebook
top40_lfmm %>%
  slice(1:10)

It is possible to flip through that and pretty quickly see what is going on. There are 594 unique SNPs there, and iut should be pretty easily whittle that down to 192, or as many as we think we might want to develop into assays.

A digression looking at the LFMM values

Just for fun, let's look at the distribution of all the min LFMM p-values in these SNPs compared to all the rest:

# first get a data frame of just one occurrence per SNP
one_each <- top40_lfmm %>%
  group_by(snp) %>%
  mutate(tmp = 1:n()) %>%
  filter(tmp == 1)

# then plot
ggplot() +
  geom_density(data = lfmm_mins, mapping = aes(x = fdr_p), alpha = 0.5, fill = "blue") +
  geom_density(data = one_each, mapping = aes(x = fdr_p), alpha = 0.5, fill = "orange")

## Warning: Removed 58 rows containing non-finite values (stat_density).

Orange are the selected SNPs and blue are all SNPs. We see that there is a difference in the distributions but that might just be because we have chosen SNPs that do not have very low minor allele frequencies.

Processing and including the "Climate-Associated SNPs"

When designing this panel of Fluidigm assays, we also wanted to include 25 SNPs that were associated with climate variables. This work was done as part of Ruegg et al. (2018). Basically, we wanted to include the 5 most significant assayable SNPs from each of the 5 climatic variables with the highest loadings from Ruegg et al. (2018):

Bio11: mean temperature of warmest quarter
Bio10: mean temperature of coldest quarter
Bio5: max temperature of coldest month
Bio9: mean temperature of driest month
Bio17: precipitation of driest quarter

Since the most significant assayable loci are often shared between these climate variables, we end up taking the 8 most signicant assayable SNPs for each climate variable, which ends up giving us 24 unique SNPs that we will call the "climate-associated" SNPs.

climate_vars <- c("BIO_11", "BIO_10", "BIO_5", "BIO_9", "BIO_17")
climate_snps <- lfmm %>%
  filter(variable %in% climate_vars) %>%
  left_join(cassay %>% filter(on_left_edge == FALSE, on_right_edge == FALSE), .) %>%   # limit it to just the assayable ones THAT ARE NOT ON EDGES
  group_by(variable) %>%
  top_n(n = 8, wt = -fdr_p) %>%
  group_by(snp) %>%
  arrange(fdr_p) %>%
  mutate(comparison = paste(variable, sprintf("%.2e ", fdr_p), collapse = ", "),
         type_of_comparison = "climate") %>%
  select(-variable, -fdr_p) %>%
  filter(!duplicated(snp)) %>%
  ungroup() %>%
  select(-(num_gene_copies:rightpos), -(assayable:on_left_edge))

climate_snps

Filtering to \(\approx 192\) candidate SNPs from the Climate and Divergence SNPs

Weighting the different comparisons for divergence SNP selection

For our purposes we deemed that it was more important to be able to resolve individuals from some minor or major groups than others Furthermore, some comparisons between groups had such low sample sizes that we chose not to try harvest many SNP candidates from them, and so forth. We codified this by choosing the number of SNP candidates to include from each comparison in our final list. That data frame with those numbers is here:

snp_pair_wts <- read_tsv("data/wifl_snp_comparison_counts.txt")
snp_pair_wts

Picking out the SNPs

We will pick out the number of SNPs from each comparison, as given above, and then we add the climate SNPs on top of those, and filter out any "divergence SNPs" that are duplicates of the climate SNPs (or other divergence SNPs) or which are too close, physically, to another SNP.

Because we will end up removing duplicates in this process we must start with more than we want at the end. So we allow ourselve to increase the number SNPs taken from each comparison using the expand_it variable, below:

expand_it <- 1.8  # this is the factor by which to increase the number of SNPs from each comparison

div_candidates <- top40_lfmm %>%
  left_join(snp_pair_wts) %>%
  group_by(comparison) %>%
  arrange(desc(cap)) %>%
  mutate(idx = 1:n()) %>%
  filter(idx <= (number_markers) * expand_it) %>%  # keep only a certain number from each comparison
  ungroup() 

# now, add the climate SNPs in there, filter out the duplicates, and then mark those
# that are too close to their neighbors and count up how many we are left with
candidate_snps <- bind_rows(climate_snps, div_candidates) %>%
  filter(!duplicated(snp))  %>%      # filter out the duplicates from the divergence SNPs
  tidyr::separate(snp, into = c("CHROM", "POS"), sep = "--", convert = TRUE, remove = FALSE) %>%  # get CHROM and POS back
  arrange(CHROM, POS) %>%
  group_by(CHROM) %>%
  mutate(dist = POS - lag(POS, 1),
         could_be_too_close = ifelse(!is.na(dist) & dist <= 10^4, TRUE, FALSE),       # flag them if closer than 10 KB
         toss_it = FALSE)  # we add the toss_it column since we will modify that by hand to do final thinning

# now return the approximate number after thinning
sum(!candidate_snps$could_be_too_close)

## [1] 210

So, that has given us, potentially, 210 candidates. That is more than 192, but some of them might not be designable because of the GC content. But, now we need to decide which ones to toss when they are too close together on the genome. There are only a few of these and it will be good to get eyes on the data anyway. So, at this point we write out the file.

write_csv(candidate_snps, file = "outputs/002/candidate_snps_untossed.csv")

Subsequently we got our eyes on that output and then hand-selected which ones to toss by putting a "TRUE" in the "toss_it" column and saving the result in exactly the same format in a file saved in our repository in stored_results/002/candidate_snps_tossed.csv.

Fetching Allelic types and sequences, and preparing the assay order

Our list of 210 final candidates is here:

final_candidates <- read_csv("stored_results/002/candidate_snps_tossed.csv") %>%
  filter(toss_it == FALSE)

Getting allelic types

Now, from those candidates we make a file that is tab-delimited with the name of the SNP we are focusing on, along with a string that can be used to pick out any variation (using tabix) within 250 bp of that SNP.

We will make the LeftStartingPoint 250 bp to the left of POS and the RightEndPoint 250 bp to the right.

regions <- final_candidates %>% 
  mutate(left = POS - 250,
         right = POS + 250) %>%
  mutate(extractor = paste(CHROM, ":", left, "-", right, sep = ""))  %>%
  select(snp, extractor)

# show the reader what this looks like
head(regions)

write.table(regions, row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t", file = "outputs/002/regions-for-candidates.txt")

Extracting positions from the VCF on the cluster

For the final stage, we need to go back to the original VCF so that we know what the REF and ALT alleles are, so we can put together the assay order. We continue to ignore the possibility of indels, which means that we just have to pull the snps that we care about out of the VCF file. We don't need all the individual data on these, so I will just return info on one individual, NULL, that is not even in the file.

This step was done on the cluster. Here is a transcript of the session. The full VCF file, with the variation at all individuals, for these steps is larger than we want to put in the repository. But the important part part of the file which is simply the positions is included in stored_results.

[kruegg@n2238 WIFL_10.15.16]$ pwd
/u/home/k/kruegg/nobackup-klohmuel/WIFL/WIFL_10.15.16
[kruegg@n2238 WIFL_10.15.16]$ module load vcftools
[kruegg@n2238 WIFL_10.15.16]$ module load bcftools

# filter it all down....
[kruegg@n2238 WIFL_10.15.16]$ vcftools --vcf WIFL_10.15.16.recode.vcf --out design-variants  --recode  --positions snps-to-care-about.txt --indv NULL 

VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf WIFL_10.15.16.recode.vcf
    --out design-variants
    --positions snps-to-care-about.txt
    --recode
    --indv NULL

Keeping individuals in 'keep' list
After filtering, kept 0 out of 219 Individuals
Outputting VCF file...

[kruegg@n2238 WIFL_10.15.16]$ bgzip design-variants.recode.vcf 

# then that was tabix indexed.

design-variants.recode.vcf is placed in stored_results/002 in the repository.

Retrieving variants that overlap our candidate regions

Now, we can retrieve variants overlapping our candidate regions pretty easily, but we need to keep each such variant associated with the actual SNP, so we did that with a little shell programming here:

while read line; do 
  array=($line); 
  tabix stored_results/002/design-variants.recode.vcf.gz  "${array[1]}" | \
  awk -v snp="${array[0]}" '
    {printf("%s\t%s\n", snp, $0);}
  ';
done < outputs/002/regions-for-candidates.txt > outputs/002/relevant_variation.txt

The result of that looks like this:

head outputs/002/relevant_variation.txt

## scaffold1006|size252274--136397  scaffold1006|size252274 136158  .   C   T   35032.8 .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136276  .   T   C   6472.29 .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136285  .   C   T   23582   .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136397  .   C   A   49764   .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136418  .   A   G   158703  .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136426  .   T   C   96229.6 .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136580  .   G   A   20303.2 .   .   GT:AD:DP:GQ:PL
## scaffold1006|size252274--136397  scaffold1006|size252274 136604  .   T   G   3768.48 .   .   GT:AD:DP:GQ:PL
## scaffold101|size1699134--364887  scaffold101|size1699134 364752  .   T   C   64148.6 .   .   GT:AD:DP:GQ:PL
## scaffold101|size1699134--364887  scaffold101|size1699134 364770  .   G   A   5528.3  .   .   GT:AD:DP:GQ:PL

And now we can read that into something that will be appropriate for snps2assays:

variation <- read_tsv("outputs/002/relevant_variation.txt", col_names = FALSE) %>%
  select(-X2, -X4, -(X7:X10)) %>%
  setNames(c("CHROM", "truePOS", "REF", "ALT")) %>%
  mutate(snppos = as.integer(str_replace(CHROM, "^.*--", ""))) %>%
  mutate(POS = (truePOS - snppos) + 251) %>%
  select(CHROM, POS, REF, ALT, truePOS)

# here is what that looks like
head(variation)

Note that we actually use the name of the SNP as if it were the "CHROM" because we will pull sequence out around just that spot. And POS is the position of the target SNP within that fragment (which is at position 251, cuz we put 250 bp of flanking region on each side.)

Getting sequences

We retrieve the consensus sequences from 250 bp on either side of each SNP we want. We use an ugly bash script much like with tabix to get the consensus sequence fragments and store them in a file along with the snp name.

while read line; do 
  array=($line); 
  samtools faidx genome/wifl-genome.fna   "${array[1]}" | \
  awk -v snp="${array[0]}" '
    /^>/ {printf("%s\t", snp); next} 
    {printf("%s", $1)} 
    END {printf("\n")}
  '
done < outputs/002/regions-for-candidates.txt > outputs/002/consensus_sequences_of_snps.txt

Then we can read those into a data frame

consSeq <- read_tsv("outputs/002/consensus_sequences_of_snps.txt", col_names = FALSE) %>%
  setNames(c("CHROM", "Seq"))
# see what it looks like
head(consSeq)

A Data frame of targets

These are just the SNPs that we want with the CHROM and POS. (Where again, CHROM is the name of the SNP)

targets <- final_candidates %>%
  select(snp, POS) %>%
  rename(CHROM = snp) %>%
  mutate(
    truePOS = POS,
    snppos = as.integer(str_replace(CHROM, "^.*--", ""))
  ) %>%
  mutate(POS = (truePOS - snppos) + 251) %>%
  select(CHROM, POS)

# see what that looks like
head(targets)

Running snps2assays

The moment we have all been waiting for---we want to see if this is going to work or not...

yay_assays <- assayize(V = variation, targets = targets, consSeq = consSeq, allVar = FALSE)

Then, we joined onto that a column or two that told us about why any particular SNP was chosen (i.e. major or minor comparison, and its rank within those):

yay_assays_with_ranks <-  candidate_snps %>%
  ungroup() %>%
  select(-(CHROM:rightpad)) %>%
  rename(CHROM = snp) %>%
  left_join(yay_assays, .)

Giving shorter names to these assays

Ultimately, we ended up naming the assays according to which of the comparisons it appeared they were useful for, and their rank within each of those comparisons. Those names, paired with the CHROM--POS specification of the SNP, is stored in the repository in /data/wifl-snp-short-names.tsv. We joinn them to yay_assays_with_ranks before writing that object out as a CSV file.

snp_names <- read_tsv("data/wifl-snp-short-names.tsv") %>%
  rename(CHROM = `snp_as_CHROM--POS`)

yay_assays_with_ranks %>%
  left_join(snp_names, by = "CHROM") %>%
  select(snp_short_name, everything()) %>%
  write_csv(file = "outputs/002/wifl-assay-candidates-with-ranks.csv")

Spot checking if you want to

Note that when I finally finished this, I hoped everything was correct, but it felt hard to verify things. Turns out that you can spot check things visually using IGV. It's a fun way to verify that you have designed assays for SNPs in the right places.

I used samtools view to pull the 500 bp around each SNP out of the big merged bamfile (filtering on MAPQ>40). I brought that to my laptop, sorted it and then indexed it, and then viewed them using IGV. I mostly wanted to verify that I had the SNP in the right spot. I spot checked a good handful and they all were correct.

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed, listed in HH:MM:SS.

td <- Sys.time() - start_time
tdf <- hms::as_hms(ceiling(hms::as_hms(td)))
dir.create("stored_run_times", showWarnings = FALSE, recursive = TRUE)
write_rds(tdf, "stored_run_times/002.rds")
tdf

## 00:00:45

Session Info

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       macOS Mojave 10.14.6        
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2021-03-04                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package         * version date       lib
##  assertthat        0.2.1   2019-03-21 [1]
##  backports         1.2.1   2020-12-09 [1]
##  broom             0.7.3   2020-12-16 [1]
##  cellranger        1.1.0   2016-07-27 [1]
##  cli               2.2.0   2020-11-20 [1]
##  colorspace        2.0-0   2020-11-11 [1]
##  crayon            1.3.4   2017-09-16 [1]
##  DBI               1.1.0   2019-12-15 [1]
##  dbplyr            2.0.0   2020-11-03 [1]
##  digest            0.6.27  2020-10-24 [1]
##  dplyr           * 1.0.2   2020-08-18 [1]
##  ellipsis          0.3.1   2020-05-15 [1]
##  evaluate          0.14    2019-05-28 [1]
##  fansi             0.4.1   2020-01-08 [1]
##  farver            2.0.3   2020-01-16 [1]
##  forcats         * 0.5.0   2020-03-01 [1]
##  fs                1.5.0   2020-07-31 [1]
##  generics          0.1.0   2020-10-31 [1]
##  genoscapeRtools * 0.1.0   2021-01-17 [1]
##  ggplot2         * 3.3.2   2020-06-19 [1]
##  glue              1.4.2   2020-08-27 [1]
##  gtable            0.3.0   2019-03-25 [1]
##  haven             2.3.1   2020-06-01 [1]
##  hms               0.5.3   2020-01-08 [1]
##  htmltools         0.5.0   2020-06-16 [1]
##  httr              1.4.2   2020-07-20 [1]
##  jsonlite          1.7.2   2020-12-09 [1]
##  knitr             1.30    2020-09-22 [1]
##  labeling          0.4.2   2020-10-20 [1]
##  lifecycle         0.2.0   2020-03-06 [1]
##  lubridate         1.7.9.2 2020-11-13 [1]
##  magrittr          2.0.1   2020-11-17 [1]
##  modelr            0.1.8   2020-05-19 [1]
##  munsell           0.5.0   2018-06-12 [1]
##  pillar            1.4.7   2020-11-20 [1]
##  pkgconfig         2.0.3   2019-09-22 [1]
##  ps                1.5.0   2020-12-05 [1]
##  purrr           * 0.3.4   2020-04-17 [1]
##  R6                2.5.0   2020-10-28 [1]
##  Rcpp              1.0.5   2020-07-06 [1]
##  readr           * 1.4.0   2020-10-05 [1]
##  readxl            1.3.1   2019-03-13 [1]
##  reprex            0.3.0   2019-05-16 [1]
##  rlang             0.4.9   2020-11-26 [1]
##  rmarkdown         2.6     2020-12-14 [1]
##  rstudioapi        0.13    2020-11-12 [1]
##  rvest             0.3.6   2020-07-25 [1]
##  scales            1.1.1   2020-05-11 [1]
##  sessioninfo       1.1.1   2018-11-05 [1]
##  snps2assays     * 0.1     2021-01-17 [1]
##  stringi           1.5.3   2020-09-09 [1]
##  stringr         * 1.4.0   2019-02-10 [1]
##  tibble          * 3.0.4   2020-10-12 [1]
##  tidyr           * 1.1.2   2020-08-27 [1]
##  tidyselect        1.1.0   2020-05-11 [1]
##  tidyverse       * 1.3.0   2019-11-21 [1]
##  vctrs             0.3.6   2020-12-17 [1]
##  withr             2.3.0   2020-09-22 [1]
##  xfun              0.19    2020-10-30 [1]
##  xml2              1.3.2   2020-04-23 [1]
##  yaml              2.2.1   2020-02-01 [1]
##  source                                   
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.1)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  Github (eriqande/genoscapeRtools@3e1dbfc)
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  Github (eriqande/snps2assays@8666f06)    
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
##  CRAN (R 4.0.2)                           
## 
## [1] /Users/eriq/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Literature Cited

Clemento, Anthony J, Eric D Crandall, John Carlos Garza, and Eric C Anderson. 2014. “Evaluation of a Single Nucleotide Polymorphism Baseline for Genetic Stock Identification of Chinook Salmon (Oncorhynchus Tshawytscha) in the California Current Large Marine Ecosystem.” Fishery Bulletin 112 (2-3).

Ruegg, Kristen, Rachael A Bay, Eric C Anderson, James F Saracco, Ryan J Harrigan, Mary Whitfield, Eben H Paxton, and Thomas B Smith. 2018. “Ecological Genomics Predicts Climate Vulnerability in an Endangered Southwestern Songbird.” Ecology Letters 21 (7). Wiley Online Library: 1085–96.

Select SNPs for Assay Development from RAD-Seq Data

Eric C. Anderson

Last Updated: 2021-03-04