In this document, we are getting the data that Brendan sent to me into a nice PLINK format.
Along the way we are making a few assumptions, etc. Basically what we are doing is:
intermediates/01
.The final products from all this are these files in intermediates/01
:
binary-ns.bed binary-ns.fam binary-ns.nosex binary-west.bim binary-west.log
binary-ns.bim binary-ns.log binary-west.bed binary-west.fam binary-west.nosex
I will commit these and all future analyses will start with those file.After all that, I decided that it was easiest to keep everything in a tidy format, so this also creates compressed output files: tidy-ns.rds
and tidy-west.rds
in intermediates/01
.
Some libraries
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages --------------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(stringr)
First, grab the locus names out of the genotype files and put them in some temporary files:
mkdir data/processed
gzcat data/NSaggregated_RAW.txt.gz | awk '$1 != "Pop" && NR>1 && NF==1' > data/processed/ns_loci.txt
gzcat data/WestAgged_RAW.txt.gz | awk '$1 != "Pop" && NR>1 && NF==1' > data/processed/west_loci.txt
Then we can compare that to what we have in the two different files:
ns_snp <- read_table("data/processed/ns_loci.txt", col_names = FALSE)
cols(
X1 = col_character()
)
west_snp <- read_table("data/processed/west_loci.txt", col_names = FALSE)
cols(
X1 = col_character()
)
conv <- read_csv("data/Convert_220K_names_to_4.7K_Ilmn-Affy_CommonSNP.csv")
Parsed with column specification:
cols(
`Illumina V2 ID` = col_character(),
`Affy 220K ID` = col_character()
)
# linkage map has some commas instead of periods for decimal points
# so we need to fix that. (Also I had to edit the csv file (Mono in a numeric column))
lmap <- read_csv("data/LinkageMap.csv",
col_types = cols(`Female map` = col_character(),
`Male map` = col_double())) %>%
mutate(fem_map = as.numeric(str_replace(`Female map`, ",", ".")))
# so these things don't overlap super well:
dim(ns_snp)
[1] 4359 1
dim(west_snp)
[1] 4367 1
sum(ns_snp$X1 %in% lmap$SNP_ID)
[1] 2862
sum(ns_snp$X1 %in% c(lmap$SNP_ID, conv$`Illumina V2 ID`))
[1] 3636
sum(west_snp$X1 %in% lmap$SNP_ID)
[1] 2876
sum(west_snp$X1 %in% c(lmap$SNP_ID, conv$`Illumina V2 ID`))
[1] 3653
So, we are a long way from having everything on the map.
Here is what I am going to do—it is an exact science to be sure, but we are interested in seeing what sort of effects physical linkage might have. I willl randomly sprinkle unmapped SNPs around the genome and seeing what the effect is.
We will take a simple mean:
sex_aved <- lmap %>%
mutate(sex_ave_map = (fem_map + `Male map`) / 2)
# and we will store that in the intermediates directory so that we
# can grab it later to tell us which markers were on the map and which
# weren't.
saveRDS(sex_aved, file = "intermediates/01/sex_averaged_map.rds", compress = "xz")
For the markers that are not on the map, for the purposes of simulation, I am going to randomly intersperse them, and then interpolate the map distances, and pretend that is the truth. This is not right, of course, but it will be a good approximation.
To do this, we will write a quick function called sprinkle_into_map
: you pass it 1) the sex-averaged map, and 2) a list of all the markers; in turn it retains only the markers from the list. Those markers that do not appear on the map are sorted randomly into the mapped markers and their “pseudo-mapped” positions are given as the average value of the flanking mapped SNPs.
#' make a map with unmapped markers randomly placed in there
#' @param m the data frame with the sex-averaged map
#' @param s the data frame with a column X1 that has the snp names
#' @return a data frame with two new columns: new_chrom and new_map
#' that have what we want.
sprinkle_into_map <- function(m, s) {
# this just puts them all together (drops mapped SNPs not on the s list)
t1 <- left_join(s,m, by = c("X1" = "SNP_ID")) %>%
arrange(Chromosome, sex_ave_map) %>%
mutate(new_chrom = Chromosome)
# now we count up how many mapped SNPs are in each chromosome. We need this
# info because we are going to sprinkle unmapped SNPs onto chromosomes at a
# rate proportional to the number of mapped SNPs already on the chromosome.
chrom_cts <- t1 %>%
filter(!is.na(Chromosome)) %>%
count(Chromosome)
# in this ugly step we assign the unmapped SNPs to chromosomes
t1$new_chrom[is.na(t1$new_chrom)] <- sample(x = chrom_cts$Chromosome,
size = sum(is.na(t1$new_chrom)),
replace = TRUE,
prob = chrom_cts$n)
# now, we sort do a step that gives simulates the order of the unmapped
# markers---essentially sprinkling them into the existing ones. But we never
# put an unmapped one on the end.
t2 <- t1 %>%
arrange(new_chrom, sex_ave_map) %>%
group_by(new_chrom) %>%
mutate(tmp_rank = 1:n(),
tmp_rank = ifelse(!is.na(sex_ave_map), tmp_rank, sample(x = 2:(sum(!is.na(sex_ave_map)) - 1),
size = sum(is.na(sex_ave_map)),
replace = TRUE))) %>%
arrange(new_chrom, tmp_rank)
# and now we just have some NA values in the sex_ave_map column to fill. I will fill them up and
# down with the last observed value and then average those, which will have the effect of
# filling them in with the mean of the two flanking mapped SNPs
t3 <- t2 %>%
mutate(fillup = sex_ave_map,
filldown = sex_ave_map) %>%
tidyr::fill(fillup, .direction = "up") %>%
tidyr::fill(filldown, .direction = "down") %>%
mutate(new_map = (fillup + filldown) / 2)
# then drop a few columns and return after making pseudo-bp coordinates
# that correspond to 1 cm = 1 Mb. And if things are in the same linkage
# group they just get put 1 bp apart
t3 %>%
select(-(tmp_rank:filldown)) %>%
group_by(new_chrom, new_map) %>%
mutate(pseudo_bp = new_map * 10^6 + 1:n()) %>%
ungroup()
}
So, let’s sprinkle! We set seeds so we get the same results each time.
set.seed(123)
ns_new <- sprinkle_into_map(sex_aved, ns_snp)
set.seed(456)
west_new <- sprinkle_into_map(sex_aved, west_snp)
Now that we have a (pseudo)-linkage map for all the markers we can extract the genotypes from the Genepop files and convert it all into PLINK format. I will do this with awk
. Recall that we got the locus names from those files earlier. Note that we are likely going to have to reorder the markers in these files so that they correspond to the sprinkled marker order.
mkdir data/processed
gzcat data/NSaggregated_RAW.txt.gz | awk 'NF>100 {printf("%s %s 0 0 0 -9", $1, $1); for(i=3;i<=NF;i++) printf(" %d %d", substr($i, 1, 3), substr($i, 4, 3)); printf("\n");}' > data/processed/tmp_ns.ped
gzcat data/WestAgged_RAW.txt.gz | awk 'NF>100 {printf("%s %s 0 0 0 -9", $1, $1); for(i=3;i<=NF;i++) printf(" %d %d", substr($i, 1, 3), substr($i, 4, 3)); printf("\n");}' > data/processed/tmp_west.ped
Now, we have to make the .map files. This should include only those markers that are actually in the GenePop files and which are in the sprinkled maps, and we want them to be in map order.
ns_loci <- scan("data/processed/ns_loci.txt", what = "character")
Read 4359 items
ns_plink_map <- ns_new %>%
mutate(map_dummy = 0) %>%
rename(Variant = X1) %>%
select(new_chrom, Variant, map_dummy, pseudo_bp) %>%
filter(Variant %in% ns_loci)
west_loci <- scan("data/processed/west_loci.txt", what = "character")
Read 4367 items
west_plink_map <- west_new %>%
mutate(map_dummy = 0) %>%
rename(Variant = X1) %>%
select(new_chrom, Variant, map_dummy, pseudo_bp) %>%
filter(Variant %in% west_loci)
Now that we have those pseudo-map files, we are going to need to put the loci in the right order in the PLINK files. We will just read it all into a tbl_df() and then select the columns to be in the right order:
id_cols <- paste("id", 1:6, sep = "")
ns_cols <- paste(rep(ns_loci, each = 2), 1:2, sep = "_")
ns_cols_reordered <- paste(rep(ns_plink_map$Variant, each = 2), 1:2, sep = "_")
nsdf <- read_delim("data/processed/tmp_ns.ped", delim = " ", col_names = c(id_cols, ns_cols), progress = FALSE)
Parsed with column specification:
cols(
.default = col_integer(),
id1 = col_character(),
id2 = col_character()
)
See spec(...) for full column specifications.
ns_reordered <- nsdf[, c(id_cols, ns_cols_reordered)]
west_cols <- paste(rep(west_loci, each = 2), 1:2, sep = "_")
west_cols_reordered <- paste(rep(west_plink_map$Variant, each = 2), 1:2, sep = "_")
westdf <- read_delim("data/processed/tmp_west.ped", delim = " ", col_names = c(id_cols, west_cols), progress = FALSE)
Parsed with column specification:
cols(
.default = col_integer(),
id1 = col_character(),
id2 = col_character()
)
See spec(...) for full column specifications.
west_reordered <- westdf[, c(id_cols, west_cols_reordered)]
# and, having gotten those together, we can now write them to PLINK .ped files
write_delim(ns_reordered, path = "data/processed/ns.ped", delim = " ", col_names = FALSE)
write_delim(west_reordered, path = "data/processed/west.ped", delim = " ", col_names = FALSE)
# and we write out the .map files as well
write_delim(ns_plink_map, path = "data/processed/ns.map", delim = "\t", col_names = FALSE)
write_delim(west_plink_map, path = "data/processed/west.map", delim = "\t", col_names = FALSE)
These will be the “final products” from this particular Rmd. We will store them in intermediates
so that we can work from them directly later on. For this to work you are gonna need plink
on your system path somewhere.
mkdir -p intermediates/01
cd data/processed
plink --file ns --aec --make-bed --out ../../intermediates/01/binary-ns > /dev/null 2>&1
plink --file west --aec --make-bed --out ../../intermediates/01/binary-west > /dev/null 2>&1
And now we can read in the names of individuals and get ready to select different subsets.
ns_names <- read_delim("intermediates/01/binary-ns.fam", delim = " ", col_names = FALSE) %>%
mutate(group = str_sub(X1, 1, 3))
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_integer(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_integer()
)
west_names <- read_delim("intermediates/01/binary-west.fam", delim = " ", col_names = FALSE) %>%
mutate(group = str_sub(X1, 1, 3))
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_integer(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_integer()
)
And here we just offer a little summary of the numbers in each:
ns_names %>% count(group)
west_names %>% count(group)
Those numbers are in agreement with the numbers of fish from the different wild populations in the paper. Note that “ns” is the Maritimes, and “west” is Newfoundland.
Before we go off and use PLINK to compute Fst, I am going to explore just putting all the data in a tidy format, and then computing expected assignment probabilities, since that is another valid way of choosing loci, and possibly better in some contexts.
So, how about a function to read plink ped and map files to tidy format. We have written one of those in its own file, so that it will be easier to package it up later.
source("R/plink2tidy.R")
Here is a listing of it:
#' @param ped path to a space delimited plink ped file
#' @param map path to a tab delimited plink map file
plink2tidy <- function(ped, map) {
m <- read_delim(map, delim = "\t", col_names = FALSE, progress = FALSE) %>%
setNames(c("chrom", "variant", "dummy", "coord"))
p <- read_delim(ped, delim = " ", col_names = FALSE, progress = FALSE, na = "0") %>%
setNames(c("id", "fid", "d1", "d2", "d3", "d4", paste(rep(m$variant, each = 2), c(1,2), sep = " ")))
# now we gather p and drop the columns that are not relevant here
# and then split the gene copies up
pt <- p %>%
select(-(fid:d4)) %>%
tidyr::gather(data = ., key = "snp_and_gc", value = "allele", -id) %>%
tidyr::separate(data = ., col = snp_and_gc, into = c("variant", "gene_copy"), sep = " ", convert = TRUE)
# now bind the marker information on there and make chrom a factor so it sorts in the right order
mp <- left_join(pt, m %>% select(-dummy)) %>%
mutate(chrom = factor(chrom, levels = unique(m$chrom))) %>%
arrange(chrom, coord, id, gene_copy)
# return that
mp
}
Now we use that function to store tidy versions of those data sets:
ns_tidy <- plink2tidy("data/processed/ns.ped", "data/processed/ns.map")
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_double(),
X4 = col_double()
)
Parsed with column specification:
cols(
.default = col_integer(),
X1 = col_character(),
X2 = col_character(),
X3 = col_character(),
X4 = col_character(),
X5 = col_character()
)
See spec(...) for full column specifications.
Joining, by = "variant"
west_tidy <- plink2tidy("data/processed/west.ped", "data/processed/west.map")
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_double(),
X4 = col_double()
)
Parsed with column specification:
cols(
.default = col_integer(),
X1 = col_character(),
X2 = col_character(),
X3 = col_character(),
X4 = col_character(),
X5 = col_character()
)
See spec(...) for full column specifications.
Joining, by = "variant"
saveRDS(ns_tidy, file = "intermediates/01/tidy-ns.rds", compress = "xz")
saveRDS(west_tidy, file = "intermediates/01/tidy-west.rds", compress = "xz")