What we call Panel 3 of assays includes the 96 SNPs that we chose to use to type the wintering birds so as to assign them to different breeding regions in North America.
In this notebook, we process the raw output of four plates (plates 4 through 7) of birds typed with Fluidigm panel 3.
We process the call-map output from the Fludigm software to make genotype files that are easily read by different programs, and by R. The call-map CSV files here have been converted to have Unix line endings (from the PC line endings that they typically have by default.)
library(tidyverse)
dir.create("outputs/007", recursive = TRUE, showWarnings = FALSE)
# run things through some scripts
cd data/fluidigm_panel_3_chips;
../../script/process-fluidigm-call-map-results.sh Plate*_Panel3.csv;
mv AllFiles_TwoCol.txt ../../outputs/007/panel3_all_two_col.txt;
mv AllZeroReports.txt ../../outputs/007/panel3_zero_reports.txt;
# clean up all the intermediate files
rm *gsisim* *twocol* xxxtempxxx
We have individuals genotyped in several different plates/panels, etc. Some might have been typed in more than one such unit. We call these geno_units, and we keep track of them for purposes of identifying which unit's results we will retain.
Join the panels.
# then read them in
panel3 <- read_tsv("outputs/007/panel3_all_two_col.txt") %>%
rename(Field_Number = Sample_ID)
dim(panel3)
## [1] 376 193
That is 376 rows (individuals) corresponding to 94 birds per plate for 4 plates. There are 193 columns, which is 1 ID column and then two columns for each of 96 SNPs.
To do this, we put it all in long format, and then it is entirely easier to deal with downstream:
nc <- ncol(panel3)
# name the gene copies .1 and .2
names(panel3)[seq(2, nc, by = 2)] <- paste(
names(panel3)[seq(2, nc, by = 2)],
".1",
sep = ""
)
names(panel3)[seq(3, nc, by = 2)] <- str_replace(
names(panel3)[seq(3, nc, by = 2)],
"_1$",
".2"
)
panel3_long <- panel3 %>%
pivot_longer(
cols = -Field_Number,
names_to = "locgc",
values_to = "allele"
) %>%
separate(locgc, into = c("assay_name", "gene_copy"), sep = "\\.") %>%
mutate(
allele_ACGT = recode(
allele,
`1` = "A",
`2` = "C",
`3` = "G",
`4` = "T",
`0` = NA_character_
),
geno_unit = "panel3"
) %>%
select(geno_unit, everything())
There were some plates that included re-runs and additional samples. These are stored in the data/fluidigm_additional_chips
folder.
p16 <- read_csv("data/fluidigm_additional_chips/WIFL.Plate16_PopID_CBdetailedResults.csv") %>%
select(Name, Assay, Converted) %>%
rename(
Field_Number = Name,
assay_name = Assay
) %>%
filter(Field_Number != "NTC") %>%
mutate(
Converted = ifelse(
Converted %in% c("No Call", "Invalid"),
NA,
Converted
),
geno_unit = "plate16"
) %>%
select(geno_unit, everything())
## Warning: Duplicated column names deduplicated: 'Allele X' => 'Allele X_1' [11],
## 'Allele Y' => 'Allele Y_1' [12]
And now we can break those genotypes into a long format:
p16_long <- p16 %>%
separate(Converted, into = c("1", "2"), remove = TRUE, sep = ":") %>%
pivot_longer(
cols = c(`1`, `2`),
names_to = "gene_copy",
values_to = "allele_ACGT"
) %>%
mutate(
allele = case_when(
allele_ACGT == "A" ~ 1,
allele_ACGT == "C" ~ 2,
allele_ACGT == "G" ~ 3,
allele_ACGT == "T" ~ 4
)
)
p316 <- bind_rows(
panel3_long,
p16_long
)
See if there are any re-genos:
p316 %>%
count(geno_unit, Field_Number) %>%
count(Field_Number) %>%
filter(n > 1)
It would appear that there are no re-genos there.
Then we have to associate the assay names with the names based on scaffold and position and the REF and ALT at each of those.
assay_meta <- read_csv("data/WIFL_fluidigm_assay_info.csv") %>%
dplyr::select(SNP_Name, CHROM, REF, ALT)
p316_2 <- left_join(
p316,
assay_meta,
by = c("assay_name" = "SNP_Name")
) %>%
mutate(
allele_01 = case_when( # here we code alleles as 0 = REF, 1 = ALT, NA = missing
is.na(allele_ACGT) ~ NA_integer_,
allele_ACGT == REF ~ 0L,
allele_ACGT == ALT ~ 1L,
TRUE ~ -999L
)
) %>%
rename(CHROMPOS = CHROM)
That is a long format that is worth seeing, because it will make it fairly straightforward to combine different data sets, etc. Here is what it looks like:
head(p316_2)
We write those genotypes out in case we need them later.
write_rds(p316_2, file = "outputs/007/470_birds_at_96_panel3_fluidigm_snps.rds", compress = "xz")
We might have some wintering birds typed on other plates, so read those earlier plates in and only keep the 96 assays of panel 3.
the564 <- read_rds("outputs/003/564_birds_at_192_fluidigm_snps.rds") %>%
mutate(geno_unit = "early_plates") %>%
select(geno_unit, everything()) %>%
semi_join(panel3_long, by = "assay_name")
all_them <- bind_rows(p316_2, the564) %>%
semi_join(panel3_long, by = "assay_name") %>%
mutate(allele = ifelse(allele == 0, NA, allele))
We also attach some other names to them and note their stage.
app1_WM <- read_csv("data/appendices/app1-winter-indivs-assignments-and-genos.csv")
needed_meta <- app1_WM %>%
filter(geno_method == "fluidigm") %>%
select(Field_Number, geno_method, Lat, Long, Stage)
winter_genos <- inner_join(all_them, needed_meta, by = "Field_Number") %>%
select(Field_Number, geno_method, Lat, Long, Stage, everything()) %>%
filter(Stage == "Wintering")
Now, check how many genotypes there are, because there may have been overlap on some assays
winter_genos %>%
count(Field_Number) %>%
count(n)
So, 21 of the individuals don't have a value recorded at one locus, it appears. Not to worry, that will become an NA when we widen it.
Look at missing data:
# what are the possible allele values?
winter_genos %>%
count(allele, allele_ACGT)
And, what is the distribution of the number of typed loci?
loci_typed <- winter_genos %>%
filter(!is.na(allele)) %>%
group_by(Field_Number) %>%
summarise(num_non_missing_loci = n() / 2)
ggplot(loci_typed, aes(x = num_non_missing_loci)) +
geom_histogram(binwidth = 1)
And, at the same time, we want to write out a data frame that we can compare to the wintering birds data in the appendix.
We need a wide data frame with columns sample_type
, repunit
, collection
, and indiv
, as well as two columns for each locus. We will use the allele base calls.
tmp <- winter_genos %>%
select(Field_Number, CHROMPOS, gene_copy, allele_ACGT) %>%
arrange(Field_Number, CHROMPOS, allele_ACGT)
# make a wide data frame to compare to the appendix
compare_to_appendix <- tmp %>%
group_by(Field_Number, CHROMPOS) %>%
summarise(geno = paste(allele_ACGT, collapse = ":")) %>%
ungroup() %>%
mutate(geno = ifelse(geno == "NA:NA", NA, geno)) %>%
pivot_wider(
names_from = CHROMPOS,
values_from = geno
)
# need to insert some code here to confirm that this gives us
# what we are distributing in the appendix
# now, make a rubias compliant tibble. It turns out that we have
# to arrange by gene-copy on to get the .1 and .2 columns in the
# correct order. Who knew...
winter_wide <- tmp %>%
arrange(Field_Number, CHROMPOS, gene_copy) %>%
pivot_wider(
names_from = c(CHROMPOS, gene_copy),
names_sep = ".",
values_from = allele_ACGT
)
fni <- seq(2, ncol(winter_wide), by = 2)
names(winter_wide)[fni] <- str_replace(
names(winter_wide)[fni],
"\\.1$",
""
)
# then, add the required columns
winter_rubias <- winter_wide %>%
mutate(
sample_type = "mixture",
repunit = NA_character_,
collection = "winter"
) %>%
rename(indiv = Field_Number) %>%
select(
sample_type,
repunit,
collection,
indiv,
everything()
)
# write that out to outputs
write_rds(
winter_rubias,
file = "outputs/007/winter_rubias.rds",
compress = "xz"
)
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed, listed in HH:MM:SS
.
td <- Sys.time() - start_time
tdf <- hms::as_hms(ceiling(hms::as_hms(td)))
dir.create("stored_run_times", showWarnings = FALSE, recursive = TRUE)
write_rds(tdf, "stored_run_times/003.rds")
tdf
## 00:00:10
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os macOS Mojave 10.14.6
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2021-03-04
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
## backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
## broom 0.7.3 2020-12-16 [1] CRAN (R 4.0.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.2)
## cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.2)
## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2)
## dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
## farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.2)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.2)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
## ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
## haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.2)
## hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.2)
## htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
## knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
## lubridate 1.7.9.2 2020-11-13 [1] CRAN (R 4.0.2)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.2)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
## ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.2)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
## readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.2)
## rlang 0.4.9 2020-11-26 [1] CRAN (R 4.0.2)
## rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.2)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
## scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
## tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.2)
## tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
## vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
## xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
##
## [1] /Users/eriq/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library