In this notebook, we process the raw output of the 192 SNPs that are on two plates of assays on Fluidigm chips to store the results in an R-friendly format. We did 6 plates of birds named Plate_1, Plate_2, Plate_3, Plate_8, Plate_9, and Plate_10. On each plate there are 94 birds (plus 2 no-template-controls). Each plate was assayed with two different panels of 96 SNPs, Panel1 and Panel2, respectively. So, in total, for this section of the project, we ran 12 Fluidigm chips, giving us data on 564 birds at 192 SNP assays.

Here, we process the call-map output from the Fludigm software to make genotype files that are easily read by different programs, and by R. The call-map CSV files here have been converted to have Unix line endings (from the PC line endings that they typically have by default.)

library(tidyverse)

dir.create("outputs/003", recursive = TRUE, showWarnings = FALSE)

Read the chips and process with shell scripts

# run things through some scripts
cd data/fluidigm_panel_1_and_2_chips; 

../../script/process-fluidigm-call-map-results.sh Plate*_Panel1.csv; 
mv AllFiles_TwoCol.txt ../../outputs/003/panel1_all_two_col.txt; 
mv AllZeroReports.txt ../../outputs/003/panel1_zero_reports.txt;
# clean up all the intermediate files
rm *gsisim* *twocol* xxxtempxxx

../../script/process-fluidigm-call-map-results.sh Plate*_Panel2.csv; 
mv AllFiles_TwoCol.txt ../../outputs/003/panel2_all_two_col.txt; 
mv AllZeroReports.txt ../../outputs/003/panel2_zero_reports.txt;
# clean up all the intermediate files
rm *gsisim* *twocol* xxxtempxxx

Read into R, join the two panels, and write them all out

Join the panels.

# then read them in
panel1 <- read_tsv("outputs/003/panel1_all_two_col.txt")
panel2 <- read_tsv("outputs/003/panel2_all_two_col.txt")

# then join them
both <- left_join(panel1, panel2) %>%
  rename(Field_Number = Sample_ID)

That is 564 rows (individuals) corresponding to 94 birds per plate for 6 plates.

Now, make the genotype calls 0,1,2 and name the loci by scaffold and position so that these can be combined with the RAD data

To do this, we put it all in long format, and then it is entirely easier to deal with downstream:

nc <- ncol(both)
# name the gene copies .1 and .2
names(both)[seq(2, nc, by = 2)] <- paste(
  names(both)[seq(2, nc, by = 2)], 
  ".1", 
  sep = ""
)
names(both)[seq(3, nc, by = 2)] <- str_replace(
  names(both)[seq(3, nc, by = 2)], 
  "_1$", 
  ".2"
)

genos_long <- both %>%
  pivot_longer(
    cols = -Field_Number,
    names_to = "locgc",
    values_to = "allele"
  ) %>%
  separate(locgc, into = c("assay_name", "gene_copy"), sep = "\\.") %>%
  mutate(
    allele_ACGT = recode(
      allele,
      `1` = "A",
      `2` = "C",
      `3` = "G",
      `4` = "T",
      `0` = NA_character_
    )
  )

Then we have to associate the assay names with the names based on scaffold and position and the REF and ALT at each of those.

assay_meta <- read_csv("data/WIFL_fluidigm_assay_info.csv") %>%
  dplyr::select(SNP_Name, CHROM, REF, ALT)

genos_long2 <- left_join(
  genos_long,
  assay_meta,
  by = c("assay_name" = "SNP_Name")
) %>%
  mutate(
    allele_01 = case_when( # here we code alleles as 0 = REF, 1 = ALT, NA = missing
      is.na(allele_ACGT) ~ NA_integer_,
      allele_ACGT == REF ~ 0L,
      allele_ACGT == ALT ~ 1L,
      TRUE ~ -999L
    )
  ) %>%
  rename(CHROMPOS = CHROM)

That is a long format that is worth seeing, because it will make it fairly straightforward to combine different data sets, etc. Here is what it looks like:

head(genos_long2)

We write those genotypes out in case we need them later.

dir.create("outputs/003", showWarnings = FALSE, recursive = TRUE)
write_rds(genos_long2, file = "outputs/003/564_birds_at_192_fluidigm_snps.rds", compress = "xz")

Extract the birds that were identified as breeders from the above

We also attach some other names to them, for the structure runs, etc. And we record their 012 genos, as well (albeit in two rows for each locus by individual), but we can always slice those out as need be.

app1_SI_1 <- read_csv("data/appendices/app1-SITable1_Breed_IndvAssigwgenos.csv") 

fluidigm_breeder_names <- app1_SI_1 %>%
  filter(geno_method == "fluidigm") %>%
  select(Field_Number, short_name, group_af_short_name, geno_method, Stage)

fluidigm_breeders_long <- inner_join(fluidigm_breeder_names, genos_long2, by = "Field_Number") %>%
  group_by(Field_Number, CHROMPOS) %>%
  mutate(geno012 = sum(allele_01)) %>%
  ungroup()

Check on the extent of missing data, and drop high-missing loci

miss_rates <- fluidigm_breeders_long %>%
  group_by(CHROMPOS) %>%
  summarise(miss_fract = mean(is.na(allele_ACGT)))

ggplot(miss_rates, aes(x = miss_fract)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = 0.05, linetype = "dashed")

The dashed line there shows a 5% cutoff, which is what we applied to toss out other loci.

Here are the loci that we toss:

toss_em <- miss_rates %>%
  filter(miss_fract > 0.05) %>%
  arrange(desc(miss_fract))

# have a look at all of them
left_join(
  miss_rates %>% arrange(desc(miss_fract)),
  assay_meta,
  by = c("CHROMPOS" = "CHROM")
)

Remove those to make our final data set of Fluidigm-typed breeders and write it out.

final_fluidigm_breeders_179_loci_393_birds_long <- fluidigm_breeders_long %>%
  anti_join(toss_em, by = "CHROMPOS")
write_rds(
  final_fluidigm_breeders_179_loci_393_birds_long,
  file = "outputs/003/final_fluidigm_breeders_179_loci_393_birds_long.rds",
  compress = "xz"
)

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed, listed in HH:MM:SS.

td <- Sys.time() - start_time
tdf <- hms::as_hms(ceiling(hms::as_hms(td)))
dir.create("stored_run_times", showWarnings = FALSE, recursive = TRUE)
write_rds(tdf, "stored_run_times/003.rds")
tdf
## 00:00:16

Session Info

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       macOS Mojave 10.14.6        
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2021-03-04                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
##  backports     1.2.1   2020-12-09 [1] CRAN (R 4.0.2)
##  broom         0.7.3   2020-12-16 [1] CRAN (R 4.0.2)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.0.2)
##  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.2)
##  colorspace    2.0-0   2020-11-11 [1] CRAN (R 4.0.2)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
##  DBI           1.1.0   2019-12-15 [1] CRAN (R 4.0.2)
##  dbplyr        2.0.0   2020-11-03 [1] CRAN (R 4.0.2)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
##  farver        2.0.3   2020-01-16 [1] CRAN (R 4.0.2)
##  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 4.0.2)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
##  ggplot2     * 3.3.2   2020-06-19 [1] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.2)
##  haven         2.3.1   2020-06-01 [1] CRAN (R 4.0.2)
##  hms           0.5.3   2020-01-08 [1] CRAN (R 4.0.2)
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
##  httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.2)
##  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
##  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.0.2)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
##  lubridate     1.7.9.2 2020-11-13 [1] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.0.2)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.2)
##  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.2)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
##  ps            1.5.0   2020-12-05 [1] CRAN (R 4.0.2)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
##  Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
##  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
##  readxl        1.3.1   2019-03-13 [1] CRAN (R 4.0.2)
##  reprex        0.3.0   2019-05-16 [1] CRAN (R 4.0.2)
##  rlang         0.4.9   2020-11-26 [1] CRAN (R 4.0.2)
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.2)
##  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)
##  rvest         0.3.6   2020-07-25 [1] CRAN (R 4.0.2)
##  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.2)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
##  tibble      * 3.0.4   2020-10-12 [1] CRAN (R 4.0.2)
##  tidyr       * 1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
##  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
##  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 4.0.2)
##  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.2)
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
##  xfun          0.19    2020-10-30 [1] CRAN (R 4.0.2)
##  xml2          1.3.2   2020-04-23 [1] CRAN (R 4.0.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
## 
## [1] /Users/eriq/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Literature Cited