Intro/Overview

After I write some functions for segregating haplotypes and creating hybrids I am going to need some functions for turning data frames of genotypes into newhybrids input files. I am going to work on this before doing the segregation code since I think it will inform what I want the end result of that segregation code to look like.

Our basic goal here is to pass in a data frame of individuals that has at least the columns: id, variant, gene_copy, allele, and also pass in a vector of variant names, and then this will find all the common variant names and make a data frame that is suitable for printing out for newhybrids.

Mostly this will be a lot of tidyr…

library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages --------------------------------------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
library(stringr)
source("R/nh_tablify.R")
source("R/write_nh.R")

Functions

This is just one that makes tables of nonLumped genotypes that we will be able to print to newhybrids files. (NOTE to self, this is going to work poorly if the group is missing data at all indivs at a locus you want…)

#' @param D a tidy data frame with at least the columns: id, variant, gene_copy, allele
#' @param V a vector of variant names giving which variants to use and the order they are
#' to be in.
#' @param opt_str  The options string to be added to each individual (like "z0s")
#' @param id_prepend this string will be prepended to the fish id (i.e. use it for "train_", or "test_")
nh_tablify <- function(D, V, opt_str = "", id_prepend = "") {
  D1 <- D %>%
    filter(variant %in% V)
  
  common_loc <- V[V %in% D1$variant]  # these are the loci from V that are in D, in V-order
  
  # now spread the gene copies and then the variants
  D1 %>%
    select(id, variant, gene_copy, allele) %>%
    tidyr::spread(data = ., key = gene_copy, value = allele) %>%
    mutate(gc1 = as.integer(ifelse(is.na(`1`), -1, `1`)),
           gc2 = as.integer(ifelse(is.na(`2`), -1, `2`))) %>%
    select(-`1`, -`2`) %>%
    mutate(geno = paste(gc1, gc2, sep = " ")) %>%
    select(-gc1, -gc2) %>% 
    tidyr::spread(data = ., key = variant, value = geno, fill = "-1 -1") %>%
    .[, c("id", common_loc)] %>%
    mutate(option = opt_str) %>%
    mutate(id = paste(id_prepend, id, sep = "")) %>%
    select(id, option, everything())
  
}

Now, if you have bind_rows-ed together a bunch of products of nh_tablify() then you can pass them to this function to write a newhybs file:

#' write a new_hybs file
#'
#' @param G a data frame like that prepared by nh_tablify, or perhaps several of those
#' that have been bind_rows()-ed together.
#' @param path  the file path you want to write to.
write_nh <- function(G, path = "nh_data.txt") {
  # first write the preamble
  cat("NumIndivs ", nrow(G), "\nNumLoci ", ncol(G) - 2, "\nDigits 1\nFormat NonLumped\n\n", file = path)
  
  # then the locus names
  cat(c("LocusNames  ", names(G)[-(1:2)]), sep = " ", eol = "\n", file = path, append = TRUE)
  
  # then format the genotype frame and write it
  G %>%
    mutate(id = paste("n", id, sep = " ")) %>%
    ungroup() %>%
    mutate(idx = 1:n()) %>%
    select(idx, everything()) %>%
    write.table(file = path, append = TRUE, quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "   ", eol = "\n")
  
}

Putting those together

OK, now that we have those functions we can put them all together with a simple example using the intermediate data stored from the last document (02). First let us get the data.

TT <- readRDS("intermediates/02/test_and_train.rds")
MM <- readRDS("intermediates/02/marker_rankings.rds")

Now, choose the top 1000 loci to use:

v1000 <- MM %>%
  filter(selectable == TRUE & cumsum(selectable) <= 1000) %>%
  .$variant

Now, we are going are just going to put pure wild and pure farmed individuals in there as a test case. We are going to include the training individuals with the z0s and z1s options. We can do this all into one big list first:

nh_list <- list(
train_farm = TT %>%
  filter(test_or_train == "train", group == "farmed") %>%
  nh_tablify(., v1000, opt_str = "z0s", id_prepend = "train_"),
train_wild = TT %>%
  filter(test_or_train == "train", group == "wild") %>%
  nh_tablify(., v1000, opt_str = "z1s", id_prepend = "train_"),
test_farm = TT %>%
  filter(test_or_train == "test", group == "farmed") %>%
  nh_tablify(., v1000, opt_str = "", id_prepend = "test_"),
test_wild = TT %>%
  filter(test_or_train == "test", group == "wild") %>%
  nh_tablify(., v1000, opt_str = "", id_prepend = "test_")
)
dir.create("scratch", showWarnings = FALSE)
bind_rows(nh_list) %>%
  write_nh(path = "scratch/all_pures.txt")

Note that if I want to run that in newhybrids with a fixed prior for the different categories, I can do that like this:

~/Documents/git-repos/newhybrids/newhybs -d all_pures.txt -g P0 1 0 0 -g p1 0 0 1 -g F1 0 1 0 -g F2 .25 .5 .25 -g BX0 .5 .5 0 -g BX1 0 .5 .5 --pi-prior fixed  1 1 1 1 1 1

Looking at the output, it is easy to verify that when the prior is fixed and equal for all categories the aa-PofZ.txt and aa-ScaledLikelihood.txt files contain exactly the same values. But the aa-PofZ.txt also has an IndivName column (using the “n” option for names), so that will be the one that we want to use.

Great. We are almost ready to put it all together, I just need to make a function that creates hybrids from the test individuals.

LS0tCnRpdGxlOiAiVXRpbGl0aWVzIGZvciBNYWtpbmcgTmV3SHlicyBpbnB1dCBGaWxlcyIKb3V0cHV0OiAKICBodG1sX25vdGVib29rOgogICAgdG9jOiB0cnVlCi0tLQoKCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQojIHNldCB0aGUgd29ya2luZyBkaXJlY3RvcnkgYWx3YXlzIHRvIHRoZSBwcm9qZWN0IGRpcmVjdG9yeSAob25lIGxldmVsIHVwKQprbml0cjo6b3B0c19rbml0JHNldChyb290LmRpciA9IG5vcm1hbGl6ZVBhdGgocnByb2pyb290OjpmaW5kX3JzdHVkaW9fcm9vdF9maWxlKCkpKSAKYGBgCgojIyBJbnRyby9PdmVydmlldwoKQWZ0ZXIgSSB3cml0ZSBzb21lIGZ1bmN0aW9ucyBmb3Igc2VncmVnYXRpbmcgaGFwbG90eXBlcyBhbmQgY3JlYXRpbmcgaHlicmlkcyBJCmFtIGdvaW5nIHRvIG5lZWQgc29tZSBmdW5jdGlvbnMgZm9yIHR1cm5pbmcgZGF0YSBmcmFtZXMgb2YgZ2Vub3R5cGVzIGludG8KbmV3aHlicmlkcyBpbnB1dCBmaWxlcy4gIEkgYW0gZ29pbmcgdG8gd29yayBvbiB0aGlzIGJlZm9yZSBkb2luZyB0aGUgc2VncmVnYXRpb24KY29kZSBzaW5jZSBJIHRoaW5rIGl0IHdpbGwgaW5mb3JtIHdoYXQgSSB3YW50IHRoZSBlbmQgcmVzdWx0IG9mIHRoYXQgc2VncmVnYXRpb24KY29kZSB0byBsb29rIGxpa2UuCgpPdXIgYmFzaWMgZ29hbCBoZXJlIGlzIHRvIHBhc3MgaW4gYSBkYXRhIGZyYW1lIG9mIGluZGl2aWR1YWxzIHRoYXQgaGFzIAphdCBsZWFzdCB0aGUgY29sdW1uczogaWQsIHZhcmlhbnQsIGdlbmVfY29weSwgYWxsZWxlLCBhbmQgYWxzbyBwYXNzIGluCmEgdmVjdG9yIG9mIHZhcmlhbnQgbmFtZXMsIGFuZCB0aGVuIHRoaXMgd2lsbCBmaW5kIGFsbCB0aGUgY29tbW9uIHZhcmlhbnQKbmFtZXMgYW5kIG1ha2UgYSBkYXRhIGZyYW1lIHRoYXQgaXMgc3VpdGFibGUgZm9yIHByaW50aW5nIG91dCBmb3IgbmV3aHlicmlkcy4KCk1vc3RseSB0aGlzIHdpbGwgYmUgYSBsb3Qgb2YgdGlkeXIuLi4KCmBgYHtyIGxvYWQtbGlic30KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkoc3RyaW5ncikKCnNvdXJjZSgiUi9uaF90YWJsaWZ5LlIiKQpzb3VyY2UoIlIvd3JpdGVfbmguUiIpCmBgYAoKIyMgRnVuY3Rpb25zCgpUaGlzIGlzIGp1c3Qgb25lIHRoYXQgbWFrZXMgdGFibGVzIG9mIG5vbkx1bXBlZCBnZW5vdHlwZXMgdGhhdCB3ZSB3aWxsIGJlIGFibGUgdG8gCnByaW50IHRvIG5ld2h5YnJpZHMgZmlsZXMuIChOT1RFIHRvIHNlbGYsIHRoaXMgaXMgZ29pbmcgdG8gd29yayBwb29ybHkgaWYgCnRoZSBncm91cCBpcyBtaXNzaW5nIGRhdGEgYXQgYWxsIGluZGl2cyBhdCBhIGxvY3VzIHlvdSB3YW50Li4uKQpgYGByCmByIHBhc3RlKHJlYWRMaW5lcygiUi9uaF90YWJsaWZ5LlIiKSwgY29sbGFwc2UgPSAiXG4iKWAKYGBgCgpOb3csIGlmIHlvdSBoYXZlIGJpbmRfcm93cy1lZCB0b2dldGhlciBhIGJ1bmNoIG9mIHByb2R1Y3RzIG9mIGBuaF90YWJsaWZ5KClgIHRoZW4geW91IGNhbgpwYXNzIHRoZW0gdG8gdGhpcyBmdW5jdGlvbiB0byB3cml0ZSBhIG5ld2h5YnMgZmlsZToKYGBgcgpgciBwYXN0ZShyZWFkTGluZXMoIlIvd3JpdGVfbmguUiIpLCBjb2xsYXBzZSA9ICJcbiIpYApgYGAKCiMjIFB1dHRpbmcgdGhvc2UgdG9nZXRoZXIKCk9LLCBub3cgdGhhdCB3ZSBoYXZlIHRob3NlIGZ1bmN0aW9ucyB3ZSBjYW4gcHV0IHRoZW0gYWxsIHRvZ2V0aGVyIHdpdGggYSBzaW1wbGUgZXhhbXBsZSB1c2luZwp0aGUgaW50ZXJtZWRpYXRlIGRhdGEgc3RvcmVkIGZyb20gdGhlIGxhc3QgZG9jdW1lbnQgKDAyKS4gICBGaXJzdCBsZXQgdXMgZ2V0IHRoZSBkYXRhLgpgYGB7ciBnZXQtZGF0YX0KVFQgPC0gcmVhZFJEUygiaW50ZXJtZWRpYXRlcy8wMi90ZXN0X2FuZF90cmFpbi5yZHMiKQpNTSA8LSByZWFkUkRTKCJpbnRlcm1lZGlhdGVzLzAyL21hcmtlcl9yYW5raW5ncy5yZHMiKQpgYGAKCk5vdywgY2hvb3NlIHRoZSB0b3AgMTAwMCBsb2NpIHRvIHVzZToKYGBge3IgdG9wMjAwfQp2MTAwMCA8LSBNTSAlPiUKICBmaWx0ZXIoc2VsZWN0YWJsZSA9PSBUUlVFICYgY3Vtc3VtKHNlbGVjdGFibGUpIDw9IDEwMDApICU+JQogIC4kdmFyaWFudApgYGAKCk5vdywgd2UgYXJlIGdvaW5nIGFyZSBqdXN0IGdvaW5nIHRvIHB1dCBwdXJlIHdpbGQgYW5kIHB1cmUgZmFybWVkIGluZGl2aWR1YWxzIGluIHRoZXJlCmFzIGEgdGVzdCBjYXNlLiAgV2UgYXJlIGdvaW5nIHRvIGluY2x1ZGUgdGhlIHRyYWluaW5nIGluZGl2aWR1YWxzIHdpdGggdGhlIHowcyBhbmQgejFzIApvcHRpb25zLiAgV2UgY2FuIGRvIHRoaXMgYWxsIGludG8gb25lIGJpZyBsaXN0IGZpcnN0OgpgYGB7ciBtYWtlLWNodW5rc30KbmhfbGlzdCA8LSBsaXN0KAp0cmFpbl9mYXJtID0gVFQgJT4lCiAgZmlsdGVyKHRlc3Rfb3JfdHJhaW4gPT0gInRyYWluIiwgZ3JvdXAgPT0gImZhcm1lZCIpICU+JQogIG5oX3RhYmxpZnkoLiwgdjEwMDAsIG9wdF9zdHIgPSAiejBzIiwgaWRfcHJlcGVuZCA9ICJ0cmFpbl8iKSwKdHJhaW5fd2lsZCA9IFRUICU+JQogIGZpbHRlcih0ZXN0X29yX3RyYWluID09ICJ0cmFpbiIsIGdyb3VwID09ICJ3aWxkIikgJT4lCiAgbmhfdGFibGlmeSguLCB2MTAwMCwgb3B0X3N0ciA9ICJ6MXMiLCBpZF9wcmVwZW5kID0gInRyYWluXyIpLAp0ZXN0X2Zhcm0gPSBUVCAlPiUKICBmaWx0ZXIodGVzdF9vcl90cmFpbiA9PSAidGVzdCIsIGdyb3VwID09ICJmYXJtZWQiKSAlPiUKICBuaF90YWJsaWZ5KC4sIHYxMDAwLCBvcHRfc3RyID0gIiIsIGlkX3ByZXBlbmQgPSAidGVzdF8iKSwKdGVzdF93aWxkID0gVFQgJT4lCiAgZmlsdGVyKHRlc3Rfb3JfdHJhaW4gPT0gInRlc3QiLCBncm91cCA9PSAid2lsZCIpICU+JQogIG5oX3RhYmxpZnkoLiwgdjEwMDAsIG9wdF9zdHIgPSAiIiwgaWRfcHJlcGVuZCA9ICJ0ZXN0XyIpCikKCmRpci5jcmVhdGUoInNjcmF0Y2giLCBzaG93V2FybmluZ3MgPSBGQUxTRSkKYmluZF9yb3dzKG5oX2xpc3QpICU+JQogIHdyaXRlX25oKHBhdGggPSAic2NyYXRjaC9hbGxfcHVyZXMudHh0IikKYGBgCgpOb3RlIHRoYXQgaWYgSSB3YW50IHRvIHJ1biB0aGF0IGluIG5ld2h5YnJpZHMgd2l0aCBhIGZpeGVkIHByaW9yIGZvciB0aGUgZGlmZmVyZW50IGNhdGVnb3JpZXMsIEkgCmNhbiBkbyB0aGF0IGxpa2UgdGhpczoKYGBgc2gKfi9Eb2N1bWVudHMvZ2l0LXJlcG9zL25ld2h5YnJpZHMvbmV3aHlicyAtZCBhbGxfcHVyZXMudHh0IC1nIFAwIDEgMCAwIC1nIHAxIDAgMCAxIC1nIEYxIDAgMSAwIC1nIEYyIC4yNSAuNSAuMjUgLWcgQlgwIC41IC41IDAgLWcgQlgxIDAgLjUgLjUgLS1waS1wcmlvciBmaXhlZCAgMSAxIDEgMSAxIDEKYGBgCkxvb2tpbmcgYXQgdGhlIG91dHB1dCwgaXQgaXMgZWFzeSB0byB2ZXJpZnkgdGhhdCB3aGVuIHRoZSBwcmlvciBpcyBmaXhlZCBhbmQgZXF1YWwgZm9yIGFsbCBjYXRlZ29yaWVzCnRoZSBgYWEtUG9mWi50eHRgIGFuZCBgYWEtU2NhbGVkTGlrZWxpaG9vZC50eHRgIGZpbGVzIGNvbnRhaW4gZXhhY3RseSB0aGUgc2FtZSB2YWx1ZXMuCkJ1dCB0aGUgYGFhLVBvZloudHh0YCBhbHNvIGhhcyBhbiBJbmRpdk5hbWUgY29sdW1uICh1c2luZyB0aGUgIm4iIG9wdGlvbiBmb3IgbmFtZXMpLCBzbyB0aGF0CndpbGwgYmUgdGhlIG9uZSB0aGF0IHdlIHdhbnQgdG8gdXNlLiAKCkdyZWF0LiAgV2UgYXJlIGFsbW9zdCByZWFkeSB0byBwdXQgaXQgYWxsIHRvZ2V0aGVyLCBJIGp1c3QgbmVlZCB0byBtYWtlIGEgZnVuY3Rpb24KdGhhdCBjcmVhdGVzIGh5YnJpZHMgZnJvbSB0aGUgdGVzdCBpbmRpdmlkdWFscy4gIAoK