Eric C. Anderson's Work-life

Open research with GitHub

Convert genepop to two column format

04 February 2015

People here at the SWFSC have started using stacks to process ddRad data. They can output data from stacks in genepop format, and sometimes the want to convert that into the format that is useful for slg_pipe.

I would typically have done that sort of thing using sed and awk, but I thought I would give a whirl at doing it in R.

Below is a function I wrote for Martha. It doesn’t do any error checking and it only applies to the particular genepop format that Martha had…but that is what stacks spit out, so it ought to work. I’m sure it could break in myriad ways.

Anyway, here it is. I suppose I should really put this in a gist…

#' convert genepop format (as Martha has it) without much error checking
#' 
#' This is just a quick, and pretty ugly thing.
#'
#' @param infile name of the genepop infile
#' @param outfile name for the two-column format file you wnat to be produced
#' @param NUM Number of digits in the genepop format. 
#' @value This returns a list with the comment line from the genepop file that has been processed.
gpop2twocol <- function(infile, outfile = "two-col.txt", NUM = 2) {
  require(stringr)
  
  x <- readLines(infile)
  ret <- list()  # for the output
  ret$comment <- x[1]  # store the comment and discard from x
  x <- x[-1]
  
  # logical vector of where the pop specifiers are
  poplines <- str_detect(toupper(x), "^POP")
  
  # logical of where commas are:
  commalines <- str_detect(x, ",")
  
  # find the locus lines.  They all have a single word in them
  # and they don't have a comma and the aren't a pop line
  loclines <- sapply(strsplit(x, "\t"), length) == 1  & !poplines & !commalines 
  
  # here are the headers for the loci, essentially
  locus_names <- rep(x[loclines], each = 2)
  
  
  # now we need to get the alleles for everyone
  inds <- x[commalines]
  inds <- str_replace_all(inds, ",", "")
  indslist <- strsplit(inds, "\t")
  
  inds_for_output <- sapply(indslist, function(x) {
    ID <- x[1]
    a <- x[-1]
    tmp <- paste(as.numeric(str_sub(a, 1, NUM)), "\t", as.numeric(str_sub(a, NUM+1, 2 * NUM)), sep="")
    tmp2 <- c(ID, tmp)
    paste(tmp2, collapse = "\t")
  })
  
  
  # now output everything to a file
  header_line <- paste(c("",locus_names), collapse = "\t")
  
  cat(c(header_line, inds_for_output), sep = "\n", file = outfile)
  
  ret$comment
}

So, go ahead and copy that function, source it in R, and then do something like this:

gpop2twocol("MyFile.txt")