We will be studying the coalescent process, a mathematical framework that is central to understanding the genetic variation that occurs in a genetic sample that you might collect from a population.
We will spend some time simulating data from the coalescent process
using Dick Hudson’s program ms
(Hudson 2002). I have written a little R package
called ‘GSImulator’ that has pre-compiled versions of ms
for windows and Mac. It is easy to install ‘GSImulator’ using the
‘devtools’ package.
To read in some trees from text files we will use the ‘ape’ package (Paradis, Claude, and Strimmer 2004), and to plot them we will use the ‘ggtree’ package (Yu et al. 2018).
So, there are some steps to prepare your computer for these excercises.
You should have installed:
Then, within R, you need to get some packages.
Important: when you are installing packages, and you are asked if you would like to update existing packages to later versions, then you should answer “No!” at least for the first time. Chances are everything will work fine without spending a long time updating everything.
# if you don't already have the tidyverse, install it:
if(!("tidyverse" %in% rownames(installed.packages()))) {
install.packages("tidyverse")
}
# if you don't already have the remotes package, install it:
if(!("remotes" %in% rownames(installed.packages()))) {
install.packages("remotes")
}
# now, get GSImulator from my GitHub page
remotes::install_github("eriqande/GSImulator", dependencies = FALSE)
# then get ape
if(!("ape" %in% rownames(installed.packages()))) {
install.packages("ape")
}
# and finally install ggtree from bioconductor
if(!("ggtree" %in% rownames(installed.packages()))) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
}
# when it says: Update all/some/none? [a/s/n], your answer should be n
Once that is done, you can test that you have everything is working by running a few lines of code. First, load the libraries and make sure there are no glaring errors or failures:
library(tidyverse)
library(ape)
library(ggtree)
library(GSImulator)
Then test that we have ms
working. If we invoke
ms
with no arguments it gives us a message about what kinds
of arguments one can use. When you give the following commands, you
should get the output below:
# ms_binary() gives the path to ms on your system
system2(command = ms_binary(), args = "")
Too few command line arguments
usage: ms nsam howmany
Options:
-t theta (this option and/or the next must be used. Theta = 4*N0*u )
-s segsites ( fixed number of segregating sites)
-T (Output gene tree.)
-F minfreq Output only sites with freq of minor allele >= minfreq.
-r rho nsites (rho here is 4Nc)
-c f track_len (f = ratio of conversion rate to rec rate. tracklen is mean length.)
if rho = 0., f = 4*N0*g, with g the gene conversion rate.
-G alpha ( N(t) = N0*exp(-alpha*t) . alpha = -log(Np/Nr)/t
-I npop n1 n2 ... [mig_rate] (all elements of mig matrix set to mig_rate/(npop-1)
-m i j m_ij (i,j-th element of mig matrix set to m_ij.)
-ma m_11 m_12 m_13 m_21 m_22 m_23 ...(Assign values to elements of migration matrix.)
-n i size_i (popi has size set to size_i*N0
-g i alpha_i (If used must appear after -M option.)
The following options modify parameters at the time 't' specified as the first argument:
-eG t alpha (Modify growth rate of all pop's.)
-eg t i alpha_i (Modify growth rate of pop i.)
-eM t mig_rate (Modify the mig matrix so all elements are mig_rate/(npop-1)
-em t i j m_ij (i,j-th element of mig matrix set to m_ij at time t )
-ema t npop m_11 m_12 m_13 m_21 m_22 m_23 ...(Assign values to elements of migration matrix.)
-eN t size (Modify pop sizes. New sizes = size*N0 )
-en t i size_i (Modify pop size of pop i. New size of popi = size_i*N0 .)
-es t i proportion (Split: pop i -> pop-i + pop-npop, npop increases by 1.
proportion is probability that each lineage stays in pop-i. (p, 1-p are admixt. proport.
Size of pop npop is set to N0 and alpha = 0.0 , size and alpha of pop i are unchanged.
-ej t i j ( Join lineages in pop i and pop j into pop j
size, alpha and M are unchanged.
-f filename ( Read command line arguments from file filename.)
See msdoc.pdf for explanation of these parameters.
/Users/eriq/Library/R/4.3/library/GSImulator/bin/ms-Darwin
And try executing the following few lines of code that should produce a plot of a tree somewhat like the one below (though it may be different since you could be starting from different random seeds).
# note that we add stderr = FALSE to the following command because of a bug in R 3.5.1
system2(command = ms_binary(), args = "10 1 -T", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
ggtree(ctree, layout = "rectangular") +
geom_tiplab(size = 2) + coord_flip() + scale_x_reverse()
If you are having problems getting this working, please talk to Eric.