We will be studying the coalescent process, a mathematical framework that is central to understanding the genetic variation that occurs in a genetic sample that you might collect from a population.

We will spend some time simulating data from the coalescent process using Dick Hudson’s program ms (Hudson 2002). I have written a little R package called ‘GSImulator’ that has pre-compiled versions of ms for windows and Mac. It is easy to install ‘GSImulator’ using the ‘devtools’ package.

To read in some trees from text files we will use the ‘ape’ package (Paradis, Claude, and Strimmer 2004), and to plot them we will use the ‘ggtree’ package (Yu et al. 2018).

So, there are some steps to prepare your computer for these excercises.

You should have installed:

  1. A recent version of R (> 4.0, say)
  2. RStudio (you don’t have to use this, but it is a great IDE)

Then, within R, you need to get some packages.

Important: when you are installing packages, and you are asked if you would like to update existing packages to later versions, then you should answer “No!” at least for the first time. Chances are everything will work fine without spending a long time updating everything.

# if you don't already have the tidyverse, install it:
if(!("tidyverse" %in% rownames(installed.packages()))) {
  install.packages("tidyverse")
}


# if you don't already have the remotes package, install it:
if(!("remotes" %in% rownames(installed.packages()))) {
  install.packages("remotes")
}

# now, get GSImulator from my GitHub page
remotes::install_github("eriqande/GSImulator", dependencies = FALSE)

# then get ape
if(!("ape" %in% rownames(installed.packages()))) {
  install.packages("ape")
}


# and finally install ggtree from bioconductor
if(!("ggtree" %in% rownames(installed.packages()))) {
  if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  BiocManager::install("ggtree")
}

# when it says: Update all/some/none? [a/s/n], your answer should be n

Once that is done, you can test that you have everything is working by running a few lines of code. First, load the libraries and make sure there are no glaring errors or failures:

library(tidyverse)
library(ape)
library(ggtree)
library(GSImulator)

Then test that we have ms working. If we invoke ms with no arguments it gives us a message about what kinds of arguments one can use. When you give the following commands, you should get the output below:

# ms_binary() gives the path to ms on your system
system2(command = ms_binary(), args = "")
Too few command line arguments
usage: ms nsam howmany 
  Options: 
     -t theta   (this option and/or the next must be used. Theta = 4*N0*u )
     -s segsites   ( fixed number of segregating sites)
     -T          (Output gene tree.)
     -F minfreq     Output only sites with freq of minor allele >= minfreq.
     -r rho nsites     (rho here is 4Nc)
         -c f track_len   (f = ratio of conversion rate to rec rate. tracklen is mean length.) 
             if rho = 0.,  f = 4*N0*g, with g the gene conversion rate.
     -G alpha  ( N(t) = N0*exp(-alpha*t) .  alpha = -log(Np/Nr)/t
     -I npop n1 n2 ... [mig_rate] (all elements of mig matrix set to mig_rate/(npop-1) 
         -m i j m_ij    (i,j-th element of mig matrix set to m_ij.)
         -ma m_11 m_12 m_13 m_21 m_22 m_23 ...(Assign values to elements of migration matrix.)
         -n i size_i   (popi has size set to size_i*N0 
         -g i alpha_i  (If used must appear after -M option.)
       The following options modify parameters at the time 't' specified as the first argument:
     -eG t alpha  (Modify growth rate of all pop's.)
     -eg t i alpha_i  (Modify growth rate of pop i.) 
     -eM t mig_rate   (Modify the mig matrix so all elements are mig_rate/(npop-1)
     -em t i j m_ij    (i,j-th element of mig matrix set to m_ij at time t )
     -ema t npop  m_11 m_12 m_13 m_21 m_22 m_23 ...(Assign values to elements of migration matrix.)
     -eN t size  (Modify pop sizes. New sizes = size*N0 ) 
     -en t i size_i  (Modify pop size of pop i.  New size of popi = size_i*N0 .)
     -es t i proportion  (Split: pop i -> pop-i + pop-npop, npop increases by 1.
         proportion is probability that each lineage stays in pop-i. (p, 1-p are admixt. proport.
         Size of pop npop is set to N0 and alpha = 0.0 , size and alpha of pop i are unchanged.
     -ej t i j   ( Join lineages in pop i and pop j into pop j
          size, alpha and M are unchanged.
      -f filename     ( Read command line arguments from file filename.)
 See msdoc.pdf for explanation of these parameters.
/Users/eriq/Library/R/4.3/library/GSImulator/bin/ms-Darwin 

And try executing the following few lines of code that should produce a plot of a tree somewhat like the one below (though it may be different since you could be starting from different random seeds).

# note that we add  stderr = FALSE to the following command because of a bug in R 3.5.1
system2(command = ms_binary(), args = "10 1 -T", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree") 
ggtree(ctree, layout = "rectangular") +
  geom_tiplab(size = 2) + coord_flip() + scale_x_reverse()

If you are having problems getting this working, please talk to Eric.

References

Hudson, Richard R. 2002. “Generating Samples Under a Wright Neutral Model of Genetic Variation.” Bioinformatics 18 (2): 337–38. https://doi.org/10.1093/bioinformatics/18.2.337.
Paradis, E., J. Claude, and K. Strimmer. 2004. “APE: Analyses of Phylogenetics and Evolution in R Language.” Bioinformatics 20: 289–90.
Yu, Guangchuang, Tommy Tsan-Yuk Lam, Huachen Zhu, and Yi Guan. 2018. “Two Methods for Mapping and Visualizing Associated Data on Phylogeny Using Ggtree.” Molecular Biology and Evolution 35: 3041–43. https://doi.org/10.1093/molbev/msy194.
LS0tCnRpdGxlOiAiUHJlcGFyaW5nIGZvciB0aGUgQ29hbGVzY2VudCIKb3V0cHV0OgogIGh0bWxfbm90ZWJvb2s6CiAgICB0b2M6IG5vCiAgaHRtbF9kb2N1bWVudDoKICAgIGRmX3ByaW50OiBwYWdlZAogICAgdG9jOiBubwpiaWJsaW9ncmFwaHk6IHJlZmVyZW5jZXMuYmliCi0tLQoKV2Ugd2lsbCBiZSBzdHVkeWluZyB0aGUgY29hbGVzY2VudCBwcm9jZXNzLCBhIG1hdGhlbWF0aWNhbCBmcmFtZXdvcmsKdGhhdCBpcyBjZW50cmFsIHRvIHVuZGVyc3RhbmRpbmcgdGhlIGdlbmV0aWMgdmFyaWF0aW9uIHRoYXQgb2NjdXJzIGluIGEgZ2VuZXRpYwpzYW1wbGUgdGhhdCB5b3UgbWlnaHQgY29sbGVjdCBmcm9tIGEgcG9wdWxhdGlvbi4KCldlIHdpbGwgc3BlbmQgc29tZSB0aW1lIHNpbXVsYXRpbmcgZGF0YSBmcm9tIHRoZSBjb2FsZXNjZW50IHByb2Nlc3MgdXNpbmcgCkRpY2sgSHVkc29uJ3MgcHJvZ3JhbSBgbXNgIFtAaHVkc29uR2VuZXJhdGluZ1NhbXBsZXNXcmlnaHQyMDAyXS4gIEkgaGF2ZSB3cml0dGVuCmEgbGl0dGxlIFIgcGFja2FnZSBjYWxsZWQgJ0dTSW11bGF0b3InIHRoYXQgaGFzIHByZS1jb21waWxlZCB2ZXJzaW9ucyBvZiBgbXNgIGZvcgp3aW5kb3dzIGFuZCBNYWMuIEl0IGlzIGVhc3kgdG8gaW5zdGFsbCAnR1NJbXVsYXRvcicgdXNpbmcgdGhlICdkZXZ0b29scycgcGFja2FnZS4KClRvIHJlYWQgaW4gc29tZSB0cmVlcyBmcm9tIHRleHQgZmlsZXMgd2Ugd2lsbCB1c2UgdGhlICdhcGUnIHBhY2thZ2UgW0BhcGUtcGFja2FnZV0sIGFuZCB0byBwbG90CnRoZW0gd2Ugd2lsbCB1c2UgdGhlICdnZ3RyZWUnIHBhY2thZ2UgW0BnZ3RyZWUtcGFja2FnZV0uICAKClNvLCB0aGVyZSBhcmUgc29tZSBzdGVwcyB0byBwcmVwYXJlIHlvdXIgY29tcHV0ZXIgZm9yIHRoZXNlIGV4Y2VyY2lzZXMuCgpZb3Ugc2hvdWxkIGhhdmUgaW5zdGFsbGVkOgoKMS4gQSByZWNlbnQgdmVyc2lvbiBvZiBSICg+IDQuMCwgc2F5KQoyLiBSU3R1ZGlvICh5b3UgZG9uJ3QgaGF2ZSB0byB1c2UgdGhpcywgYnV0IGl0IGlzIGEgZ3JlYXQgSURFKQoKVGhlbiwgd2l0aGluIFIsIHlvdSBuZWVkIHRvIGdldCBzb21lIHBhY2thZ2VzLgoKKipJbXBvcnRhbnQ6IHdoZW4geW91IGFyZSBpbnN0YWxsaW5nIHBhY2thZ2VzLCBhbmQgeW91IGFyZSBhc2tlZCBpZiB5b3Ugd291bGQKbGlrZSB0byB1cGRhdGUgZXhpc3RpbmcgcGFja2FnZXMgdG8gbGF0ZXIgdmVyc2lvbnMsIHRoZW4geW91IHNob3VsZCBhbnN3ZXIgIk5vISIKYXQgbGVhc3QgZm9yIHRoZSBmaXJzdCB0aW1lLiAgQ2hhbmNlcyBhcmUgZXZlcnl0aGluZyB3aWxsIHdvcmsgZmluZSB3aXRob3V0CnNwZW5kaW5nIGEgbG9uZyB0aW1lIHVwZGF0aW5nIGV2ZXJ5dGhpbmcuICoqCgpgYGB7ciwgZXZhbD1GQUxTRX0KIyBpZiB5b3UgZG9uJ3QgYWxyZWFkeSBoYXZlIHRoZSB0aWR5dmVyc2UsIGluc3RhbGwgaXQ6CmlmKCEoInRpZHl2ZXJzZSIgJWluJSByb3duYW1lcyhpbnN0YWxsZWQucGFja2FnZXMoKSkpKSB7CiAgaW5zdGFsbC5wYWNrYWdlcygidGlkeXZlcnNlIikKfQoKCiMgaWYgeW91IGRvbid0IGFscmVhZHkgaGF2ZSB0aGUgcmVtb3RlcyBwYWNrYWdlLCBpbnN0YWxsIGl0OgppZighKCJyZW1vdGVzIiAlaW4lIHJvd25hbWVzKGluc3RhbGxlZC5wYWNrYWdlcygpKSkpIHsKICBpbnN0YWxsLnBhY2thZ2VzKCJyZW1vdGVzIikKfQoKIyBub3csIGdldCBHU0ltdWxhdG9yIGZyb20gbXkgR2l0SHViIHBhZ2UKcmVtb3Rlczo6aW5zdGFsbF9naXRodWIoImVyaXFhbmRlL0dTSW11bGF0b3IiLCBkZXBlbmRlbmNpZXMgPSBGQUxTRSkKCiMgdGhlbiBnZXQgYXBlCmlmKCEoImFwZSIgJWluJSByb3duYW1lcyhpbnN0YWxsZWQucGFja2FnZXMoKSkpKSB7CiAgaW5zdGFsbC5wYWNrYWdlcygiYXBlIikKfQoKCiMgYW5kIGZpbmFsbHkgaW5zdGFsbCBnZ3RyZWUgZnJvbSBiaW9jb25kdWN0b3IKaWYoISgiZ2d0cmVlIiAlaW4lIHJvd25hbWVzKGluc3RhbGxlZC5wYWNrYWdlcygpKSkpIHsKICBpZiAoIXJlcXVpcmVOYW1lc3BhY2UoIkJpb2NNYW5hZ2VyIiwgcXVpZXRseSA9IFRSVUUpKQogICAgaW5zdGFsbC5wYWNrYWdlcygiQmlvY01hbmFnZXIiKQogIEJpb2NNYW5hZ2VyOjppbnN0YWxsKCJnZ3RyZWUiKQp9CgojIHdoZW4gaXQgc2F5czogVXBkYXRlIGFsbC9zb21lL25vbmU/IFthL3Mvbl0sIHlvdXIgYW5zd2VyIHNob3VsZCBiZSBuCmBgYAoKCk9uY2UgdGhhdCBpcyBkb25lLCB5b3UgY2FuIHRlc3QgdGhhdCB5b3UgaGF2ZSBldmVyeXRoaW5nIGlzIHdvcmtpbmcgYnkgcnVubmluZyBhIGZldyBsaW5lcyBvZiBjb2RlLgpGaXJzdCwgbG9hZCB0aGUgbGlicmFyaWVzIGFuZCBtYWtlIHN1cmUgdGhlcmUgYXJlIG5vIGdsYXJpbmcgZXJyb3JzIG9yIGZhaWx1cmVzOgpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkoYXBlKQpsaWJyYXJ5KGdndHJlZSkKbGlicmFyeShHU0ltdWxhdG9yKQpgYGAKClRoZW4gdGVzdCB0aGF0IHdlIGhhdmUgYG1zYCB3b3JraW5nLgpJZiB3ZSBpbnZva2UgYG1zYCB3aXRoIG5vIGFyZ3VtZW50cyBpdCBnaXZlcyB1cyBhIG1lc3NhZ2UgYWJvdXQgd2hhdCBraW5kcwpvZiBhcmd1bWVudHMgb25lIGNhbiB1c2UuICBXaGVuIHlvdSBnaXZlIHRoZSBmb2xsb3dpbmcgY29tbWFuZHMsIHlvdSBzaG91bGQKZ2V0IHRoZSBvdXRwdXQgYmVsb3c6CmBgYHtyfQojIG1zX2JpbmFyeSgpIGdpdmVzIHRoZSBwYXRoIHRvIG1zIG9uIHlvdXIgc3lzdGVtCnN5c3RlbTIoY29tbWFuZCA9IG1zX2JpbmFyeSgpLCBhcmdzID0gIiIpCmBgYAoKCkFuZCB0cnkgZXhlY3V0aW5nIHRoZSBmb2xsb3dpbmcgZmV3IGxpbmVzIG9mIGNvZGUgdGhhdCBzaG91bGQgcHJvZHVjZSBhCnBsb3Qgb2YgYSB0cmVlIHNvbWV3aGF0IGxpa2UgdGhlIG9uZSBiZWxvdwoodGhvdWdoIGl0IG1heSBiZSBkaWZmZXJlbnQgc2luY2UgeW91IGNvdWxkIGJlIHN0YXJ0aW5nIGZyb20gZGlmZmVyZW50CnJhbmRvbSBzZWVkcykuCmBgYHtyfQojIG5vdGUgdGhhdCB3ZSBhZGQgIHN0ZGVyciA9IEZBTFNFIHRvIHRoZSBmb2xsb3dpbmcgY29tbWFuZCBiZWNhdXNlIG9mIGEgYnVnIGluIFIgMy41LjEKc3lzdGVtMihjb21tYW5kID0gbXNfYmluYXJ5KCksIGFyZ3MgPSAiMTAgMSAtVCIsIHN0ZG91dCA9ICJvdXQudHJlZSIsIHN0ZGVyciA9IEZBTFNFKQpjdHJlZSA8LSByZWFkLnRyZWUoIm91dC50cmVlIikgCmdndHJlZShjdHJlZSwgbGF5b3V0ID0gInJlY3Rhbmd1bGFyIikgKwogIGdlb21fdGlwbGFiKHNpemUgPSAyKSArIGNvb3JkX2ZsaXAoKSArIHNjYWxlX3hfcmV2ZXJzZSgpCmBgYAoKSWYgeW91IGFyZSBoYXZpbmcgcHJvYmxlbXMgZ2V0dGluZyB0aGlzIHdvcmtpbmcsIHBsZWFzZSB0YWxrIHRvIEVyaWMuCgoKCiMjIFJlZmVyZW5jZXMK