library(tidyverse)
library(GSImulator)
library(ape)
library(ggtree)
ms
will simulate infinite sites mutation very readily.
There are two options.
First we can tell it to simulate an exact number of segregating sites
(like 7, for example):
cat("101 203", file = "seedms") # this sets the seed for ms
system2(command = ms_binary(), args = "10 1 -s 7 -T")
/Users/eriq/Library/R/4.3/library/GSImulator/bin/ms-Darwin 10 1 -s 7 -T
101 203 59243
//
(1:0.342,((3:0.079,(7:0.011,(4:0.009,10:0.009):0.002):0.068):0.205,(9:0.121,((2:0.015,8:0.015):0.023,(5:0.017,6:0.017):0.020):0.084):0.163):0.057);
segsites: 7
positions: 0.0769 0.1640 0.3143 0.3899 0.4311 0.7488 0.9669
0000000
0001001
0101100
0101000
0001011
0001001
0101000
0001001
1011001
0101000
Notice that the samples can be thought of as being labeled 1–10, and
the mutations can be thought of as being labeled 1–7. Each line like
0010010 is an individual gene sequence sample. A 1 means it inherited
the mutation, and a 0 means it inherited the non-mutated base.
In class exercise 1 with 7 groups
Work with your group. Each group will be assigned a site (i.e. a
mutation). Figure out which upon which branch in the tree, below, that
mutation must have occurred. Each group will then stick their mutation
to their branch with some tape.

In class excerise 2
This exercise will get us thinking about what is called the
“four-gamete test”. Each group will take their site and one other site
that they can choose from amongst the 7, and count up the number of
different pairs of mutations at the two sites are found by looking at
the segretating sites output above, and then write that on the
board.
For example, for sites 1 and 2 the result that gets written on the board
should look like:
Each group will do this with two pairs of sites. For example, if your
group is assigned site 1, you might to it for sites 1 and 2 and sites 1
and 5.
Write each little table on the board with a title like “Sites 1,2”
above each. When that is done, we will talk about it.
Sprinkling mutations at rate \(\theta =
4N\mu L\)
Instead of using the -s
option to specify the exact
number of segregating sites, you can also use the -t
option
to specify \(\theta = 4N\mu L\) where
\(\nu\) is the per-base-pair
per-generation mutation rate and \(L\)
is the length of the DNA sequence in number of base pairs. Under the
infinite sites model, each mutation is an entirely novel mutation.
LS0tCnRpdGxlOiAiSW5maW5pdGUgU2l0ZXMgTXV0YXRpb24iCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazoKICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCmF1dGhvcjogIkVyaWMgQy4gQW5kZXJzb24iCmJpYmxpb2dyYXBoeTogcmVmZXJlbmNlcy5iaWIKLS0tCgoKYGBge3IsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KEdTSW11bGF0b3IpCmxpYnJhcnkoYXBlKQpsaWJyYXJ5KGdndHJlZSkKYGBgCgpgbXNgIHdpbGwgc2ltdWxhdGUgaW5maW5pdGUgc2l0ZXMgbXV0YXRpb24gdmVyeSByZWFkaWx5LiAgVGhlcmUgYXJlIHR3bwpvcHRpb25zLiAgCgpGaXJzdCB3ZSBjYW4gdGVsbCBpdCB0byBzaW11bGF0ZSBhbiBleGFjdCBudW1iZXIgb2Ygc2VncmVnYXRpbmcgc2l0ZXMgKGxpa2UgNywgZm9yCmV4YW1wbGUpOgpgYGB7cn0KCmNhdCgiMTAxIDIwMyIsIGZpbGUgPSAic2VlZG1zIikgICMgdGhpcyBzZXRzIHRoZSBzZWVkIGZvciBtcwpzeXN0ZW0yKGNvbW1hbmQgPSBtc19iaW5hcnkoKSwgYXJncyA9ICIxMCAxIC1zIDcgLVQiKQpgYGAKCk5vdGljZSB0aGF0IHRoZSBzYW1wbGVzIGNhbiBiZSB0aG91Z2h0IG9mIGFzIGJlaW5nIGxhYmVsZWQgMS0tMTAsIGFuZCB0aGUgbXV0YXRpb25zCmNhbiBiZSB0aG91Z2h0IG9mIGFzIGJlaW5nIGxhYmVsZWQgMS0tNy4gIEVhY2ggbGluZSBsaWtlIDAwMTAwMTAgaXMgYW4gaW5kaXZpZHVhbApnZW5lIHNlcXVlbmNlIHNhbXBsZS4gIEEgMSBtZWFucyBpdCBpbmhlcml0ZWQgdGhlIG11dGF0aW9uLCBhbmQgYSAwIG1lYW5zIGl0IGluaGVyaXRlZAp0aGUgbm9uLW11dGF0ZWQgYmFzZS4KCiMjIEluIGNsYXNzIGV4ZXJjaXNlIDEgd2l0aCA3IGdyb3VwcwoKV29yayB3aXRoIHlvdXIgZ3JvdXAuICBFYWNoIGdyb3VwIHdpbGwgYmUgYXNzaWduZWQgYSBzaXRlIChpLmUuIAphIG11dGF0aW9uKS4gIEZpZ3VyZSBvdXQgd2hpY2ggdXBvbiB3aGljaCBicmFuY2ggaW4gdGhlIHRyZWUsIGJlbG93LCB0aGF0IG11dGF0aW9uIG11c3QgaGF2ZSBvY2N1cnJlZC4gIEVhY2gKZ3JvdXAgd2lsbCB0aGVuIHN0aWNrIHRoZWlyIG11dGF0aW9uIHRvIHRoZWlyIGJyYW5jaCB3aXRoIHNvbWUgdGFwZS4KYGBge3IgZWNobz1GQUxTRX0KY2F0KCIxMDEgMjAzIiwgZmlsZSA9ICJzZWVkbXMiKSAgIyB0aGlzIHNldHMgdGhlIHNlZWQgZm9yIG1zCnN5c3RlbTIoY29tbWFuZCA9IG1zX2JpbmFyeSgpLCBhcmdzID0gIjEwIDEgLXMgNyAtVCIsIHN0ZG91dCA9ICJvdXQudHJlZSIsIHN0ZGVyciA9IEZBTFNFKQpjdHJlZSA8LSByZWFkLnRyZWUoIm91dC50cmVlIikgCmdndHJlZShjdHJlZSwgbGF5b3V0ID0gInJlY3Rhbmd1bGFyIikgKwogIGdlb21fdGlwbGFiKHNpemUgPSA1KQpgYGAKCgojIyBJbiBjbGFzcyBleGNlcmlzZSAyCgpUaGlzIGV4ZXJjaXNlIHdpbGwgZ2V0IHVzIHRoaW5raW5nIGFib3V0IHdoYXQgaXMgY2FsbGVkIHRoZSAKImZvdXItZ2FtZXRlIHRlc3QiLiAgRWFjaCBncm91cCB3aWxsIHRha2UgdGhlaXIgc2l0ZSBhbmQgb25lIG90aGVyIHNpdGUKdGhhdCB0aGV5IGNhbiBjaG9vc2UgZnJvbSBhbW9uZ3N0IHRoZSA3LCBhbmQgY291bnQgdXAgdGhlIG51bWJlciBvZgpkaWZmZXJlbnQgcGFpcnMgb2YgbXV0YXRpb25zIGF0IHRoZSB0d28gc2l0ZXMgYXJlIGZvdW5kIGJ5IGxvb2tpbmcgYXQgdGhlIApzZWdyZXRhdGluZyBzaXRlcyBvdXRwdXQgYWJvdmUsIGFuZCB0aGVuIHdyaXRlIHRoYXQgb24gdGhlIGJvYXJkLiAgCkZvciBleGFtcGxlLCBmb3Igc2l0ZXMgMSBhbmQgMiB0aGUgcmVzdWx0IHRoYXQgZ2V0cyB3cml0dGVuIG9uIHRoZSBib2FyZCBzaG91bGQKbG9vayBsaWtlOgoKICBQYWlyICB8ICAgbiAKLS0tLS0tLS18LS0tLQowMCB8IDUKMDEgfCA0CjEwIHwgMQoKRWFjaCBncm91cCB3aWxsIGRvIHRoaXMgd2l0aCB0d28gcGFpcnMgb2Ygc2l0ZXMuICBGb3IgZXhhbXBsZSwgaWYgeW91cgpncm91cCBpcyBhc3NpZ25lZCBzaXRlIDEsIHlvdSBtaWdodCB0byBpdCBmb3Igc2l0ZXMgMSBhbmQgMiBhbmQgc2l0ZXMgMSBhbmQgNS4KCldyaXRlIGVhY2ggbGl0dGxlIHRhYmxlIG9uIHRoZSBib2FyZCB3aXRoIGEgdGl0bGUgbGlrZSAiU2l0ZXMgMSwyIiBhYm92ZSBlYWNoLgpXaGVuIHRoYXQgaXMgZG9uZSwgd2Ugd2lsbCB0YWxrIGFib3V0IGl0LgoKCgojIyMgU3ByaW5rbGluZyBtdXRhdGlvbnMgYXQgcmF0ZSAkXHRoZXRhID0gNE5cbXUgTCQKCkluc3RlYWQgb2YgdXNpbmcgdGhlIGAtc2Agb3B0aW9uIHRvIHNwZWNpZnkgdGhlIGV4YWN0IG51bWJlciBvZiBzZWdyZWdhdGluZyBzaXRlcywKeW91IGNhbiBhbHNvIHVzZSB0aGUgYC10YCBvcHRpb24gdG8gc3BlY2lmeSAkXHRoZXRhID0gNE5cbXUgTCQgd2hlcmUgJFxudSQgaXMgdGhlIApwZXItYmFzZS1wYWlyIHBlci1nZW5lcmF0aW9uIG11dGF0aW9uIHJhdGUgYW5kICRMJCBpcyB0aGUgbGVuZ3RoIG9mIHRoZSBETkEgc2VxdWVuY2UKaW4gbnVtYmVyIG9mIGJhc2UgcGFpcnMuICBVbmRlciB0aGUgaW5maW5pdGUgc2l0ZXMgbW9kZWwsIGVhY2ggbXV0YXRpb24gaXMgYW4gZW50aXJlbHkKbm92ZWwgbXV0YXRpb24uCg==