This document shows how to use the R package Rrunstruct
to use R to conduct multiple structure
runs based on the parameters chosen for any parameter set specified in the structure
front-end. It goes hand in hand with the computer practical in the SISG MCMC course. Before you follow this document, you will want to have completed the first part of the computer practical and have done at least one run of structure
using the front-end.
The goal here is to see how to use some R tools to parse structure
output to be able to quickly assess whether multiple runs of structure's
MCMC algorithm give similar results. Along the way we will encounter a couple of challenges that arise, most notably the issue that the labels of the clusters may not be concordant across multiple structure
runs. We might also take some time to look at some of the tools in today’s Hadley Wickham R ecosystem (dplyr
, ggplot2
, tidyr
, stringr
, etc.) that are helpful.
Before we can begin there is some setup you might have to do.
You will need to install the package Rrunstruct
. Because the package includes the structure
executable file, it is not available on CRAN, but can be installed from GitHub. Doing so requires the devtools
package.
devtools::install_github("eriqande/Rrunstruct")
If that doesn’t work, it likely is because you need to get the devtools
package and then try again:
install.packages("devtools")
devtools::install_github("eriqande/Rrunstruct")
To run structure
you just need to:
structure
project that holds the parameter set you want to use. On my computer that is ~/Desktop/scot-cats/PlainPars
. On your computer, that path will have to be set accordingly.structure_runs()
, telling it where to put the output, what K values to use, how many reps for each, and the seed to use for the first run (this gets incremented by one for each successive run).A Big Note The mainparams
file in the PlainPars needs to have the RANDOMIZE variable set to 0 in order to set the random number generator seed to a different value for each run of structure
. This can be achieved by running a set of runs through the scheduler using the front end. (Or by editing the mainparams
file with a simple text editor.)
library(Rrunstruct)
setwd("~/Desktop/scot-cats/PlainPars") # you've got to change this for your own computer!
structure_runs(outdir = "struct-runs-from-r-1", K = 1:4, reps = 3, seed = 5566)
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_1_rep_1 at Wed Jul 26 23:14:44 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_1_rep_1 at Wed Jul 26 23:14:47 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_1_rep_2 at Wed Jul 26 23:14:47 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_1_rep_2 at Wed Jul 26 23:14:49 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_1_rep_3 at Wed Jul 26 23:14:49 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_1_rep_3 at Wed Jul 26 23:14:52 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_2_rep_1 at Wed Jul 26 23:14:52 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_2_rep_1 at Wed Jul 26 23:14:55 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_2_rep_2 at Wed Jul 26 23:14:55 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_2_rep_2 at Wed Jul 26 23:14:58 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_2_rep_3 at Wed Jul 26 23:14:58 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_2_rep_3 at Wed Jul 26 23:15:01 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_3_rep_1 at Wed Jul 26 23:15:01 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_3_rep_1 at Wed Jul 26 23:15:06 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_3_rep_2 at Wed Jul 26 23:15:06 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_3_rep_2 at Wed Jul 26 23:15:10 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_3_rep_3 at Wed Jul 26 23:15:10 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_3_rep_3 at Wed Jul 26 23:15:14 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_4_rep_1 at Wed Jul 26 23:15:14 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_4_rep_1 at Wed Jul 26 23:15:18 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_4_rep_2 at Wed Jul 26 23:15:18 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_4_rep_2 at Wed Jul 26 23:15:23 2017
Launching structure. Writing stdout to file struct-runs-from-r-1/struct_stdout_K_4_rep_3 at Wed Jul 26 23:15:23 2017
Done with structure. Results written to file struct-runs-from-r-1/struct_results_K_4_rep_3 at Wed Jul 26 23:15:28 2017
Once those runs have completed, you can take a look at what the outputs look like. All the output files should be in the directory struct-runs-from-r-1
that is inside the PlainPars
directory. Let’s see what they look like:
dir(path = "struct-runs-from-r-1")
[1] "struct_results_K_1_rep_1_f" "struct_results_K_1_rep_2_f" "struct_results_K_1_rep_3_f"
[4] "struct_results_K_2_rep_1_f" "struct_results_K_2_rep_2_f" "struct_results_K_2_rep_3_f"
[7] "struct_results_K_3_rep_1_f" "struct_results_K_3_rep_2_f" "struct_results_K_3_rep_3_f"
[10] "struct_results_K_4_rep_1_f" "struct_results_K_4_rep_2_f" "struct_results_K_4_rep_3_f"
[13] "struct_stdout_K_1_rep_1" "struct_stdout_K_1_rep_2" "struct_stdout_K_1_rep_3"
[16] "struct_stdout_K_2_rep_1" "struct_stdout_K_2_rep_2" "struct_stdout_K_2_rep_3"
[19] "struct_stdout_K_3_rep_1" "struct_stdout_K_3_rep_2" "struct_stdout_K_3_rep_3"
[22] "struct_stdout_K_4_rep_1" "struct_stdout_K_4_rep_2" "struct_stdout_K_4_rep_3"
The naming convention is pretty simple, *results*
are the files that structure spits out at the end of each run. The *stdout*
files store the stuff that the command line version of structure spits out. Let’s have a look at the first couple lines of the trace files:
cat(readLines("struct-runs-from-r-1/struct_stdout_K_3_rep_1")[1:50], sep = "\n")
----------------------------------------------------
STRUCTURE by Pritchard, Stephens and Donnelly (2000)
and Falush, Stephens and Pritchard (2003)
Code by Pritchard, Falush and Hubisz
Version 2.3.4 (Jul 2012)
----------------------------------------------------
Reading file "mainparams".
datafile is
/Users/eriq/Desktop/scot-cats/project_data
Reading file "extraparams".
Reading file "/Users/eriq/Desktop/scot-cats/project_data".
Number of alleles per locus: min= 9; ave=12.4; max=17
--------------------------------------
Finished initialization; starting MCMC
4000 iterations + 2000 burnin
Rep#: Alpha F1 F2 F3 Ln Like
1: 1.020 0.010 0.010 0.010 --
2: 1.006 0.010 0.010 0.010 --
3: 1.006 0.010 0.010 0.010 --
4: 1.006 0.010 0.010 0.010 --
5: 1.002 0.010 0.010 0.010 --
6: 1.002 0.010 0.010 0.010 --
7: 1.020 0.010 0.010 0.010 --
8: 1.045 0.010 0.010 0.010 --
9: 1.046 0.010 0.010 0.010 --
Rep#: Alpha F1 F2 F3 Ln Like
10: 1.033 0.010 0.010 0.010 --
11: 1.060 0.010 0.010 0.010 --
12: 1.064 0.010 0.010 0.010 --
13: 1.101 0.010 0.010 0.010 --
14: 1.067 0.010 0.010 0.010 --
15: 1.062 0.010 0.010 0.010 --
16: 1.060 0.010 0.010 0.010 --
17: 1.030 0.010 0.010 0.010 --
18: 1.004 0.010 0.013 0.010 --
19: 1.004 0.010 0.013 0.010 --
Rep#: Alpha F1 F2 F3 Ln Like
20: 1.019 0.010 0.013 0.010 --
21: 0.981 0.010 0.013 0.010 --
And look at a bit of one of the result files. Note that you are going to want to extract the “Inferred ancestry of individuals from this file”
cat(readLines("struct-runs-from-r-1/struct_results_K_3_rep_1_f")[1:100], sep = "\n")
----------------------------------------------------
STRUCTURE by Pritchard, Stephens and Donnelly (2000)
and Falush, Stephens and Pritchard (2003)
Code by Pritchard, Falush and Hubisz
Version 2.3.4 (Jul 2012)
----------------------------------------------------
Command line arguments: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/Rrunstruct/bin/structure_mac -K 3 -o struct-runs-from-r-1/struct_results_K_3_rep_1 -D 5572
Input File: /Users/eriq/Desktop/scot-cats/project_data
Run parameters:
304 individuals
9 loci
3 populations assumed
2000 Burn-in period
4000 Reps
RANDOMIZE turned off
--------------------------------------------
Proportion of membership of each pre-defined
population in each of the 3 clusters
Given Inferred Clusters Number of
Pop 1 2 3 Individuals
1: 0.058 0.081 0.860 27
2: 0.089 0.149 0.763 61
3: 0.231 0.475 0.293 91
4: 0.082 0.162 0.757 28
5: 0.111 0.258 0.631 19
6: 0.030 0.075 0.895 4
7: 0.784 0.179 0.037 10
8: 0.675 0.277 0.048 18
9: 0.741 0.219 0.040 46
--------------------------------------------
Allele-freq. divergence among pops (Net nucleotide distance),
computed using point estimates of P.
1 2 3
1 - 0.0221 0.1228
2 0.0221 - 0.0930
3 0.1228 0.0930 -
Average distances (expected heterozygosity) between individuals in same cluster:
cluster 1 : 0.7322
cluster 2 : 0.7521
cluster 3 : 0.6584
--------------------------------------------
Estimated Ln Prob of Data = -8714.2
Mean value of ln likelihood = -8496.9
Variance of ln likelihood = 434.5
Mean value of alpha = 0.1224
Mean value of Fst_1 = 0.0490
Mean value of Fst_2 = 0.0361
Mean value of Fst_3 = 0.1940
Inferred ancestry of individuals:
Label (%Miss) Pop: Inferred clusters
1 Z12 (0) 3 : 0.140 0.779 0.081
2 Z13 (0) 3 : 0.110 0.862 0.027
3 Z14 (0) 3 : 0.062 0.741 0.197
4 Z15 (0) 3 : 0.551 0.257 0.192
5 Z16 (0) 3 : 0.104 0.363 0.533
6 Z17 (0) 1 : 0.016 0.023 0.961
7 Z18 (0) 3 : 0.075 0.190 0.735
8 Z19 (0) 3 : 0.594 0.296 0.110
9 Z20 (11) 2 : 0.024 0.137 0.838
10 Z21 (0) 2 : 0.029 0.036 0.935
11 Z30 (0) 3 : 0.621 0.362 0.017
12 Z31 (0) 3 : 0.910 0.068 0.022
13 Z34 (0) 3 : 0.647 0.322 0.031
14 DB1 (0) 2 : 0.107 0.766 0.128
15 DB2 (0) 2 : 0.037 0.498 0.465
16 DB3 (0) 3 : 0.064 0.586 0.350
17 DB5 (0) 2 : 0.477 0.415 0.108
18 DB6 (0) 2 : 0.026 0.028 0.946
19 DB7 (0) 5 : 0.052 0.072 0.876
20 DB8 (11) 3 : 0.886 0.088 0.026
21 DB9 (0) 2 : 0.277 0.422 0.301
22 DB10 (0) 3 : 0.332 0.608 0.059
23 DB11 (0) 3 : 0.067 0.353 0.579
24 DB12 (0) 1 : 0.074 0.080 0.846
25 DB16 (0) 2 : 0.045 0.053 0.902
26 DB17 (0) 3 : 0.108 0.071 0.821
27 DB18 (0) 3 : 0.147 0.050 0.803
28 DB19 (0) 3 : 0.048 0.923 0.029
29 DB21 (22) 5 : 0.041 0.064 0.894
30 DB22 (0) 2 : 0.111 0.109 0.780
31 DB23 (0) 2 : 0.047 0.483 0.470
32 DB24 (0) 3 : 0.547 0.410 0.043
33 DB25 (0) 4 : 0.088 0.090 0.821
Although R’s syntax for text processing is perhaps not as compact as awk
or sed
, and it might be slower than python
it has nice facilities for parsing text files. The package stringr
used in combination with the function readLines()
is particularly easy to use and helpful.
We won’t go into the details now though. Instead we have a single function that:
stdout
and results
files and wrangles them into long-format (tidy) data frames.If you want the details of all that, you can check the package out on GitHub. I used the packages tidyr
and dplyr
extensively (but the code is still kind of ugly…).
Here we get all the output into a list of two data frames:
L <- read_and_process_structure_output("struct-runs-from-r-1")
And let’s take a look at the tops of the two data frames in that list.
First, the estimated Q values from each individual
L$q
The columns K
and Rep
just give the K-value assumed in the structure run, and which replicate the run was. Index
is just the numerical index of the individual, Label
is the label given to the individual sample, Miss
is the percentage of loci missing data in that individual, Pop
is the putative popuation or sampling location of the individual (as given in the data input), cluster
gives the cluster that the probability
refers to. probability
is the posterior mean of the Q value for the given cluster. mp
is the number of the label permutation that makes the results of this replicate as congruent as possible with replicate #1. Finally cluster_relabeled
is the label of the cluster, once the clusters have been relabeled to be congruent with replicate #1. We will explore that further.
Here we look at the traces:
L$traces
The columns K
, Rep
, and mp
are as above. Sweep
tells us which sweep the row refers to. It runs from 1 to 6000 in this case—it includes sweeps from the burn-in period. variable
tells us what variable we are looking at. Here, those are either “Alpha” (the \(\alpha\) variable) or one of the \(F\) parameters (i.e. F1, F2, etc). variables_relabeled
has the name of the variable after they have been relabeled to be (more or less) congruent with replicate 1. value
is the value of that variable.
Once structure’s outputs are in this type of long-format data frame, it is relatively easy to plot and compare between runs.
First we will plot the traces of \(\alpha\) and the \(F\) parameters and see how the compare across runs.
Here are the \(\alpha\)’s:
ggplot(L$traces %>% filter(variable == "Alpha"), aes(x = Sweep, y = value)) +
geom_line(colour = "blue") +
facet_grid(K ~ Rep)
That is interesting. Not surprisingly, when K is 1, \(\alpha\) meanders all over because it is not really used in the model when K is 1, so there is nothing constraining it.
For \(K>1\) the results look pretty similar between the reps. But, let’s focus on \(K>1\):
ggplot(L$traces %>% filter(variable == "Alpha", K != 1), aes(x = Sweep, y = value)) +
geom_line(colour = "blue") +
facet_grid(K ~ Rep)
When we look at the values of F, we see the labeling issue:
ggplot(L$traces %>% filter(str_detect(variable, "^F")), aes(x = Sweep, y = value, colour = variable)) +
geom_line() +
facet_grid(K ~ Rep)
When \(K>1\) the colors don’t match up because the clusters don’t have the same name. We can rectify that by using the relabeled version:
ggplot(L$traces %>% filter(str_detect(variables_relabeled, "^F")), aes(x = Sweep, y = value, colour = variables_relabeled)) +
geom_line() +
facet_grid(K ~ Rep)
We see that the course of the F variables during MCMC looks similar between reps for K=1 and K=2 but quite different between reps for K>2. We contend that the runs for K>2 might have to be run for much longer (and even then they may not converge to the same place, reliably).
Often what we are really interested in are the Q values for different individuals. It is thus important to assess how much those differ between runs. One way to do that is to plot the estimated Q values in each rep > 1 with those in the first rep. To do that, we can add the rep1 results (at each K) to the data frame of results:
Qs <- L$q %>%
filter(Rep == 1) %>% # get just the first rep
group_by(K, Index, cluster_relabeled) %>% # group by these to preserve the columns
transmute(rep1_cluster = cluster_relabeled, rep1_prob = probability) %>% # rename columns
ungroup %>%
inner_join(L$q) %>% # slap those values back on the original data frame matching K and Rep
filter(Rep != 1) %>% # toss out the rows comparing Rep 1 to Rep 1
mutate(which_cluster = paste(rep1_cluster)) # name a new column with something better for the plot
Adding missing grouping variables: `K`, `Index`, `cluster_relabeled`
Joining, by = c("K", "Index", "cluster_relabeled")
Here is what that data frame looks like:
Qs
And now it is easy to compare the Q values in each Rep after Rep 1 with the Q values in Rep 1:
ggplot(Qs, aes(x = rep1_prob, y = probability, colour = which_cluster)) +
geom_point() +
facet_grid(K ~ Rep)
This is interesting as it shows that for K=2 the three runs all converged to pretty much the same Q values. The “wiggle” off the \(y=x\) line there is due to Monte Carlo variance, but not caused from the chains converging to very different parts of the space, as is clear what has happened for some of the replicates for K=3 and K=4.
Let’s see if we can tighten up that Monte Carlo variance be doing longer runs. I have made a new directory LongerRun
with the structure
GUI and set it up to to 5000 burn in and 45000 sweeps after burn in. Once again, you have to ensure that the RANDOMIZE setting in the mainparams
file in that directory is set to 0. Let’s do 4 reps at K=2 to see how that works out.
# do the runs
setwd("~/Desktop/scot-cats/LongerRun") # this must be changed on other computers
The working directory was changed to /Users/eriq/Desktop/scot-cats/LongerRun inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
structure_runs(outdir = "r-runs-2", K = 2, reps = 4, seed = 9876)
Launching structure. Writing stdout to file r-runs-2/struct_stdout_K_2_rep_1 at Wed Jul 26 23:15:48 2017
Done with structure. Results written to file r-runs-2/struct_results_K_2_rep_1 at Wed Jul 26 23:16:17 2017
Launching structure. Writing stdout to file r-runs-2/struct_stdout_K_2_rep_2 at Wed Jul 26 23:16:17 2017
Done with structure. Results written to file r-runs-2/struct_results_K_2_rep_2 at Wed Jul 26 23:16:48 2017
Launching structure. Writing stdout to file r-runs-2/struct_stdout_K_2_rep_3 at Wed Jul 26 23:16:48 2017
Done with structure. Results written to file r-runs-2/struct_results_K_2_rep_3 at Wed Jul 26 23:17:17 2017
Launching structure. Writing stdout to file r-runs-2/struct_stdout_K_2_rep_4 at Wed Jul 26 23:17:17 2017
Done with structure. Results written to file r-runs-2/struct_results_K_2_rep_4 at Wed Jul 26 23:17:47 2017
Then slurp up the results
setwd("~/Desktop/scot-cats/LongerRun") # this must be changed on other computers
The working directory was changed to /Users/eriq/Desktop/scot-cats/LongerRun inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
L2 <- read_and_process_structure_output("r-runs-2")
reading files
Adding missing grouping variables: `K`, `Rep`, `Index`, `Label`, `Miss`
finding label permutations for the reps
relabeling the traces and results
Then, compare rep 1 to the others:
# format the data frame
Qs2 <- L2$q %>%
filter(Rep == 1) %>% # get just the first rep
group_by(K, Index, cluster_relabeled) %>% # group by these to preserve the columns
transmute(rep1_cluster = cluster_relabeled, rep1_prob = probability) %>% # rename columns
ungroup %>%
inner_join(L2$q) %>% # slap those values back on the original data frame matching K and Rep
filter(Rep != 1) %>% # toss out the rows comparing Rep 1 to Rep 1
mutate(which_cluster = paste(rep1_cluster)) # name a new column with something better for the plot
Adding missing grouping variables: `K`, `Index`, `cluster_relabeled`
Joining, by = c("K", "Index", "cluster_relabeled")
Plot those up:
ggplot(Qs2, aes(x = rep1_prob, y = probability, colour = which_cluster)) +
geom_point() +
facet_wrap( ~ Rep, ncol = 3)
Those look pretty concordant apart from a few individuals.