species_1_simulation.Rmd
For this first example, we use the hypothetical life history of species 1. First we have to set spip up to run with that life history.
spip
has a large number of demographic parameters.
Typically spip
is run as a command-line program in Unix. In
CKMRpop, all that action goes on under the hood, but you still have to
use the spip
parameters. This vignette is not about using
spip
. For a short listing of all the spip
options, do this:
If you want a full, complete, long listing of all the
spip
options, then you can do:
library(CKMRpop)
spip_help_full()
All of the “long-form” options to spip
are given on the
Unix command line starting with two dashes, like
--fem-surv-probs
. To set parameters within
CKMRpop
to send to spip, you simply make a named list of
input values. The names of the items in the list are the long-format
option names without the leading two dashes. For an example,
see the package data object species_1_life_history
, as
described below.
These parameters are included in the package in the variable
species_1_life_history
. It is named list of parameters to
send to spip. The list names are the names of the spip options. It looks
like this:
species_1_life_history
#> $`max-age`
#> [1] 20
#>
#> $`fem-surv-probs`
#> [1] 0.75 0.76 0.76 0.77 0.77 0.78 0.78 0.79 0.79 0.80 0.80 0.80 0.81 0.81 0.82
#> [16] 0.82 0.82 0.82 0.82 0.82
#>
#> $`male-surv-probs`
#> [1] 0.75 0.76 0.76 0.77 0.77 0.78 0.78 0.79 0.79 0.80 0.80 0.80 0.81 0.81 0.82
#> [16] 0.82 0.82 0.82 0.82 0.82
#>
#> $`fem-prob-repro`
#> [1] 0.00 0.00 0.00 0.02 0.09 0.36 0.75 0.94 0.99 1.00 1.00 1.00 1.00 1.00 1.00
#> [16] 1.00 1.00 1.00 1.00 1.00
#>
#> $`male-prob-repro`
#> [1] 0.00 0.00 0.00 0.00 0.01 0.05 0.22 0.64 0.91 0.98 1.00 1.00 1.00 1.00 1.00
#> [16] 1.00 1.00 1.00 1.00 1.00
#>
#> $`fem-asrf`
#> [1] 0.00 0.00 0.00 3.56 3.72 3.88 4.04 4.20 4.36 4.52 4.68 4.84 5.00 5.16 5.32
#> [16] 5.48 5.64 5.80 5.96 6.12
#>
#> $`male-asrp`
#> [1] 0.00 0.00 0.00 3.56 3.72 3.88 4.04 4.20 4.36 4.52 4.68 4.84 5.00 5.16 5.32
#> [16] 5.48 5.64 5.80 5.96 6.12
#>
#> $`offsp-dsn`
#> [1] "negbin"
#>
#> $`fem-rep-disp-par`
#> [1] 0.7
#>
#> $`male-rep-disp-par`
#> [1] 0.7
#>
#> $`mate-fidelity`
#> [1] 0.3
#>
#> $`sex-ratio`
#> [1] 0.5
We want to add instructions to those, telling spip how long to run the simulation, and what the initial census sizes should be.
So, first, we copy species_1_life_history
to a new
variable, SPD
:
SPD <- species_1_life_history
Now, we can add things to SPD.
The number of new fish added each year is called the “cohort-size”. Once we know that, we can figure out what the stable age distribution would be given the survival rates, and we can use that as our starting point. There is a function in the package that helps with that:
# before we tell spip what the cohort sizes are, we need to
# tell it how long we will be running the simulation
SPD$`number-of-years` <- 100 # run the sim forward for 100 years
# this is our cohort size
cohort_size <- 300
# Do some matrix algebra to compute starting values from the
# stable age distribution:
L <- leslie_from_spip(SPD, cohort_size)
# then we add those to the spip parameters
SPD$`initial-males` <- floor(L$stable_age_distro_fem)
SPD$`initial-females` <- floor(L$stable_age_distro_male)
# tell spip to use the cohort size
SPD$`cohort-size` <- paste("const", cohort_size, collapse = " ")
Spip let’s you specify what fraction of fish of different ages should be sampled in different years. Here we do something simple, and instruct spip to sample 1% of the fish of ages 1, 2, and 3 (after the episode of death, see the spip vignette…) every year from year 50 to 75.
samp_frac <- 0.03
samp_start_year <- 50
samp_stop_year <- 75
SPD$`discard-all` <- 0
SPD$`gtyp-ppn-fem-post` <- paste(
samp_start_year, "-", samp_stop_year, " ",
samp_frac, " ", samp_frac, " ", samp_frac, " ",
paste(rep(0, SPD$`max-age` - 3), collapse = " "),
sep = ""
)
SPD$`gtyp-ppn-male-post` <- SPD$`gtyp-ppn-fem-post`
There are two function that do all this for you. The function
run_spip()
runs spip in a temporary directory. After
running spip, it also processes the output with a few shell scripts. The
function returns the path to the temporary directory. You pass that
temporary directory path into the function slurp_spip()
to
read the output back into R. It looks like this:
set.seed(5) # set a seed for reproducibility of results
spip_dir <- run_spip(pars = SPD) # run spip
slurped <- slurp_spip(spip_dir, 2) # read the spip output into R
Note that setting the seed allows you to get the same results from spip. If you don’t set the seed, that is fine. spip will be seeded by the next two integers in current random number sequence.
If you are doing multiple runs and you want them to be different, you should make sure that you don’t inadvertently set the seed to be the same each time.
Although during massive production simulations, you might not go back to every run and summarize it to see what it looks like, when you are parameterizing demographic simulations you will want to be able to quickly look at observed demographic rates and things. There are a few functions in CKMRpop that make this quick and easy to do.
This is just a convenience function to make a pretty plot so you can check to see what the population demographics look like:
ggplot_census_by_year_age_sex(slurped$census_postkill)
This shows that the function leslie_from_spip()
does a
good job of finding the initial population numbers that accord with the
stable age distribution.
We can compute the survival rates like this:
surv_rates <- summarize_survival_from_census(slurped$census_postkill)
That returns a list. One part of the list is a tibble with observed survival fractions. The first 40 rows look like this:
surv_rates$survival_tibble %>%
slice(1:40)
#> # A tibble: 40 × 7
#> year pop age sex n cohort surv_fract
#> <int> <int> <int> <chr> <int> <int> <dbl>
#> 1 20 0 20 female 1 0 0
#> 2 20 0 19 female 2 1 0.5
#> 3 21 0 20 female 1 1 0
#> 4 20 0 18 female 1 2 1
#> 5 21 0 19 female 1 2 1
#> 6 22 0 20 female 1 2 0
#> 7 20 0 17 female 2 3 1
#> 8 21 0 18 female 2 3 1
#> 9 22 0 19 female 2 3 1
#> 10 23 0 20 female 2 3 0
#> # ℹ 30 more rows
The second part of the list holds a plot with histograms of age-specific, observed survival rates across all years. The blue line is the mean over all years.
surv_rates$plot_histos_by_age_and_sex
To compare these values to the parameter values for the simulation, you must pass those to the function:
surv_rates2 <- summarize_survival_from_census(
census = slurped$census_prekill,
fem_surv_probs = SPD$`fem-surv-probs`,
male_surv_probs = SPD$`male-surv-probs`
)
# print the plot
surv_rates2$plot_histos_by_age_and_sex
Here, the red dashed line is the value chosen as the parameter for the simulations. The means are particularly different for the older age classes, which makes sense because there the total number of individuals in each of those year classes is smaller.
It makes sense to check that your simulation is delivering a reasonable distribution of offspring per year. This is the number of offspring that survive to just before the first prekill census. Keep in mind that, for super high-fecundity species, we won’t model every single larva, we just don’t start “keeping track of them” until they reach a stage that is recognizable in some way.
We make this summary from the pedigree information. In order to get the number of adults that were present, but did not produce any offspring, we also need to pass in the postkill census information. Also, to get lifetime reproductive output, we need to know how old individuals were when they died, so we also pass in the information about deaths.
To make all the summaries, we do:
offs_and_mates <- summarize_offspring_and_mate_numbers(
census_postkill = slurped$census_postkill,
pedigree = slurped$pedigree,
deaths = slurped$deaths, lifetime_hexbin_width = c(1, 2)
)
Note that we are setting the lifetime reproductive output hexbin width to be suitable for this example.
The function above returns a list of plots, as follows:
offs_and_mates$plot_age_specific_number_of_offspring
Especially when dealing with viviparous species (like sharks and mammals) it is worth checking this to make sure that there aren’t some females having far too many offspring.
Especially with long-lived organisms, it can be instructive to see how lifetime reproductive output varies with age at death.
offs_and_mates$plot_lifetime_output_vs_age_at_death
Yep, many individuals have no offspring, and you have more kids if you live longer.
Out of all the offspring born each year, we can tabulate the fraction that were born to males (or females) of each age. This summary shows a histogram of those values. The represent the distribution of the fractional contribution of each age group each year.
offs_and_mates$plot_fraction_of_offspring_from_each_age_class
The blue vertical lines show the means over all years.
Some of the parameters in spip
affect the distribution
of the number of mates that each individual will have. We can have a
quick look at whether the distribution of number of mates (that produced
at least one offspring) appears to be what we might hope it to be.
mates <- count_and_plot_mate_distribution(slurped$pedigree)
That gives us a tibble with a summary, like this:
head(mates$mate_counts)
#> # A tibble: 6 × 6
#> sex year pop parent num_offs num_mates
#> <chr> <int> <int> <chr> <int> <int>
#> 1 female 20 0 F0_0_0 2 1
#> 2 female 20 0 F10_0_1 5 4
#> 3 female 20 0 F10_0_11 3 1
#> 4 female 20 0 F10_0_12 1 1
#> 5 female 20 0 F10_0_13 2 1
#> 6 female 20 0 F10_0_2 3 2
And also a plot:
mates$plot_mate_counts
From the samples that we slurped up from the spip output we compile
all the related pairs that we found in there with a single function. It
is important to note that this finds all the related pairs that share
ancestors back within num_generations
generations. Recall,
that we ran slurp_spip()
with
num_generations = 2
which means that we check for matching
ancestors up to and including the grandparents of the sample.
crel <- compile_related_pairs(slurped$samples)
The result that comes back has a single row for each pair. The individuals appear in each pair such that the first name comes before the second name, alphabetically, in each pair. There is information in list columns about the year(s) that each member of the pair was sampled in and also years in which they were born, and also the indices of the populations they were sampled from. (Knowing the population will become useful when/if we start simulating multiple populations connected by gene flow). Here we show the first 10 pairs in the samples:
crel %>%
slice(1:10)
#> # A tibble: 10 × 31
#> id_1 id_2 conn_comp dom_relat max_hit dr_hits upper_member
#> <chr> <chr> <dbl> <chr> <int> <list> <int>
#> 1 F47_0_19 M53_0_109 1 FC 1 <int [2]> NA
#> 2 F47_0_19 M53_0_11 1 FC 1 <int [2]> NA
#> 3 F47_0_25 F48_0_4 1 FC 1 <int [2]> NA
#> 4 F47_0_25 F50_0_118 1 FC 1 <int [2]> NA
#> 5 F47_0_25 F50_0_17 1 FC 1 <int [2]> NA
#> 6 F47_0_25 F50_0_96 1 FC 1 <int [2]> NA
#> 7 F47_0_25 F52_0_18 1 FC 1 <int [2]> NA
#> 8 F47_0_25 F57_0_84 1 FC 1 <int [2]> NA
#> 9 F47_0_25 M49_0_118 1 FC 1 <int [2]> NA
#> 10 F47_0_25 M51_0_135 1 FC 1 <int [2]> NA
#> # ℹ 24 more variables: times_encountered <int>,
#> # primary_shared_ancestors <list>, psa_tibs <list>, pop_pre_1 <chr>,
#> # pop_post_1 <chr>, pop_dur_1 <chr>, pop_pre_2 <chr>, pop_post_2 <chr>,
#> # pop_dur_2 <chr>, sex_1 <chr>, sex_2 <chr>, born_year_1 <int>,
#> # born_year_2 <int>, samp_years_list_pre_1 <list>, samp_years_list_1 <list>,
#> # samp_years_list_dur_1 <list>, samp_years_list_post_1 <list>,
#> # samp_years_list_pre_2 <list>, samp_years_list_2 <list>, …
Because some pairs might be related in multiple ways (i.e., they
might be paternal half-sibs, but, through their mother’s lineage, they
might also be half-first cousins), things can get complicated. However,
CKMRpop
has an algorithm to categorize pairs according to
their “most important” relationship.
The column dom_relat
gives the “dominant” or “closest”
relationship between the pair. The possibilities, when considering up to
two generations of ancestors are:
Se
: self.PO
: parent-offspringSi
: sibling.GP
: grandparentalA
: avuncular (aunt-niece)FC
: first cousin.The max_hit
column can be interpreted as the number of
shared ancestors at the level of the dominant relationship. For example
a pair of half-siblings are of category Si
and have
max_hit = 1
, because they share one parent. On the other
hand, a pair with category Si
and max_hit = 2
would be full siblings.
Likewise, A
with max_hit = 1
is a
half-aunt-niece or half-uncle-nephew pair, while A
with
max_hit = 2
would be a full-aunt-niece or full-uncle-nephew
pair.
The column dr_hits
gives the number of shared ancestors
on the upper vs lower diagonals of the ancestry match matrices (see
below). These are meaningful primarily for understanding the
“directionality” of non-symmetrical relationships. Some explanation is
in order: some relationships, like Se, Si, and FC are
symmetrical relationships, because, if, for example Greta is
your sibling, then you are also Greta’s sibling. Likewise, if you are
Milton’s first cousin, then Milton is also your first cousin. Other
relationships, like PO, A, and GP, are not symmetrical: If Chelsea is
your mother, then you are not Chelsea’s mother, you are Chelsea’s child.
In the non-symmetrical relationships there is always one member who is
typically expected to be older than the other. This is a requirement in
a direct-descent relationship (like parent-offspring, or
grandparent-grandchild), but is not actually required in avuncular
relationships (i.e. it is possible to have an aunt that is younger than
the nephew…). We refer to the “typically older” member of
non-symmetrical pairs as the “upper member” and the
upper_member
column of the output above tells us whether
id_1
or id_2
is the upper member in such
relationships, when upper_member
is 1 or 2, respectively.
upper_member
is NA for symmetrical relationships and it can
be 0 for weird situations that should rarely arise where, for example a
pair A and B is related such that A is B’s half-uncle, but B is A’s
half-aunt.
Often the dominant relationship is the only relationship between the
pair. However, if you want to delve deeper into the full relationship
out to num_generations
generations, you can analyze the
ancestry match matrix for the pair, which is stored in the the
anc_match_matrix
column. This matrix holds a TRUE for each
shared ancestor in the two individual’s ancestry (out to
num_generations
). If this seems obtuse, it should become
more understandable when we look at some figures, later.
Here, we count up the number of pairs that fall into different relationship types:
relat_counts <- count_and_plot_ancestry_matrices(crel)
The first component of the return list is a tibble of the
relationship counts in a highly summarized form tabulating just the
dom_relat
and max_hit
over all the pairs.
relat_counts$highly_summarised
#> # A tibble: 10 × 3
#> dom_relat max_hit n
#> <chr> <int> <int>
#> 1 FC 1 1456
#> 2 A 1 912
#> 3 Si 1 420
#> 4 FC 2 114
#> 5 A 2 56
#> 6 PO 1 31
#> 7 Si 2 24
#> 8 GP 1 19
#> 9 FC 3 2
#> 10 A 3 1
This is telling us there are 1456 half-first-cousin pairs, 912 half-avuncular (aunt/uncle with niece/nephew) pairs, and so forth.
As noted before, the table just lists the dominant relationship of
each pair. If you want to quickly assess, within those dominant
categories, how many specific ancestry match matrices underlie them, you
can look at the dr_counts
component of the output:
relat_counts$dr_counts
#> # A tibble: 65 × 7
#> dom_relat dr_hits max_hit anc_match_matrix n tot_dom ID
#> <chr> <list> <int> <list> <int> <int> <chr>
#> 1 FC <int [2]> 1 <lgl [7 × 7]> 280 1572 001-FC[1,1] - 280
#> 2 FC <int [2]> 1 <lgl [7 × 7]> 258 1572 002-FC[1,1] - 258
#> 3 FC <int [2]> 1 <lgl [7 × 7]> 255 1572 003-FC[1,1] - 255
#> 4 FC <int [2]> 1 <lgl [7 × 7]> 192 1572 004-FC[1,1] - 192
#> 5 FC <int [2]> 1 <lgl [7 × 7]> 134 1572 005-FC[1,1] - 134
#> 6 FC <int [2]> 1 <lgl [7 × 7]> 129 1572 006-FC[1,1] - 129
#> 7 FC <int [2]> 1 <lgl [7 × 7]> 105 1572 007-FC[1,1] - 105
#> 8 FC <int [2]> 1 <lgl [7 × 7]> 103 1572 008-FC[1,1] - 103
#> 9 FC <int [2]> 2 <lgl [7 × 7]> 26 1572 009-FC[2,2] - 26
#> 10 FC <int [2]> 2 <lgl [7 × 7]> 24 1572 010-FC[2,2] - 24
#> # ℹ 55 more rows
Each of these distinct ancestry match matrices for each dominant relationship can be visualized in a series of faceted plots, which are also returned. For example, the ancestry match matrices seen amongst the FC relationships are:
relat_counts$dr_plots$FC
Within each dominant relationship, the distinct ancestry matrices in
each separate panel are named according to their number
(001
, 002
, 003
, etc),
relationship and the dr_hits
vector, (FC[1,2]
)
and the number of times this ancestry match matrix was observed amongst
pairs in the sample (like - 4
). So
006-FC[1,1] - 129
was observed in 129 of the sampled
pairs.
It is worth pointing out that inbred individuals can be easily seen
in these plots. For example, the 013-FC[2,2] - 5
plot shows
five pairwise relationships in which individual 2 is inbred, because its
father’s mother (pm) and mother’s mother (mm) are the same
individual.
Let’s look at the distinct ancestry match matrices from the siblings:
relat_counts$dr_plots$Si
Here in 004
–008
we see some interesting
case where ind_1 and ind_2 are half siblings, through one parent, but
are also half-first cousins through the other..
At times, for example, when looking for the more bizarre relationships, you might just want to visualize all the ancestry match matrices in order of the number of times that they occur. The number of different ancestry match matrices (and the matrices themselves) can be accessed with:
relat_counts$anc_mat_counts
#> # A tibble: 65 × 2
#> anc_match_matrix n
#> <list> <int>
#> 1 <lgl [7 × 7]> 280
#> 2 <lgl [7 × 7]> 258
#> 3 <lgl [7 × 7]> 255
#> 4 <lgl [7 × 7]> 244
#> 5 <lgl [7 × 7]> 223
#> 6 <lgl [7 × 7]> 202
#> 7 <lgl [7 × 7]> 192
#> 8 <lgl [7 × 7]> 167
#> 9 <lgl [7 × 7]> 134
#> 10 <lgl [7 × 7]> 129
#> # ℹ 55 more rows
But more useful for visualizing things is
relat_counts$anc_mat_plots
which is a list that holds a
series of pages/plots showing all the different ancestry matrices seen.
Here are the first 30:
relat_counts$anc_mat_plots[[1]]
And here are the remaining 14 relationship types:
relat_counts$anc_mat_plots[[2]]
These are worth staring at for a while, and making sure you understand what they are saying. I spent a lot of time staring at these, which is how I settled upon a decent algorithm for identifying the dominant relationship in each.
When using spip
within CKMRpop
you have to
specify the fraction of individuals in the population that you want to
sample at any particular time. You must set those fractions so that,
given the population size, you end up with roughly the correct number of
samples for the situation you are trying to simulate. Sometimes,
however, you might want to have sampled exactly 5,000 fish. Or some
other number. The function downsample_pairs
lets you
randomly discard specific instances in which an individual was sampled
so that the number of individuals (or sampling instances) that remains
is the exact number you want.
For example, looking closely at slurped$samples
shows
that 386 distinct individuals were sampled:
nrow(slurped$samples)
#> [1] 386
However, those 386 individuals represent multiple distinct sampling instances, because some individuals may sampled twice, as, in this simulation scenario, sampling the individuals does not remove them from the population:
slurped$samples %>%
mutate(ns = map_int(samp_years_list, length)) %>%
summarise(tot_times = sum(ns))
#> # A tibble: 1 × 1
#> tot_times
#> <int>
#> 1 394
Here are some individuals sampled at multiple times
SS2 <- slurped$samples %>%
filter(map_int(samp_years_list, length) > 1) %>%
select(ID, samp_years_list)
SS2
#> # A tibble: 8 × 2
#> ID samp_years_list
#> <chr> <list>
#> 1 F53_0_145 <int [2]>
#> 2 M55_0_22 <int [2]>
#> 3 M60_0_7 <int [2]>
#> 4 M64_0_29 <int [2]>
#> 5 M64_0_100 <int [2]>
#> 6 F65_0_24 <int [2]>
#> 7 F66_0_20 <int [2]>
#> 8 M69_0_50 <int [2]>
And the years that the first two of those individuals were sampled are as follows:
# first indiv:
SS2$samp_years_list[[1]]
#> [1] 55 56
# second indiv:
SS2$samp_years_list[[2]]
#> [1] 56 57
Great! Now, imagine that we wanted to see how many kin pairs we found when our sampling was such that we had only 100 instances of sampling (i.e., it could have been 98 individuals sampled in total, but two of them were sampled in two different years). We do like so:
subsampled_pairs <- downsample_pairs(
S = slurped$samples,
P = crel,
n = 100
)
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#> always returns an ungrouped data frame and adjust accordingly.
#> ℹ The deprecated feature was likely used in the CKMRpop package.
#> Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Now there are only 188 pairs instead of 3035.
We can do a little calculation to see if that makes sense: because the number of pairs varies roughly quadratically, we would expect that the number of pairs to decrease by a quadratic factor of the number of samples:
# num samples before downsampling
ns_bd <- nrow(slurped$samples)
# num samples after downsampling
ns_ad <- nrow(subsampled_pairs$ds_samples)
# ratio of sample sizes
ssz_rat <- ns_ad / ns_bd
# square of the ratio
sq_rat <- ssz_rat ^ 2
# ratio of number of pairs found amongst samples
num_pairs_before <- nrow(crel)
num_pairs_after_downsampling = nrow(subsampled_pairs$ds_pairs)
ratio <- num_pairs_after_downsampling / num_pairs_before
# compare these two things
c(sq_rat, ratio)
#> [1] 0.06711590 0.06194399
That checks out.
Finally, in order to visually summarize all the kin pairs that were found, with specific reference to their age, time of sampling, and sex, I find it helpful to use what I have named the “Uncooked Spaghetti Plot”. There are multiple subpanels on this plot. Here is how to read/view these plots:
F->F
,
F->M
, M->F
, and M->M
.
These refer to the different possible combinations of sexes of the
individuals in the pair.
F
for female or M
for
male) denoting the sex of the “upper_member” of the relationship. That
is, if it is PO, then the sex of the parent is the first letter. The sex
of the non-upper-member is the second letter. Thus a PO
pair that consists of a father and a daughter would appear in a plot
that is in the PO
row in the M->F
column.F->M
subpanel.max_hits
) at the level of the dominant relationship. This
is how you can distinguish full-sibs from half-sibs, etc.We crunch out the data and make the plot like this:
# because we jitter some points, we can set a seed to get the same
# result each time
set.seed(22)
spag <- uncooked_spaghetti(
Pairs = crel,
Samples = slurped$samples
)
Now, the plot can be printed like so:
spag$plot
One issue that arises frequently in CKMR is the concern (especially in small populations) that the pairs of related individuals are not independent. The simplest way in which this occurs is when, for example, A is a half-sib of B, but B is also a half-sib of C, so that the pairs A-B and B-C share the individual B. These sorts of dependencies can be captured quickly by thinking of individuals as vertices and relationships between pairs of individuals as edges, which defines a graph. Finding all the connected components of such a graph provides a nice summary of all those pairs that share members and hence are certainly not independent.
The CKMRpop
package provides the connected component of
this graph for every related pair discovered. This is in column
conn_comp
of the output from
compile_related_pairs()
. Here we can see it from our
example, which shows that the first 10 pairs all belong to the same
connected component, 1.
crel %>%
slice(1:10)
#> # A tibble: 10 × 31
#> id_1 id_2 conn_comp dom_relat max_hit dr_hits upper_member
#> <chr> <chr> <dbl> <chr> <int> <list> <int>
#> 1 F47_0_19 M53_0_109 1 FC 1 <int [2]> NA
#> 2 F47_0_19 M53_0_11 1 FC 1 <int [2]> NA
#> 3 F47_0_25 F48_0_4 1 FC 1 <int [2]> NA
#> 4 F47_0_25 F50_0_118 1 FC 1 <int [2]> NA
#> 5 F47_0_25 F50_0_17 1 FC 1 <int [2]> NA
#> 6 F47_0_25 F50_0_96 1 FC 1 <int [2]> NA
#> 7 F47_0_25 F52_0_18 1 FC 1 <int [2]> NA
#> 8 F47_0_25 F57_0_84 1 FC 1 <int [2]> NA
#> 9 F47_0_25 M49_0_118 1 FC 1 <int [2]> NA
#> 10 F47_0_25 M51_0_135 1 FC 1 <int [2]> NA
#> # ℹ 24 more variables: times_encountered <int>,
#> # primary_shared_ancestors <list>, psa_tibs <list>, pop_pre_1 <chr>,
#> # pop_post_1 <chr>, pop_dur_1 <chr>, pop_pre_2 <chr>, pop_post_2 <chr>,
#> # pop_dur_2 <chr>, sex_1 <chr>, sex_2 <chr>, born_year_1 <int>,
#> # born_year_2 <int>, samp_years_list_pre_1 <list>, samp_years_list_1 <list>,
#> # samp_years_list_dur_1 <list>, samp_years_list_post_1 <list>,
#> # samp_years_list_pre_2 <list>, samp_years_list_2 <list>, …
It should clearly be noted that the size of the connected components
will be affected by the size of the population (with smaller
populations, more of the related pairs will share members) and the
number of generations back in time over which generations are compiled
(if you go back for enough in time, all the pairs will be related to one
another). In our example case, with a small population (so it can be
simulated quickly for building the vignettes) and going back
num_generations = 2
generations (thus including
grandparents and first cousins, etc.) we actually find that all
of the pairs are in the same connected component. Wow!
Because this simulated population is quite small, at this juncture we
will reduce the number of generations so as to create more connected
components amongst these pairs for illustration. So, let us compile just
the pairs with num_generations = 1
. To do this, we must
slurp up the spip results a second time
slurped_1gen <- slurp_spip(spip_dir, num_generations = 1)
And after we have done that, we can compile the related pairs:
crel_1gen <- compile_related_pairs(slurped_1gen$samples)
Look at the number of pairs:
nrow(crel_1gen)
#> [1] 475
That is still a lot of pairs, so let us downsample to 150 samples so that our figures are not overwhelmed by connected components.
set.seed(10)
ssp_1gen <- downsample_pairs(
S = slurped_1gen$samples,
P = crel_1gen,
n = 150
)
And also tally up the number of pairs in different connected components:
ssp_1gen$ds_pairs %>%
count(conn_comp) %>%
arrange(desc(n))
#> # A tibble: 27 × 2
#> conn_comp n
#> <dbl> <int>
#> 1 3 14
#> 2 14 8
#> 3 12 7
#> 4 17 7
#> 5 2 6
#> 6 10 6
#> 7 7 3
#> 8 25 3
#> 9 24 2
#> 10 1 1
#> # ℹ 17 more rows
There are some rather large connected components there. Let’s plot them.
# for some reason, the aes() function gets confused unless
# ggraph library is loaded...
one_gen_graph <- plot_conn_comps(ssp_1gen$ds_pairs)
one_gen_graph$plot
Note that if you want to attach labels to those nodes, to see which individuals we are talking about, you can do this (and also adjust colors…):
one_gen_graph +
ggraph::geom_node_text(aes(label = name), repel = TRUE, size = 1.2) +
scale_edge_color_manual(values = c(`PO-1` = "tan2", `Si-1` = "gold", `Si-2` = "blue"))
#> NULL
And, for fun, look at it with 2 generations and all of the samples:
plot_conn_comps(crel)$plot
What a snarl! With a small population, several generations, and large samples, in this case…everyone is connected!
We can simulate the genotypes of the sampled individuals at unlinked markers that have allele frequencies (amongst the founders) that we specify. We provide the desired allele frequencies in a list. Here we simulate uniformly distributed allele frequencies at 100 markers, each with a random number of alleles that is 1 + Poisson(3):
Then run spip with those allele frequencies:
set.seed(5)
spip_dir <- run_spip(
pars = SPD,
allele_freqs = freqs
)
# now read that in and find relatives within the grandparental range
slurped <- slurp_spip(spip_dir, 2)
Now, the variable slurped$genotypes
has the genotypes we
requested. The first column, (ID
) is the ID of the
individual (congruent with the ID
column in
slurped$samples
) and the remaining columns are for the
markers. Each locus occupies one column and the alleles are separated by
a slash.
Here are the first 10 individuals at the first four loci:
slurped$genotypes[1:10, 1:5]
#> # A tibble: 10 × 5
#> ID Locus_1 Locus_2 Locus_3 Locus_4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 F47_0_19 3/3 3/1 4/1 3/3
#> 2 F47_0_25 3/2 3/3 1/4 1/1
#> 3 F48_0_4 2/4 3/1 3/2 1/1
#> 4 F48_0_28 2/3 1/3 2/4 3/1
#> 5 F48_0_56 3/3 3/3 1/2 1/1
#> 6 F48_0_138 1/3 3/1 4/4 1/3
#> 7 M48_0_89 4/4 2/3 1/2 1/1
#> 8 F49_0_16 2/2 2/2 2/1 1/1
#> 9 F49_0_49 4/3 1/1 3/1 3/3
#> 10 F49_0_62 2/2 3/3 2/3 1/3