library(CKMRsim)
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.2      readr     2.1.4
#>  forcats   1.0.0      stringr   1.5.0
#>  ggplot2   3.4.3      tibble    3.2.1
#>  lubridate 1.9.2      tidyr     1.3.0
#>  purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Prerequisites

In order to pursue simulations in the face of physical linkage, you must download and install the external dependency, Mendel version 16. For Windows and Mac OS X, CKMRsim will look for the Mendel binary in its default install location. So download it from http://software.genetics.ucla.edu/mendel and do a default install. Note that you need to register with your email in order to download Mendel.

An Example

‘CKMRsim’ comes with an example data set of linked markers. The data themselves are concocted—they are merely the microhaplotypes that have been assigned some fictitious positions within a genome—but that is useful for demonstration. The data are allele frequencies in the format required by ‘CKMRsim’. Here are the first few lines of them”

linked_mhaps
#> # A tibble: 825 × 7
#>    Chrom Locus            Pos Allele LocIdx AlleIdx   Freq
#>    <int> <chr>          <dbl> <chr>   <int>   <int>  <dbl>
#>  1     1 tag_id_2255  2778713 AGC         1       1 0.745 
#>  2     1 tag_id_2255  2778713 AAC         1       2 0.223 
#>  3     1 tag_id_2255  2778713 AGT         1       3 0.0326
#>  4     1 tag_id_1508 10014085 GG          2       1 0.956 
#>  5     1 tag_id_1508 10014085 AG          2       2 0.0220
#>  6     1 tag_id_1508 10014085 GA          2       3 0.0220
#>  7     1 tag_id_2513 11890256 CGGAA       3       1 0.621 
#>  8     1 tag_id_2513 11890256 CGGAG       3       2 0.121 
#>  9     1 tag_id_2513 11890256 CGGGA       3       3 0.0895
#> 10     1 tag_id_2513 11890256 CGGGG       3       4 0.0684
#> # ℹ 815 more rows

See that each marker is given a position (in base pairs) along the chromosome upon which it resides.

By default, ‘CKMRsim’ assumes a simple recombination rate of 1 cM per megabase. If you have an actual genetic linkage map for you markers, then you can accommodate that by giving base-pair positions (Pos) of the markers that correspond to the positions that would yield the observed recombination fractions under a constant rate of 1 cM/Mb.

Here we will show the process of simulating full sibling pairs with and without the physical linkage in linked_mhaps. We will also show that physical linkage does not affect parent-offspring or unrelated pairs.

Simulating Pairs

Creating the CKMR object is just the same as it was in the “Example 1” vignette.

ck_lmh <- create_ckmr(
  D = linked_mhaps,
  kappa_matrix = kappas[c("PO", "FS", "U"), ],
  ge_mod_assumed = ge_model_TGIE,
  ge_mod_true = ge_model_TGIE,
  ge_mod_assumed_pars_list = list(epsilon = 0.01),
  ge_mod_true_pars_list = list(epsilon = 0.01)
)

Simulating the markers under the assumption of no linkage is also the same:

Qs_lmh_no_link <- simulate_Qij(ck_lmh,
                               calc_relats = c("PO", "FS", "U"),
                               sim_relats = c("PO", "FS", "U") 
                               )
#> Simulating unlinked markers from Y_l_true matrices for relationship: PO
#> Simulating unlinked markers from Y_l_true matrices for relationship: FS
#> Simulating unlinked markers from Y_l_true matrices for relationship: U

In order to simulate the markers with physical linkage, you use the same simulate_Qij() function, but you tell it to simulate with physical linkage and you also have to give it a list of pedigrees so that it can pass the correct pedigrees to the program MENDEL, which actually does the simulation. CKMRsim comes with such a list. At the moment, it only has three entries” “PO”, FS”, “HS”. If you want to add more (like cousins) feel free. Just note that the two focal individuals of the pair have to be individuals 1 and 2.
If you need help, ping me at eric.anderson@noaa.gov.

So, simulating with linkage looks like this:

# we only run this if it is on eric's laptop.  This way
# we can build the vignette even on CRAN's machines that don't
# have Mendel installed
Qs_lmh_with_link <- simulate_Qij(ck_lmh,
                                 calc_relats = c("PO", "FS", "U"),
                                 sim_relats = c("PO", "FS", "U"), 
                                 unlinked = FALSE,
                                 pedigree_list = pedigrees
)

Note that CKMRsim is wise enough to simply simulate markers as unlinked for the PO, U, and MZ cases (because, when dealing with pairs with those relationships, the presence of physical linkage makes no difference to the outcome).

Carrying on, it is instructive to look at the distributions of the FS/U log-likelihood ratios. They should differ between the cases with and without linkage, most notably the case with linkage should produce values with greater variance. Let us see if that is the case. To do that, we extract the log-likelihood ratio values, put them together in a single data frame, and then plot them. Note that the extract_logls function returns a column that says whether the simulation was done with or without linkage, so it is straightforward to do this in two lines:

lnl_logls <- list(
  extract_logls(Qs_lmh_no_link, numer = c(FS = 1), denom = c(U = 1)),
  extract_logls(Qs_lmh_with_link, numer = c(FS = 1), denom = c(U = 1))
) %>%
  bind_rows()

Now compare the cases when the truth is full-sibling:

lnl_logls %>%
  filter(true_relat == "FS") %>%
  ggplot(aes(x = logl_ratio, fill = simtype)) +
  geom_histogram(bins = 70, 
                 position = position_identity(),
                 alpha = 0.5)

See, logl ratios have the same mean top and bottom.

lnl_logls %>%
  filter(true_relat == "FS") %>%
  group_by(simtype) %>%
  summarise(mean_logl_rat = mean(logl_ratio),
            mean_numer = mean(numer_logl))
#> # A tibble: 2 × 3
#>   simtype  mean_logl_rat mean_numer
#>   <chr>            <dbl>      <dbl>
#> 1 linked            42.4      -390.
#> 2 unlinked          42.3      -391.