CKMRsim-example-1

Introduction

In this example, we will work through a relatively complete analysis of data from kelp rockfish. These data have been used in two publications to date (Baetscher et al. 2019, 2018). The genotype data are included in the package as a gzip compressed csv file. The contents are identical to what can be downloaded from Dryad at: http://datadryad.org/bitstream/handle/10255/dryad.205630/kelp_genos_used_two_col.csv?sequence=1.

Before we get started let’s load the packages we need. We’ll be using a lot of functions from the tidyverse, so let’s do it.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.2     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(CKMRsim)

Reading in the Genotype Data set

We first must read in the genotypes of the individuals. This file holds genotypes of all the adults and all of the juveniles. The first part of our analysis involves simulations using allele frequencies. Since we don’t expect large full sibling groups in our data, it is reasonable to use both adults and juveniles, together, to estimate allele frequencies in this kelp rockfish population.

It is also worth noting that, in general, it is important to include most/all of the individuals you will be using in your CKMR study when estimating allele frequencies. This is because all alleles that you might encounter when doing kin-finding must have been accounted for and must have allele frequencies computed for them. Especially when using multiallelic markers like microhaplotypes or microsatellites, it is important to make sure that every allele that you might encounter during kin-finding is accounted for during the allele frequency and power estimation phases of the analysis.

The file that holds the genotypes comes with the CKMRsim package at:

geno_path <- system.file("extdata/kelp_genos_used_two_col.csv.gz", package = "CKMRsim")

When we read that in, we get a lot of warnings (they have been suppressed here so that the vignette will pass CRAN checks) that column names have been deduplicated. This is because each locus in the file occupies two columns and the names of those columns are the same (i.e., the name of the locus.)

genos1 <- suppressWarnings(read_csv(geno_path))
#> New names:
#> Rows: 6091 Columns: 157
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (1): NMFS_DNA_ID dbl (156): Plate_1_A01_Sat_GW603857_consensus...2,
#> Plate_1_A01_Sat_GW603857_...
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `Plate_1_A01_Sat_GW603857_consensus` ->
#>   `Plate_1_A01_Sat_GW603857_consensus...2`
#> • `Plate_1_A01_Sat_GW603857_consensus` ->
#>   `Plate_1_A01_Sat_GW603857_consensus...3`
#> • `Plate_1_A11_Sat_GE820299_consensus` ->
#>   `Plate_1_A11_Sat_GE820299_consensus...4`
#> • `Plate_1_A11_Sat_GE820299_consensus` ->
#>   `Plate_1_A11_Sat_GE820299_consensus...5`
#> • `Plate_2_A09_Sat_EW986980_consensus` ->
#>   `Plate_2_A09_Sat_EW986980_consensus...6`
#> • `Plate_2_A09_Sat_EW986980_consensus` ->
#>   `Plate_2_A09_Sat_EW986980_consensus...7`
#> • `Plate_2_C08_Sat_EW987116_consensus` ->
#>   `Plate_2_C08_Sat_EW987116_consensus...8`
#> • `Plate_2_C08_Sat_EW987116_consensus` ->
#>   `Plate_2_C08_Sat_EW987116_consensus...9`
#> • `Plate_3_C03_Sat_GE798118_consensus` ->
#>   `Plate_3_C03_Sat_GE798118_consensus...10`
#> • `Plate_3_C03_Sat_GE798118_consensus` ->
#>   `Plate_3_C03_Sat_GE798118_consensus...11`
#> • `Plate_4_E10_Sat_EW976030_consensus` ->
#>   `Plate_4_E10_Sat_EW976030_consensus...12`
#> • `Plate_4_E10_Sat_EW976030_consensus` ->
#>   `Plate_4_E10_Sat_EW976030_consensus...13`
#> • `Plate_4_G06_Sat_EW976181_consensus` ->
#>   `Plate_4_G06_Sat_EW976181_consensus...14`
#> • `Plate_4_G06_Sat_EW976181_consensus` ->
#>   `Plate_4_G06_Sat_EW976181_consensus...15`
#> • `tag_id_1049` -> `tag_id_1049...16`
#> • `tag_id_1049` -> `tag_id_1049...17`
#> • `tag_id_108` -> `tag_id_108...18`
#> • `tag_id_108` -> `tag_id_108...19`
#> • `tag_id_1184` -> `tag_id_1184...20`
#> • `tag_id_1184` -> `tag_id_1184...21`
#> • `tag_id_1229` -> `tag_id_1229...22`
#> • `tag_id_1229` -> `tag_id_1229...23`
#> • `tag_id_1366` -> `tag_id_1366...24`
#> • `tag_id_1366` -> `tag_id_1366...25`
#> • `tag_id_1399` -> `tag_id_1399...26`
#> • `tag_id_1399` -> `tag_id_1399...27`
#> • `tag_id_1428` -> `tag_id_1428...28`
#> • `tag_id_1428` -> `tag_id_1428...29`
#> • `tag_id_143` -> `tag_id_143...30`
#> • `tag_id_143` -> `tag_id_143...31`
#> • `tag_id_1441` -> `tag_id_1441...32`
#> • `tag_id_1441` -> `tag_id_1441...33`
#> • `tag_id_1449` -> `tag_id_1449...34`
#> • `tag_id_1449` -> `tag_id_1449...35`
#> • `tag_id_1471` -> `tag_id_1471...36`
#> • `tag_id_1471` -> `tag_id_1471...37`
#> • `tag_id_1498` -> `tag_id_1498...38`
#> • `tag_id_1498` -> `tag_id_1498...39`
#> • `tag_id_1558` -> `tag_id_1558...40`
#> • `tag_id_1558` -> `tag_id_1558...41`
#> • `tag_id_1576` -> `tag_id_1576...42`
#> • `tag_id_1576` -> `tag_id_1576...43`
#> • `tag_id_1598` -> `tag_id_1598...44`
#> • `tag_id_1598` -> `tag_id_1598...45`
#> • `tag_id_1604` -> `tag_id_1604...46`
#> • `tag_id_1604` -> `tag_id_1604...47`
#> • `tag_id_1613` -> `tag_id_1613...48`
#> • `tag_id_1613` -> `tag_id_1613...49`
#> • `tag_id_162` -> `tag_id_162...50`
#> • `tag_id_162` -> `tag_id_162...51`
#> • `tag_id_1652` -> `tag_id_1652...52`
#> • `tag_id_1652` -> `tag_id_1652...53`
#> • `tag_id_170` -> `tag_id_170...54`
#> • `tag_id_170` -> `tag_id_170...55`
#> • `tag_id_1708` -> `tag_id_1708...56`
#> • `tag_id_1708` -> `tag_id_1708...57`
#> • `tag_id_1748` -> `tag_id_1748...58`
#> • `tag_id_1748` -> `tag_id_1748...59`
#> • `tag_id_1751` -> `tag_id_1751...60`
#> • `tag_id_1751` -> `tag_id_1751...61`
#> • `tag_id_1762` -> `tag_id_1762...62`
#> • `tag_id_1762` -> `tag_id_1762...63`
#> • `tag_id_179` -> `tag_id_179...64`
#> • `tag_id_179` -> `tag_id_179...65`
#> • `tag_id_1808` -> `tag_id_1808...66`
#> • `tag_id_1808` -> `tag_id_1808...67`
#> • `tag_id_1836` -> `tag_id_1836...68`
#> • `tag_id_1836` -> `tag_id_1836...69`
#> • `tag_id_1871` -> `tag_id_1871...70`
#> • `tag_id_1871` -> `tag_id_1871...71`
#> • `tag_id_1880` -> `tag_id_1880...72`
#> • `tag_id_1880` -> `tag_id_1880...73`
#> • `tag_id_1889` -> `tag_id_1889...74`
#> • `tag_id_1889` -> `tag_id_1889...75`
#> • `tag_id_1950` -> `tag_id_1950...76`
#> • `tag_id_1950` -> `tag_id_1950...77`
#> • `tag_id_1961` -> `tag_id_1961...78`
#> • `tag_id_1961` -> `tag_id_1961...79`
#> • `tag_id_1966` -> `tag_id_1966...80`
#> • `tag_id_1966` -> `tag_id_1966...81`
#> • `tag_id_1982` -> `tag_id_1982...82`
#> • `tag_id_1982` -> `tag_id_1982...83`
#> • `tag_id_1994` -> `tag_id_1994...84`
#> • `tag_id_1994` -> `tag_id_1994...85`
#> • `tag_id_1999` -> `tag_id_1999...86`
#> • `tag_id_1999` -> `tag_id_1999...87`
#> • `tag_id_2008` -> `tag_id_2008...88`
#> • `tag_id_2008` -> `tag_id_2008...89`
#> • `tag_id_2017` -> `tag_id_2017...90`
#> • `tag_id_2017` -> `tag_id_2017...91`
#> • `tag_id_2062` -> `tag_id_2062...92`
#> • `tag_id_2062` -> `tag_id_2062...93`
#> • `tag_id_2082` -> `tag_id_2082...94`
#> • `tag_id_2082` -> `tag_id_2082...95`
#> • `tag_id_2114` -> `tag_id_2114...96`
#> • `tag_id_2114` -> `tag_id_2114...97`
#> • `tag_id_2134` -> `tag_id_2134...98`
#> • `tag_id_2134` -> `tag_id_2134...99`
#> • `tag_id_2178` -> `tag_id_2178...100`
#> • `tag_id_2178` -> `tag_id_2178...101`
#> • `tag_id_220` -> `tag_id_220...102`
#> • `tag_id_220` -> `tag_id_220...103`
#> • `tag_id_2203` -> `tag_id_2203...104`
#> • `tag_id_2203` -> `tag_id_2203...105`
#> • `tag_id_221` -> `tag_id_221...106`
#> • `tag_id_221` -> `tag_id_221...107`
#> • `tag_id_2214` -> `tag_id_2214...108`
#> • `tag_id_2214` -> `tag_id_2214...109`
#> • `tag_id_2237` -> `tag_id_2237...110`
#> • `tag_id_2237` -> `tag_id_2237...111`
#> • `tag_id_2247` -> `tag_id_2247...112`
#> • `tag_id_2247` -> `tag_id_2247...113`
#> • `tag_id_2258` -> `tag_id_2258...114`
#> • `tag_id_2258` -> `tag_id_2258...115`
#> • `tag_id_2301` -> `tag_id_2301...116`
#> • `tag_id_2301` -> `tag_id_2301...117`
#> • `tag_id_250` -> `tag_id_250...118`
#> • `tag_id_250` -> `tag_id_250...119`
#> • `tag_id_2607` -> `tag_id_2607...120`
#> • `tag_id_2607` -> `tag_id_2607...121`
#> • `tag_id_2635` -> `tag_id_2635...122`
#> • `tag_id_2635` -> `tag_id_2635...123`
#> • `tag_id_265` -> `tag_id_265...124`
#> • `tag_id_265` -> `tag_id_265...125`
#> • `tag_id_325` -> `tag_id_325...126`
#> • `tag_id_325` -> `tag_id_325...127`
#> • `tag_id_402` -> `tag_id_402...128`
#> • `tag_id_402` -> `tag_id_402...129`
#> • `tag_id_410` -> `tag_id_410...130`
#> • `tag_id_410` -> `tag_id_410...131`
#> • `tag_id_436` -> `tag_id_436...132`
#> • `tag_id_436` -> `tag_id_436...133`
#> • `tag_id_572` -> `tag_id_572...134`
#> • `tag_id_572` -> `tag_id_572...135`
#> • `tag_id_67` -> `tag_id_67...136`
#> • `tag_id_67` -> `tag_id_67...137`
#> • `tag_id_788` -> `tag_id_788...138`
#> • `tag_id_788` -> `tag_id_788...139`
#> • `tag_id_843` -> `tag_id_843...140`
#> • `tag_id_843` -> `tag_id_843...141`
#> • `tag_id_855` -> `tag_id_855...142`
#> • `tag_id_855` -> `tag_id_855...143`
#> • `tag_id_874` -> `tag_id_874...144`
#> • `tag_id_874` -> `tag_id_874...145`
#> • `tag_id_875` -> `tag_id_875...146`
#> • `tag_id_875` -> `tag_id_875...147`
#> • `tag_id_879` -> `tag_id_879...148`
#> • `tag_id_879` -> `tag_id_879...149`
#> • `tag_id_913` -> `tag_id_913...150`
#> • `tag_id_913` -> `tag_id_913...151`
#> • `tag_id_934` -> `tag_id_934...152`
#> • `tag_id_934` -> `tag_id_934...153`
#> • `tag_id_942` -> `tag_id_942...154`
#> • `tag_id_942` -> `tag_id_942...155`
#> • `tag_id_981` -> `tag_id_981...156`
#> • `tag_id_981` -> `tag_id_981...157`

Let’s look at the first bit of the resulting tibble:

genos1[1:10, 1:10]
#> # A tibble: 10 × 10
#>    NMFS_DNA_ID Plate_1_A01_Sat_GW603857_consensus...2 Plate_1_A01_Sat_GW603857…¹
#>    <chr>                                        <dbl>                      <dbl>
#>  1 R011743                                          1                          1
#>  2 R011744                                          1                          1
#>  3 R011753                                          1                          1
#>  4 R011755                                          1                          2
#>  5 R011765                                          1                          1
#>  6 R011766                                          1                          2
#>  7 R011769                                          1                          2
#>  8 R011772                                          1                          1
#>  9 R011775                                          1                          2
#> 10 R011778                                          1                          1
#> # ℹ abbreviated name: ¹Plate_1_A01_Sat_GW603857_consensus...3
#> # ℹ 7 more variables: Plate_1_A11_Sat_GE820299_consensus...4 <dbl>,
#> #   Plate_1_A11_Sat_GE820299_consensus...5 <dbl>,
#> #   Plate_2_A09_Sat_EW986980_consensus...6 <dbl>,
#> #   Plate_2_A09_Sat_EW986980_consensus...7 <dbl>,
#> #   Plate_2_C08_Sat_EW987116_consensus...8 <dbl>,
#> #   Plate_2_C08_Sat_EW987116_consensus...9 <dbl>, …

That is not a super informative print out. But it does show us that the NMFS_DNA_ID column is the identifier for the indvidual fish, and that the genetic data start in column 2. Also, the even columns have the locus name and the odd columns (after the first one) are named with the locus names plus a _1 appended to them. The alleles of each genotype are recorded as numbers.

Which are adults and which are juveniles?

We also have in the package two data files that list the IDs of the fish that are adults (and potentially parents) and which are juveniles (and potentially offspring). These correspond to the dryad files kelp_adults.rds and kelp_juveniles.rds. We get those data and then make a tibble of IDs with a column telling what life stage they were sampled at, and keep only those IDs that we have genotypes for in genos1:

adults <- read_rds(file = system.file("extdata/kelp_adults.rds", package = "CKMRsim"))
juvies <- read_rds(file = system.file("extdata/kelp_juveniles.rds", package = "CKMRsim"))


life_stages <- bind_rows(
  tibble(stage = "adult",
         NMFS_DNA_ID = adults),
  tibble(stage = "juvenile",
         NMFS_DNA_ID = juvies)
) %>%
  filter(NMFS_DNA_ID %in% genos1$NMFS_DNA_ID)

Here we can see the numbers of adult and juvenile fish we are dealing with:

life_stages %>%
  count(stage)
#> # A tibble: 2 × 2
#>   stage        n
#>   <chr>    <int>
#> 1 adult     1847
#> 2 juvenile  4244

Estimating Power for Relationship Inference

Estimating the power that this set of markers is expected to give us for pairwise relationship inference involves doing Monte Carlo simulations from the allele frequencies in the population. So, we are going to first have to estimate the allele frequencies from our genotype data set. And, we have to present those allele frequencies in a specific format that CKMRsim is expecting.

Allele Frequency Input Format

CKMRsim is designed to be able to work with information about markers that are in known locations on different chromosomes within the genome. Note that genome coordinates of markers are not required but the input format has a place to record the chromome and position of each marker. If these are not known, we fill them in with dummy values, as will be shown. If simulations are done without physical linkage, then the values used have no effect.

The package data long_markers shows an example of the allele frequency format that CKMRsim expects. It looks like this:

long_markers
#> # A tibble: 53,637 × 7
#>    Chrom Locus     Pos Allele LocIdx AlleIdx   Freq
#>    <int> <chr>   <dbl> <chr>   <int>   <int>  <dbl>
#>  1     4 chr4-1   3845 a1          1       1 0.790 
#>  2     4 chr4-1   3845 a2          1       2 0.210 
#>  3     4 chr4-2  71520 a1          2       1 0.485 
#>  4     4 chr4-2  71520 a3          2       2 0.237 
#>  5     4 chr4-2  71520 a2          2       3 0.167 
#>  6     4 chr4-2  71520 a4          2       4 0.112 
#>  7     4 chr4-3 105104 a3          3       1 0.499 
#>  8     4 chr4-3 105104 a2          3       2 0.402 
#>  9     4 chr4-3 105104 a1          3       3 0.0989
#> 10     4 chr4-4 256481 a4          4       1 0.402 
#> # ℹ 53,627 more rows

The main columns are as follows:

Chrom should be a character or integer denoting which chromosome the marker is on. (For example 1 or X or Omy12)
Locus gives a character vector with the names of the markers/loci. Please don’t use spaces in the marker names!
Pos gives the genome coordinates of the marker. This needs to be a number (double or integer)
Allele gives the name of each allele at each locus. This must be a character vector. So, in our example, we will coerce the numbers we have for each locus into a character vector. Please do not use spaces in allele names.
Freq is the frequency of each allele in the population. These should sum to 1.0 over each locus.

The remaining columns, LocIdx and AlleIdx are integer indices that get assigned to each locus and to each allele within each index. These columns get filled by using the reindex_markers() function.

Computing Allele Frequencies from Genotype Data

The strategy for doing this in the tidyverse is to make a long data frame with columns NMFS_DNA_ID, Locus, gene_copy and Allele, and then summarising as appropriate. Notice that, since these fish are diploids, we will let gene_copy be a or b to denote whether we are talking about the allelic type of the first or second gene copy of the locus in a diploid individual. To do this, we will rename the column headers to look like Locus.gene_copy, then do some gathering and separating.

MISSING DATA: Note that missing data in this data set is denoted by 0, and it must be denoted by NA. So, turn Allele == 0 to NA:

# make note of the current order of loci in the data set
nc <- ncol(genos1)
loci <- str_replace(names(genos1)[seq(2, nc, by = 2)], "\\.\\.\\.[0-9]+$", "")

#  reset the locus names
names(genos1)[seq(2, nc, by = 2)] <- str_c(loci, "1", sep = ".")
names(genos1)[seq(3, nc, by = 2)] <- str_c(loci, "2", sep = ".")

# then make some long format genotypes
long_genos <- genos1 %>% 
  gather(key = "loc", value = "Allele", -NMFS_DNA_ID) %>%
  separate(loc, into = c("Locus", "gene_copy"), sep = "\\.") %>%
  mutate(Allele = as.character(Allele)) %>%
  mutate(Allele = ifelse(Allele == "0", NA, Allele)) %>%
  rename(Indiv = NMFS_DNA_ID)

IMPORTANT NOTE The data frame long_genos here is a long-format specification of multilocus genotypes of individuals. This is a a basic input type that will be used later in CKMRsim. It looks like this:

long_genos
#> # A tibble: 950,196 × 4
#>    Indiv   Locus                              gene_copy Allele
#>    <chr>   <chr>                              <chr>     <chr> 
#>  1 R011743 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  2 R011744 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  3 R011753 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  4 R011755 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  5 R011765 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  6 R011766 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  7 R011769 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  8 R011772 Plate_1_A01_Sat_GW603857_consensus 1         1     
#>  9 R011775 Plate_1_A01_Sat_GW603857_consensus 1         1     
#> 10 R011778 Plate_1_A01_Sat_GW603857_consensus 1         1     
#> # ℹ 950,186 more rows

Note this about the data frame:

It has four columns: Indiv, Locus, gene_copy, and Allele.
Alleles are named with characters (even if they are numbers, they must be coerced to characters).
The gene_copy column contains either a 1 or a 2 in every row, telling us which copy of the gene (in a diploid) is which allele.
Missing data in the Allele column is given by NA.

Now we can compute the allele frequencies, and while we are at it, we will say that these markers are all on chromosome Unk in positions given by their order in the original data set. And we add columns for AlleIdx and LocIdx, as required to pass this tibble into reindex_markers(), but we can leave each entry as NA.

alle_freqs <- long_genos %>%
  count(Locus, Allele) %>%
  group_by(Locus) %>%
  mutate(Freq = n / sum(n),
         Chrom = "Unk",
         Pos = as.integer(factor(Locus, levels = loci))) %>%
  ungroup() %>%
  select(Chrom, Pos, Locus, Allele, Freq) %>%
  arrange(Pos, desc(Freq)) %>%
  mutate(AlleIdx = NA,
         LocIdx = NA) %>%
  filter(!is.na(Allele))

# see what that looks like:
alle_freqs
#> # A tibble: 993 × 7
#>    Chrom   Pos Locus                              Allele     Freq AlleIdx LocIdx
#>    <chr> <int> <chr>                              <chr>     <dbl> <lgl>   <lgl> 
#>  1 Unk       1 Plate_1_A01_Sat_GW603857_consensus 1       6.33e-1 NA      NA    
#>  2 Unk       1 Plate_1_A01_Sat_GW603857_consensus 2       3.66e-1 NA      NA    
#>  3 Unk       1 Plate_1_A01_Sat_GW603857_consensus 3       4.10e-4 NA      NA    
#>  4 Unk       1 Plate_1_A01_Sat_GW603857_consensus 4       2.46e-4 NA      NA    
#>  5 Unk       1 Plate_1_A01_Sat_GW603857_consensus 5       8.21e-5 NA      NA    
#>  6 Unk       1 Plate_1_A01_Sat_GW603857_consensus 6       8.21e-5 NA      NA    
#>  7 Unk       2 Plate_1_A11_Sat_GE820299_consensus 1       7.01e-1 NA      NA    
#>  8 Unk       2 Plate_1_A11_Sat_GE820299_consensus 2       2.49e-1 NA      NA    
#>  9 Unk       2 Plate_1_A11_Sat_GE820299_consensus 3       2.45e-2 NA      NA    
#> 10 Unk       2 Plate_1_A11_Sat_GE820299_consensus 4       1.19e-2 NA      NA    
#> # ℹ 983 more rows

Note: It is very important to remove the NAs from the allele frequencies as is done in the last step above.

Once that is done, we pass them through the reindex_markers() function to yield a data frame of allele frequencies that CKRMsim can use to do simulations.

afreqs_ready <- reindex_markers(alle_freqs)

We now have what we need to do a simulation analysis of power for pairwise relationship inference from these markers.

Note that we will come back to long_genos later to actually do the pairwise relationship inference.

Create a CKMR object (`create_ckmr()`)

The first step in the simulation process is to create an object of class ckmr which holds a large number of matrices useful for simulating genotypes (and thus for estimating power for relationship inference). The main function that we use for this is create_ckmr(). This function takes, as input, the allele frequencies formatted as we have above, and a few other settings having to do with the relationships that the user is interested in simulation and the genotyping error models that the user want to consider, and then it creates all the necessary elements for doing the simulations.

Specifically, the arguments you need to pass to create_ckmr() are as follows:

D : the tibble of allele frequencies that has been run through reindex_markers().
kappa_matrix : A matrix that describes the pairwise relationships that you will be wanting to jointly simulate genotypes and likelihoods for. Each row is named by what you want to call the relationship and there are three columns which give, respectively, the probability that a pair with such a relationship share 0, 1, or 2 genes identical-by-descent. The CKMRsim package comes with a matrix called kappas that has this information for 12 relationships:
- MZ : monozgotic twins (or “self”). This can be used to figure out how much power you have for identifying the same individual, sampled twice.
- PO : parent-offspring.
- FS : full siblings.
- HS : half siblings.
- GP : grandparent - grandoffspring.
- AN : aunt-neice (same as uncle-nephew or any such avuncular relationship)
- DFC : double first cousins.
- FC : first cousins.
- HC : half cousins.
- U : unrelated. The kappas matrix looks like this:
```
kappas
#>        kappa0  kappa1   kappa2
#> MZ   0.000000 0.00000 1.000000
#> PO   0.000000 1.00000 0.000000
#> FS   0.250000 0.50000 0.250000
#> HS   0.500000 0.50000 0.000000
#> GP   0.500000 0.50000 0.000000
#> AN   0.500000 0.50000 0.000000
#> DFC  0.562500 0.37500 0.062500
#> HAN  0.750000 0.25000 0.000000
#> FC   0.750000 0.25000 0.000000
#> HFC  0.875000 0.12500 0.000000
#> DHFC 0.765625 0.21875 0.015625
#> SC   0.937500 0.06250 0.000000
#> HSC  0.968750 0.03125 0.000000
#> U    1.000000 0.00000 0.000000
```
ge_mod_assumed : a function that describes the genotyping error model that will be applied to the simulated data when computing the likelihoods of the genotypes. In this case we will use ge_model_TGIE which is appropriate for integer-coded data. More information about this is in the vignette about writing genotyping error functions: the-tgie-function.
ge_mod_true : a function that describes the genotyping error model that will actually be used to simulate the genotype data. Being able to separately specify these two models (assumed and true) allows the user to investigate the effects of misspecification of the genotyping error model.
ge_mod_assumed_pars_list a list of named parameters for the assumed genotyping error model (or, leave blank to use the defaults).
ge_mod_true_pars_list a list of named parameters for the true genotyping error model (or, leave blank to use the defaults).

For the present example we are going to start with the assumption that the true and the assumed genotyping error models are the same: true-genotype-independent model with error rate (epsilon) equal to 1 in 200. We will be concerned with how well we can distinguish Parent-offspring pairs from unrelated, and full-sibling pairs from unrelated, and we also want to see if half-siblings might sometimes look like full siblings. So, we will create this ckmr object with the following relationships: PO, FS, HS, U. We do all that like this:

ex1_ckmr <- create_ckmr(
  D = afreqs_ready,
  kappa_matrix = kappas[c("PO", "FS", "HS", "U"), ],
  ge_mod_assumed = ge_model_TGIE,
  ge_mod_true = ge_model_TGIE,
  ge_mod_assumed_pars_list = list(epsilon = 0.005),
  ge_mod_true_pars_list = list(epsilon = 0.005)
)

Now, when we print this object we get a little summary of what is in it:

ex1_ckmr
#> A ckmr object with 78 loci having between 3 and 27 alleles.
#> Locus names: Unk.Plate_1_A01_Sat_GW603857_consensus.1, Unk.Plate_1_A11_Sat_GE820299_consensus.2,   ...   , Unk.tag_id_981.78
#> Relationships:  PO, FS, HS, U
#> Assumed and true genotyping error models are the same

This is an S3 object of class ckmr.

Simulate genotype pairs and calculate their log-probabilities (`simulate_Qij()`)

Once we have created our ckmr object, we can use it to efficiently simulate the multilocus genotypes of pairs of individuals (of different relationships). At the same time, we can calculate the log-probability of those genotypes of simulated pairs of individuals under the assumption that they are of one relationship type or another. Those log genotype probabilities are the main ingredient for computing the log-likelihood ratios used for inferring relationships.

The main function that we use in this step is simulate_Qij(). This function simulates genotypes from different relationships, and then for each simulated genotype pair it also calculates the log probability of the pair of genotypes conditional on the pair being of one or several relationships. This function has four main inputs that we discuss here (the others allow the addition of missing data and physical linkage, and are discussed in other vignettes).

C : the ckmr object to use for simulation and probabilty calculation.
sim_relats : the set of true relationships you want to simulate genotypes from.
calc_relats : the set of assumed relationships you wish to compute genotype probabilities for, from the simulated data.
reps : for each relationship in sim_relats, the number of genotype pairs to simulate. This is, by default 10,000.

The arguments sim_relats and calc_relats can use some extra discussion here. Keep in mind that we are going to use the output of this function to learn about the distribution of likelihood ratios. For example, if we are trying to identify parent-offspring pairs, we will typically use the ratio of he probability of a pair’s genotypes given that they are parent-offspring, divided by the probability of the pair’s genotypes given that they are unrelated. We can denote such a likelihood ratio by PO/U. If we are looking for full sibling pairs, we would, rather, typically use the likelihood ratio FS/U. If, on the other we hand we wanted to know how well full-siblings might be resolved from half-siblings, we could use FS/HS. Whatever the target of one’s question, in order for CKMRsim to calculate such likelihood ratios from the simulated data, it is necessary to compute the probabilities that are used in the ratios—that is what the calc_relats argument is for. You should set it to a vector of the relationships that occur in the numerator or the denominator of the likelihood ratios that you might want to investigate. For example, if we want to investigate the distribution of the ratios PO/U, FS/U, and FS/HS, then calc_relats should be c("PO", "FS", "HS", "U").

sim_relats refers to the set of relationships that pairs of individuals are simulated from. You will typically be interested in the distribution of a likelihood ratio when the truth is relationship X. For example, if doing parentage inference, then to compute false positive and false negative rates, you will need to know the distribution of PO/U when pairs are truly unrelated (U) and when they are truly parent-offspring, (PO). But, you might also be interested to know what the distribution of PO/U looks like when the pairs are actually full-siblings. If you were interested in all of the above, then you would set sim_relats equal to c("PO", "FS", "U").

In this vignette, for our current example, we will assume that we want to know whether our set of markers has sufficient power for identifying PO pairs and FS pairs from U pairs by using the likelihood ratios PO/U and FS/U. So, we will clearly want calc_relats = c("PO", "FS", "U"). But, let us also imagine that we are curious whether individuals that are truly HS might be mistakenly identified as FS; in such a case we will be curious about the distribution of FS/U when the truth is HS, so we will also need to simulate from HS. Hence we will use `sim_relats = c(“PO”, “FS”, “HS”, “U”)

It is worth pointing out at this point that any relationship found in calc_relats or sim_relats must have been included as a rowname in the kappa_matrix argument to create_ckmr()!

Here we simulate all the genotype-pair probabilities we will need:

ex1_Qs <- simulate_Qij(ex1_ckmr, 
                       calc_relats = c("PO", "FS", "U"),
                       sim_relats = c("PO", "FS", "HS", "U") )
#> Simulating unlinked markers from Y_l_true matrices for relationship: PO
#> Simulating unlinked markers from Y_l_true matrices for relationship: FS
#> Simulating unlinked markers from Y_l_true matrices for relationship: HS
#> Simulating unlinked markers from Y_l_true matrices for relationship: U

If we print the resulting S3 object (of class Qij), we get a little information about it:

ex1_Qs
#> A Qij object with 10000 reps
#> simulated with markers unlinked
#> "sim_relats" relationships: PO, FS, HS, U
#> "calc_relats"   relationships: PO, FS, U
#> rando_miss_n: 0

Compute log-likelihood ratios to view them (`extract_logls()`)

This next step is not entirely necessary, but it helps to create pictures to understand what we are doing here, and the extract_logls() function lets us get simulated log-likelihood ratios out of the Qij object we created above and plot a histogram or a density plot of the distributions. This is helpful for developing intuition about things and understanding what is going on.

The main function here is extract_logls(). It simply takes an object of class Qij and uses the information within it to compute log-likelihood ratios from all the genotype probabilities that are stored within it. There is one small twist: the numerator and denominator of the likelihood ratios can be mixtures of different relationship categories. For example, instead of just computing PO/U, you might want to compute a log likelihood ratio that reflects your belief that, of the genotyping pairs you are sampling, 95% are unrelated and the other 5% are full siblings. Then you might want to base your inference on a log-likelihood ratio that looked like: PO/(0.95 * U + 0.05 * FS).

Often you will not want to do that, but, because extract_logls() is designed to do that, you have explicitly tell it how much weight to give each relationship category in the numerator and the denominator of the log likelihood ratio.

An example here should help. Let’s say we are interested in the distribution of the log likelihood ratio PO/U. Then we would do:

PO_U_logls <- extract_logls(ex1_Qs,
                            numer = c(PO = 1),
                            denom = c(U = 1))

That returns a tibble of log likelihood values for the numerator and the denominator and also the ratio of the two, one row for each simulated value in ex1_Qs (in the present case that is 10,000 values for each of the true relationship being PO, FS, HS, and U):

PO_U_logls
#> # A tibble: 40,000 × 11
#>    simtype  PO_sim  rando_miss_n numer_wts denom_wts true_relat   rep numer_logl
#>    <chr>    <chr>          <dbl> <chr>     <chr>     <chr>      <int>      <dbl>
#>  1 unlinked not_fo…            0 PO=1      U=1       PO             1      -226.
#>  2 unlinked not_fo…            0 PO=1      U=1       PO             2      -245.
#>  3 unlinked not_fo…            0 PO=1      U=1       PO             3      -218.
#>  4 unlinked not_fo…            0 PO=1      U=1       PO             4      -223.
#>  5 unlinked not_fo…            0 PO=1      U=1       PO             5      -229.
#>  6 unlinked not_fo…            0 PO=1      U=1       PO             6      -215.
#>  7 unlinked not_fo…            0 PO=1      U=1       PO             7      -227.
#>  8 unlinked not_fo…            0 PO=1      U=1       PO             8      -242.
#>  9 unlinked not_fo…            0 PO=1      U=1       PO             9      -245.
#> 10 unlinked not_fo…            0 PO=1      U=1       PO            10      -247.
#> # ℹ 39,990 more rows
#> # ℹ 3 more variables: denom_logl <dbl>, logl_ratio <dbl>, true_log_prob <dbl>

We can visualize the distribution of each of these quite easily with ggplot:

ggplot(PO_U_logls,
            aes(x = logl_ratio, fill = true_relat)) +
  geom_density(alpha = 0.25)

OK, that shows that there is very little overlap in the PO/U log likelihood ratio between unrelated pairs and parent-offspring pairs; however pairs that are truly full siblings are quite likely to have a high PO/U log likelihood ratio, and even half-sib pairs have considerable overlap in PO/U log likelihood ratio values with true parent-offspring pairs.

Let’s look at the above plot, but limit our focus to PO and U pairs:

ggplot(PO_U_logls %>% filter(true_relat %in% c("PO", "U")),
            aes(x = logl_ratio, fill = true_relat)) +
  geom_density(alpha = 0.25)

This shows that if we are only trying to distinguish PO pairs from U pairs, then we should apparently do quite well if we say that every pair we see with a PO/U logl ratio (that is the shorthand we will use, henceforth, for “log likelihood ratio”) greater than, say, 5, is a PO pair. From the picture, it appears that there is no probability density under the U curve above 0 (and certainly not 5). However you must keep in mind that, if you are comparing, say, 5000 offspring to 5000 parents, then there are 25 million pairs (almost all of them unrelated), which equates to 25 million chances to incorrectly declare one of those unrelated pairs a parent-offspring pair. What is required here is some way to compute accurately, the very small probability that an unrelated pair has a PO/U logl ratio > 5. Because, even if that probability is only 1 in a million, you will still expect 25 of the unrelated pairs to have a PO/U logl ratio that exceeds 5!

The next section discusses the CKMRsim function mc_sample_simple() that allows these small probabilities to be estimated.

Before we proceed there, let us also imagine that we might be interested in distinguishing full-siblings from unrelated pairs. In that case, it would be best to use the FS/U logl ratio. The distributions of those (for the truth being either FS or U) can be visualized like so:

FS_U_logls <- extract_logls(ex1_Qs,
                            numer = c(FS = 1),
                            denom = c(U = 1))

ggplot(FS_U_logls %>% filter(true_relat %in% c("FS", "U")),
            aes(x = logl_ratio, fill = true_relat)) +
  geom_density(alpha = 0.25)

Estimating False Negative and False Positive Rates (`mc_sample_simple()`)

The function mc_sample_simple() lets you estimate false positive rates and false negative rates from the simulated values in an object of class Qij. The function is configured so that it will automatically compute both regular or “vanilla” Monte Carlo estimates of these probabilities, and also importance sampling (IS) Monte Carlo estimates, which are particularly good for estimating very small probabilities.

By default, it uses importance sampling to compute false positive rates that are associated with false-negative rates of 0.3, 0.2, 0.1, 0.05, 0.01, and 0.001, but you can set those false negative rates to be whatever you wish.

Here is a simple use case to estimate false positive rates when the true relationship is U, but we are looking for PO pairs. This will, by default, use importance sampling.

ex1_PO_is <- mc_sample_simple(ex1_Qs, 
                 nu = "PO",
                 de = "U")

ex1_PO_is
#> # A tibble: 5 × 10
#>     FNR      FPR       se num_nonzero_wts Lambda_star pstar mc_method numerator
#>   <dbl>    <dbl>    <dbl>           <int>       <dbl> <chr> <chr>     <chr>    
#> 1  0.01 4.36e- 9 3.35e-10            9900        14.2 PO    IS        PO       
#> 2  0.05 2.13e-10 1.01e-11            9500        18.4 PO    IS        PO       
#> 3  0.1  2.77e-11 1.05e-12            9000        20.8 PO    IS        PO       
#> 4  0.2  1.84e-12 5.81e-14            8000        23.9 PO    IS        PO       
#> 5  0.3  2.32e-13 6.68e-15            7000        26.2 PO    IS        PO       
#> # ℹ 2 more variables: denominator <chr>, true_relat <chr>

This shows false positive rates (FPR) of around \(10^{-6}\) and smaller when the false negative rate (FNR) is 0.01 and greater.

If we wanted to see what the results would look like if we were using a logl ratio of 5 as a cutoff, we could do this:

ex1_PO_is_5 <- mc_sample_simple(ex1_Qs, 
                 nu = "PO",
                 de = "U", 
                 lambda_stars = 5)

ex1_PO_is_5
#> # A tibble: 6 × 10
#>      FNR      FPR       se num_nonzero_wts Lambda_star pstar mc_method numerator
#>    <dbl>    <dbl>    <dbl>           <int>       <dbl> <chr> <chr>     <chr>    
#> 1 0.0001 7.41e- 7 3.60e- 7            9999         5   PO    IS        PO       
#> 2 0.01   4.36e- 9 3.35e-10            9900        14.2 PO    IS        PO       
#> 3 0.05   2.13e-10 1.01e-11            9500        18.4 PO    IS        PO       
#> 4 0.1    2.77e-11 1.05e-12            9000        20.8 PO    IS        PO       
#> 5 0.2    1.84e-12 5.81e-14            8000        23.9 PO    IS        PO       
#> 6 0.3    2.32e-13 6.68e-15            7000        26.2 PO    IS        PO       
#> # ℹ 2 more variables: denominator <chr>, true_relat <chr>

We see we would have had a false positive rate around 7.41e-07, if we used a logl ratio cutoff of 5. Since there are 1847 adults and 4244 juvniles in this study, there will be \(1847 \times 4244 = 7,838,668\) pairs being tested for being parent-offspring pairs. A per-pair FPR of 7.41e-07 would leave us with an expected number of false positive pairs on the order of 6.

My general recommendation for being confident about not erroneously identifying unrelated individuals as related pairs is to require that the FPR be about 10 to 100 times smaller than the reciprocal of the number of comparisons. So, in this case,

0.1 * (4244 * 1847) ^ (-1)
#> [1] 1.275727e-08

or smaller would be a good FPR to shoot for. That would be a logl ratio cutoff of between 12 and 15. For fun, let’s look at FPRs and FNRs at a number of such cutoffs:

ex1_PO_is_8_12 <- mc_sample_simple(ex1_Qs, 
                 nu = "PO",
                 de = "U", 
                 lambda_stars = seq(12, 15, by = .1))
ex1_PO_is_8_12
#> # A tibble: 36 × 10
#>       FNR     FPR       se num_nonzero_wts Lambda_star pstar mc_method numerator
#>     <dbl>   <dbl>    <dbl>           <int>       <dbl> <chr> <chr>     <chr>    
#>  1 0.0046 1.65e-8 2.03e- 9            9954        12   PO    IS        PO       
#>  2 0.0049 1.48e-8 1.75e- 9            9951        12.1 PO    IS        PO       
#>  3 0.005  1.42e-8 1.67e- 9            9950        12.2 PO    IS        PO       
#>  4 0.0052 1.32e-8 1.52e- 9            9948        12.3 PO    IS        PO       
#>  5 0.0056 1.15e-8 1.26e- 9            9944        12.4 PO    IS        PO       
#>  6 0.0056 1.15e-8 1.26e- 9            9944        12.5 PO    IS        PO       
#>  7 0.0058 1.08e-8 1.16e- 9            9942        12.6 PO    IS        PO       
#>  8 0.006  1.02e-8 1.06e- 9            9940        12.7 PO    IS        PO       
#>  9 0.0062 9.57e-9 9.76e-10            9938        12.8 PO    IS        PO       
#> 10 0.0063 9.30e-9 9.40e-10            9937        12.9 PO    IS        PO       
#> # ℹ 26 more rows
#> # ℹ 2 more variables: denominator <chr>, true_relat <chr>

Let’s find the logl ratio value that gives the largest FPR smaller than 1.275727e-08:

ex1_PO_is_8_12 %>%
  filter(FPR > 1.275727e-08) %>%
  arrange(FPR) %>%
  slice(1)
#> # A tibble: 1 × 10
#>      FNR       FPR      se num_nonzero_wts Lambda_star pstar mc_method numerator
#>    <dbl>     <dbl>   <dbl>           <int>       <dbl> <chr> <chr>     <chr>    
#> 1 0.0052   1.32e-8 1.52e-9            9948        12.3 PO    IS        PO       
#> # ℹ 2 more variables: denominator <chr>, true_relat <chr>

This suggests that requiring a logl ratio cutoff of about 15 should mean almost no false positives, but perhaps a false negative or two (if the genotyping error rate is accurate).

What about for finding full-siblings?

We can do the same for full siblings. Note that with 4244 juveniles that we might sift through, looking for full siblings, we will be making 4244 * 4243 / 2 comparisons, which is about 9 million. So, we would like to have an FPR of about 1e-08 or less, and that corresponds to a false negative rate of about 0.2.

ex1_FS_is <- mc_sample_simple(ex1_Qs, 
                 nu = "FS",
                 de = "U")
ex1_FS_is
#> # A tibble: 5 × 10
#>     FNR      FPR       se num_nonzero_wts Lambda_star pstar mc_method numerator
#>   <dbl>    <dbl>    <dbl>           <int>       <dbl> <chr> <chr>     <chr>    
#> 1  0.01 1.70e- 6 1.58e- 7            9900        7.92 FS    IS        FS       
#> 2  0.05 2.19e- 8 1.14e- 9            9500       13.5  FS    IS        FS       
#> 3  0.1  2.16e- 9 8.78e-11            9000       16.4  FS    IS        FS       
#> 4  0.2  8.17e-11 2.92e-12            8000       19.9  FS    IS        FS       
#> 5  0.3  5.98e-12 1.99e-13            7000       22.6  FS    IS        FS       
#> # ℹ 2 more variables: denominator <chr>, true_relat <chr>

Note! The results here differ from those reported in the data for Baetscher et al. (2019) because we are using a different genotyping error model here, with a higher error rate than is likely…

Actually doing the comparisons

Now that we have computed the power for relationship inference and identified reasonable logl ratio cutoffs for doing pairwise relationship inference, the next step is to actually do all the pairwise comparisons and find kin pairs.

Before we do that, it is always good practice to first look for duplicate samples.

Screen for duplicate samples (`find_close_matching_genotypes()`)

Believe it or not, when your lab is handling tens of thousands of samples, it is quite possible that the same DNA ended up in two different wells that are supposed to represent different individuals. Fortunately, it is easy to find duplicate samples because they typically share the same genotype at all loci (apart from a few that would be due to genotyping errors.)

We have a function find_close_matching_genotypes() for that. It operates on the long genotypes data frame, and also uses some information within the ckmr object that we created from the allele frequencies computed from that long genotype data frame. It returns all pairs of individuals with fewer than max_mismatch loci at which genotypes mismatch.

matchers <- find_close_matching_genotypes(LG = long_genos,
                                          CK = ex1_ckmr,
                                          max_mismatch = 6)

matchers
#> [1] indiv_1      indiv_2      ind1         ind2         num_mismatch
#> [6] num_loc     
#> <0 rows> (or 0-length row.names)

This suggests that any duplicate genotypes have already been removed from this data set. Good.

Compute Logl Ratios for All Pairwise Comparisons (`pairwise_kin_logl_ratios()`)

We have a simple function that computes the log likelihood ratio for all pairwise comparisons between two sets of individuals. Each set of individuals is specified as a long-format genotype data frame that derives from the long-format genotype data frame that went into making the allele frequencies for the ckmr object. You pass it two such long-format genotype data frames, the associated ckmr object, and then you specify the relationship you want in the numerator and in the denominator for the log likelihood ratio. Let’s use it first to compare all parents to all offspring.

Looking for parent offspring pairs

First, break long_genos into two data frames:

parent_ids <- life_stages %>%
  filter(stage == "adult") %>%
  pull(NMFS_DNA_ID)

offspring_ids <- life_stages %>%
  filter(stage == "juvenile") %>%
  pull(NMFS_DNA_ID)

candidate_parents <- long_genos %>% 
  filter(Indiv %in% parent_ids)

candidate_offspring <- long_genos %>% 
  filter(Indiv %in% offspring_ids)

Then, use those:

po_pairwise_logls <- pairwise_kin_logl_ratios(D1 = candidate_parents, 
                                              D2 = candidate_offspring, 
                                              CK = ex1_ckmr,
                                              numer = "PO",
                                              denom = "U",
                                              num_cores = 1)
# note, num_cores is set to 1 because more than that breaks
# it on CRAN checks.  But, leave it blank (i.e., don't use the
# num_core option) to, by default, use all
# your cores in parallel on a non-windows machine.

If desired we can retain only those with logl ratio > 12, which is in the vicinity of a reasonable cutoff of about 15 which we determined through simulation.

po_pairwise_logls %>%
  filter(logl_ratio > 12) %>%
  arrange(desc(logl_ratio))
#> # A tibble: 8 × 4
#>   D2_indiv D1_indiv logl_ratio num_loc
#>   <chr>    <chr>         <dbl>   <int>
#> 1 R031608  R019764        43.3      78
#> 2 R015015  R012195        40.0      78
#> 3 R014780  R016256        34.0      78
#> 4 R015337  R015977        31.6      78
#> 5 R031274  R019641        29.8      78
#> 6 R015263  R019881        24.6      78
#> 7 R013607  R019760        17.6      78
#> 8 R029094  R015684        15.0      75

Looking for full-sib pairs among the juveniles

We can look for full-sib pairs among the juveniles using the same function. We just pass the same set of offspring genotypes in for both the D1 and the D2 parameters to the function.

fs_pairwise_logls <- pairwise_kin_logl_ratios(D1 = candidate_offspring,
                                              D2 = candidate_offspring,
                                              CK = ex1_ckmr,
                                              numer = "FS",
                                              denom = "U", 
                                              num_cores = 1)
#> D1 and D2 are identical: dropping self comparisons and keeping only first instance of each pair

# note, num_cores is set to 1 because more than that breaks
# it on CRAN checks.  But, leave it blank (i.e., don't use the
# num_core option) to, by default, use all
# your cores in parallel on a non-windows machine.

Now, look at all those with logl_ratio > 12 as well, (also close to about 14 or 15, which we determined from simulation to be a reasonable cutoff value, albeit with a high false negative rate.)

fs_pairwise_logls %>%
  filter(logl_ratio > 12) %>%
  arrange(desc(logl_ratio))
#> # A tibble: 41 × 4
#>    D2_indiv D1_indiv logl_ratio num_loc
#>    <chr>    <chr>         <dbl>   <int>
#>  1 R014981  R014982        73.7      76
#>  2 R021481  R027461        63.6      78
#>  3 R031276  R031280        58.8      78
#>  4 R014137  R014139        45.0      77
#>  5 R013839  R014675        42.9      78
#>  6 R020421  R027348        41.4      78
#>  7 R021571  R025619        41.0      78
#>  8 R014520  R014536        38.8      78
#>  9 R021500  R021511        36.5      78
#> 10 R015404  R020276        36.0      78
#> # ℹ 31 more rows

There are two steps that really should be done before that:

Observe the distribution of internal heterozygosity across all individuals in the data set. This will be used to compare to values from kin pairs to detect cases where, for example, two individuals of the wrong species are identified as kin in a data set, or when two individuals look similar because they each suffered some odd contamination or genotyping problem.

Literature Cited

Baetscher, Diana S., Eric C. Anderson, Elizabeth A. Gilbert-Horvath, Daniel P. Malone, Emily T. Saarman, Mark H. Carr, and John Carlos Garza. 2019. “Dispersal of a Nearshore Marine Fish Connects Marine Reserves and Adjacent Fished Areas Along an Open Coast.” Molecular Ecology 28 (7): 1611–23. https://doi.org/10.1111/mec.15044.

Baetscher, Diana S., Anthony J. Clemento, Thomas C. Ng, Eric C. Anderson, and John C. Garza. 2018. “Microhaplotypes Provide Increased Power from Short-Read DNA Sequences for Relationship Inference.” Molecular Ecology Resources 18 (2): 296–305. https://doi.org/10.1111/1755-0998.12737.

Eric C. Anderson

2024-07-31