Finding Kin-Pairs with Genetic Data
The Wildlife Society CKMR Workshop, Sunday November 6, 2022
Basic Overview of the Problem
- Finding kin pairs involves repeatedly asking the question, “Are you my mother (or brother, or half-sibling, etc.)?”
- You ask that question for every pair of individuals in your data set.
- How many times is that?
- Lots of chances to answer the question incorrectly
Why are we doing this pairwise?
Why not do full, joint pedigree reconstruction?
After all, doing so let’s you use all the available data
Because:
- CKMR works best in sparse situations
- Little advantage to “joint”, multi-individual inference (Colony, MasterBayes, Pedfactory) when true relationships are few and far between.
- (Joint inference just takes a lot longer…)
- Much harder to assess False Positive Rates in joint-inference,
full-pedigree reconstruction methods than in pairwise inference.
Getting Statistical about “Are you my mother?”
Cast the question as a statistical test applied to pairs:
- \(H_0\) the pair is Unrelated
- \(H_A\) the pair is Parent-offspring (or full-sib, etc.)
Then, if it were the case that all pairs were either Unrelated or Parent-offspring
And, if we fixed our false positive rate (FPR) to be \(\alpha\)…
Then, amongst all possible statistical tests, the one with the highest power (lowest false negative rate) will be based on the likelihood ratio. \[
\frac{P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~\mathrm{PO})}
{P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~\mathrm{Unrelated})}
\] Where \(G_i\) and \(G_j\) denote the observed genotypes in the two individuals of the pair.
This holds for any two relationships: \[
\frac{P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~R_1)}
{P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~R_2)}
\] is a most powerful test (Neyman-Pearson Lemma).
Review
We previously showed we can use \(\boldsymbol{\kappa}(R)\) (the \(\kappa\) coefficients for relationship \(R\)) to calculate the joint probability \(P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~R)\):
\[
\begin{aligned}
P(G_i, G_j|\boldsymbol{\kappa}) &= \kappa_0 P(G_i)P(G_j) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{\tiny\mathrm{no~gene~copies~IBD}} \\
&+ \kappa_1 P(G_i)P(G_j | G_i=\mathrm{ma}, G_j=\mathrm{kid}) ~~~~~~~~~{\tiny\mathrm{1~gene~copy~IBD}} \\
&+ \kappa_2 P(G_i)\mathcal{I}\{P(G_j)=P(G_i)\} ~~~~~~~~~~~~~~~~~~~~~~~~~~{\tiny\mathrm{2~gene~copies~IBD}} \\
\end{aligned}
\] where \(\mathcal{I}\{\cdot\}\) is an indicator function returning 1 when what is inside it is true, and 0 otherwise.
What about at multiple loci?
With \(L\) different loci, writing the genotype at the \(\ell^\mathrm{th}\) locus as \(G^{(\ell)}\) we can use: \[
P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}(R)) = \prod_{\ell = 1}^L P(G_i^{(\ell)}, G_j^{(\ell)}~|~\boldsymbol{\kappa}(R))
\] This is actually the correct probability so long as:
- The markers are not in linkage disequilibrium
- The markers are not physically linked to one another
So, assumption (2) often gets made for this calculation, even when it is known to be wrong. (We come back to this later.)
Back to likelihood ratios
Using that (approximate) joint probability of the genotypes we can compute the (approximate) likelihood ratio suggested by the NP-lemma: \[
\mathrm{For~PO~vs~Unrelated~pairs:}~~~~~~~~~~~~~~~~~~~\frac{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}= (0, 1,0))}
{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}= (1, 0,0))}
\]
Let’s give the log of that a name: \(\Lambda\):
\[
\Lambda_{PO/U} = \log\biggl(\frac{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}= (0, 1,0))}
{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}= (1, 0,0))} \biggr)
\]
It will be convenient later to recognize this as a sum:
\[
\begin{aligned}
\Lambda_{PO/U} & = & \log\biggl(
\prod_{\ell = 1}^L \frac{P(G_i^{(\ell)}, G_j^{(\ell)}~|~\boldsymbol{\kappa}= (0, 1, 0))}
{P(G_i^{(\ell)}, G_j^{(\ell)}~|~\boldsymbol{\kappa}= (1, 0, 0))}
\biggr) \\
& = &
\sum_{\ell = 1}^L\log\biggl(
\frac{P(G_i^{(\ell)}, G_j^{(\ell)}~|~\boldsymbol{\kappa}= (0, 1, 0))}
{P(G_i^{(\ell)}, G_j^{(\ell)}~|~\boldsymbol{\kappa}= (1, 0, 0))}
\biggr)
\end{aligned}
\]
Wait! What about Genotyping error?
Good point.
In our survey of genotyping technologies, we noted that they all can produce errors to some extent.
Errors are most problematic in assessing likelihoods for the PO relationship.
These probabilities can be computed in a way that accounts for genotyping error: \[
\Lambda_{PO/U} = \log\biggl(\frac{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\mu}, \boldsymbol{\kappa}= (0, 1,0))}
{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\mu}, \boldsymbol{\kappa}= (1, 0,0))} \biggr)
\]
Error model can be simple (or complex)
R package ‘CKMRsim’ (we will use it this afternoon) can accommodate all possible single-locus genotyping models.
How do we use these \(\Lambda\)’s?
- Truly PO pairs tend to have higher values of \(\Lambda_{PO/U}\) than do unrelated pairs.
- The distribution of \(\Lambda\) is not analytically tractable
- So, investigate it through Monte Carlo simulation.
So, Set a threshold \(\Lambda_c\). For example:
FALSE NEGATIVE RATE:
probability that a true kin pair (PO in this case) has \(\Lambda \leq \Lambda_c\).
FALSE POSITIVE RATE:
probability that a non kin pair (U in this case) has \(\Lambda > \Lambda_c\).
Hopes and Desires
- Obviously you want FNR as low as possible, but FPR must be low as well.
- How low must FPR be? That depends on the number of pairwise comparisons.
- 10,000 candidate parents and 10,000 candidate offspring = \(10^{8}\) pairwise comparisons.
- I think you should aim for FPR at least 100 times smaller than the reciprocal of the number of pairwise comparisons; preferably smaller.
- How do you estimate such small probabilities?
Calculating the FPR
\[
\mathrm{FPR}_{\Lambda_c} =
\sum_{\Lambda_{PO/U}}\mathcal{I}\{ \Lambda_{PO/U} > \Lambda_c\}
P(\Lambda_{PO/U}~|~i,j~\mathrm{Unrelated})
\approx
\frac{1}{M}\sum_{m=1}^M
\mathcal{I}\{ \Lambda_{PO/U}^{(m)} > \Lambda_c\}
\] with each \(\Lambda^{(m)}_{PO/U}\) simulated from its distribution assuming the pair is unrelated.
\[
\begin{aligned}
\mathrm{FPR}_{\Lambda_c} &=
\sum_{\Lambda_{PO/U}}\mathcal{I}\{ \Lambda_{PO/U} > \Lambda_c\}
\frac{P(\Lambda_{PO/U}~|~i,j~\mathrm{Unrelated})}
{P(\Lambda_{PO/U}~|~i,j~\mathrm{PO})}
P(\Lambda_{PO/U}~|~i,j~\mathrm{PO}) \\
& \approx \frac{1}{M}\sum_{m=1}^M
\biggl[\mathcal{I}\{ \Lambda_{PO/U}^{(m)} > \Lambda_c\}
\frac{P(\Lambda^{(m)}_{PO/U}~|~i,j~\mathrm{Unrelated})}
{P(\Lambda^{(m)}_{PO/U}~|~i,j~\mathrm{PO})}\biggr]
\end{aligned}
\] with each \(\Lambda^{(m)}_{PO/U}\) simulated from the importance sampling distribution, \(P(\Lambda^{(m)}_{PO/U}~|~i,j~\mathrm{PO})\).
- Useful for calculating FPR for comparison of any relationship to Unrelated,
regardless of physical linkage.
Decreasing the FPR—Three main ways
- Increase \(\Lambda_c\), thus increasing the FNR, too.
Easy to accommodate this in CKMR models if you can estimate the FNR.
Decreasing the FPR—Three main ways
- Increasing the number of loci. Example HS vs Unrelated:
Decreasing the FPR—Three main ways
- Increasing the heterozygosity of each locus
- Using markers with more alleles:
Decreasing the FPR—Three main ways
- Increasing the heterozygosity of each locus
- Or, just by having allele frequencies that are closer to even.
- Example: 100 SNPs at minor allele frequency of 0.05, 0.1, 0.2, 0.3, 0.5:
How does physical linkage affect all this?
The good:
- If you have dense genomic data and a recombination map, you can distinguish between HS pairs and Aunt-niece pairs
- Different expected distribution of the lengths of IBD segments.
- Much more complex calculations and no one has done this in the context of CKMR AFAIK.
- If you don’t know where the markers reside in the genome, or you have no good linkage map you can still compute \(\Lambda\). It is a valuable statistic even if it is not the correct probability.
The bad and the ugly:
- Physical linkage increases the true variance in \(\Lambda\) in all but three relationship categories.
- More overlap between the \(\Lambda\) values from different categories.
How does physical linkage affect the distribution of \(\Lambda\)?
- Up till now, all simulations have assumed no linkage.
- This means that each locus, independently, is simulated to have 0, 1, or 2 gene copies IBD, according to \(\boldsymbol{\kappa}\).
- In reality, the IBD status of nearby sites in the genome will be correlated, and so their contributions to the sum: \[
\Lambda = \sum_{\ell = 1}^L\log\biggl( \frac{P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}_\mathrm{numer}}{P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}_\mathrm{denom}}\biggr)
\] will also be positively correlated.
- So the variance of that sum is increased.
- Think \(\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\mathrm{Cov}(X, Y)\).
The Effect of Physical Linkage (An example)
- With many markers, the effect of linkage can be profound.
- 2,569 genetic markers spread across 16 Chromosomes
A Quick Sketch of Kin-finding Operationally
- Compare each sample to every other.
- For each calculate a variety of \(\Lambda\)’s. Each one appropriate for distinguishing between a particular pair of relationships.
- \(\Lambda_{PO/U}\)
- \(\Lambda_{FS/U}\)
- \(\Lambda_{PO/FS}\)
- \(\Lambda_{FS/HS}\)
- \(\Lambda_{HS/HAN}\)
- etc.
- Identify the most related pairs first (PO, FS)—they are the easiest.
- Continue down through half-siblings, while paying particular care to distinguish them from half-aunt-niece.
- Compare empirical to theoretical distributions of pairwise likelihoods to get a sense for how your genetic model fits.
Wrap-Up
- We’ve had a fairly high-level overview of how kin-finding proceeds.
- In the afternoon session we will get in and do it with the bearded seal genetic data set.
Parting thought
- Our capacity to assemble high quality genomes is rapidly expanding.
- Mapping our data to a good genome will tell us where those markers reside in the genome.
- We don’t typically have linkage maps.
- But the genome coordinates are probably enough
- It’s time to start computing likelihoods whilst accounting for physical linkage.