CKMRsim-writing-geno-error-funcs.Rmd
CKMRsim
allows users to specify genotyping error models
for locus \(\ell\) in terms of a matrix
\(\mathbf{C}_\ell\). If there are \(G\) possible genotypes, then \(\mathbf{C}_\ell\) is a \(G \times G\) matrix where the element in
row \(s\) and column \(t\) is the probability that an individual
truly carrying genotype \(s\) is
observed to have genotype \(t\). This
model covers any possible type of genotyping error in which errors are
independent between different loci and are also independent between
different individuals. That is a wide class of models.
CKMRsim
comes with some “pre-packaged” genotyping error
models, but users might want to create their own that are tailored to
the particular marker types that they work on. Currently, the way to
define a genotyping error model is to define a function that takes an
input L
, which carries information about the locus, and any
additional named parameters (like genotyping error rates.) At some
point, I will reimplement most of the list structures in
CKMRsim
as list-columns in tibbles, and then it will be
much easier to define locus-specific, and even allele-specific, error
rates, etc. For now, it is a little tough to do that cleanly and
programmatically on a large scale…
In this vignette, I will show how I created genotyping error functions for two of the “pre-packaged” versions: True-Genotype-Independent Errors (TGIE), and a model tailored for microsatellites that includes allele mis-calling and allele dropout.
Every genotyping error model function should have a name starting
with ge_model_
. For example we will name our TGIE model
ge_model_TGIE()
. This naming convention is not a
requirement, but it helps to keep things organized.
These ge_model_
functions are required to have one
argument named L
. This will be where information about the
locus gets passed into the function. Currently, the information
available within L
about the locus is contained in two list
components:
L$freqs
: a named vector of allele frequencies. The
names of the vector are the names of the corresponding alleles. So, for
example, if the names were microhaplotypes, like “ACCAT”, then that
information would be available for use. Alternatively, if the names are
microsatellite allele lengths, like “112”, then those would also be
available for creating the genotyping error model. Note that the alleles
are listed in this vector in the order in which CKMRsim represents them
internally (for the C++ based functions), so whenever you do anything
with the alleles, you want to keep them in this order!L$geno_freqs
: a named vector of genotype frequencies
ordered in the way that CKMRsim
has ordered the possible
diploid genotypes that can be formed from the different alleles. The
names are the names of the genotypes, which are merely the names of the
two alleles in the genotype, separated by space-slash-space. For example
“112 / 116”. (Note that, by default, CKMRsim names genotypes this way.
There should be no problem parsing such genotype names into the names of
the constituent alleles, so long as alleles never have spaces in their
names. Spaces in allele names would be a problem. If you have spaces in
your allele names, then please fix that!After the required L
argument, the
ge_model_
function can have any other arguments that
control the behavior of the genotyping error.
The return value of the function must be the \(G \times G\) matrix \(\mathbf{C}_\ell\) described above. The rows
of this matrix must sum to one, since they are probabilities. Note that
\(G\) is given by
length(L$geno_freqs)
.
True-genotype-independent error is a very simple genotyping error model that has been in use for a long time in relationship inference. The main goal of it is to ensure that genotyping errors don’t create zero-probability situations. It is not intended to model the actual genotyping error process, but it is mathematically tractable in a lot of situations, and it is useful in cases where little is known about the genotyping error process. One such situation occurs when someone gives you a file of genotypes that are coded up as integers, but those integers mean very little, i.e., they just denote separate alleles which could be microsatellite alleles (but not the actual lengths), or they could be microhaplotype alleles (but not the actual variant sequences), or they could be SNPs (but not the actual bases, etc).
We will pretend for now that we are dealing with such integer-coded
data and that we might have multiple alleles at each locus. For example,
the components of L
might look like this:
> L$freqs
1 2 3 4 5
6.329010e-01 3.662781e-01 4.104416e-04 2.462650e-04 8.208833e-05
6
8.208833e-05
> L$geno_freqs
1 / 1 1 / 2 1 / 3 1 / 4 1 / 5
4.005637e-01 4.636356e-01 5.195378e-04 3.117227e-04 1.039076e-04
1 / 6 2 / 2 2 / 3 2 / 4 2 / 5
1.039076e-04 1.341597e-01 3.006716e-04 1.804029e-04 6.013432e-05
2 / 6 3 / 3 3 / 4 3 / 5 3 / 6
6.013432e-05 1.684623e-07 2.021548e-07 6.738493e-08 6.738493e-08
4 / 4 4 / 5 4 / 6 5 / 5 5 / 6
6.064644e-08 4.043096e-08 4.043096e-08 6.738493e-09 1.347699e-08
6 / 6
6.738493e-09
This model is parameterized by a genotyping error rate, \(\epsilon\), and errors occur independently between individuals and loci. With probability \(1-\epsilon\) a genotyping error does not occur so that the observed genotype is the true genotype. With probability \(\epsilon\) the genotype is subject to genotying error, in which case, the observed value is drawn from the remaining genotypes, proportionally to their frequencies. Under such a model, using \(y_\ell\) to denote the observed genotype and \(x_\ell\) to denote the true genotype, we have the row \(s\) and column \(t\) entry of \(\mathbf{C}_\ell\) being the the conditional probability: \[ P(y_\ell = g_{t} | x_\ell = g_{s}) = \left\{ \begin{array}{ll} (1-\epsilon), & s = t \\ \epsilon q(g_{t}) / (1 - g_s), & s \neq t. \end{array} \right. \] where \(q(g_{s})\) is the frequency of genotype \(s\).
Thus, to implement TGIE, all we need to do is write code to return
the matrix \(\mathbf{C_\ell}\) given
the input L$geno_freqs
and another parameter
epsilon
. That looks like this:
#' implements a simple true-genotype-independent genotyping error model
#' @param L required locus specific information
#' @param epsilon the rate at which genotypes are incorrectly observed.
ge_model_TGIE <- function(L, epsilon = 0.01) {
# first make a matrix G x G matrix of the genotype frequencies
gmat <- matrix(rep(L$geno_freqs, length(L$geno_freqs)), byrow = TRUE, nrow = length(L$geno_freqs))
# Now, we make the diagonals 1 - epsilon exactly and rescale
# everything else to sum to epsilon.
diag(gmat) <- 0
gmat <- t(apply(gmat, 1, function(x) x * (epsilon/sum(x))))
diag(gmat) <- 1 - epsilon
# put the genotype names on for the row and column names since that can be
# handy down the road.
rownames(gmat) <- names(L$geno_freqs)
colnames(gmat) <- names(L$geno_freqs)
# return gmat, which is the matrix C.
gmat
}
That is all there is to it. And we can run it on a simple 2-allele example here:
library(CKMRsim)
L <- list(
freqs = c(A = 0.6, B = 0.4),
geno_freqs = c(`A / A` = 0.36, `A / B` = 0.48, `B / B` = 0.16)
)
C <- ge_model_TGIE(L, epsilon = 0.02)
C
#> A / A A / B B / B
#> A / A 0.980000000 0.01500000 0.005000000
#> A / B 0.013846154 0.98000000 0.006153846
#> B / B 0.008571429 0.01142857 0.980000000
It is always good to check that your function returns a matrix whose rows sum to one.
rowSums(C)
#> A / A A / B B / B
#> 1 1 1
Here we develop a microsatellite model that tries to capture a couple of features of microsatellites:
Anyway, we note here that we have both allelic mis-calls and dropout
errors. CKMRsim includes a function called
combine_miscalls_and_dropouts()
that makes it fairly easy
to implement the combination of these two types of errors, using a
simple model (see the paper Anderson, in prep, for details.) In order to
use this function one needs to specify two different sets of
probabilities:
It is important to note that each of the two copies of a gene at a locus has a chance to drop out or be miscalled. Thus, if you have a true genotype of \(CD\) and the \(C\) allele has a dropout rate of 0.01 and \(D\) a dropout rate of 0.02, then, a dropout will occur at the locus with probability \(0.01 + 0.02 = 0.03\).
To develop this model we will want to refer to an example locus. We use one of the ones from the Ruzzante et al. data set in one of the brook trout populations. This is here:
example_L_microsat
#> $freqs
#> 79 61 73 58 82 76
#> 0.465231788 0.269867550 0.125827815 0.099337748 0.028145695 0.006622517
#> 70 85
#> 0.003311258 0.001655629
#>
#> $geno_freqs
#> 79 / 79 79 / 61 79 / 73 79 / 58 79 / 82 79 / 76
#> 2.164406e-01 2.511019e-01 1.170782e-01 9.243016e-02 2.618854e-02 6.162010e-03
#> 79 / 70 79 / 85 61 / 61 61 / 73 61 / 58 61 / 82
#> 3.081005e-03 1.540503e-03 7.282849e-02 6.791369e-02 5.361607e-02 1.519122e-02
#> 61 / 76 61 / 70 61 / 85 73 / 73 73 / 58 73 / 82
#> 3.574405e-03 1.787202e-03 8.936012e-04 1.583264e-02 2.499890e-02 7.083023e-03
#> 73 / 76 73 / 70 73 / 85 58 / 58 58 / 82 58 / 76
#> 1.666594e-03 8.332968e-04 4.166484e-04 9.867988e-03 5.591860e-03 1.315732e-03
#> 58 / 70 58 / 85 82 / 82 82 / 76 82 / 70 82 / 85
#> 6.578659e-04 3.289329e-04 7.921802e-04 3.727907e-04 1.863953e-04 9.319767e-05
#> 76 / 76 76 / 70 76 / 85 70 / 70 70 / 85 85 / 85
#> 4.385773e-05 4.385773e-05 2.192886e-05 1.096443e-05 1.096443e-05 2.741108e-06
That shows what the L
argument to our genotyping error
model might look like.
When we implement this, here is how we get the features we describe above. First, to get the matrix \(\mathbf{W}\), taking account of proximity of different alleles (in terms of length) we do this:
Then, to get some sort of “large-allele dropout” effect, we will do this:
Here is a function that accomplishes that. This is just the function definition from the package.
#' a simple "length-aware" genotyping error model for microsatellites
#'
#' In this genotyping error model, mis-calls are more likely to nearby
#' allele lengths than distant ones, and larger alleles are more likely
#' to drop out than small or average-sized ones.
#'
#' @param L an element of the list created by \code{\link{long_markers_to_X_l_list}}. Such
#' an element basically holds the information at a single locus. The idea here
#' is that every ge_mod_* function takes in an object like L, and then can
#' use any piece of information in it about alleles or genotypes to
#' configure a genotyping error model. In the case of microsatellites,
#' the names of the alleles must be the allele lengths, like "64" or "112".
#' @param miscall_rate the rate at which alleles are miscalled.
#' @param miscall_decay the parameter (or a geometric distribution) that determines
#' how quickly the probability of the miscall being to a certain allele length
#' decreases as you get further and further away from the true allele length.
#' The larger this values, the more quickly the probability decays. For every
#' microsatellite allele step size further from the true allele size, the probability
#' increases by a factor of 1 - miscall_decay. Thus, if miscall_decay = 1
#' then all errors are a step-size of one away, and if it is 0, then
#' the miscall is equally likely to occur to any other allele length.
#' @param dropout_rate the base rate at which alleles of "typical" size dropout.
#' @param dropout_scale_factor the rate of drop out increases for alleles
#' larger than the (frequency-weighted) average size of allele by a factor of
#' dropout_scale_factor times the z-score of the allele length.
#' @export
#'
ge_model_microsat1 <- function(L,
miscall_rate = 0.005,
miscall_decay = 0.8,
dropout_rate = 0.01,
dropout_scale_factor = 0.25
) {
# first, compute the step-size as the minimum distance between any two alleles
lengths <- as.integer(names(L$freqs))
diffs <- outer(lengths, lengths, function(x,y) abs(x - y))
diag(diffs) <- NA
min_diff = min(diffs, na.rm = TRUE)
step_size = min_diff
# now, we compute the decay factors between all pairs of alleles,
# and then we multiply that by the base-miscall rate to get the
# matrix W of allele-to-allele miscall rates.
steps <- diffs/step_size
preW <- miscall_decay * (1 - miscall_decay) ^ (steps - 1)
# and those need to be normalized so that the off-diagonals sum to miscall_rate
W <- t(apply(preW, 1, function(x) x * miscall_rate / sum(x, na.rm = TRUE)))
# and we put 1 - miscall_rate on the diagonal
diag(W) <- 1 - miscall_rate
# and we MUST put the allele lengths on there as the names
rownames(W) <- names(L$freqs)
colnames(W) <- names(L$freqs)
# Now, let's make the vector D of dropout rates.
# first get the mean allele length
mean_len <- sum(lengths * L$freqs)
# and then we get the variance and the standard deviation.
# Note that we have the freqs of each lengths, but not the sample
# sizes, so we can't do the n-1 sort of correction for variance.
var_len <- sum( L$freqs * (lengths - mean_len) ^ 2)
sd_len <- sqrt(var_len)
# now compute the z-scores
zscores <- (lengths - mean_len) / sd_len
# now we can compute the vector D of dropout probabilities
D <- rep(dropout_rate, length.out = length(L$freqs))
D[zscores > 0] <- dropout_rate * (1 + zscores[zscores > 0] * dropout_scale_factor)
names(D) <- names(L$freqs)
# And now we create the matrix C by pushing all these into the function
# that combines allelic mis-calls and dropouts. Note that by setting Ws
# to W here we assume that the occurrence of a dropout does not affect the
# rate at which mis-calls occur.
ret <- combine_miscalls_and_dropouts(geno_names = names(L$geno_freqs),
D = D,
W = W,
Ws = W)
ret
}
We can run that function with the default values to see what the results look like. The matrix is quite large (there are many genotypic states) but so we just look at the first 4 rows and columns.
default <- ge_model_microsat1(example_L_microsat)
default[1:4, 1:4]
#> 79 / 79 79 / 61 79 / 73 79 / 58
#> 79 / 79 0.990146081 1.272959e-06 7.955992e-04 2.545918e-07
#> 79 / 61 0.009951501 9.680772e-01 3.706518e-05 4.633071e-03
#> 79 / 73 0.010852496 1.723054e-05 9.676134e-01 3.446108e-06
#> 79 / 58 0.009950312 4.816547e-03 7.706602e-06 9.680772e-01
Note that, because of the way dropout errors occur, you are more likely to encounter a genotyping error when the true genotype is a heterozygote.
It will be worth the user’s time to play around with different settings to see the effect of them.
Microhaplotypes are clusters of a handful of SNPs that are all close enough to one another that they can be assayed by amplifying the segment of DNA they are on and then sequencing it on a next generation sequencing machine. Since such machines are given the sequence of a single piece of DNA (typically about 100 to 300 bp long), the alleles that occur together on such a read are certain to have been together on the same haplotype. These microhaplotypes are thus multiallelic markers. The main type of genotyping error that occurs with them is a dropout error, in which an individual carries one or two copies of the gene that do not amplify well, and thus will not yield enough reads to detect that allele. On top of that, sequencing errors might lead one to read an incorrect sequence for one of the alleles. In this case, it is more likely that the erroneous haplotype is “close” to the true one (i.e., differs at few SNPs), rather than far from it (different at many SNPs from the true haplotype.)
The alleles of a microhaplotype can be named by the allelic type of the SNPs within it. For example, at a microsatellite composed of 5 SNPs, one allele might be “ACCGT”. Here we show a simple error model for genotypes involving such microhaplotype alleles. It has the following features:
miscall_rate
. In some circumstances, we might want loci
with many SNPs to have a higher miscall rate. This is achieved by
multiplying the miscall rate by the number of SNPs.combine_miscalls_and_dropouts()
.We have, in the package, an example of what a microhaplotype locus might look like:
example_L_microhap
#> $freqs
#> GCA ACA ACG ATA GCG GTA
#> 0.543478261 0.260869565 0.086956522 0.065217391 0.038043478 0.005434783
#>
#> $geno_freqs
#> GCA / GCA GCA / ACA GCA / ACG GCA / ATA GCA / GCG GCA / GTA
#> 2.953686e-01 2.835539e-01 9.451796e-02 7.088847e-02 4.135161e-02 5.907372e-03
#> ACA / ACA ACA / ACG ACA / ATA ACA / GCG ACA / GTA ACG / ACG
#> 6.805293e-02 4.536862e-02 3.402647e-02 1.984877e-02 2.835539e-03 7.561437e-03
#> ACG / ATA ACG / GCG ACG / GTA ATA / ATA ATA / GCG ATA / GTA
#> 1.134216e-02 6.616257e-03 9.451796e-04 4.253308e-03 4.962193e-03 7.088847e-04
#> GCG / GCG GCG / GTA GTA / GTA
#> 1.447306e-03 4.135161e-04 2.953686e-05
So, we will need a function to compute distances between the microhap alleles, and then we will need to use that to compute a matrix \(\mathbf{W}\), like we did for microsatellites, and then we will combine that with dropout rates. In this function those dropout rates are made the same for every allele, but that could be changed.
Here is what such a function looks like: