Genetic Inheritance and Genetic Data

Eric C. Anderson

The Wildlife Society CKMR Workshop, Sunday November 6, 2022

Outline

  1. Motivation
  2. Mendelian Inheritance
  3. Identity-by-descent
  4. Physical linkage
  5. Genetic markers and identity in state
  6. Pairwise joint genotype probabilities
  7. Types of genetic markers

Motivation

  • Close-kin mark-recapture depends on identifying pairs of related individuals
  • Genetic data are used to identify these relatives
  • Relatives can be identified because they “share” more of their genome than unrelated (or less related) individuals.
  • A rigorous definition of what is meant by “genome sharing” is required.

  • We look like our relatives due both to shared genetics and shared environment
  • Quantitative genetics: what fraction of phenotypic resemblance is due to genetics
  • Kin-finding: using the fraction of shared genetic markers to infer relationship

Starting from the beginning

  • Gregor Mendel (1822-1884)
  • Inferred laws of genetic segregation by observing patterns of single-locus traits in pea plants.
  • Not applicable to most (polygenic) traits
  • But, we now know his laws are quite applicable to the transmission of genetic material in diploid organisms.
  • They apply to dicussing the inheritance of a locus.

What the heck is a locus?

  • The word “locus” in genetics, much like the word “gene” has been applied to a lot of things.

  • For our purposes, we will use it to refer to a “chunk” of DNA that is defined by its position in the genome.

  • For example, around base position X on Chromosome 3:

  • Two copies of the DNA at each locus. One chromosome from mom. Another from dad.

Mendel’s First Law of Segregation (in sexual diploids)

When gametes are formed, the two copies of each locus segregate so that each gamete carries exactly one copy of each locus.

Since individuals are formed by the union of gametes, this dictates the segregation of DNA to offspring:

  • Each child receives exactly one copy of a locus from its mother and one from its father.
  • The copy of the locus received from the parent is chosen randomly (with probability \(\frac{1}{2}\) for each) from amongst the two copies in the parent.

Consider the case of the maternal allele segregated to an offspring, in the pedigree to the right.

  • Circles are females.
  • Males are squares.


\(\mathrm{Probability} = \frac{1}{2}\)


\(\mathrm{Probability} = \frac{1}{2}\)

Mendel’s Second Law

“The gene copy that gets segregated to a gamete/offspring at one locus is independent of the gene copy that gets segregated at another locus.”

  • This is not universally true.
  • It is only true for loci that are on different chromosomes.

When two loci are on the same chromosome they are referred to as “physically linked”

AND, they do not segregate independently.

Chromosomes, Crossovers, and Recombination

  • Chromosomes are inherited in big chunks
  • Crossovers occur between the maternal and paternal chromosomes of an individual.
  • An odd number of crossovers between two loci = recombination
  • Crossover modeled as a Poisson process along chromosomes.
  • Loci close together are less likely to have a recombination

  • We will come back to linked loci later today.
  • For now, we focus on a single locus…

Genetic Identity by Descent

  • When the DNA at a locus is a direct copy of the DNA in an ancestor…
  • Or when the stretches of DNA in two different individuals at a locus are both copies of the same piece of DNA in a recent ancestor…
  • These pieces of DNA are termed identical by descent, (IBD).

  • Ma and Kid are IBD at one gene copy (fuschia)
  • Pa and Kid are IBD at one gene copy (blue)

  • Kid-1 and Kid-2 share 1 gene copy IBD (blue)

  • Kid-1 and Kid-2 share no gene copies IBD (blue)

The number of gene copies IBD

  • A pair of non-inbred diploid individuals can share 0, 1, or 2 gene copies IBD at a locus.

  • Each possible relationship between two non-inbred individuals can be characterized by the expected fraction of the genome at which the two individuals share 0, 1, or 2 gene copies IBD. \[ \boldsymbol{\kappa}= (\kappa_0, \kappa_1, \kappa_2) \]

    • \(\kappa_0\): expected fraction of genome with 0 gene copies IBD
    • \(\kappa_1\): expected fraction of genome with 1 gene copy IBD
    • \(\kappa_2\): expected fraction of genome with 2 gene copies IBD
  • These are also the marginal probabilities that a pair of individuals share 0, 1, or 2, gene copies at a single locus.

  • Sometimes called “Cotterman coefficients”, we will call them “kappas”

  • These are not the coefficient of relationship or coefficient of consanguinity. (Those are not sufficient).

\(\boldsymbol{\kappa}\) for Parent-Offspring Pairs

  • We start with an easy one
  • What is \(\boldsymbol{\kappa}\) for the parent offspring relationship?
  • Well, a parent and offspring always share exactly one gene copy IBD.
  • \[ \begin{aligned} \kappa_0 &= 0 \\ \kappa_1 &= 1 \\ \kappa_2 &= 0 \\ \end{aligned} \]
  • or \[ \boldsymbol{\kappa}= (0, 1, 0) \]

\(\boldsymbol{\kappa}\) for Half-Sibling Pairs

  • A slightly harder one:
  • We know that, with probability 1, Pa segregates a copy of a single gene to Kid-1
  • With probability \(\frac{1}{2}\) Pa segregates of a copy of the same gene to Kid-2
  • With probability \(\frac{1}{2}\) Pa segregates a copy of the other gene to Kid-2
  • \[ \begin{aligned} \kappa_0 &= 1/2\\ \kappa_1 &= 1/2 \\ \kappa_2 &= 0 \\ \end{aligned} \]

\(\boldsymbol{\kappa}\) for Full-Sibling Pairs

  • Like two independent half-sib relationships…
  • With probability \(\frac{1}{2}\), Kid-1 and Kid-2 share 1 gene copy (or 0 gene copies) IBD from Ma
  • Independently, with probability \(\frac{1}{2}\), Kid-1 and Kid-2 share 1 gene copy (or 0 gene copies) IBD from Pa.
  • So, no gene copies IBD = none from Ma and none from Pa = \(\frac{1}{2} \times \frac{1}{2} = \frac{1}{4}\).
  • Two gene copies IBD = one from Ma and one from Pa = \(\frac{1}{2} \times \frac{1}{2} = \frac{1}{4}\)
  • One gene copy IBD = one from Ma and zero from Pa, or zero from Ma and one from Pa = \(\frac{1}{2} \times \frac{1}{2} + \frac{1}{2} \times \frac{1}{2} = \frac{1}{2}\).
  • So, \(\kappa_0 = \frac{1}{4}~~~~~~\kappa_1 = \frac{1}{2}~~~~~~\kappa_1 = \frac{1}{4}\)

Some relationships and their \(\boldsymbol{\kappa}\) values

Must we worry about all these relationships for CKMR?

Inferring relationships from genetic data

Aha! Can we use genetic data to estimate the fraction of genome IBD, and from that estimate the pairwise relationship?

Yes! But there are some complications:

  • Genetic data do not let us observe IBD directly.
  • Genetic data at a locus give you the genotype, \(G\), of an individual.
  • Genotype = the unordered combination of the allelic types of the two gene copies in a diploid.
  • More colloquially, Genotype = the two alleles at an individual.

Denoting alleles:

  • When we need to, we will denote alleles by \(A_i\) or \(A_j\).
  • Different subscripts mean different alleles.
  • So, the genotype of individual \(i\) at a single locus could be \(G_i = A_k A_\ell\) or \(G_i = A_iA_j\) or \(G_i = A_jA_j\).

What are alleles?

What we end up calling alleles depends entirely on the type of genetic marker that we end up using.

  • Microsatellites: Alleles are lengths of repeat regions between conserved primers.
  • SNPs: Alleles are different DNA bases at a single position in the genome.
  • etc…(More on this later)

Identity in state (IIS) vs Identity by descent (IBD)

  • If the alleles carried by different individuals are the same allelic type they are called identical in state (IIS).
  • Two alleles can be IIS even though they are not IBD.
  • But two gene copies that are IBD will be IIS unless there was a recent mutation (very unlikely) or a genotyping error (somewhat more likely)

Evidence for a pairwise relationship

We base our inference of the pairwise relationship between individuals \(i\) and \(j\) on the likelihood of the relationship, \(R\). \[ L(R) = P(G_i, G_j | \boldsymbol{\kappa}(R)) \] where \(\boldsymbol{\kappa}(R)\) denotes the \(\kappa\) coefficients for relationship \(R\).

The likelihood of relationship \(R\) is the joint probability of the genotypes of \(i\) and \(j\) at the locus, given the \(\kappa\) coefficients of relationship \(R\)

  • This is for a single locus, but we will add multiple loci later.
  • Likelihoods are only relevant when compared against one another
  • So we will use these in the context of likelihood ratios, like: \[ \frac{L(\mathrm{PO})}{L(\mathrm{U})} ~~~~~~~~\mathrm{or} ~~~~~~~~\frac{L(\mathrm{PO})}{L(\mathrm{FS})} ~~~~~~~~\mathrm{or} ~~~~~~~~\frac{L(\mathrm{FS})}{L(\mathrm{HS})} \]

The likelihood is useful for comparing the evidence for different relationships.

Calculating \(P(G_i, G_j|\boldsymbol{\kappa})\)

  • \(\boldsymbol{\kappa}\) gives you the marginal probability that a pair shares 0, 1, or 2, gene copies IBD at a locus.
  • So, \[ P(G_i, G_j|\boldsymbol{\kappa}) = \kappa_0P_0(G_i, G_j) + \kappa_1P_1(G_i, G_j) + \kappa_2P_2(G_i, G_j) \] where \(P_x(G_i,G_j)\) is the joint probability of the genotypes of the pair, given that they share \(x\) gene copies IBD.

Which is: \[ \begin{aligned} P(G_i, G_j|\boldsymbol{\kappa}) &= \kappa_0 P(G_i)P(G_j) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{\tiny\mathrm{no~gene~copies~IBD}} \\ &+ \kappa_1 P(G_i)P(G_j | G_i=\mathrm{ma}, G_j=\mathrm{kid}) ~~~~~~~~~{\tiny\mathrm{1~gene~copy~IBD}} \\ &+ \kappa_2 P(G_i)\mathcal{I}\{P(G_j)=P(G_i)\} ~~~~~~~~~~~~~~~~~~~~~~~~~~{\tiny\mathrm{2~gene~copies~IBD}} \\ \end{aligned} \] where \(\mathcal{I}\{\cdot\}\) is an indicator function returning 1 when what is inside it is true, and 0 otherwise.

  • Trivial, except for \(P(G_j | G_i=\mathrm{ma}, G_j=\mathrm{kid})\) which is still pretty easy.

\(P(G_i,G_j | G_i=\mathrm{ma}, G_j=\mathrm{kid})\)

With \(p_i\) equal to the relative frequency of allele \(A_i\) in the population, we can enumerate the possibilities: \[ \begin{aligned} ~~~~~~G_\mathrm{kid}\downarrow ~ /~ G_\mathrm{ma}\rightarrow & ~~~~~~~~~ & A_iA_i & ~~~~~~~~~ & A_iA_j \\ \hline A_iA_i~~~~~~~~| & & p_i & & p_i / 2 \\ A_jA_j~~~~~~~~| & & 0 & & p_j / 2 \\ A_iA_j~~~~~~~~| & & p_j & & (p_i + p_j)/2 \\ A_iA_k~~~~~~~~| & & p_k & & p_k/2 \\ A_jA_k~~~~~~~~| & & 0 & & p_k/2 \\ A_kA_\ell~~~~~~~~| & & 0 & & 0 \\ \end{aligned} \]

So, we know how to calculate \(P(G_i, G_j|\boldsymbol{\kappa}(R))\).

Switch Gears: Genotyping Technology and Genotyping Errors

Overview:

  • Next generation sequencing: basic background on Illumina sequencing and a little bioinformatics.
  • HOW do you target certain parts of the genome?
  • WHAT variation do you use at these parts of the genome?
  • Genotyping errors, biases, and their consequences.

Next Generation Sequencing (NGS)

  • Many genotyping technologies today do not use NGS:
    • Traditional Microsatellites
    • SNP chips
    • Fluidigm assays
  • Nonetheless, CKMR is feasible today because of high-throughput sequencing.
  • You (or someone on your team) needs some background/understanding of the genotyping technology you use.
  • We’ll discuss Illumina sequencing today. (most common and cost effective)

Basic Illumina Sequencing — Library Prep

Inside the Illumina Sequencer

  • The “lawn”
  • Bridge PCR / Cluster Generation
  • Sequencing by Synthesis
  • If “paired-end” then sequence is read from both ends of the fragment.

An informative video from Illumina for the ultra-interested.

The Raw Data:

Paired end sequences, sample-specific FASTQ files

These data are not useful unless you have a reference to map them to.

Alignment or “Mapping” to a Reference

Finding where in the reference sequence each read belongs

  • In “production genotyping” by NGS, this is where the bioinformatics typically starts (after some quality control).
  • You get:
    • Where in the reference each read aligns
    • Info about gaps, mismatches and misaligning bits
    • A measure of confidence that the location is correct

A Picture of Aligned Reads, Zoomed In

A Picture of Aligned Reads, Zoomed Out a Bit

x

Variant Detection and Genotype Calling

  • Variant Detection:
    • Using aligned reads to find positions in the genome that are polymorphic (SNPs, Indels, etc.)
  • Genotype Calling:
    • Using the aligned reads of an individual at a specific position to identify the genotype of that individual at that locus/variant.
    • Each read is a copy of the DNA on one chromosome
    • Reads from both copies of the chromosome align in the same position.
    • Allelic type from each chromosome is more accurately inferred if there are many reads from it.

The Fundamental Tradeoff in NGS

  • Number of reads is “fixed”

    • NovaSeq 6000 S4 flow cell
      • 2.5 Billion reads per lane
      • Each lane \(\approx\) $5,000.
  • Higher read depth at a locus = more accurate genotype calls.

  • Reads must be apportioned to:

    • Different individuals
    • Different loci / parts of the genome
  • If you target less of the genome, you can genotype more individuals, accurately.

HOW do you target certain parts of the genome?

  • Not at all (Whole-genome, shotgun sequencing)
  • Around specific enzyme cut sites (RAD and friends)
  • Probe hybridization / capture (MyBaits, etc.)
  • Enzyme cutsite + Capture (Rapture, DArTcap)
  • Amplicon Sequencing (GT-seq)

The Awesome Wonder of Whole Genome Sequencing Data—3 megabases of Chr28 in 124 Steelhead

  • Amazing for many applications.
  • A bit too many variants for CKMR. And too expensive.

RAD-Seq and ddRAD-Seq

What RAD gets you

  • 15,000 to 250,000 loci / SNPs
  • Costs can be as low as around $30 sample
  • A somewhat tedious library prep
  • There can be a lot of missing genotypes
  • Even the non-missing genotypes are not always super reliable (and are sometimes highly error prone).

RAD for CKMR?

  • RAD/ddRAD not always ready for CKMR prime-time
    • Cost, Missing data, Het-miscall rates
  • Might be useful for confirming kin pairs discovered using less expensive methods

Capture arrays

Enrich sequences by sticking them to probes / “baits”

By itself, probably not cost effective. However, in combination with RAD…

Combination RAD + Capture

  • A promising combination.
  • Initial RAD library prep to attach bar codes and select for DNA around cut sites.
  • Use capture probes to enrich for specific RAD loci that you have selected.
  • Great for targeting 500 to 5,000 genomic regions/loci.
  • 100s to 1,000s of individuals at $1 to $5 / sample genotyping cost
  • RAD + capture = RAPTURE; ddRAD + capture = DArTcap
  • Read depths potentially in the 100s per locus per individual
  • In the afternoon workshop we will be using some of these data.
  • Commercial enterprise (Diversity Arrays Technology) that does this for a fee.

Amplicon Sequencing

Use PCR primers to amplify (in multiplex) target regions of the genome.

  • Suitable for 100 - 400 Markers
  • Great for PO and FS relationships
  • HS in high-diversity species focusing on microhaplotypes.
    • Confirm with RAD afterward.

Microsatellites from Next Generation Sequencing

This is also based on Amplicon Sequencing.

  • Getting to play around with these data blew my mind.
  • Still awaiting further studies of the genotyping accuracy of these.
  • Could be quite useful, but still requires a lot of them.

Wrap Up

  • Mendel’s principles lead to simple expressions for joint genotype probabilities.

  • Likelihood of different relationships, given genetic data, can be compared.

  • Genetic data can take many forms, but most likely it will involve next-generation sequencing.

  • Next session: kin-finding.