Pop. Gen. IV: Finite Populations

Finite populations and neutral genetic diversity

In our last lecture, we focused on simple models of selection in large (effectively infinite) populations. While many populations are indeed quite large1, infinite population sizes are clearly impossible. So what happens in finite populations, in which we have a known number of individuals? To keep things simple, we are going to (temporarily) ignore fitness differences between genotypes; that is, we are going to focus on neutral genetic variation2. Recall from our first lecture that, in the absence of any other evolutionary process:

1 Example for Drosophila?

2 Example for Drosophila?

\[ \E[p_{t+1}] = p_t \]

“If nothing happens,
    then nothing happens!”
                - Bengt-Olle Bengtson

This equation, which we discussed as the law of constancy of allele frequencies, tells us that given some current allele frequency \(p_t\), the The EXPECTED frequency at time \(t+1\) will remain the same. Now, in and infinite population, \(p_{t+1} = p_t\) exactly. However, in a finite population, we have to take into account variation due to random sampling of genotypes/alleles. That is, if there is a fixed and finite number of individuals (\(N\)) in each generation, how do we choose the genotypes of the \(N\) individuals in the next generation?

  • Answer: random sampling of haploid gametes.

The Wright-Fisher Model of Genetic Drift

We will be walking through a class of models developed independently by Sewall Wright & R.A. Fisher (we will meet him again later). Consider a population of \(N\) diploid individuals. For simplicity, we will assume these are self-compatible hermaphrodites. Each new generation is obtained from the previous one by repeating the following basic steps:

  1. Choose two alleles at random and with replacement from the \(2N\) alleles present in the parent generation.
  2. Make an exact copy of those two alleles.
  3. Place those allele copies in the new generation as one of the \(N\) diploid offspring genotypes.
  4. Repeat for all \(N\) individuals
Figure 1: Cartoon sketch of a W-F model.

Hopefully you can already see that this simple model can be a very powerful tool for studying finite populations. Specifically, the above cartoon immediately suggests that we can use a W-F model to study the statistical properties of allele frequencies in finite populations. It also suggests a deep connection between the process of randomly sampling haploid gametes to produce an offspring generation, and the properties of lineages of alleles that can be traced backwards through the evolutionary history of a population.

We will be exploring both of these aspects in the following sections.

Decay of heterozygosity due to drift

Let’s begin our exploration of statistical properties of finite populations by looking at how genetic drift influences genetic variation in a finite population. To do this, we first need to establish two definitions:

  • Identity by state (I.B.S.): What it means for two alleles to be identical by state depends on the context. However, for our purposes, we will retain the same definition that we gave for different alleles earlier, and say that two alleles are identical by state if they are the same variant (e.g., \(A_1\)). Alleles can be identical by state, but differ in other important ways…

  • Identity by descent (I.B.D.): Alleles are said to be identical by descent if they share a single common ancestor allele in their recent evolutionary history. What we mean by ‘recent’ is a little fuzzy here, since technically all alleles in a population have a single ancestor at some point in their history (a point we will return to soon). Note that being identical by descent is different from beingidentical by state. Alleles that are I.B.D. are I.B.S. by definition. But alleles that are I.B.S. are not necessarily I.B.D. due to mutations in different individuals or recombination events.

Armed with these two definitions, let’s return to our Wright-Fisher population. There are two ways for an individual to inherit two alleles that are identical by state:

Figure 2: There are two ways to inherit alleles that are Identical By State

What are the probabilities of these two outcomes?

  1. The \(2\) alleles are copies of the same allele in the parental generation. This occurs with probability \(1/(2N)\).
  2. The \(2\) alleles don’t have the same ancestor allele in the previous generation, but their two ancestor alleles are themselves I.B.S. This occurs with probability \(1−1/(2N)\)

Recall from our lecture on H-W that to be ‘homozygous’ means an individual has inherited two alleles that are I.B.S. Likewise, inheriting two alleles that are NOT I.B.S. leads to a ‘heterozygous’ genotype. Hence, the overall probability that \(2\) alleles are identical by state is equal to \(1−H\), where \(H\) is the population genetic diversity index, or heterozygosity! This leads to the following classic derivation: \[ \Pr \left[ \text{IBS} \right]_{t+1} = 1 - H_{t+1} = \underbrace{\left( \frac{1}{2 N} \right)}_{\text{IBS b/c IBD}} + \underbrace{\left(1 - \frac{1}{2 N} \right) (1 - H_t)}_{\text{not IBD, but still IBS}} \]

Rearranging, we get: \[ \begin{aligned} H_{t+1} &= 1 - \left( \frac{1}{2 N} \right) - \left(1 - \frac{1}{2 N} \right) (1 - H_t) \\ &= \left(1 - \frac{1}{2 N} \right) H_t \end{aligned} \]

and we can describe the deterministic trajectory for the expected heterozygosity over time as: \[ \begin{aligned} H_{t} &= H_0 \left(1 - \frac{1}{2 N} \right)^t \\ &\approx H_0 e^{-t/2N} \end{aligned} \]

Let’s take a look at what this expression looks like:

Insight!

Genetic drift, the simple process of stochastic survival of alleles from one generation to another leads inevitably to the loss of genetic variation! Why does this happen? Because in a finite population, the long term outcome is ALWAYS extinction of one allele and fixation of the other!

Without some other force introducing new genetic variation (e.g., mutation, migration), or selection that actively maintains genetic variation (e.g., heterozygote advantage), heterozygosity is expected to decay over time due to genetic drift.

The Effective Population Size (\(N_e\))

Natural populations never meet the assumptions of a W-F model. However, some of the possible violations lead to models that are still relatively simple extensions of the classic W-F model; others require entirely new models. We need a way of translating the predictions of W-F models to real situations:

The effective population size (\(N_e\)): The size that a W-F population would have if it lost genetic variation at the same rate as the modeled/studied population. Provides a way of quantifying the effective size of a population due to a variety of factors that violate the assumptions of W-F.

Most extensions of W-F share one property with it, the variance in allele frequency after mating is of the form \[ \Var \left[ p^{\prime} \right] = \frac{pq}{2N_e}, \]

where \(N_e\) is called the effective population size, or more precisely, the variance effective population size. All properties of the more complicated models that depend on \(\Var \left[ p^{\prime} \right]\) are the same as those of a W-F population with an appropriate choice of \(N_e\).

We will skip the derivations, but here are a few forms of \(N_e\) that you need to know:

Cyclic change in population size

Figure 3: Cyclic changes in population size

\[ N_e = \frac{t}{\sum_{t=1}^t \frac{1}{N_t}} = \text{Harmonic Mean} \]

For \(N_1 = 10\), \(N_2 = 100\), \(N_3 = 1000\), we have \(N_e \approx 27\), while the arithmetic mean \(\bar{N} = 370\)

Insight: Population bottlenecks reduce \(N_e\)!

Two Sexes

\[ N_e = \frac{4 N_f N_m}{N_f + N_m}, \]

where \(N_f\) and \(N_m\) are the numbers of males and females, respectively. If \(N_f = N_m = N/2\), then: \[ N_e = \frac{4 (\frac{N}{2})^2}{\frac{N}{2} + \frac{N}{2}} = \frac{N^2}{N} = N \]

Insight: All results for standard W-F model hold when we have an equal sex ratio!!!

Skewed sex ratios reduce \(N_e\)!!!

If \(N_f = 10\) and \(N_m = 100\), then \[ N_e = \frac{4 \times 10 \times 100}{10 + 100} = \frac{4000}{110} = 36 \]

Inbreeding

\[ N_e = \frac{N}{1 + F} \]

Insight: Inbreeding reduces \(N_e\)!!!

Drift and Mutation

Recall in Lecture 3 that we explored the consquences of mutation-selection balance in effectively infinite populations, where mutation introduced new deleterious alleles into the population, and selection removed them. The lead to a dynamic balance between the two processes, an equilibrium with intermediate frequencies of the deleterious allele.

Here, we will examine a similar situation, except now we will be focusing on neutral variation in finite populations. In this case, mutation still introduces new variants into a population, but now genetic drift is the opposing force that removes variation. Mutation and drift will have opposite effects on expected heterozygosity, but where do they perfectly counterbalance one another? Where is the Mutation-Drift balance for \(H_t\)?

As before, let’s start with mutation. New mutations will enter a population at a rate of \(2 N \mu\) per generation (that’s just the number of haploid genomes in the population, multiplied by the mutation rate). It is easier to follow the derivation if we start by quantifying the probability that a pair of alleles will be I.B.S., which will equal \(1 - H_t\). Let’s walk through the logic:

  • If a mutation occurs on either of two initially I.B.S. alleles, they will no longer be I.B.S.
  • Then, just as we did earlier, we recognize that there are two ways that a pair of alleles in the next generation can be I.B.S.
    • They can be I.B.S. because they are I.B.D., OR
    • They can be I.B.S. even though they are not I.B.D
  • Below, we step through the resulting derivation

\[ \begin{aligned} 1 - H_{t+1} &= \underbrace{(1 - \mu)^2}_{\Pr(\text{No Mutation})} \underbrace{\left( \underbrace{\left( \frac{1}{2 N} \right)}_{\text{IBS b/c IBD}} + \underbrace{\left(1 - \frac{1}{2 N} \right) (1 - H_t)}_{\text{Not IBD, still IBS}} \right)}_{\Pr(\text{IBS})} \\ &\approx (1 - 2 \mu) \left( \left( \frac{1}{2 N} \right) + \left(1 - \frac{1}{2 N} \right) (1 - H_t) \right) + \mathcal{O}(\mu^2) \\ &\approx \frac{1}{2N} + \left(1 - \frac{1}{2 N} \right) (1 - H_t) - 2\mu(1 - H_t) + \mathcal{O}(\mu^2) \end{aligned} \]

rearranging, we get \[ H_{t+1} \approx \left(1 - \frac{1}{2 N} \right) H_t + 2 \mu (1 - H_t) \]

The per-generation change in \(H\) is then: \[ \Delta H \approx -\frac{1}{2 N} H + 2 \mu (1 - H) \]

At equilibrium3, \(\Delta H = 0\), and we can solve for the expected heterozygosity:

3 Exercise: solve for \(\hat{H}\) yourself

\[ \hat{H} = \frac{4 N \mu}{1 + 4 N \mu} \]

The Expected Heterozygosity, \(\theta\)

The term \(4 N \mu\) describes the expected heterozygosity in an infinite population, and is so important and shows up so often that we give it a name: \(\theta = 4 N \mu\)

and hence we can also write4: \[ \hat{H} = \frac{\theta}{1 + \theta} \]

4 Compare with the equation for Expected Heterozygosity above.

The Probability of Allele Fixation

In light of what we have just learned about the decay of heterozygosity due to genetic drift, we have now have a conundrum… We know that \[ \E[p^{\prime}] = p \]

BUT without mutation, we know that \(H_t \rightarrow 0\) as \(t \rightarrow \infty\). That is, the expected allele frequency is expected to stay the same, but heterozygosity is lost over time. This means that, in the long run, one allele will be lost and the other will be lost. There are some important implications:

  • Since all variation is ultimately lost, eventually one allele will be the ancestor of alleles in a population.

  • There are \(2N\) alleles, so the chance that any one of them will be the ancestor of all alleles at some point in the future (after \(H_t \rightarrow 0\)) is just \(1/(2N)\).

  • Now, suppose that there are \(j\) copies of the \(A_1\) allele. The chance that any of those \(j\) copies goes on to be the common ancestor of all alleles at some point in the future is just \(j/(2N)\).

Hence, the probability that the \(A_1\) allele will ultimately go to fixation is equal to it’s current frequency: \[ \Pr[\text{fix} | p] = p \]

Let’s use this result to go one step further…

The rate of substitution

Let’s consider the probability that a mutation that is new to the current generation ultimately ultimately goes to fixation: \[ \Pr \left[ \text{fix} \mid \text{new mutation} \right] = \rho = \underbrace{2 N \mu}_{\text{\# new mutations}} \times \underbrace{\frac{1}{2N}}_{\Pr \left[ \text{fix} \right]} = \mu \]

Insight!

The expected rate of substitution for neutral mutations is equal to the mutation rate, and independent of \(N\)!!!

  • In a W-F population, a new mutation goes to fixation every \(1/\mu\) generations!

  • We have an evolutionary clock!

  • Neutral mutations are essential tools for studying genetics of populations!

Looking backward in time: The Coalescent

So far we have taken a forward-in-time approach, where estimates are based on models that project forward in time from \(t\) to \(t + 1\). Alternatively, one could develop estimates based on retrospective models that look backward in time. Remember the lineages of different alleles we sketched out in our cartoon of a W-F model? Coalescent approaches are based on the assumption that all alleles at a given locus can be traced back to a single ancestral allele. We then focus on tracing the lineage/geneology of a alleles backward in time to where they coalesce at the most recent common ancestor.

The Coalescent
the lineage/geneology of a sample of alleles traced back to their most recent common ancestor.

Bifurcating trees are a natural representation for these geneologies:

(a) The coalescence of gene lineages in our WF model.
(b) The equivalent coalescent tree. \(T_k\) = time with \(k\) alleles segregating (image credit unkn.).
Figure 4: Parallels between the forward-in-time W-F model, and the backward-in-time coalescent.

Within the Coalescent framework, we can defining a variety of useful statistical properties of these gene geneologies. For example, we can define \(T_{tot}\), the total time in all branches of the geneology until all sampled alleles coalesce: \[ T_{tot} = \sum_{k=2}^n k T_k \]

For our example tree in the figure above, \(T_{tot} = 4 T_4 + 3 T_3 + 2 T_2\).

However, not every geneology will look the same. The actual topology will vary depending on the sample of alleles. This means that the \(T_k\)’s are random variable, as is \(T_{tot}\)! This means we can calculate the expected value of \(T_k\), just like any other random variable. Let’s walk through the logic of calculating \(\E[T_k]\). Consider a sample of \(k\) alleles drawn from a population of \(N\) diploid individuals.

  • The \(1^{st}\) allele has a direct ancestor in the previous generation (obviously).

  • The \(2^{nd}\) allele will have a different ancestor with probability \[ 1 - \frac{1}{2N} = \frac{2N - 1}{2N}. \]

    • The right-hand side of this equation is particularly informative. It shows that there are \(2N\) possible ancestors for the \(2^{nd}\) allele, but only \(2N - 1\) of them are not I.B.D. to the \(1^{st}\) allele.
  • The probability that a \(3^{rd}\) sampled allele will NOT share an ancestor with the previous two, given that the first two do not share an ancestor, is $(2N - 2) /(2N).

    • The total probability that the first 3 alleles do not share an ancestor is: \[ \frac{2N - 1}{2N} \times \frac{2N - 2}{2N} = \left( 1 - \frac{1}{2N} \right) \times \left( 1 - \frac{2}{2N} \right). \]
  • With a minor leap of intuition, we can see that the probability that the \(k\) alleles ALL have different ancestors is (ignoring terms of order \(N^{-2}\): \[ \left( 1 - \frac{1}{2N} \right) \left( 1 - \frac{2}{2N} \right) \cdots \left( 1 - \frac{k-1}{2N} \right) \approx 1 - \frac{1}{2N} - \frac{2}{2N} - \ldots \frac{k-1}{2N} \]

  • The probability that a coalescence occurs is one minus the probability that it does not, or: \[ \Pr [\geq 1 \text{coalescence}] = \frac{1 + 2 + \ldots + (k - 1)}{2N} = \frac{n(n-1)}{4N} \]

  • The probability distribution of the time until the first coalescence is a geometric distribution5, with probability of success \(p = n(n - 1)/(4N)\)

5 The geometric distribution is a standard statistical distribution used to describe ‘waiting times’ before a certain number of events which occur at a constant rate.

  • The mean of the geometric distribution is the reciprocal of the probability of success, which gives us: \[ \E[T_k] = \frac{4 N}{k(k - 1)} \]

NOW… that was a bit of work. But the beauty of the coalescent approach is that all the hard work is now over. We can now quickly calculate several key results:

Some important results from coalescent theory

Expected time interval between coalescent events

For example, the time interval from a coalescence event from \(i\) alleles to \(i - 1\) alleles is just \[ \E[T_i] = \frac{4 N}{i(i - 1)}, \]

and therefore \[ \E[T_2] = \frac{4 N}{2(2 - 1)} = 2 N \]

That is, the expected time to coalescence for any two randomly chosen alleles is equal to \(2N\), the number of haploid genomes in the population!

Time to most recent common ancestor for a sample of alleles

Another important result is the expected time to the most recent common ancestor for a sample of \(n\) alleles: \[ \begin{aligned} \E[T_{\text{MRCA}}] &= \sum_{i=2}^n T_i \\ &= \sum_{i=2}^n \frac{4 N}{i(i - 1)} \\ &= 4 N \left(1 - \frac{1}{n}\right) \end{aligned} \]

But notice: As the number of sampled alleles, \(n\), becomes large, \(\E[T_{\text{MRCA}}] \rightarrow 4N\)! Hence, the expected time to the most recent common ancestor in an entire population is \(4N\).

Insight!

Compare the last two results… The expected time for 2 alleles to coalesce is \(T_2 = 2N\)… but the expected time for ALL alleles in the population to coalesce is \(T_{\text{MRCA}} = 4 N\)(!) Hence, no matter how many alleles there are segregating in a population, around half the total time to the most recent common ancestor represents waiting for the last 2 alleles to coalesce. Take another look at Figure 4 (b) !

Expected total time in a coalescent tree

We can also easily find the expected total time in ALL branches of a coalescent tree. We already know that \[ T_{tot} = \sum_{i=2}^n i T_i. \]

So the expected total time is: \[ \E[T_{tot}] = \sum_{i=2}^n i \E[T_i] = 4 N \sum_{i=2}^n \frac{1}{i - 1} \]

The expected number of segregating sites

Now, the expected number of segregating sites, \(S_n\), is equal the neutral mutation rate, \(\mu\) times the expected time in the coalescent tree. So, we can easily calculate: \[ \begin{aligned} \E[S_n] &= \mu \E[T_{tot}] \\ &= 4 N \mu \sum_{i=2}^n \frac{1}{i - 1} \\ &= \theta \sum_{i=2}^n \frac{1}{i - 1}. \end{aligned} \]

Remember our old friend \(\theta\)?

Expected heterozygosity(!)

This suggests that \(\theta\) can be well estimated by: \[ \hat{\theta} = \frac{S_n}{1 + \frac{1}{2} + \frac{1}{3} + \ldots + \frac{1}{n-1}} \]

But, we already know that the probability that two alleles different by origin are also different by state is \(\hat{H}\). Now, two alleles will differ by state if a mutation occurred on the lineages more recently than their coalescence. The probability that a coalescence occurs in any particular generation is \(1/(2N)\), and the probability that a mutation occurs is \(\Pr[\text{mut.}] = 1 - (1 - \mu)^2 \approx 2 \mu\) per generation. The probability that a mutation is the first event to occur is just its’ relative probability of occurrence: \[ \begin{aligned} \hat{H} &= \frac{2 \mu}{2 \mu + \frac{1}{2 N}} \\ &= \frac{4 N \mu}{1 + 4 N \mu} \\ &= \frac{\theta}{1 + \theta} \end{aligned} \]

INSIGHT!!!

We have just re-derived the expected heterozygosity, \(H\), using backward-sampling Coalescence, just as we did using the forward-in-time W-F Model, and got the same answer!!!

Surely this is a most satisfying convergence of results!

Hypothesis testing based on The Coalescent (Tajima’s \(D\))

These important results from Coalescent Theory form the basis of most of the hypothesis testing we do in modern genomics and bioinformatics. Here we will briefly connect our earlier results with key estimators.

Recall that the expected value of the sum of the site heterozygosities in a population (NOT a sample) is: \[ \pi = \E \left[ \sum_{i=1}^\infty 2 p_i (1 - p_i) \right] = \theta. \]

Equivalently, we can interpret \(\pi\) as the mean number of nucleotide differences between a pair of alleles drawn at random from a population. We can estimate \(\pi\) using a sample of \(n\) alleles as: \[ \hat{\pi} = \frac{n}{n - 1} \sum_{i=1}^{S_n} 2 \hat{p}_i (1 - \hat{p}_i) \]

where \(\hat{p}_i\) is the frequency of one of the two alleles at the \(i^{th}\) segregating site in the sample. The factor \(n/(n - 1)\) guarantees that \(\E[\pi] = \theta\).

Now, since both \(\hat{\theta}\) and \(\hat{\pi}\) are unbiased estimators of \(4 N \mu\), they should be close in value for samples of neutral loci. Tajima’s \(D\) quantifies the deviation between these two estimators. \[ D_T = \frac{\hat{\pi} - \hat{\theta}}{C} \]

where \(C\) is a normalizing constant that you don’t need to know. When these two estimators differ significantly from one another, it suggests that the sites in question are NOT neutral, and are rather under selection. We will meet Tajima’s \(D\) again in a later lecture on Signatures of Selection

What we have covered
  • Take-home message 1
  • Take-home message 2
  • Take-home message 3