Genetics Computer Simulation Exercise

BIOR92: Spring 2025

Author

Colin Olito

Published

April 18, 2025

Note

I have tested the code and everything seems to be working correctly, but please let me know if you run into any problems and I will do my best to help troubleshoot. Suggestions for improvements are very ewe.

Background

Population genetics involves both the empirical study of genetic variation as well as building a theoretical foundation for interpreting empirical data and understanding evolutionary processes. Theoretical population genetics is almost entirely mathematical and is almost as old as the broader field of genetics; very soon after the rediscovery of Mendel’s laws of inheritance the question arose as to how these results, which concerned controlled crosses, could be applied to whole populations.

Today, computer simulations are a valuable tool in our genetics toolbox. Computer simulations can be used in many ways. One is to test theoretical results, another is to test the rigor of ideas by giving them a firm mathematical basis. In these short practical exercises we will use simulations to illustrate some fundamental processes, and key results in evolutionary genetics.

The excercises focus on single-locus population genetics models. We will be modeling allele frequency dynamics at a fictional locus \(\mathbf{A}\) with alleles \(A_1\) and \(A_2\) with frequencies \(p\) and \(q\) (where \(q = 1 - p\)). For the most part, we will be tracking the fate of the \(A_2\) allele under a variety of different scenarios.

Instructions

  1. Self-assemble into groups of 2-4 students and - using either your laptop or lab computers - start up python in a terminal, and open up the text-editor of your choice, or via VSCode if that is what you are used to.
    1. You are welcome to use another python interface if you prefer, but note that some of the plotting functions you will be using don’t work in a notebook setting.
  2. You have been provided with the file: Allele-A1-functions.py. Save it into the working directory you will be using for this exercise.
    1. This file defines a variety of basic python functions that you will need to use to answer the problems.
  3. You will use these functions to complete the exercises.
    1. You will need to set and modify the function arguments for each question in order to make plots, do calculations, and generate simulated data.
    2. Be creative! Play around with the functions.
  4. As a group, work through each numbered step in the exercises, then answer the italicized discussion questions in your lab report.
Important
  1. Each member of your group will need to submit the assignment on Canvas. You can all submit the same document, but Please list the names of all group members on your report. Be sure that your names and id numbers are clearly labeled.

Some python preliminaries

In order to complete the exercises, you will need to write and run some python code, but you won’t need to do it all from scratch. I have provided several pre-defined functions that you will use, though you will need to write some minimal code to produce the desired output for the exercises.

As mentioned above, you have been provided with a file titled: Allele-A1-functions.py. Let’s start by saving it into the working directory you will be using for this exercise. This file defines a variety of basic python functions that you will need to use to complete the exercise.

The exercises are designed to be run in an interactive Python session, so let’s fire up Python. Now would also be a good time to open a text editor program of your choice and create a file in your working directory where you will write and save your code for doing the exercises (e.g., myGeneticsSimCode.py).

You should now be able read the file Allele-A1-functions.py directly by running:

import os  
exec(open('../code/Allele-A1-functions.py').read())

You can also copy/paste the contents of the file into your interactive python session. It doesn’t really matter.

Let’s walk through the beginning of this file. The first thing we need to do is import a few packages into python:

from scipy import *
import matplotlib.pyplot as plt
import numpy as np
import random 

Next, we need to define the functions you will use for the exercises. The first function is a simple 1-locus recursion equation, which looks like this:

### Deterministic 1-locus selection model: recursion equation
# args:
# q   : Frequency of A2 allele  (numeric, between 0,1)
# w_1 : Relative fitness of A1A1 genotype (numeric, somewhere between 0 & 1.5)
# w_2 : Relative fitness of A1A2 genotype (numeric, somewhere between 0 & 1.5)
# w_3 : Relative fitness of A2A2 genotype (numeric, somewhere between 0 & 1.5)
# u   : Mutation rate from A1 --> A2 (numeric, usually small, e.g., 0.0001)
def qNext(q, w_1, w_2, w_3, u):
    q = q + (1 - q)*u
    return (q**2 * w_3 + (1 - q)*q*w_2) / ((1 - q)**2 * w_1 + 2*(1 - q)*q * w_2 + q**2 * w_3)

You will see that in Allele-A1-functions.py each function has all of the arguments it accepts described above it. As input, qNext() accepts a frequency for the focal allele (\(q\)) in the current generation, as well as genotypic relative fitness values \(w_1\), \(w_2\), and \(w_3\), as well as a mutation rate from \(A_1 \rightarrow A_2\) (we ignore back-mutation for now). It returns the deterministic allele frequency in the next generation:

qNext(q = 0.1, w_1 = 1, w_2 = 1.05, w_3 = 1.1, u=0)
0.10445544554455446

By itself, qNext() doesn’t do much. However, we can define a wrapper function that iterates this recursion many ties to give a deterministic frequency trajectory for the allele over time:

### Function to generate time series of deterministic allele frequencies
def determTraj(q0, w_1, w_2, w_3, nGen, u=0):

    q    = [0] * nGen # storage array
    q[0] = q0 # initial frequency
    
    # loop over time steps to calculate allele frequency trajectory
    for t in range(1,nGen):
        q[t] = qNext(q=q[t-1], w_1=w_1, w_2=w_2, w_3=w_3, u=u)

    # return allel freq. trajectory
    return np.array(q)

The next one defines a function that makes plotting the results easy:

### Function to plot deterministic trajectory
def plotDetermTraj(q0, w_1, w_2, w_3, nGen, u=0, plotP=False, yLim=[0,1]):
    
    # make data
    q = determTraj(q0 = q0, w_1 = w_1, w_2 = w_2, w_3 = w_3, u = u, nGen = nGen)
    time = np.arange(start=0, stop=nGen, step=1)

    # plot options
    fig,ax=plt.subplots()
    ax.set_xlim([0, nGen])
    ax.set_ylim(yLim)
    ax.set_xlabel('Generations')
    ax.set_ylabel('Frequency of $A_2$ allele ($q$)')

    # make plot
    if plotP:
        plt.plot(time, 1-q, color='dodgerblue', ls='-', lw=2)
    else:
        plt.plot(time, q, color='dodgerblue', ls='-', lw=2)

    plt.show()

Let’s test it out:

### Test the plotting function
plotDetermTraj(q0=0.05, w_1=1, w_2=1.01, w_3=1.05, nGen=500)

If everything worked, python should have produced a plot that looks something like the figure above. If not, then please let me know so that we can troubleshoot and make sure Python is working properly.

Allele-A1-functions.py defines several other functions you will need. I encourage you to refer back to this file to check the proper use of the arguments, etc.

Note

That is about it for the python preliminaries. Below you will find the exercises you will to work through as a group. Code blocks with examples demonstrating how to use the functions appear in callouts boxes. I encourage you to play around with the functions as you are working through the questions… try different parameter values, explore, and have some fun!


Exercises

Exercise 1: Selection in an infinite population

Using the functions we have already defined, set the initial frequency of the \(A_2\) allele to , q_0 = 0.01, and the number of generations to simulate to nGen = 300.

  1. Set the genotypic fitness values to \(w_1 = 0.95\), \(w_2 = 1\), \(w_3 = 0.95\), and run the simulation.
plotDetermTraj(q0=0.01, w_1=1, w_2=1.05, w_3=1.1, nGen=300)
  1. Now, in the plotDetermTraj() function, set the option plotP=true, don’t change anything else. What happens?

  2. Recall that we often express relative fitnesses as: \(w_{1} = 1\), \(w_{2} = 1 + hs\), \(w_{3} = 1 + s\), where \(h\) and \(s\) are the dominance and selection coefficients, respectively. Using these fitness expressions, and setting plotP = false, run simulations for the following fitness values:

    1. \(w_1 = 1.0\), \(w_2 = 1.0\), \(w_3 = 1.04\)
    2. \(w_1 = 1.0\), \(w_2 = 1.01\), \(w_3 = 1.04\)
    3. \(w_1 = 1.0\), \(w_2 = 1.02\), \(w_3 = 1.04\)
    4. \(w_1 = 1.0\), \(w_2 = 1.03\), \(w_3 = 1.04\)
    5. \(w_1 = 1.0\), \(w_2 = 1.04\), \(w_3 = 1.04\)

What parameter are you adjusting here? What happens to the trajectory of allele \(A_2\)? Re-run these simulations using the \(h\) and \(s\) coefficients directly.

h = 1/2
s = 0.1
plotDetermTraj(q0=0.01, w_1 = 1, w_2 = 1 + h*s, w_3 = 1 + s, nGen=300)
  1. Set the fitness values to \(w_1 = 1.00\), \(w_2 = 1.05\), \(w_3 = 1.0\). What happens? How might you write fitness expressions for this situation?

  2. Play around with the fitness values (try a variety of combinations of your choosing). Take note of the outcomes. What happens to the allele you are plotting? Why?

Discussion questions - remember to address each of these in your lab report!
  • What are the possible outcomes of selection in an infinite population?
  • What has to be true for selection to maintain genetic variation?
  • How does dominance alter the evolution of a beneficial mutation?
  • Why is heterozygote fitness so important for the invasion of beneficial mutations?

Exercise 2: Deleterious mutations

Using the plotDetermTraj() function, use the following settings

  • Make sure you are plotting the frequency of the \(A_2\) allele (i,.e., plotP = false).
  • Set nGen = 1000
  • Set the initial frequency of the \(A_2\) allele to q0 = 0.001.
  • Set the mutation rate from \(A_1 \rightarrow A_2\) to \(0.0001\) (i.e., u = 0.0001).

If the frequencies are too low to see clearly, you can modify the y-axis range using the yLim argument in plotDetermTraj(). To get the precise final frequency, use the determTraj() function directly, and inspect the last element.

plotDetermTraj(q0=0.001, w_1=1, w_2=0.99, w_3=0.98, u=0.0001, nGen=1000, plotP = False, yLim=[0,0.025])
test = determTraj(q0=0.001, w_1=1, w_2=0.99, w_3=0.98, u=0.0001, nGen=1000)
test[-1]

np.float64(0.009899565853159188)
  1. Set the fitness values to \(w_1 = 1.0\), \(w_2 = 0.95\), \(w_2 = 0.9\). What are the final allele frequencies?
  2. Set the fitness values to \(w_1 = 1.0\), \(w_2 = 1.0\), \(w_2 = 0.9\). What are the final allele frequencies?
  3. Set the fitness values to \(w_1 = 1.0\), \(w_2 = 1.0\), \(w_2 = 0.0\). What are the final allele frequencies?
Discussion questions
  • What are some important conclusions you can draw from these results?
  • Try to discuss them in relation to disease genetics in your report. e.g., why do well known, deadly, congenital diseases persist in humans?
  • Where are rare (and often deleterious) alleles often found?
  • Why might many deleterious alleles often be strongly recessive?
    • What are some implications for human health?

Exercise 3: Genetic Drift

For the exercises involving genetic drift in finite populations, we will use the following new functions:

  • WFSim() - Runs a single Wright-Fisher simulation for a specified number of generations.
  • plotWFSims() - Visualize results from multiple Wright-Fisher simulations.
  • WFSimToEnd() - Runs a single WF simulation until extinction/fixation
  • WFSimCountFix() - Wrapper function that counts the number of fixations among a specified number of replicate simulations. Also records average time to extinction/fixation.

Refer back to Allele-A1-functions.py for descriptions of the arguments accepted by these functions.

  1. Use the plotWFSims() function, and start with these settings (be sure that u = 0 from now on):
    • Set the initial frequency of the \(A_2\) allele to \(0.5\) (i.e., q0=0.5)
    • Set all the fitness values to \(1\) (i.e., w_1=1, w_2=1, w_3=1).
    • Set the number of generations to run to \(100\) (i.e., nGen=100).
    • Set the population size to \(100\) (i.e., N=100)
    • Set the number of simulations to run to \(50\) (i.e., n_lines=50).
plotWFSims(q0=0.5, w_1=1, w_2=1, w_3=1, nGen=100, N=100, n_lines=50, clrmp='managua')
  1. Now, progressively increase the number of generations from \(100\) to \(200, 300, 500, 1000\). What happens to the allele frequencies in the long run?

  2. Set the initial frequency of \(A_2\) to \(0.9\) and the number of generations to \(1000\). Run \(100\) simulations, and count the number of times allele \(A_2\) goes to fixation (reaches a frequency of \(1\); Hint: it may be easier to count the number of times it went extinct). In what proportion of the simulations did \(A_2\) fix? How many generations did it take on average?

  3. Now use the WFSimCountFix() function to do the counting for you. This function returns a list with the following information: [# of A_2 fixations, Avg. time to ext., Avg. time to fix.]. Use this function to calculate the probability of fixation and average time to fixation among \(1000\) replicate simulations for populations of \(N = 100\) using the following initial frequencies:

    • \(q_0 = 0.5\)
    • \(q_0 = 0.4\)
    • \(q_0 = 0.3\)
    • \(q_0 = 0.2\)
    • \(q_0 = 0.1\)
    • \(q_0 = 1/500\)
WFSimCountFix(q0 = 0.5, w_1 = 1, w_2 = 1, w_3 = 1, N=100, nSims=1000)
  1. Set the initial frequency, q_0 to \(1/(2N)\) (e.g., if N = 100, set q_0 = 0.005). Using the WFSimCountFix() function, estimate how long it takes the \(A_2\) allele to go extinct on average. What is the relation with \(N\)?

Here is another way to to do this where R calculates the initial frequency for you based on your chosen population size:

popSize = 100
initialFreq = 1/(2*popSize)
WFSimCountFix(q0 = initialFreq, w_1 = 1, w_2 = 1, w_3 = 3, N = popSize, nSims = 10000)
  1. In lecture, we discussed the decay of heterozygosity, \(H\), over time due to genetic drift in finite populations, and covered several equations describing the expected heterozygosity over time (e.g., \(H_t = H_0(1 - \frac{1}{2N})^t\)). One of the functions you have been provided, WFSim(), will return as output an array with the allele frequency in each generation from a single WF simulations. Using this function, and starting with an initial frequency of q0 = 0.5, explore how well the analytic equations describe the decay of genetic variation in your simulations. How many simulations you you need to run before the results begin to converge on the prediction?

You will probably want to write some new code of your own to automate the process of saving output from replicate simulations so that you can calculate \(H\) and mean \(H_t\) from the allele frequencies in your simulations.

TRY TO FIGURE THIS OUT AS A GROUP. IF, and only if, you cannot find a way to do this calculation, try looking at the function H_decay_plot in the Allele-A1-function.R file.

H_decay_plot(q0 = 0.5, N = 500, w_1 = 1, w_2 = 1, w_3 = 1, u=0, nGen=5000, n_sims=10)

Discussion questions

  • What are the possible outcomes of genetic drift in a finite population?
  • What is the probability that a neutral allele will go to fixation?
  • How long does it take for a neutral allele to go to fixation on average?
  • How long does it take for a neutral allele to go extinct on average?
    • Is there a relation between these average times and the population size, \(N\)? initial frequency, \(q_0\)? Why?
  • How well did your simulations match analytic predictions for the rate of decay of genetic variation over time due to genetic drift?
    • How many replicates simulations did you need to run before the simulations converged on the predictions?
    • Did you try for different population sizes? If so, how did \(N\) affect the results?

Exercise 4: Selection in finite populations.

Simulate the following scenarios, where we are tracking the outcome of selection on a rare beneficial allele (\(A_2\) in this case).

  1. Use WFSimCountFix() to run \(1000\) simulations with the following settings:
    • Set the initial frequency of \(A_2\) to \(0.02\) and \(N = 1000\).
    • Set the fitness values to \(w_1= 1.0\), \(w_2=1.01\), \(w_3=1.02\)

In what proportion of the simulations did the \(A_2\) allele go to fixation (reach a frequency of 1)? How does this compare to what the fixation probability of a neutral allele?

  1. Now we will investigate the fixation probability of single-copy beneficial mutations. Fill in the following table with the proportion of \(10,000\) simulations that result in fixation of allele \(A_2\). **Be sure to always set q0 = 1/(2*N). The left-most column gives fitness values for \(A_1A_2\) and \(A_2A_2\) genotypes:
Fitnesses (\(w_1\) always equals 1.0) \(N = 50\) \(N = 100\) \(N = 1000\)
\(w_2 = 1.0005\), \(w_3 = 1.001\)
\(w_2 = 1.005\), \(w_3 = 1.01\)
\(w_2 = 1.05\), \(w_3 = 1.1\)
  1. Recall that the selection coefficient, \(s\) is equal to the genotypic fitness values minus \(1\) (i.e., if the fitness of the heterozygote \(A_1A_2\) genotype is \(w_{2} = 1.05\), then \(s_{het} = 0.05\)). From your results above, can you come up with a ‘rule of thumb’ for when the allele frequency dynamics become more deterministic (as in Exercise 1), and when the dynamics are more dominated by genetic drift? Take a careful look at the fixation probabilities as you try to figure this out. Can you also come up with an approximation for the fixation probability of a beneficial mutation from these data?

Try to come up with simple mathematical expressions for both the threshold where selection begins to dominate drift, and for the fixation probability of beneficial mutations using the parameters \(N\) and \(s_{het}\).

Discussion questions
  • What happened as you increased either the population size, the strength of selection, or both?
  • At what point did the dynamics transition from being dominanted by drift vs. selection? What was your ‘rule of thumb’?
  • Could you come up with an approximation for the fixation probability of a single beneficial mutation? What was it?
  • Do beneficial alleles always go to fixation in finite populations?
    • What does this tell you about the idea of natural selection as a ‘hill-climbing’ or optimization algorithm that inevitably finds the global fitness optimum?

Writing up your lab report

Lab Report Instructions
  • Write your lab report as a group, remember to include all names/student numbers.
  • For each Exercise (1-4) include the following in your report:
    • Write about 1/2 page addressing the discussion questions at the end of each exercise.
    • Be sure to refer to and describe any aspects of the exercises that you did which are relevant for answering the discussion questions.
    • Including figures is fine, but don’t go crazy.
    • You don’t need to include your code, though if you’ve used a Jupyter notebook for your work it may be easier to submit the whole notebook.
  • Grading the reports is easier if each member of the group submits a copy of the same report on Canvas (but one last time, it is important to include all names/numbers, so I can keep track of who was in which group!).