Quant. Gen. III: Correlation between relatives

The two traditional concerns of quantitative genetics

During the development of the field of quantitative genetics (i.e., during ‘the battle’), the biometricians focused heavily on finding quantitatively approaches to describe:

  1. The correlation between relatives.

  2. Population responses to selection (in the short term).

Today, we will focus on the first of these, which was also the primary concern of the Biometricians at the turn of the 19th century

Quantifying similarity among relatives was an immediate and obvious application of Darwin’s ideas, and a promising approach to solving the mystery of how inheritance worked. Focusing on quantifying correlations between relatives led quickly to a variety of approaches to compare similarity in quantitative traits (e.g., height, between offspring, parents, grandparents, and more distant relations). Some key methods, developed primarily by Weldon and Karl Pearson, included linear regression and variance partitioning. As with many things, these methods were later fully elaborated and perfected by R.A. Fisher between 1916–1930.

A simple model from which to build

If genes are important for determining phenotype, we might expect relatives to resemble each other more closely than a we would a randomly chosen pair of individuals in a population. To assess the relative contribution of genes to phenotype, we require an explicit model that states, however naïvely, how to construct a phenotype from genetic and environmental factors. The figure to the right shows an illustration of perhaps the simplest possible model for doing so:

Figure 1: Simple additive model of inheritance for parent & offspring. Note: \(X_m\) and \(X_p\) are the additive effects of the maternally and paternally inherited alleles, and \(\epsilon\) is the environmental effect

Using this model, we can express the expected phenotypes in terms of their expectations, that is their expected mean value, as follows:

\[ \E[P] = \E[X_m] + \E[X_p] + \E[\epsilon]. \]

By convention, and to standardize phenotypes, we will always use mean-standardized allelic effects. By this, we mean that all of the phenotypic effects of the alleles in the model are expressed as deviations from the population mean (i.e., the \(X_i\) and \(\epsilon\) are deviations from the population mean!). The expected value of a deviation from a mean is, by definition, zero, so that

\[ \E[P] = \E[X_m] + \E[X_p] + \E[\epsilon] = 0 + 0 + 0 = 0. \]

Now, let’s assume that the additive allelic effects \(X_m\), \(X_p\), and \(\epsilon\) are normally distributed variables, with expectations of zero, and variances of:

\[ \begin{aligned} \Var[X_m] &= \Var[X_p] = V_A/2 \\ \Var[\epsilon] &= V_E \end{aligned} \]

By the law of additive variances, we can now calculate the expected phenotypic variance of the offspring, \(P\):

\[ \begin{aligned} \Var[P] = \Var[X_m] &+ \Var[X_p] + \Var[\epsilon] \\ &+ 2 \Cov[X_m,X_p]^{\ast} + 2 \Cov[X_m,\epsilon]^{\dagger} + 2 \Cov[X_p,\epsilon]^{\dagger} \end{aligned} \]

\(^{\ast}\) Parents are assumed to be unrelated, so \(\Cov[X_m,X_p] = 0\).

\(\dagger\) For simplicity, we also assume no genotype-by-environment interactions, so \(\Cov[X_m,\epsilon] = \Cov[X_p,\epsilon] = 0\).

1 Note: G\(\times\)E interactions DO occur in nature. Our only reason for ignoring them is to make the model simpler. In more advanced models of QG, you would estimate these explicitly.

We make two simplifying assumptions here. First, we assume that our parents are unrelated\(^{\ast}\), which allows us to assume that there is no covariance between their genotypes. Second, we assume that there are no genotype-by-environment interactions\(^{\dagger}\), which allows us to ignore several of the covariance terms1. Under these two simplifying assumptions, we recover our basic model of genetic variances:

\[ \begin{aligned} \Var[P] &= \Var[X_m] + \Var[X_p] + \Var[\epsilon] \\ V_P &= \frac{V_A}{2} + \frac{V_A}{2} + V_E \\ V_P &= V_A + V_E \\ \end{aligned} \]

Correlation between parents & offspring

Now let’s try to calculate the correlation between a particularly simple pair of relatives: a single parent (a mother in this case) and her offspring:

\[ \begin{aligned} P_P &= X_m + X_m^{\prime} + \epsilon_P \\ P_O &= X_m + X_p + \epsilon_O \end{aligned} \]

where \(X_M^{\prime}\) is the maternal allele that is NOT passed to the offspring.

We can calculate the covariance of the mother’s & offspring’s phenotypes as follows2:

2 Again, this looks horrendous but is conceptually relatively simple.

\[ \begin{align} \Cov[P_P,P_O] = \,&\Cov[X_m,X_m] + \Cov[X_m,X_p] + \Cov[X_m^{\prime},X_m] + \Cov[X_m^{\prime},X_p] \\ &+ \Cov[X_m,\epsilon_O] + \Cov[X_m^{\prime},\epsilon_O] + \Cov[X_m,\epsilon_O\epsilon_P] + \Cov[X_p,\epsilon_P] \\ &+ \Cov[\epsilon_P,\epsilon_O] \end{align} \]

There are many covariance terms because \(3\) factors contribute to each phenotype. BUT…

  • \(\Cov[X_m,X_m] = \Var[X_m] = V_A/2\). All other covariance terms between genetic contributions are \(0\)’s.3
  • No G\(\times\)E , so all covariances involving genetic and environmental effects are also \(0\)’s.
  • The last term is tricky. In reality there can be covariance between the environmental effects experienced by both parents and offspring4.

3 Due to our assumption that parents are randomly chosen.

4 Can be controlled in lab, and will be assumed to be \(0\) here for simplicity.

After all of these cancellations, we are left with:

\[ \Cov[P_P,P_O] = \frac{V_A}{2} \]

To calculate the correlation coefficient, we divide by the total variance, \(V_P\), which gives:

\[ \begin{aligned} \Corr[P_P,P_O] &= \frac{\Cov[P_P,P_O]}{\sqrt{\Var[P_P] \Var[P_O]}} \\ &= \frac{V_A}{2V_P} = \frac{h^2}{2} \end{aligned} \]

Correlation between relatives

The next step is to define the correlation between an arbitrary pair of relatives, \(X\) and \(Y\), who’s phenotypes are determined by:

\[ \begin{align} P_X &= X_m + X_p + \epsilon_X \\ P_Y &= Y_m + Y_p + \epsilon_Y \end{align} \]

Let’s continue assuming there are no G\(\times\)E, so that:

\[ \Cov[P_X,P_Y] = \Cov[X_m,Y_m] + \Cov[X_m,Y_p] + \Cov[X_p,Y_m] + \Cov[X_p,Y_p] \]

Recall from our lecture on inbreeding that the values of these covariances depend on the number of shared alleles! Specifically, each pair of alleles that are identical-by-descent yields one non-zero covariance of magnitude \(V_A/2\). So that:

\[ \Cov[P_X,P_Y] = \Pr[0~\text{IBD}] \cdot 0 + \Pr[1~\text{IBD}] \cdot \frac{V_A}{2} + \Pr[2~\text{IBD}] \cdot V_A \]

We can rearrange this equation, using the definition of the correlation of relatedness (\(\lambda = \Pr[1~\text{IBD}]/2 + \Pr[2~\text{IBD}]\)):

\[ \begin{aligned} \Cov[P_X,P_Y] &= \Pr[1~\text{IBD}] \cdot \frac{V_A}{2} + \Pr[2~\text{IBD}] \cdot V_A \\ &= \frac{\Pr[1~\text{IBD}]}{2} \cdot V_A + \Pr[2~\text{IBD}] \cdot V_A \\ &= \lambda V_A \end{aligned} \]

Finally, dividing both sides of the equation by \(V_P\) gives:

\[ \Corr[P_X,P_Y] = \lambda h^2 \]

Insight!

The correlation between a pair of relatives is equal to the coefficient of relatedness times the heritability! 5

5 Here is a most satisfying answer to the question the biometricians were asking!!!

The coefficient of relatedness

Understanding the coefficient of relatedness requires that you are comfortable with the concept of identity by descent. Let’s review it again, with an emphasis on making this connection.

Figure 2: Identity By Descent vs.~Identity By State
Coefficient of relatedness
We can define the coefficient of relatedness, \(\lambda\), in one of two ways: 1. One-half the average number of shared alleles between two individuals. 2. The total probability that two IBD alleles are shared between two individuals.

Let’s walk though a few examples:

Full Siblings

Figure 3: Calculating \(\lambda\) following Definition \(1\) above.

Alternatively, we can find the correlation by computing the probability of inheriting the IBD alleles :

Figure 4: Calculating \(\lambda\) following Definition \(2\) above.

Aunt/Uncle

Figure 5: Calculating \(\lambda\) for aunt-uncle relationships.

Exercise: Work through this example on your own to verify that it is correct.

Now, let’s come at this from a slightly different angle…

Correlation between continuous variables

First, let’s briefly refresh your memory about regression analyses and correlations between continuous variables.

For a general understanding of the principles of a linear regression analysis (without spending a full day on it), imagine an experiment with two variables, \(x\) and \(y\), where you want to see how the value of \(x\) influences the value of \(y\). The most common way to present the results is a standard two-dimensional scatterplot:

Figure 6: Regression analyses

The linear regression creates a straight line of the form \(y = \alpha + \beta x\), based on the data. For each individual (observed) \(y\) variable (\(y_i\)), the line predicts a value: \(\alpha + \beta x_i\), where \(x_i\) is the \(x\) variable that corresponds to \(y_i\). The individual differences between the observed and predicted values, \(y_i - (\alpha + \beta x_i)\), are a measure of how well the line matches the data.

To find the line of “best-fit”, the regression analysis minimizes the quadratic deviation between the observed and predicted values regarding \(a\) and \(b\). This means that the values of \(a\) and \(b\) that minimize the expression \(\sum_i^n \left[ \left( y_i - (\alpha + \beta x) \right)^2 \right]\) provide the “best” straight line.

Note that the observations of the \(y\) variable must be independent of each other. You can also calculate standard deviations for the estimates of \(\alpha\) and \(\beta\). If the random deviations of the \(y\) variables from the expected values (which are predicted by the regression line) follow a Normal distribution, you can also easily test hypotheses regarding the regression line.

Within this framework (and without getting into the details), we can express the regression slope, \(\beta\), in terms of the sample variances and covariance of \(x\) and \(y\):

\[ \hat{\beta} = \frac{s_{x,y}}{s^2_{x}} = r_{x,y} \frac{s_y}{s_x} \]

where

  • \(r_{x,y}\) is the sample correlation coefficient between \(x\) and \(y\),
  • \(s_x\) and \(s_y\) are the uncorrected sample standard deviations of \(x\) and \(y\).
  • \(s_x^2\) and \(s_{x,y}\) are the sample variance and sample covariance, respectively.

The BIG reveal!

So this is the “AHA!” moment: When using centered variables, and using the expressino for \(\beta\) given above, the slope of the regression line is equal to the correlation coefficient!

\[ \hat{\beta} = r \dfrac{\sqrt{\sum (y_i-\bar{y})^2}}{\sqrt{\sum(x_i-\bar{x})^2}} \]

Hence, the slope of the regression of offspring phenotype on a their relative’s phenotype is:

\[ \text{Slope} = \lambda h^2 = \beta \]

Figure 7: Regression: offspring phenotype on relative’s phenotype.

Note that for a mid-parent regression, where the \(x_i\) value for each offspring trait value, \(y_i\), is the mean of both parent’s phenotypic value, \(\lambda = 1\). So that \(\hat{\beta}_{\text{mid-parent}} = h^2\):

Figure 8: Mid-parent regression.

Accounting for non-independence of parental values

For many traits and populations, the assumption of independence between the maternal and paternal phenotypes will be violated. For example, if there exists any kind of assortative mating by phenotypes. In this case, any estimate of narrow-sense heritability based on a single-parent regression analysis will be inflated. However, we can calculate an adjusted \(h^2\) value which accounts for this, provided that we can obtain an independent estimate of the correlation between parental phenotypes:

\[ \hat{\beta}_{1 par.} = \frac{1}{2} \hat{h}^2 (1 + \rho_{m,f}) \]

Rearranging, we have:

\[ \hat{h}^2_{\text{adj.}} = \frac{\hat{\beta}_{1 par.}} {\frac{1}{2} (1 + \rho_{m,f})} \]

where \(\rho_{m,f}\) is the correlation coefficient between parental phenotypes.