5 Mathematical expectation

Upon completion of this chapter, you should be able to:

understand the concept and definition of mathematical expectation.
compute the expectations of a random variable, functions of a random variable and linear functions of a random variable.
compute the variance and other higher moments of a random variable.
derive the moment-generating function of a random variable and linear functions of a random variable.
find the moments of a random variable from the moment-generating function.
state and use Tchebysheff’s inequality.

5.1 Expected values

Because random variables are random, knowing the outcome on any one realisation of the random process is not possible. Instead, we can talk about what we might expect to happen, or what might happen on average.

This is the idea of mathematical expectation. In more usual terms, the mathematical expression of the probability distribution of a random variable is the mean of the random variable. Mathematical expectation goes far beyond just computing means, but we begin here as the idea of a mean is easily understood.

The definition looks different in detail for discrete, continuous and mixed random variables, but the intention is the same.

Definition 5.1 (Expectation) The expectation or expected value (or mean) of a random variable $X$ is written $\operatorname{E}[X]$ (or $\mu$, or $\mu_X$ to distinguish between random variables).

For a discrete random variable $X$ with PMF $p_X(x)$, the expected value is \[ \operatorname{E}[X] = \sum_{x\in \mathcal{R}_X} x\, p_X(x). \] For a continuous random variable $X$ with PDF $f_X(x)$, the expected value is \[ \operatorname{E}[X] = \int_{-\infty}^\infty x\, f_X(x). \]

For a mixed random variable $X$, the expected value is a combination of the two above results, for the discrete and continuous components of $\mathcal{R}_X$; that is, \[ \operatorname{E}[X] = \sum_{x_i} x_i \, p_X(x_i) + \int_{-\infty}^\infty x \, f_X(x) \, dx, \] where $p_X(x)$ is the probability mass function over discrete points $x_i\in \mathcal{R}_X$, and $f_X(x)$ is the probability density function (PDF) over the continuous regions of $x\in \mathcal{R}_X$ with PDF $f_X(x)$.

In the rest of this chapter, the case of mixed random variable $X$ will not be explicitly discussed; however, the results remain a combination of the discrete case for the discrete points in $\mathcal{R}_X$ and the continuous case for the continuous component of $\mathcal{R}_X$.

Effectively $\operatorname{E}[X]$ is a weighted average of the points in $\mathcal{R}_X$, the weights being the probabilities for each value of $x\in \mathcal{R}_X$ in the discrete case and probability densities in the continuous case.

Example 5.1 (Expectation for discrete variables) Consider the discrete random variable $U$ with probability function \[ p_U(u) = \begin{cases} (u^2 + 1)/5 & \text{for $u = -1, 0, 1$};\\ 0 & \text{elsewhere}, \end{cases} \] so that $\mathcal{R}_U = \{-1, 0, 1\}$. The expected value of $U$ is \[\begin{align*} \operatorname{E}[U] &= \sum_{\mathcal{R}_U} u\, p_U(u) \\ &= \sum_{\mathcal{R}_U} u \times\left( \frac{u^2 + 1}{5} \right) \\ &= \left( -1 \times \frac{(-1)^2 + 1}{5} \right ) + \left( 0 \times \frac{(0)^2 + 1}{5} \right ) + \left( 1 \times \frac{(1)^2 + 1}{5} \right ) \\ &= -2/5 + 0 + 2/5 = 0. \end{align*}\] The expected value of $U$ is $\operatorname{E}[U] = 0$.

Example 5.2 (Expectation for continuous variables) Consider a continuous random variable $X$ with PDF \[ f_X(x) = \begin{cases} x/4 & \text{for $1 < x < 3$};\\ 0 & \text{elsewhere}. \end{cases} \] The expected value of $X$ is \[\begin{align*} \operatorname{E}[X] &= \int_{-\infty}^\infty x\, f_X(x) \, dx = \int_1^3 x(x/4)\, dx\\ &= \left.\frac{1}{12} x^3\right|_1^3 = 13/6. \end{align*}\] The expected value of $X$ is $\operatorname{E}[X] = 13/6$.

Example 5.3 (Expectation for mixed variables) Consider a continuous random variable $W$ with probability function \[ f_W(w) = \begin{cases} 1/2 & \text{for $w = 0$};\\ \exp(-2w) & \text{for $w > 0$};\\ 0 & \text{elsewhere}, \end{cases} \] so that $p_W(w) = 1/2$ for $w = 0$, and $f_W(w) = \exp(-2w)$ for $w > 0$. The expected value of $W$ is \[\begin{align*} \operatorname{E}[W] &= \overbrace{\sum_{w = 0} w\, p_W(w)}^{\text{Discrete component}} \quad + \quad \overbrace{\int_{-\infty}^\infty w\, f_W(w)\, dw}^{\text{Continuous component}}\\ &= \sum_{w = 0} 0\times (1/2)\quad + \quad \int_{0}^\infty w\times \exp(-2w)\, dw\\ &= 0 \quad + \quad 1/4\\ &= 1/4. \end{align*}\] The expected value of $W$ is $\operatorname{E}[W] = 1/4$.

Example 5.4 (Expectation for a coin toss) Consider tossing a coin once and counting the number of tails. Let this random variable be $T$. The probability function is \[ p_T(t) = \begin{cases} 0.5 & \text{for $t = 0$ or $t = 1$};\\ 0 & \text{otherwise.} \end{cases} \] The expected value of $T$ is \[\begin{align*} \operatorname{E}[T] &= \sum_{i = 1}^2 t\, p_T(t)\\ &= \Pr(T = 0) \times 0 \quad + \quad \Pr(T = 1) \times 1\\ &= (0.5 \times 0) \qquad + \qquad (0.5 \times 1) = 0.5. \end{align*}\] Of course, $0.5$ tails can never actually be observed in practice on one toss. But it would be silly to round up (or down) and say that the expected number of tails on one toss of a coin is one (or zero). The expected value of $0.5$ simply means that over a large number of repetitions of this random process, a tail is expected to occur in half of those repetitions.

Example 5.5 (Mean not defined) Consider the distribution of $Z$, with the probability density function \[ f_Z(z) = \begin{cases} z^{-2} & \text{for $z \ge 1$};\\ 0 & \text{elsewhere} \end{cases} \] as in Fig. 5.1. The expected value of $Z$ is \[ \operatorname{E}[Z] = \int_1^{-\infty} z \frac{1}{z^2}\, dz = \int_1^\infty \frac{1}{z} = -\log z \Big|_1^\infty. \] However, $\displaystyle\lim_{z\to\infty}\, -\log z \to \infty$. The expected value of $\operatorname{E}[Z]$ is undefined.

FIGURE 5.1: The probability function for the random variable $Z$. The mean is not defined.

Show R code




# Define values of z > 1 to plot over
z <- seq(1, 6, length.out = 100)

# Plot for z > 1
plot(x = z, y = z^(-2),
     type = "l", lwd = 2, las = 1,
     xlim = c(0, 6), ylim = c(-0.025, 1),
     xlab = expression( italic(z)),
     ylab = "Density",
     main = expression(paste(The~probability~"function for"~italic(Z))))

# Line for x = 0 to x = 1
lines( x = c(-1, 1),
       y = c(0, 0),
       lwd = 2) ### lwd = 2: Thicker line width

# Dotted vertical line
abline(v = 1, lty = 2, ### lty = 2: means 'dotted lines' 
       col = "grey") 

# Show open point
points(x = 1, y = 0,
       pch = 1) ### pch = 1: open circle

5.2 Expectation of a function of a random variable

While the mean can be expressed in terms of mathematical expectation, mathematical expectation is a more general concept.

Let $X$ be a discrete random variable with a probability function $p_X(x)$, or a continuous random variable with PDF $f_X(x)$. Also assume $g(X)$ is a real-valued function of $X$. We can then define the expected value of $g(X)$.

Definition 5.2 (Expectation for function of a random variable) The expected value of some function $g(\cdot)$ of a random variable $X$ is written $\operatorname{E}[ g(X)]$.

For a discrete random variable $X$ wth PMF $p_X(x)$, the expected value of $g(X)$ is \[ \operatorname{E}\big[g(X)\big)] = \sum_{x\in \mathcal{R}_X} g(x)\, p_X(x). \] For a continuous random variable $X$ wth PDF $f_X(x)$, the expected value of $g(X)$ is \[ \operatorname{E}\big[g(X)\big] = \int_{-\infty}^\infty g(x)\, f_X(x)\,dx. \]

Example 5.6 (Expectation for a function of a discrete variable) Consider the discrete random variable $U$ with probability function shown in Example 5.1: \[ p_U(u) = \begin{cases} (u^2 + 1)/5 & \text{for $u = -1, 0, 1$};\\ 0 & \text{elsewhere}. \end{cases} \] Since $\mathcal{R}_U = \{-1, 0, 1\}$, then $\mathcal{R}_V = \{ (-1)^2, 0^2, 1^2\} = \{0, 1\}$. Then expected value of $V = U^2$, where $g(U) = U^2$, is \[\begin{align*} \operatorname{E}[V] = \operatorname{E}[g(U)] &= \sum_{\mathcal{R}_U} g(u)\, p_U(u) \\ &= \left( (-1)^2 \times \frac{(-1)^2 + 1}{5} \right ) + \left( 0^2 \times \frac{(0)^2 + 1}{5} \right ) + \left( 1^2 \times \frac{(1)^2 + 1}{5} \right ) \\ &= 2/5 + 0 + 2/5 = 4/5. \end{align*}\] The expected value of $V = U^2$ is $\operatorname{E}[V] = 4/5$.

Example 5.7 (Expectation for a function of a continuous variable) Consider the continuous random variable $X$ with probability density function shown in Example 5.2: \[ f_X(x) = \begin{cases} x/4 & \text{for $1 < x < 3$};\\ 0 & \text{elsewhere}. \end{cases} \] The expected value of $Y = \sqrt{X}$, where $g(X) = \sqrt{X}$, is \[\begin{align*} \operatorname{E}[Y] = \operatorname{E}[ g(X) ] &= \int_{-\infty}^\infty g(x)\, f_X(x) \, dx\\ &= \int_1^3 \sqrt{x}\times \frac{x}{4}\, dx\\ &= \frac{9\sqrt{3} - 1}{10}\approx 1.458... \end{align*}\] The expected value of $Y = \sqrt{X}$ is $\operatorname{E}[Y] = (9\sqrt{3} - 1)/10$.

Importantly, the expectation operator is a linear operator, as stated below.

Theorem 5.1 (Expectation properties) For any random variable $X$ and constants $a$ and $b$, \[ \operatorname{E}[aX + b] = a\operatorname{E}[X] + b. \]

Proof. Assume $X$ is a discrete random variable with probability function $p_X(x)$. By Def. 5.2 with $g(X) = aX + b$, \[ \operatorname{E}[aX + b] = \sum_x (ax + b)\, p_X(x) = a\sum_x p_X(x) + \sum_x b\, p_X(x) = a\operatorname{E}[X] + b, \] using that $\sum_x p_X(x) = 1$. (The proof in the continuous case is similar, but the probability function is a PDF and integrals replace summations.)

Example 5.8 (Expectation of a function of a random variable) Consider the random variable $Z = 2X$ where $X$ is defined in Example 5.2. Using Theorem 5.1 with $a = 2$ and $b = 0$, the value of $\operatorname{E}[Z]$ is \[ \operatorname{E}[Z] = \operatorname{E}[2X] = 2\operatorname{E}[X] = 2 \times 13/6 = 13/3. \]

5.3 The variance and standard deviation

Apart from the mean, the most important description of a random variable is the variability: quantifying how the values of the random variable are dispersed. The most important measure of variability is the variance.

The variance of a random variable is a measure of the variability of a random variable. (More correct is to say ‘the variance of the distribution of the random variable’ rather than ‘variance of a random variable’, but this language is commonly used.) A small variance means the observations are nearly the same (i.e., small variation); a large variance means they are quite different. The variance can be expressed as a function of a random variable.

Definition 5.3 (Variance) The variance of a random variable $X$ (or, of the distribution of $X$) is \[ \operatorname{var}[X] = \operatorname{E}\big[(X - \mu)^2\big] \] where $\mu = \operatorname{E}[X]$. The variance of $X$ is commonly denoted by $\sigma^2$, or $\sigma^2_X$ if distinguishing among variables is needed.

The variance is the expected value of the squared distance of the values of the random variable from the mean, weighted by the probability function. The unit of measurement for variance is the original unit of measurement squared. That is, if $X$ is measured in metres, the variance of $X$ is in $\text{metres}^2$.

Describing the variability in terms of the original units is more natural, by taking the square root of the variance.

Definition 5.4 (Standard deviation) The standard deviation of a random variable $X$ is defined as the positive square root of the variance (denoted by $\sigma$); i.e., \[ \text{sd}[X] = \sigma = +\sqrt{\operatorname{var}[X]} \]

The variance is less popular than the standard deviation in practice to describe variability. In theoretical work, however, the variance is easier to work with than standard deviation (due to the square root), and the variance, rather than standard deviation, features in many results in theoretical statistics.

Example 5.9 (Variance for a die toss) Suppose a fair die is tossed, and $X$ denotes the number of points showing. Then $\Pr(X = x) = 1/6$ for $x = 1, 2, 3, 4, 5, 6$ and \[ \mu = \operatorname{E}[X] = \sum_S x\,\Pr(X = x) = (1 + 2 + 3 + 4 + 5 + 6 )/6 = 7/2. \] The variance of $X$ is then \[\begin{align*} \sigma^2 &= \operatorname{var}[X] = \sum (X - \mu)^2 \Pr(X = x)\\ &= \frac{1}{6}\left[ \left(1 - \frac{7}{2}\right)^2 + \left(2 - \frac{7}{2}\right)^2 + \dots + \left(6 - \frac{7}{2}\right)^2 \right] = \frac{70}{24}. \end{align*}\] The standard deviation is then $\sigma = \sqrt{70/24} = 1.71$.

An important result is the computational formula for variance, which is usually easier to use in practice than the formula given in Definition] 5.3.

Theorem 5.2 (Computational formula for variance) For any random variable $X$, \[ \operatorname{var}[X] = \operatorname{E}[X^2] - \operatorname{E}[X]^2. \]

Proof. Let $\operatorname{E}[X] = \mu$, then (using the properties of expectation in Theorem 5.1): \[\begin{align*} \operatorname{var}[X] = \operatorname{E}\left[(X - \mu)^2\right] &= \operatorname{E}[X^2 - 2X\mu + \mu^2] \\ &= \operatorname{E}[X^2] - \operatorname{E}[2X\mu] + \operatorname{E}[\mu^2]\quad\text{(since $\operatorname{E}[\cdot]$ is a linear operator)}\\ &= \operatorname{E}[X^2] - 2\mu\operatorname{E}[X] + \mu^2\\ &= \operatorname{E}[X^2] - 2\mu^2 + \mu^2 \\ &= \operatorname{E}[X^2] - \mu^2 \\ &= \operatorname{E}[X^2] - \operatorname{E}[X]^2. \end{align*}\]

This formula is often easier to use to compute $\operatorname{var}[X]$ than using the definition directly.

Example 5.10 (Variance for a die toss) Consider Example 5.9 again. Then \[\begin{align*} \operatorname{E}[X^2] = \sum_S x^2 \Pr(X = x) &= \frac{1}{6}[1^2 + 2^2 + 3^2 + 4^2 = 5^2 + 6^2]\\ &= 91/6, \end{align*}\] and so $\operatorname{var}[X] = 91/6 - (7/2)^2 = 70/24$, as before.

Example 5.11 (Variance using computational formula) Consider the continuous random variable $X$ with PDF \[ f_X(x) = \begin{cases} 3x(2 - x)/4 & \text{for $0 < x < 2$};\\ 0 & \text{elsewhere}. \end{cases} \] The variance of $X$ can be computed in two ways: using $\operatorname{var}[X] = \operatorname{E}[(X - \mu)^2]$ or using the computational formula. The expected value of $X$ is \[ \operatorname{E}[X] = \int_0^2 x\times 3x(2 - x)4\, dx = 1. \] To use the computational formula, also find \[ \operatorname{E}[X^2] = \frac{6}{5}, \] and so $\operatorname{var}[X] = \operatorname{E}[X^2] - \operatorname{E}[X]^2 = 1/5$.

Using the definition, \[\begin{align*} \operatorname{var}[X] = \operatorname{E}\big[(X - \operatorname{E}[X])^2\big] &= \operatorname{E}\big[(X - 1)^2\big]\\ &= \int_0^2 (x - 1)^2 \times 3x(2 - x)/4\,dx = 1/5. \end{align*}\] Both methods give the same answer of course, and both methods require initial computation of $\operatorname{E}[X]$.

The variance represents the expected value of the squared distance of the values of the random variable from the mean. The variance is never negative, and is only zero when all the values of the random variable are identical (that is, there is no variation).

If most of the probability lies near the mean, the dispersion will be small; if the probability is spread out over a considerable range the dispersion will be large.

Example 5.12 (Variance does not exist) In Example 5.5, $\operatorname{E}[X]$ was not defined. For that reason, the variance is also undefined, since computing the variance relies on having a finite value for $\operatorname{E}[X]$.

Theorem 5.3 (Variance properties) For any random variable $X$ and constants $a$ and $b$, \[ \operatorname{var}[aX + b] = a^2\operatorname{var}[X]. \]

Proof. Using the computational formula for the variance: \[\begin{align*} \operatorname{var}[aX + b] &= \operatorname{E}[ (aX + b)^2 ] - \left[\operatorname{E}[aX + b] \right] ^2\\ &= \operatorname{E}[a^2 X + 2abX + b^2] + (a\mu+b)^2\\ &= a^2 \operatorname{E}[X^2] - a^2\mu^2\\ &= a^2 \operatorname{var}[X]. \end{align*}\]

The special case $a = 0$ is instructive: $\operatorname{var}[b] = 0$ when $b$ is constant; that is, a constant has zero variation, as expected.

Example 5.13 (Variance of a function of a random variable) Consider the random variable $Y = 4 - 2X$ where $\operatorname{E}[X] = 1$ and $\operatorname{var}[X] = 3$. Then: \[\begin{align*} \operatorname{E}[Y] &= \operatorname{E}[4 - 2X] = 4 - 2\times\operatorname{E}[X] = 2;\\ \operatorname{var}[Y] &= \operatorname{var}[4 - 2X] = (-2)^2\operatorname{var}[X] = 12. \end{align*}\]

5.4 Higher moments

5.4.1 Raw and central moments

The ideas of a mean and a variance can be generalised. The mean is a special case of a ‘raw moment’, and the variance is a special case of a ‘central moment’.

Definition 5.5 (Raw moments) The $r$th raw moment, or $r$th moment about the origin, of a random variable $X$ (where $r$ is a positive integer) is denoted $\mu'_r$ and defined as $\mu'_r = \operatorname{E}[X^r]$.

For a discrete random variable $X$, the $r$th moment about the origin is \[ \mu'_r = \operatorname{E}[X^r] = \sum_X x^r\, p_X(x). \] For a continuous random variable $X$, the $r$th moment about the origin is \[ \mu'_r = \operatorname{E}[X^r] = \int_{-\infty}^\infty x^r\, f_X(x) \]

Definition 5.6 (Central moments) The $r$th central moment, or $r$th moment about the mean (where $r$ is a positive integer), is denoted $\mu_r$ and defined as $\mu_r = \operatorname{E}[(X - \mu)^r]$.

For a discrete random variable $X$, the $r$th central moment is \[ \mu_r = \operatorname{E}[(X - \mu)^r] = \sum_x (x - \mu)^r\, p_X(x). \] For a continuous random variable $X$, the $r$th central moment is \[ \mu_r = \operatorname{E}\big[(X - \mu)^r\big] = \int_{-\infty}^{\infty} (x - \mu)^r\, f_X(x). \]

From these definitions:

the mean $\mu'_1 = \mu$ is the first raw moment;
$\mu'_2 = \operatorname{E}[X^2]$ is the second raw moment; and
the variance $\mu_2 = \sigma^2$ is the second central moment.

5.4.2 Skewness

Higher moments also exist that describe other features of a random variable. The third central moment is related to skewness, a measure the asymmetry of a distribution.

Definition 5.7 (Symmetry) The distribution of $X$ is said to be symmetric if, for all $x\in \mathcal{R}_X$,

$p_X(\mu + x) = p_X(\mu - x)$ for a discrete random variable $X$ with PMF $p_X(x)$, or
$f_X(\mu + x) = f_X(\mu - x)$ for a continuous random variable $X$ with PDF $f_X(x)$,

where $\mu = \operatorname{E}[X]$ is the mean of $X$.

For a symmetric distribution, the odd central moments are zero (Exercise 5.19). This suggests that the odd central moments (such as the third central moment) can be used to measure the asymmetry of a distribution.

However, rather than using the third central moment explicitly, by applying Def. 5.6, finding the appropriate expected value of a normalised version of the random variable (i.e., with mean zero and variance one) is preferred. That is, the definition of skewness finds the appropriate expected value of $(X - \mu)/\sigma$ rather than of $X$ directly. This means that the value of the skewness for the random variable $X$ is unaffected by a linear transformation of the type $Y = aX + b$ (for constants $a$ and $b$).

Definition 5.8 (Skewness) The skewness of the distribution of a random variable $X$ with mean $\operatorname{E}[X] = \mu$ and variance $\operatorname{var}[X] = \sigma^2$ is defined as \[\begin{align} \text{skewness} = \gamma_1 &= \operatorname{E}\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]\notag\\ &= \frac{\mu_3}{(\sigma^2)^{3/2}} = \frac{\mu_3}{\mu_2^{3/2}}. \tag{5.1} \end{align}\]

If $\gamma_1 > 0$ we say the distribution is positively (or right) skewed, and it is ‘stretched’ in the positive (negative) direction. Similarly, if $\gamma_1 < 0$ we say the distribution is negatively (or left) skewed. The distribution is symmetric if $\gamma_1 = 0$. For a symmetric distribution, the mean is also a median of the distribution, as these results show.

Example 5.14 (Skewness) Figure 5.2 shows examples of right-skewed (left panels), symmetric (centre panels) and left-skewed (right panels) distributions, for both a continuous random variable (top panels) and a discrete random variable (bottom panels).

(The top distributions are all beta distributions; the bottom distributions are all binomial distributions.)

Examples of right-skewed (left panels), symmetric (centre panels) and left-skewed (right panels) distributions. Top: continuous random variable. Bottom: discrete random variable.

FIGURE 5.2: Examples of right-skewed (left panels), symmetric (centre panels) and left-skewed (right panels) distributions. Top: continuous random variable. Bottom: discrete random variable.

Example 5.15 (Skewness) Consider the random variable $X$ in Example 5.11, where $f_X(x) = x(2 - x)$ for $0 < x < 2$. From that example, $\operatorname{E}[X] = \mu'_1 = 1$ and $\operatorname{E}[X^2] = \mu_2 = 6/5$. Then, \[ \mu_3 = \int_0^2 (x - 1)^3\times 3x(2 - x)/4 \,dx = 0, \] so that the skewness in Eq. (5.1) will be zero. This is expected, since the distribution is symmetric (Fig. 5.3).

$The probability density function for\ $X$.$

FIGURE 5.3: The probability density function for $X$.

Example 5.16 (Skewness) Consider the random variable $Y$ with PMF \[ p_Y(y) = \begin{cases} 0.2 & \text{for $y = 5$};\\ 0.3 & \text{for $y = 6$};\\ 0.5 & \text{for $y = 7$};\\ 0 & \text{elsewhere}. \end{cases} \] Then \[ \mu'_1 = \operatorname{E}[Y] = (5\times 0.2) + (6\times 0.3) + (7\times 0.5) = 6.3. \] Likewise, \[\begin{align*} \mu_2 &= \operatorname{E}\big[(y - 6.3)^2 \big] = (5 - 6.3)^2\times 0.2 + (6 - 6.3)^2\times 0.3 + (7 - 6.3)^2\times 0.5 = 0.61;\quad{\text{and}}\\ \mu_3 &= \operatorname{E}\big[(y - 6.3)^3 \big] = (5 - 6.3)^3\times 0.2 + (6 - 6.3)^3\times 0.3 + (7 - 6.3)^3\times 0.5 = -0.276. \end{align*}\] Hence, the skewness is \[ \gamma_1 = \frac{\mu_3}{\mu_2^{3/2}} = \frac{-0.276}{0.61^{3/2}} = -0.579\dots, \] so the distribution has slight negative skewness.

5.4.3 Kurtosis

Another description of a distribution is kurtosis, which measures the heaviness of the tails in a distribution; that is, how much of the probability of the random variable $X$ is concentrated in the extremes values of $X$. This is related to the fourth central moment. Again, finding the appropriate expected value of a normalised version of the random variable (i.e., with mean zero and variance one) is preferred. That is, the definition of kurtosis finds the appropriate expected value of $(X - \mu)/\sigma$ rather than of $X$ directly.

Definition 5.9 (Kurtosis) The kurtosis of a random variable $X$ with mean $\mu = \operatorname{E}[X]$ and variance $\sigma^2 = \operatorname{var}[X]$ is defined as \[ \text{kurtosis} = \operatorname{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4\right] = \frac{\mu_4}{\mu^2_2}. \] The excess kurtosis of the distribution of a random variable is defined as \[\begin{equation*} \gamma_2 = \frac{\mu_4}{\mu^2_2} - 3. \end{equation*}\] The excess kurtosis definition defines the excess kurtosis compared to a bell-shaped (normal distribution), which has an excess kurtosis of zero.

Excess kurtosis is so commonly used that is often just called ‘kurtosis’.

One way to understand kurtosis (from Moors (1986)) is to first define $Z = (X - \mu)/\sigma$; then the kurtosis is, from Def. 5.9, $\operatorname{E}[Z^4]$, and $\operatorname{E}[Z] = 0$ and $\operatorname{var}[X] = 1$. Also observe that since $\operatorname{var}[X] = \operatorname{E}[X^2] - \operatorname{E}[X]^2$ (the definition of variance), we can write \[\begin{equation} \operatorname{E}[X^2] = \operatorname{var}[X] + \operatorname{E}[X]^2. \tag{5.2} \end{equation}\] Then, the kurtosis is \[\begin{align*} \operatorname{E}[Z^4] &= \operatorname{var}[Z^2] + \operatorname{E}[Z^2]^2\quad\text{(using Eq.~(\ref{eq:VarianceRearranged}))}\\ &= \operatorname{var}[Z^2] + \left\{ \operatorname{var}[Z] + \operatorname{E}[Z]^2\right\}^2\quad\text{(using Eq.~(\ref{eq:VarianceRearranged}) again)}\\ &= \operatorname{var}[Z^2] + (1 + 0)^2 \\ &= \operatorname{var}[Z^2] + 1. \end{align*}\] Thus, the kurtosis is related to the variance of $Z^2$ (not $Z$) about the mean. That is, kurtosis emphasises focuses on the probability function in the extremes of the random variable.

Large values of kurtosis corresponds to greater proportion of the distribution in the tails. Then (see Fig. 5.4):

distributions with negative excess kurtosis are called platykurtic. These distribution have fewer, or less extreme, observations in the tail compared to the normal distribution (‘thinner tails’). Examples include the Bernoulli distribution (Sect. 7.3).
distributions with positive excess kurtosis are called leptokurtic. These distribution have more, or more extreme, observations in the tail compared to the normal distribution (‘fatter tails’). Examples include the exponential distribution (Sect. 8.4) and Poisson distributions (Sect. 7.7).
distributions with zero excess kurtosis are called mesokurtic. The normal distribution (Sect. 8.3) is the obvious example.

$Kurtosis for three distributions plotted from $x = -3$ to $x = +3$; all plots have mean of\ $0$, variance of\ $1$ and are symmetric. The grey line shows the middle distribution as a reference, with $\gamma_1 = 0$ (zero excess kurtosis).$

FIGURE 5.4: Kurtosis for three distributions plotted from $x = -3$ to $x = +3$; all plots have mean of $0$, variance of $1$ and are symmetric. The grey line shows the middle distribution as a reference, with $\gamma_1 = 0$ (zero excess kurtosis).

Example 5.17 (Uses of skewness and kurtosis) Monypenny and Middleton (1998b) and Monypenny and Middleton (1998a) use the skewness and kurtosis to analyse wind gusts at Sydney airport.

Example 5.18 (Uses of skewness and kurtosis) Galagedera, Henry, and Silvapulle (2002) used higher moments in a capital analysis pricing model for Australian stock returns.

Example 5.19 (Skewness and kurtosis) Consider the discrete random variable $U$ from Example 5.1. The raw moments are \[\begin{align*} \mu'_r = \operatorname{E}[U^r] &= \sum_{u = -1, 0, 1} u^r \frac{u^2 + 1}{5} \\ &= (-1)^r \frac{ (-1)^2 + 1}{5} + (0)^r \frac{ (0)^2 + 1}{5} + (1)^r \frac{ (1)^2 + 1}{5} \\ &= \frac{2(-1)^r}{5} + 0 + \frac{2}{5} \\ &= \frac{2}{5}[ (-1)^r + 1] \end{align*}\] for the $r$th raw moment. Then, \[\begin{align*} \operatorname{E}[X] &= \mu'_1 = \frac{2}{5}[ (-1)^1 + 1 ] = 0;\\ \operatorname{E}[X^2] &= \mu'_2 = \frac{2}{5}[ (-1)^2 + 1 ] = 4/5;\\ \operatorname{E}[X^3] &= \mu'_1 = \frac{2}{5}[ (-1)^3 + 1 ] = 0;\\ \operatorname{E}[X^4] &= \mu'_2 = \frac{2}{5}[ (-1)^4 + 1 ] = 4/5. \end{align*}\] Since $\operatorname{E}[U] = 0$, then the $r$th central and raw moments are the same: $\mu'_r = \mu_r$. Notice that once the initial computations to find $\mu'_r$ are complete, the evaluation of any raw moment is simple.

The skewness is \[ \gamma_1 = \frac{\mu_3}{\mu_2^{3/2}} = \frac{0}{(4/5)^{3/2}} = 0, \] so the distribution is symmetric. The excess kurtosis is \[ \gamma_2 = \frac{\mu_4}{\mu_2^2} -3 = \frac{4/5}{(4/5)^2} -3 = -7/4, \] so the distribution is platykurtic.

5.5 Moment-generating functions

5.5.1 Introduction

By themselves, the mean, variance, skewness and kurtosis do not completely describe a distribution; many different distributions can be found having a given mean, variance, skewness and kurtosis. However, in general, all the moments of a distribution together define the distribution. This leads to the idea of a moment-generating function.

Suppose I asked you to draw the probability density function of a random variable $X$, with $\operatorname{E}[X] = 2$. Any of the six distributions in Fig. 5.5 meet this (first moment) criterion, so this information is not sufficient to uniquely define a distribution.

So, suppose a second criterion is added: in addition, we require $\operatorname{var}[X] = 1$. Any of the first five distributions in Fig. 5.5 meet these two criteria (based on the first two moments), so again this information is not sufficient to uniquely define a distribution.

Suppose a third criterion is added: the distribution must be symmetric. Any of the top four distributions in Fig. 5.5 meet these three criteria (based on the first three moments); again, this information is not sufficient to uniquely define a distribution.

Suppose a fourth criterion is added: the distribution must have zero excess kurtosis. Either of the top two distributions in Fig. 5.5 meet these four criteria (based on the first four moments); again, this information is not sufficient to uniquely define a distribution.

In general, all the moments of a distribution are needed to uniquely define a distribution. However, computing all (or even many) moments of a distribution is usually very tedious. For this reason, the moment generating function (or MGF) is now introduced, a function that encapsulates all the moments of a distribution.

$Six distributions, all with mean 1 and variance 1. The top four are also symmetric (i.e., $\gamma_1 = 0$); the top two also have zero excess kurtosis (i.e., $\gamma_2=0$).$

FIGURE 5.5: Six distributions, all with mean 1 and variance 1. The top four are also symmetric (i.e., $\gamma_1 = 0$); the top two also have zero excess kurtosis (i.e., $\gamma_2=0$).

5.5.2 Definition

So far, the distribution of a random variable has been described using a probability function or a distribution function. Sometimes, however, working with a different representation is useful (for example, see Sect. 6.4).

In this section, the moment-generating function is used to represent the distribution of the probabilities of a random variable. As the name suggests, this function can be used to generate any moment of a distribution. Other uses of the moment-generating function are seen later (see Sect. 6.4).

Definition 5.10 (Moment-generating function (MGF)) The moment-generating function (or MGF) $M_X(t)$ of the random variable $X$ defined over a range $\mathcal{R}_X$ is denoted $M_X(t)$, and defined as \[ \operatorname{E}\big[\exp(tX)\big], \] provided the expectation exists for values of $t$ in some interval that includes $t = 0$.

When $X$ is a discrete random variable, \[ M_X(t) = \operatorname{E}\big[\exp(tX)\big] = \sum_{x\in \mathcal{R}_X} \exp(tx)\, p_X(x). \] When $X$ is a continuous random variable, \[ M_X(t) = \operatorname{E}\big[\exp(tX)\big] = \int_{-\infty}^\infty \exp(tx)\, f_X(x). \]

The MGF may not always exist (that is, converge to a finite value) for all values of $t$, so the MGF may not be defined for all values of $t$. Note that the MGF always exists for $t = 0$; in fact $M_X(0) = 1$.

Provided the MGF is defined for some values of $t$ other than zero, it uniquely defines a probability distribution, and we can use it to easily generate the moments of the distribution, as described in Theorem 5.4.

Example 5.20 (Moment-generating function) Consider the random variable $Y$ with PDF \[ f_Y(y) = \begin{cases} \exp(-y) & \text{for $y > 0$};\\ 0 & \text{elsewhere.} \end{cases} \] The MGF is \[\begin{align*} M_Y(t) = \operatorname{E}[\exp(tY)] &= \int_0^\infty \exp(ty)\,\exp(-y)\, dy \\ &= \int_0^\infty \exp\{ y(t-1) \}\, dy \\ &= (1 - t)^{-1} \end{align*}\] provided $t - 1 < 0$; that is, for $t < 1$ (which includes $t = 0$). If $t > 1$, the integral does not converge. For example, if $t = 2$, \[ \left. \frac{1}{2 - 1} \exp(y)\right|_{y = 0}^{y = \infty} = \exp(0) - \lim_{y\to\infty} \exp(y) \] which does not converge.

Example 5.21 (MGF for die rolls) Consider the PMF of $X$, the outcome of tossing a fair die (Example 5.9). The MGF of $X$ is \[\begin{align*} M_X(t) &= \operatorname{E}[\exp(tX)] = \sum_{x = 1}^6 \exp(tx)\, p_X(x)\\ &= \frac{1}{6}\left(e^t + e^{2t} + e^{3t} + e^{4t} + e^{5t} + e^{6t}\right), \end{align*}\] which exists for all values of $t$.

Example 5.22 (MGF does not exist) Consider the Cauchy distribution with the PDF \[ f_X(x) = \frac{1}{\pi(1 + x^2)}, \] defined over $x\in\mathbb{R}$. The moment generating function is \[ \operatorname{E}[\exp(tX)] = \int_{-\infty}^{\infty} e^{tx}\frac{1}{\pi(1 + x^2)}\,dx. \] Consider the integrand $\exp(tx)/\big(\pi(1 + x^2)\big)$. This integrand does not converge unless $t = 0$.

For example, consider $t > 0$: as $x\to\infty$, we see $\exp(tx)\to\infty$, while $1/(1 + x^2)\to 0$ quite slowly; the integrand diverges (see Fig. 5.6 (left panel) for an example when $t = 1$). Now consider $t < 0$: as $x\to-\infty$, we see $\exp(tx)\to\infty$, while $1/(1 + x^2)\to 0$ quite slowly; the integrand again diverges (see Fig. 5.6 (right panel) for an example when $t = -1$).

The integral only converges for $t = 0$. The definition for the MGF states that the MGF exists ‘provided the expectation exists for values of $t$ in some interval that includes $t = 0$’. This is not the case: the integral exists only for $t = 0$. The MGF does not exist for the Cauchy distribution.

FIGURE 5.6: The MGF cannot be computed, as the integrand diverges for $t > 0$ (left panel) and for $t < 0$ (right panel).

5.5.3 Using the MGF to generate moments

Replacing $\exp(xt)$ by its series expansion (App. B) in the definition of the MGF for a discrete random variable $X$ gives \[\begin{align*} M_X(t) & = {\sum_x} \left(1 + xt + \frac{x^2t^2}{2!} + \dots\right) \Pr(X = x)\\ & = 1 + \mu'_1t + \mu'_2 \frac{t^2}{2!} +\mu'_3 \frac{t^3}{3!} + \dots \end{align*}\] Then, the $r$th moment of a distribution about the origin is seen to be the coefficient of $t^r/r!$ in the series expansion of $M_X(t)$: \[\begin{align*} \frac{d M_X(t)}{dt} & = \sum_x x\,e^{xt}\Pr(X = x)\\ \frac{d^2 M_X(t)}{dt^2} & = \sum_x x^2\,e^{xt} \Pr(X = x), \end{align*}\] and, in general, for each positive integer $r$: \[ \frac{d^r M_X(t)}{dt^r} = \sum_x x^re^{xt}\Pr(X = x). \] On setting $t = 0$, \[\begin{align*} \left.\frac{d M_X(t)}{dt}\right|_{t = 0} &= \operatorname{E}[X]\\ \left.\frac{d^2M_X(t)}{dt^2}\right|_{t = 0} &= \operatorname{E}[X^2]. \end{align*}\] (The notation to the left means to evaluate the expression at $t = 0$.) In general, for each positive integer $r$, \[\begin{equation} \left.\frac{d^r M_X(t)}{dt^r}\right|_{t = 0} = \operatorname{E}[X^r]. \end{equation}\] (Sometimes, $d^r M_X(t)/dt^r$ evaluated at $t = 0$ is written as $M^{(r)}(0)$ for brevity.) This result is summarised in the following theorem.

Theorem 5.4 (Moments) The $r$th moment $\mu'_r$ of the distribution of the random variable $X$ about the origin is given by either

the coefficient of $t^r/r!, r = 1, 2, 3,\dots$ in the power series expansion of $M_X(t)$; or
$\displaystyle \mu'_r = \left.\frac{d^rM(t)}{dt^r}\right|_{t = 0}$ where $M_X(t)$ is the MGF of $X$.

Example 5.23 (Mean and variance from a MGF) Continuing Example 5.20, the mean and variance of $Y$ can be found from the MGF. To find the mean, first find \[ \frac{d}{dt}M_Y(t) = (1 - t)^{-2}. \] Setting $t = 0$ gives the mean as $\operatorname{E}[Y] = 1$. Likewise, \[ \frac{d^2}{dt^2}M_Y(t) = 2(1 - t)^{-3}. \] Setting $t = 0$ gives $\operatorname{E}[Y^2] = 2$. The variance is therefore $\operatorname{var}[Y] = 2 - 1^2 = 1$.

Once the moment-generating function has been computed, raw moments can be computed using \[ \operatorname{E}[Y^r] = \mu'_r = \left.\frac{d^r}{dt^r} M_Y(t)\right|_{t = 0}. \]

5.5.4 Some useful results

The moment-generating function can be used to derive the distribution of a function of a random variable (see Sect. 6.4). The following theorems are valuable for this task.

Theorem 5.5 (MGF of linear combinations) If the random variable $X$ has MGF $M_X(t)$ and $Y = aX + b$ where $a$ and $b$ are constants, then the MGF of $Y$ is \[ M_Y(t) = \operatorname{E}\big[\exp\{t(aX + b)\}\big] = \exp(bt) M_X(at). \]

Theorem 5.6 (MGF of independent rvs) If $X_1$, $X_2$, $\dots$, $X_n$ are $n$ independent random variables, where $X_i$ has MGF $M_{X_i}(t)$, then the MGF of $Y = X_1 + X_2 + \cdots X_n$ is \[ M_Y(t) = \prod_{i = 1}^n M_{X_i}(t). \]

Proof. The proofs are left as an exercise.

Note that in the special case when all the random variables are independently and identically distributed in Theorem 5.5, we have \[ M_Y(t) = [M_{X_i}(t)]^n. \]

Example 5.24 (MGF of linear combinations) Consider the random variable $X$ with pf \[ p_X(x) = 2(1/3)^x \qquad \text{for $x = 1, 2, 3, \dots$} \] and zero elsewhere. The MGF of $X$ is \[\begin{align*} M_X(t) &= \sum_{x: p(x) > 0} \exp(tx)\, p_X(x) \\ &= \sum_{x = 1}^\infty \exp(tx)\, 2(1/3)^x \\ &= 2\sum_{x = 1}^\infty (\exp(t)/3)^x \\ &= 2\left\{ \frac{\exp(t)}{3} + \left(\frac{\exp(t)}{3}\right)^2 + \left(\frac{\exp(t)}{3}\right)^3 + \dots\right\} \\ &= \frac{2\exp(t)}{3 - \exp(t)} \end{align*}\] where $\sum_{y = 1}^\infty a^y = a/(1 - a)$ for $a < 1$ has been used (App. B); here $a = \exp(t)/3$.

Next consider finding the MGF of $Y = (X - 2)/3$. From Theorem 5.5 with $a = 1/3$ and $b = -2/3$, \[ M_Y(t) = \exp(-2t/3) M_X(t/3) = \frac{2\exp\{(-t)/3\}}{3 - \exp(t/3)}. \] In practice, rather than identify $a$ and $b$ and remember Theorem 5.5, problems like this are best solved directly from the definition of the MGF: \[\begin{align*} M_Y(t) = \operatorname{E}[\exp(tY)] &= \operatorname{E}[\exp\{t(X - 2)/3\}]\\ &= \operatorname{E}[\exp\{tX/3 - 2t/3\}]\\ &= \exp(-2t/3) M_X(t/3) \\ &= \frac{2\exp\{(-t)/3\}}{3 - \exp(t/3)}. \end{align*}\]

5.5.5 Determining the distribution from the MGF

The MGF (if it exists) completely determines the distribution of a random variable hence, given a MGF, deducing the probability function should be possible. For some distributions, the PDF cannot be written in closed form (so the PDF can only be evaluated numerically; for example, see Sect. 9.4), but the MGF is relatively simple to write down.

For a discrete random variable $X$, the MGF is defined as \[\begin{equation} M_X(t) = \operatorname{E}[\exp(tX)] = \sum_X e^{tx} p_X(x) \tag{5.3} \end{equation}\] for $X$ discrete with PMF $p_X(x)$. This can be expressed as \[\begin{align*} M_X(t) &= \exp(t x_1) p_X(x_1) + \exp(t x_2)p_X(x_2) + \dots\\ &= \exp(t x_1) \Pr(X = x_1) + \exp(t x_2)\Pr(X = x_2) + \dots\\ \end{align*}\] and so the probability function of $Y$ can be deduced from the MGF.

Example 5.25 (Distribution from the MGF) Suppose a discrete random variable $D$ has the MGF \[ M_D(t) = \frac{1}{3} \exp(2t) + \frac{1}{6}\exp(3t) + \frac{1}{12}\exp(6t) + \frac{5}{12}\exp(7t). \] Then, by the definition of the MGF in the discrete case given above, the coefficient of $t$ in the exponential indicates values of $D$, and the coefficient indicates the probability of that value of $Y$: \[\begin{align*} M_D(t) &= \overbrace{\frac{1}{3} \exp(2t)}^{D = 2} + \overbrace{\frac{1}{6}\exp(3t)}^{D = 3} + \overbrace{\frac{1}{12}\exp(6t)}^{D = 6} + \overbrace{\frac{5}{12}\exp(7t)}^{D = 7}\\ &= \Pr(D = 2)\exp(2t) + \Pr(D = 3)\exp(3t) + \\ & \quad \Pr(D = 6)\exp(6t) + \Pr(D = 7)\exp(7t). \end{align*}\] So the PMF is \[ p_D(d) = \begin{cases} 1/3 & \text{for $d=2$}\\ 1/6 & \text{for $d=3$}\\ 1/12 & \text{for $d=6$}\\ 5/12 & \text{for $d=7$}\\ 0 & \text{otherwise} \end{cases} \] (Of course, it is easy to check by computing the MGF for $D$ from the pf found above; you should get the original MGF.)

Sometimes, using the results in App. B can be helpful.

Example 5.26 (Distribution from the MGF) Consider the MGF \[ M_X(t) = \frac{\exp(t)}{3 - 2\exp(t)}. \] To find the corresponding probability function, one approach is to write the MGF as \[ M_X(t) = \frac{\exp(t)/3}{1 - 2\exp(t)/3}. \] This is the sum of a geometric series (Eq. (B.5)): \[ a + ar + ar^2 + \ldots + ar^{n - 1} \rightarrow \frac{a}{1 - r} \text{ as $n \rightarrow \infty$}, \] where $a = \exp(t)/3$ and $r = 2\exp(t)/3$. Hence the MGF can be expressed as \[ \frac{1}{3}\exp(t) + \frac{1}{3}\left(\frac{2}{3}\right) \exp(2t) + \frac{1}{3}\left(\frac{2}{3}\right)^2 \exp(3t) + \dots \] so that the probability function can be deduced as \[\begin{align*} \Pr(X = 1) &= \frac{1}{3};\\ \Pr(X = 2) &= \frac{1}{3}\left(\frac{2}{3}\right);\\ \Pr(X = 3) &= \frac{1}{3}\left(\frac{2}{3}\right)^2, \end{align*}\] or, in general, \[ p_x(x) = \frac{1}{3}\left( \frac{2}{3}\right)^{x - 1}\quad\text{for $x = 1, 2, 3, \dots$}. \] (Later, this will be identified as a geometric distribution.)

For a continuous random variable $X$, the approach is more involved. Suppose the continuous random variable $X$ has the MGF $M_X(t)$. Then the probability density function is (see Abramowitz and Stegun (1964), 26.1.10) \[\begin{equation} f_X(x) = \frac{1}{2\pi} \int_{-\infty}^{\infty} M_X(it) \exp(-itx)\, dt, \tag{5.4} \end{equation}\] where $i = \sqrt{-1}$.

Example 5.27 (Distribution from the MGF) Consider the MGF for a continuous random variable $X$ such that $M_X(t) = \exp(t^2/2)$, for $y\in\mathbb{R}$ and $t\in\mathbb{R}$. Then, $M_X(it) = \exp\left( (it)^2/2 \right) = \exp(-t^2/2)$. Using Eq. (5.4), the PDF is: \[\begin{align*} f_X(x) &= \frac{1}{2\pi} \int_{-\infty}^{\infty} \exp(-t^2/2) \exp(-itx)\, dt,\\ &= \frac{1}{2\pi} \int_{-\infty}^{\infty} \exp(-t^2/2)\left[ \cos(-tx) + i\sin(-itx)\right]\, dt, \end{align*}\] since $x\in\mathbb{R}$ and $t\in\mathbb{R}$. Extracting just the real components: \[\begin{align*} f_X(x) &= \frac{1}{2\pi} \int_{-\infty}^{\infty} \exp(-t^2/2) \cos(-tx) \, dt\\ &= \frac{1}{2\pi} \left( \sqrt{2\pi} \exp( -x^2/2 ) \right) = \frac{1}{ \sqrt{2\pi} } \exp( -x^2/2 ), \end{align*}\] which will later be identified as a normal distribution with a mean of zero, and standard deviation of one.

In practice, using Eq. (5.4) can become tedious or intractable for producing a closed form expression for the PDF. However, Eq. (5.4) has been used to compute numerical values of the probability density function. For example, Dunn and Smyth (2008) used Eq. (5.4) to evaluate the Tweedie distributions (Dunn and Smyth 2001), for which (in general) the PDF has no closed form, but does have a simple MGF.

5.6 Tchebysheff’s inequality

Tchebysheff’s inequality applies to any probability distribution, and is sometimes useful in theoretical work or to provide bounds on probabilities.

Theorem 5.7 (Tchebysheff's theorem) Let $X$ be a random variable with finite mean $\mu$ and variance $\sigma^2$. Then for any positive $k$, \[\begin{equation} \Pr\big(|X - \mu| \geq k\sigma \big)\leq \frac{1}{k^2} \tag{5.5} \end{equation}\] or, equivalently \[\begin{equation} \Pr\big(|X - \mu| < k\sigma \big)\geq 1 - \frac{1}{k^2}. \end{equation}\]

Proof. The proof for the continuous case only is given. Let $X$ be continuous with PDF $f(x)$. For some $c > 0$, then \[\begin{align*} \sigma^2 & = \int^\infty_{-\infty} (x - \mu )^2f(x)\,dx\\ & = \int^{\mu -\sqrt{c}}_{-\infty} (x - \mu )^2f(x)\, dx + \int^{\mu + \sqrt{c}}_{\mu-\sqrt{c}}(x - \mu )^2f(x)\,dx + \int^\infty_{\mu + \sqrt{c}}(x - \mu)^2f(x)\,dx\\ & \geq \int^{\mu -\sqrt{c}}_{-\infty} (x - \mu )^2f(x)\,dx + \int^\infty_{\mu + \sqrt{c}}(x - \mu )^2f(x)\,dx, \end{align*}\] since the second integral is non-negative. Now $(x - \mu )^2 \geq c$ if $x \leq \mu -\sqrt{c}$ or $x\geq \mu + \sqrt{c}$. So in both the remaining integrals above, replace $(x - \mu )^2$ by $c$ without altering the direction of the inequality: \[\begin{align*} \sigma^2 &\geq c \int^{\mu -\sqrt{c}}_{-\infty} f(x)\,dx + c\int^\infty_{\mu + \sqrt{c}}f(x)\,dx\\ &= c\,\Pr(X \leq \mu - \sqrt{c}\,) + c\,\Pr(X \geq \mu + \sqrt{c}\,)\\ &= c\,\Pr(|X - \mu| \geq \sqrt{c}\,). \end{align*}\] Putting $\sqrt{c} = k\sigma$, Eq. (5.5) is obtained.

With the probability function or PDF of a random variable $X$, then $\operatorname{E}[X]$ and $\operatorname{var}[X]$ can be found, but the converse is not true. That is, from a knowledge of $\operatorname{E}[X]$ and $\operatorname{var}[X]$ we cannot reconstruct the probability distribution of $X$ and hence cannot compute probabilities such as $\Pr(|X - \mu| \geq k\sigma)$. Nonetheless, using Tchebysheff’s inequality we can find a useful bound to either the probability outside or inside of $\mu \pm k\sigma$.

5.7 Mathematical expectation for bivariate distributions

5.7.1 Expected values of a bivariate function

In a manner analogous to the univariate case, the expectation of functions of two random variables can be given.

Definition 5.11 (Expectation for bivariate distributions) Let $(X, Y)$ be a $2$-dimensional random variable and let $u(X, Y)$ be a function of $X$ and $Y$.

For a discrete bivariate distribution with probability mass function $p_{X, Y}(x, y)$ defined over $(x, y) \in R$, the expectation or expected value of $\operatorname{E}[u(X, Y)]$ is \[ \operatorname{E}[u(X, Y)] = \mathop{\sum\sum}_{(x, y)\in R} u(x, y)\, p_{X, Y}(x, y). \] For a continuous bivariate distribution with probability density function $f_{X, Y}(x, y)$ defined over $(x, y) \in R$, the expectation or expected value of $\operatorname{E}[u(X, Y)]$ is \[ \operatorname{E}[u(X, Y)] = \mathop{\sum\sum}_{(x, y)\in R} u(x, y)\, p_{X, Y}(x, y). \]

This definition can be extended to the expectation of a function of any number of random variables.

Example 5.28 (Expectation of function of two rvs (discrete)) Consider the joint distribution of $X$ and $Y$ in Example 4.4. Determine $\operatorname{E}[X + Y]$; i.e., the mean of the number of heads plus the number showing on the die.

From Def. 5.11, write $u(X, Y) = X + Y$ and so \[\begin{align*} \operatorname{E}[X + Y] &= \sum_{x = 0}^2 \sum_{y = 1}^6 (x + y)\, p_{X, Y}(x, y)\\ &= 1\times(1/24) + 2\times(1/24) + \dots + 6\times(1/24)\\ & \qquad + 2\times(1/12) + 3\times(1/12) + \dots + 7\times(1/12)\\ & \qquad + 3\times(1/24) + 4\times(1/24) + \dots + 8\times(1/24)\\ &= 21/24 + 27/12 + 33/24\\ &= 4.5. \end{align*}\] The answer is just $\operatorname{E}[X] + \operatorname{E}[Y] = 1 + 3.5 = 4.5$. This is no coincidence, as we see from Theorem 5.8.

Example 5.29 (Expectation of function of two rvs (continuous)) Consider Example 4.6. To determine $\operatorname{E}[XY]$, write $u(X, Y) = XY$ and proceed: \[ \operatorname{E}[XY] = \frac{6}{5} \int_0^1\int_0^1 xy(x + y^2)\,dx\,dy = \frac7{20}. \] Unlike the previous example, an alternative simple calculation based on $\operatorname{E}[X]$ and $\operatorname{E}[Y]$ is not possible, since $\operatorname{E}[XY]\neq\operatorname{E}[X] \operatorname{E}[Y]$ in general.

Theorem 5.8 (Expectations of two rvs) If $X$ and $Y$ are any random variables, and $a$ and $b$ are any constants, then \[ \operatorname{E}[aX + bY] = a\operatorname{E}[X] + b\operatorname{E}[Y]. \]

This theorem is no surprise after seeing Theorem 5.1, but is powerful and useful. The proof given here is for the discrete case; the continuous case is analogous.

Proof. \[\begin{align*} \operatorname{E}[aX + bY] &= \mathop{\sum\sum}_{(x, y) \in R}(ax + by) \, p_{X, Y}(x, y), \text{ by definition}\\ &= \sum_x \sum_y ax\, p_{X, Y}(x, y) + \sum_x \sum_y by\, p_{X, Y}(x, y)\\ &= a\sum_x x\sum_y p_{X, Y}(x, y) + b\sum_y y\sum_x p_{X, Y}(x, y)\\ &= a\sum_x x \Pr(X = x) + b\sum_y y \Pr(Y = y)\\ &= a\operatorname{E}[X] + b\operatorname{E}[Y]. \end{align*}\]

This result is true whether or not $X$ and $Y$ are independent. Theorem 5.8 naturally generalises to the expected value of a linear combination of random variables (see Theorem 11.1).

5.7.2 Moments of a bivariate distribution: covariance

The idea of a moment in the univariate case naturally extends to the bivariate case. Hence, define $\mu'_{rs} = \operatorname{E}[X^r Y^s]$ or $\mu_{rs} = \operatorname{E}\big[(X - \mu_X)^r (Y - \mu_Y)^s\big]$ as the raw and central moments for a bivariate distribution.

The most important of these moments is the covariance.

Definition 5.12 (Covariance) The covariance of $X$ and $Y$ is defined as \[ \operatorname{Cov}(X, Y) = \operatorname{E}[(X - \mu_X)(Y - \mu_Y)]. \] When $X$ and $Y$ are discrete, \[ \operatorname{Cov}(X, Y) = \sum_{x} \sum_{y} (x - \mu_X)(y - \mu_Y)\, p_{X, Y}(x, y). \] When $X$ and $Y$ are continuous, \[ \operatorname{Cov}(X, Y) = \int_{-\infty}^\infty\!\int_{-\infty}^\infty (x - \mu_X)(y - \mu_Y)\, f_{X, Y}(x, y)\, dx\, dy. \]

The covariance is a measure of how $X$ and $Y$ vary jointly, in the sense that a positive covariance indicates that ‘on average’ $X$ and $Y$ increase (or decrease) together whereas a negative covariance indicates that `on average’ as $X$ increases and $Y$ decreases (and vice versa). We say that covariance is a measure of linear dependence.

Covariance is best evaluated from the computational formula.

Theorem 5.9 (Covariance) For any random variables $X$ and $Y$, \[ \operatorname{Cov}(X, Y) = \operatorname{E}[XY] - \operatorname{E}[X]\operatorname{E}[Y]. \]

Proof. The proof uses Theorems 5.1 and 5.8. \[\begin{align*} \operatorname{Cov}(X, Y) &= \operatorname{E}\big[ (X - \mu_X)(Y-\mu_Y)\big] \\ &= \operatorname{E}[ XY - \mu_X Y - \mu_Y X + \mu_X\mu_Y] \\ &= \operatorname{E}[ XY ] - \mu_X\operatorname{E}[Y] - \mu_Y\operatorname{E}[X] + \mu_X \mu_Y \\ &= \operatorname{E}[ XY ] - \mu_X\mu_Y - \mu_Y\mu_X + \mu_X \mu_Y \\ &= \operatorname{E}[ XY ] - \mu_X \mu_Y. \end{align*}\]

Computing the covariance is tedious: $\operatorname{E}[X]$, $\operatorname{E}[Y]$, $\operatorname{E}[XY]$ need to be computed, and so the joint and marginal distributions of $X$ and $Y$ are needed.

Covariance has units given by the product of the units of $X$ and $Y$. For example, if $X$ is measured in metres and $Y$ is measured in seconds then $\operatorname{Cov}(XY)$ has the units metre–seconds. To compare the strength of covariation amongst pairs of random variables, a unitless measure is useful. Correlation does this by scaling the covariance in terms of the standard deviations of the individual variables.

Definition 5.13 (Correlation) The correlation coefficient between the random variables $X$ and $Y$ is denoted by $\text{Corr}(X, Y)$ or $\rho_{X, Y}$ and is defined as \[ \rho_{X, Y} = \frac{\operatorname{Cov}(X, Y)}{\sqrt{ \operatorname{var}[X]\operatorname{var}[Y]}} = \frac{\sigma_{X, Y}}{\sigma_X \sigma_Y}. \]

If there is no confusion over which random variables are involved, we write $\rho$ rather than $\rho_{XY}$. It can be shown that $-1 \leq \rho \leq 1$.

Example 5.30 (Correlation coefficient (discrete rvs)) Consider two discrete random variables $X$ and $Y$ with the joint pf given in Table 5.1. To compute the correlation coefficient, the following steps are required.

$\text{Corr}(X, Y) = \operatorname{Cov}(X, Y)/\sqrt{ \operatorname{var}[X]\operatorname{var}[Y]}$, so $\operatorname{var}[X]$, $\operatorname{var}[Y]$ must be computed;
To find $\operatorname{var}[X]$ and $\operatorname{var}[Y]$, $\operatorname{E}[X]$ and $\operatorname{E}[X^2]$, $\operatorname{E}[Y]$ and $\operatorname{E}[Y^2]$ are needed, so the marginal probability functions of $X$ and $Y$ are needed.

So first, the marginal pfs are \[ p_X(x) = \sum_{y = -1, 1} p_{X, Y}(x, y) = \begin{cases} 7/24 & \text{for $x = 0$};\\ 8/24 & \text{for $x = 1$};\\ 9/24 & \text{for $x = 2$};\\ 0 & \text{otherwise} \end{cases} \] and \[ p_Y(y) = \sum_{x = 0}^2 p_{X, Y}(x, y) = \begin{cases} 1/2 & \text{for $y = -1$};\\ 1/2 & \text{for $y = 1$};\\ 0 & \text{otherwise.} \end{cases} \] Then, \[\begin{align*} \operatorname{E}[X] &= (7/24 \times 0) + (8/24 \times 1) + (9/24\times 2) = 26/24;\\ \operatorname{E}[X^2] &= (7/24 \times 0^2) + (8/24 \times 1^2) + (9/24\times 2^2) = 44/24;\\ \operatorname{E}[Y] &= (1/2 \times -1) + (1/2 \times 1) = 0;\\ \operatorname{E}[Y^2] &= (1/2 \times (-1)^2) + (1/2 \times 1^2) = 1, \end{align*}\] giving $\operatorname{var}[X] = 44/24 - (26/24)^2 = 0.6597222$ and $\operatorname{var}[Y] = 1 - 0^2 = 1$. Then, \[\begin{align*} \operatorname{E}[XY] &= \sum_x\sum_y xy\,p_{X,Y}(x,y) \\ &= (0\times -1 \times 1/8) + (0\times 1 \times 1/6) + \cdots + (2\times 1 \times 1/4) \\ &= 1/12. \end{align*}\] Hence, \[ \operatorname{Cov}(X,Y) = \operatorname{E}[XY] - \operatorname{E}[X] \operatorname{E}[Y] = \frac{1}{12} - \left(\frac{26}{24}\times 0\right) = 1/12, \] and \[ \text{Corr}(X,Y) = \frac{ \operatorname{Cov}(X,Y)}{\sqrt{ \operatorname{var}[X]\operatorname{var}[Y] } } = \frac{1/12}{\sqrt{0.6597222 \times 1}} = 0.1025978, \] so the correlation coefficient is about $0.10$, and a small positive linear association exists between $X$ and $Y$.

TABLE 5.1: A bivariate discrete probability function.
	$x = 0$	$x = 1$	$x = 2$	Total
$y = -1$	$1/8$	$1/4$	$1/8$	$1/2$
$y = +1$	$1/6$	$1/12$	$1/4$	$1/2$
Total	$7/24$	$1/3$	$3/8$	$1$

5.7.3 Properties of covariance and correlation

The correlation has no units.
The covariance has units; if $X$ is measured in kilograms and $Y$ in centimetres, then the units of the covariance are kg-cm.
If the units of measurements change, the numerical value of the covariance changes, but the numerical value of the correlation stays the same. (For example, if $X$ is changed from kilograms to grams, the numerical value of the correlation will not change in value, but the numerical values of covariance will change.)
The correlation is a number between $-1$ and $1$ (inclusive). When the correlation coefficient (or covariance) is negative, a negative linear relationship is said to exist between the two variables. Likewise, when the correlation coefficient (or covariance) is positive, a positive linear relationship is said to exist between the two variables.
When the correlation coefficient (or covariance) is zero, no linear dependence is said to exist.

Theorem 5.10 (Properties of the covariance) For random variables $X$, $Y$ and $Z$, and constants $a$ and $b$:

$\operatorname{Cov}(X, Y) = \operatorname{Cov}(Y, X)$.
$\operatorname{Cov}(aX,bY) = ab\,\operatorname{Cov}(X, Y)$.
$\operatorname{var}[aX + bY] = a^2\operatorname{var}[X] + b^2\operatorname{var}[Y] + ab\,\operatorname{Cov}(X, Y)$.
If $X$ and $Y$ are independent, then $\operatorname{E}[XY] = \operatorname{E}[X]\operatorname{E}[Y]$ and hence $\operatorname{Cov}(X,Y) = 0$.
$\operatorname{Cov}(X, Y) = 0$ does not imply $X$ and $Y$ are independent, except for the special case of the bivariate normal distribution.

A zero correlation coefficient in an indication of no linear dependence only. A relationship may still exist between $X$ and $Y$ even if the correlation is zero.

Example 5.31 (Linear dependence and correlation) Consider $X$ with the pf:

$x$	$-1$	$0$	$1$
$p_{X}(x)$	$1/3$	$1/3$	$1/3$

Then, define $Y$ to be explicitly related to $X$: $Y = X^2$. So, we know a relationship exists between $X$ and $Y$ (but the relationship is non-linear). The joint probability function for $(X, Y)$ is shown in Table 5.2. Then \[\begin{equation*} \operatorname{Cov}(X, Y) = \operatorname{E}[X, Y] - \operatorname{E}[X]\operatorname{E}[Y] = 0 - 0\times 2/3 = 0 \end{equation*}\] so $\text{Corr}(X, Y) = 0$. But $X$ and $Y$ are certainly related, because $Y$ was explicitly defined as a function of $X$.

Since the correlation is a measure of the strength of the linear relationship between two random variables, a correlation of zero simply is indication of no linear relationship between $X$ and $Y$. (As is the case in this example, there may be a different relationship between the variables, but no linear relationship.)

TABLE 5.2: A bivariate discrete probability function.
	$x = -1$	$x = 0$	$x = 1$	Total
$y = 0$	$0$	$1/3$	$0$	$1/3$
$y = 1$	$1/3$	$0$	$1/3$	$2/3$
Total	$1/3$	$1/3$	$1/3$	$1$

5.7.4 Conditional expectations

Conditional expectations are simply expectations computed from a conditional distribution.

The conditional mean is the expected value computed from a conditional distribution.

Definition 5.14 (Conditional expectation) The conditional expected value or conditional mean of a random variable $X$ for given $Y = y$ is denoted by $\operatorname{E}[X \mid Y = y]$.

If the conditional distribution is discrete with probability mass function $p_{X\mid Y}(x\mid y)$, then \[ \operatorname{E}[X \mid Y = y] = \displaystyle \sum_{x} x \cdot p_{X\mid Y}(x\mid y). \] If the conditional distribution is continuous with probability density function $f_{X\mid Y}(x\mid y)$, then \[ \operatorname{E}[X \mid Y = y] = \int_{-\infty}^\infty x \cdot f_{X\mid Y}(x\mid y)\, dx. \]

$\operatorname{E}[X \mid Y = y]$ is typically denoted $\mu_{X \mid Y = y}$.

Example 5.32 (Conditional mean (continuous)) Consider the two random variables $X$ and $Y$ with joint PDF \[ f_{X, Y}(x, y) = \begin{cases} \frac{3}{5}(x + xy + y^2) & \text{for $0 < x < 1$ and $-1 < y < 1$};\\ 0 & \text{otherwise.} \end{cases} \] To find $f_{Y \mid X = x}(y\mid x)$, first $f_X(x)$ is needed: \[ f_X(x) = \int_{-1}^1 f_{X,Y}(x,y) dy = \frac{3}{15}(6x + 2) \] for $0 < x < 1$. Then, \[ f_{Y \mid X = x}(y \mid x) = \frac{ f_{X, Y}(x, y)}{ f_X(x) } = \frac{3(x + xy + y^2)}{6x + 2} \] for $-1 < y < 1$ and given $0 < x < 1$. The expected value of $Y$ given $X = x$ is then \[ \operatorname{E}[Y\mid X = x] = \frac{x}{3x + 1}. \] This expression indicates that the conditional expected value of $Y$ depends on the given value of $X$; for example, \[\begin{align*} \operatorname{E}[Y\mid X = 0] &= 0;\\ \operatorname{E}[Y\mid X = 0.5] &= 0.2;\\ \operatorname{E}[Y\mid X = 1] &= 1/4. \end{align*}\] Since $\operatorname{E}[Y\mid X = x]$ depends on the value of $X$, this means $X$ and $Y$ are not independent.

The conditional variance is the variance computed from a conditional distribution.

Definition 5.15 (Conditional variance) The conditional variance of a random variable $X$ for given $Y = y$ is denoted by $\operatorname{var}[X \mid Y = y]$.

If the conditional distribution is discrete with probability mass function $p_{X\mid Y}(x\mid y)$, then \[ \operatorname{var}[X \mid Y = y] = \displaystyle \sum_{x} (x - \mu_{X\mid y})^2\, p_{X\mid Y}(x\mid y), \] where $\mu_{X \mid y}$ is the conditional mean of $X$ given $Y = y$.

If the conditional distribution is continuous with probability density function $f_{X\mid Y}(x\mid y)$, then \[ \operatorname{var}[X \mid Y = y] = \int_{-\infty}^\infty (x - \mu_{X\mid y})^2\, f_{X\mid Y}(x\mid y)\, dx. \] where $\mu_{X \mid y}$ is the conditional mean of $X$ given $Y = y$.

For brevity, $\operatorname{var}[X \mid Y = y]$ is often denoted $\sigma^2_{X \mid Y = y}$.

Example 5.33 (Conditional variance (continuous)) Refer to Example 5.32. The conditional variance of $Y$ given $X = x$ can be found by first computing $\operatorname{E}[Y^2\mid X = x]$: \[\begin{align*} \operatorname{E}[Y^2\mid X = x] &= \int_{-1}^1 y^2 f_{Y\mid X = x}(y\mid x)\,dy \\ &= \frac{3}{6x + 2} \int_{-1}^1 y^2 (x + xy + y^2)\, dy \\ &= \frac{5x + 3}{5(3x + 1)}. \end{align*}\] So the conditional variance is \[\begin{align*} \operatorname{var}[Y\mid X = x] &= \operatorname{E}[Y^2\mid X = x] - \left( \operatorname{E}[Y\mid X = x] \right)^2 \\ &= \frac{5x+3}{5(3x + 1)} - \left( \frac{x}{3x + 1}\right)^2 \\ &= \frac{10x^2 + 14x + 3}{5(3x + 1)^2} \end{align*}\] for given $0 < x < 1$. Hence the variance of $Y$ depends on the value of $X$ that is given; for example, \[\begin{align*} \operatorname{var}[Y\mid X = 0] &= 3/5 = 0.6\\ \operatorname{var}[Y\mid X = 0.5] &= \frac{10\times (0.5)^2 + (14\times0.5) + 3}{5(3\times0.5 + 1)^2} = 0.4\\ \operatorname{var}[Y\mid X = 1] &= 27/80 = 0.3375. \end{align*}\]

In general, to compute the conditional variance of $X\mid Y = y$ given a joint probability function, the following steps are required.

Find the marginal distribution of $Y$.
Use this to compute the conditional probability function $p_{X \mid Y = y}(x \mid y) = p_{X, Y}(x, y)/p_{X}(x)$.
Find the conditional mean $\operatorname{E}[X \mid Y = y]$.
Find the conditional second raw moment $\operatorname{E}[X^2 \mid Y = y]$.
Finally, compute $\operatorname{var}[X\mid Y = y] = \operatorname{E}[X^2\mid Y=y] - (\operatorname{E}[X\mid Y=y])^2$.

Example 5.34 (Conditional variance (discrete)) Two discrete random variables $U$ and $V$ have the joint probability function given in Table 5.3. To find the conditional variance of $V$ given $U = 11$, use the steps above.

First, find the marginal distribution of $U$: \[ p_U(u) = \begin{cases} 4/9 & \text{for $u = 10$};\\ 7/18 & \text{for $u = 11$};\\ 1/6 & \text{for $u = 12$};\\ 0 & \text{otherwise.}\\ \end{cases} \] Secondly, compute the conditional probability function: \[\begin{align*} p_{V\mid U = 11}(v \mid u = 11) &= p_{U, V}(u,v)/p_{U}(u = 11) \\ &= \begin{cases} \frac{1/18}{7/18} = 1/7 & \text{if $v = 0$};\\ \frac{1/3}{7/18} = 6/7 & \text{if $v = 1$} \end{cases} \end{align*}\] using $p_U(u = 11) = 7/18$ from the Step 1.

Thirdly, find the conditional mean: \[ \operatorname{E}[V\mid U = 11] = \sum_v v p_{V\mid U}(v\mid u = 11) = \left(\frac{1}{7}\times 0\right) + \left(\frac{6}{7}\times 1\right) = 6/7. \] Fourthly, find the conditional second raw moment: \[ \operatorname{E}[V^2\mid U] = \sum_v v^2 p_{V\mid U}(v\mid u = 11) = \left(\frac{1}{7}\times 0^2\right) + \left(\frac{6}{7}\times 1^2\right) = 6/7. \] Finally, compute: \[\begin{align*} \operatorname{var}[V\mid U = 11] &= \operatorname{E}[V\mid U = 11] - (\operatorname{E}[V\mid U = 11])^2\\ &= (6/7) - (6/7)^2\\ &\approx 0.1224. \end{align*}\]

TABLE 5.3: A bivariate discrete probability function.
	$u = 10$	$u = 11$	$u = 12$	Total
$v = 0$	$1/9$	$1/18$	$1/6$	$1/3$
$v = 1$	$1/3$	$1/3$	$0$	$2/3$
Total	$4/9$	$7/18$	$1/6$	$1$

5.8 Numerical approaches

Some distributions cannot be written in closed form (i.e., a neat function of standard mathematical functions), which makes (for example) evaluation of probability and computation of means difficult. In these cases, numerical methods, such as numerical integration, may be needed.

Consider the random variable $X$ with density function \[ f_X(x) = \frac{c}{\sqrt{x}}\exp( -x - \sqrt{2x})\quad \text{for $x > 0$}. \] for some normalising constant $c$. This density function has no closed form, and the value of $c$ cannot be found via integration. However, the value of $c$ could be found using numerical integration in R using integrate():

### Define the function, without the constant term
g <- function(x){
  ifelse(x > 0, 
         x^(-0.5) * exp(-x - sqrt(2 * x)), # When x > 0
         0)                                # When x <= 0
}

### Integrate between 0 and infinity:
#\n adds a "new line"
Const <- 1 / integrate(g, 
                       lower = 0, 
                       upper = Inf)$value
cat("The value of  c  is about", round(Const, 4), "\n")
#> The value of  c  is about 1.0784

# So now define f(x):
f <- function(x){
  ifelse(x > 0, 
         x^(-0.5) * exp(-x - sqrt(2 * x)) * Const,  # When x > 0
         0)                                         # When x <= 0
}

That is, $c = 1.0784\dots$. The distribution function can even be found:

F <- function(x){
  F <- array(dim = length(x) )
  for (i in 1:length(x)){
    F[i] <- integrate( f, 
                       lower = 0,
                       upper = x[i])$value
  }
  return(F)
}

So now the density and the distribution function can be plotted (Fig. 5.7):

### Make room for two plots side-by-side
par( mfrow = c(1, 2))

### Evaluate over these values of x
x <- seq(0.01, 2, 
         length = 1000)

### Now plot
plot( f(x) ~ x,
      type = "l",
      las = 1,
      lwd = 3,
      main = "Density function",
      xlab = expression(italic(x)),
      ylab = "Density fn.")
plot( F(x) ~ x,
      type = "l",
      las = 1,
      lwd = 3,
      main = "Distribution function",
      xlab = expression(italic(x)),
      ylab = "Distribution fn.")

FIGURE 5.7: The density function (left) and distribution function (right) of a distribution that cannot be expressed in closed form.

Probabilities can then be computed; for example, we can find $\Pr(X > 0.5)$:

integrate(f,
          lower = 0.5,
          upper = Inf)
#> 0.1433935 with absolute error < 4.4e-06

And the expected value of $X$, $\operatorname{E}[X]$, can be found:

f_Expected <- function(x){
  x * f(x)
}
integrate(f_Expected, 
          lower = 0,
          upper = Inf)
#> 0.2374324 with absolute error < 2.1e-07

5.9 Exercises

Selected answers appear in Sect. E.5.

Exercise 5.1 The PDF for the random variable $Y$ is defined as \[ f_Y(y) = \begin{cases} 2y + k & \text{for $1\le y \le 2$};\\ 0 & \text{elsewhere}. \end{cases} \]

Find the value of $k$.
Plot the PDF of $Y$.
Compute $\operatorname{E}[Y]$.
Compute $\operatorname{var}[Y]$.
Compute $\Pr(Y > 1.5)$.

Exercise 5.2 The PDF for the random variable $X$ is defined as \[ f_Y(y) = \begin{cases} 2(x + 1)/ 3 & \text{for $-1\le x \le 0$};\\ (2 - x)/ 3 & \text{for $0\le x \le 2$};\\ 0 & \text{elsewhere}. \end{cases} \]

Plot the PDF of $X$.
Compute $\operatorname{E}[X]$.
Compute $\operatorname{var}[X]$.
Compute $\Pr(X > 0)$.

Exercise 5.3 The random variable $T$ has the density function \[ f_T(t) = \begin{cases} k & \text{for $0 < t < 1$};\\ 2k(2 - t) & \text{for $1 < t < 2$}. \end{cases} \]

Plot the PDF of $T$.
Compute $\operatorname{E}[T]$.
Compute $\operatorname{var}[T]$.
Find and plot the distribution function of $T$.
Compute $\Pr(T \le 1)$.

Exercise 5.4 The random variable $Z$ has the density function \[ f_Z(z) = \begin{cases} 1 - z & \text{for $0 < z < 1$};\\ 2 + z & \text{for $2 < z < 3$};\\ 0 & \text{elsewhere}. \end{cases} \]

Plot the PDF of $Z$.
Compute $\operatorname{E}[Z]$.
Compute $\operatorname{var}[Z]$.
Find and plot the distribution function of $Z$.
Compute $\Pr(Z > 2 \mid Z > 1)$.

Exercise 5.5 The PMF for the random variable $D$ is defined as \[ p_D(d) = \begin{cases} 1/2 & \text{for $d = 1$};\\ 1/4 & \text{for $d = 2$};\\ k & \text{for $d = 3$};\\ 0 & \text{otherwise}, \end{cases} \] for a constant $k$.

Plot the probability mass function.
Compute the mean and variance of $D$.
Find the MGF for $D$.
Compute the mean and variance of $D$ from the MGF.
Compute $\Pr(D < 3)$.

Exercise 5.6 The PMF for the random variable $Z$ is defined as \[ p_Z(z) = \begin{cases} c/z^2 & \text{for $z = 1, 2, 3, 4$};\\ 0 & \text{otherwise}, \end{cases} \] for a constant $c$.

Plot the probability mass function.
Compute the mean and variance of $Z$.
Find the MGF for $Z$.
Compute the mean and variance of $Z$ from the MGF.
Compute $\Pr(Z \ge 2)$.

Exercise 5.7 The MGF of the discrete random variable $Z$ is \[ M_Z(t) = [0.3\exp(t) + 0.7]^2. \]

Compute the mean and variance of $Z$.
Find the probability function of $Z$.

Exercise 5.8 The MGF of the discrete random variable $W$ is \[ M_W(t) = \frac{p}{1 - (1 - p)\exp(w)}\quad\text{for $t < -\log(1 - p)$}. \]

Compute the mean and variance of $W$.
Find the probability mass function of $W$.

Exercise 5.9 The random variable $A$ has mean $13$ and variance $5$. The random variable $B$ has mean $4$ and variance $2$. Assuming $A$ and $B$ are independent, find:

$\operatorname{E}[A + B]$.
$\operatorname{var}[A + B]$.
$\operatorname{E}[2A - 3B]$.
$\operatorname{var}[2A - 3B]$.

Exercise 5.10 Repeat Exercise 5.9, but with $\operatorname{Cov}(A, B) = 0.2$.

Exercise 5.11 The MGF of $G$ is $M_G(t) = (1 - \beta t)^{-\alpha}$ (where $\alpha$ and $\beta$ are constants). Find the mean and variance of $G$.

Exercise 5.12 Suppose that the PDF of $X$ is \[ f_X(x) = \begin{cases} 2(1 - x) & \text{for $0 < x < 1$};\\ 0 & \text{otherwise}. \end{cases} \]

Find the $r$th raw moment of $X$.
Find the $r$th central moment of $X$.
Find $\operatorname{E}\big[(X + 3)^2\big]$ using the previous answer.
Find the value of the skewness $\gamma_1$ using the previous results.
Find the value of the excess kurtosis $\gamma_2$ using the previous results.
Find the variance of $X$.

Exercise 5.13 Suppose that the PDF of $X$ is \[ f_X(x) = \begin{cases} 1/x^2 & \text{for $1 < x < \infty$};\\ 0 & \text{otherwise}. \end{cases} \]

Find the $r$th raw moment of $X$.
Find the $r$th central moment of $X$.
Find $\operatorname{E}\big[(X -1)^2\big]$ using the previous results.
Find the value of the skewness $\gamma_1$ using the previous results.
Find the value of the excess kurtosis $\gamma_2$ using the previous results.
Find the variance of $X$.

Exercise 5.14 Find the MGF for the continuous random variable $Y$ with probability density function \[ f_X(x) = 1/2\quad\text{for $3 < x < 5$}. \]

Exercise 5.15 Find the MGF for the continuous random variable $R$ with probability density function \[ f_R(r) = 6 r (1 - r) \quad\text{for $0 < r < 1$}. \]

Exercise 5.16 Consider the PDF \[ f_Y(y) = \frac{2}{y^2}\qquad y\ge 2. \]

Show that the mean of the distribution is not defined.
Show that the variance does not exist.
Plot the probability density function over a suitable range.
Plot the distribution function over a suitable range.
Determine the median of the distribution.
Determine the interquartile range of the distribution. (The interquartile range is a measure of spread, and is calculated as the difference between the third quartile and the first quartile. The first quartile is the value below which $25$% of the data lie; the third quartile is the value below which $75$% of the data lie.)
Find $\Pr(Y > 4 \mid Y > 3)$.

Exercise 5.17 The Cauchy distribution has the PDF \[\begin{equation} f = \frac{1}{\pi(1 + x^2)}\quad\text{for $x\in\mathbb{R}$}. \tag{5.6} \end{equation}\]

Use R to draw the probability density function.
Compute the distribution function for $X$. Again, use R to draw the function.
Show that the mean of the Cauchy distribution is not defined.
Find the variance and the mode of the Cauchy distribution.

Exercise 5.18 The exponential distribution has the PDF \[ f_Y(y) = \frac{1}{\lambda}\exp( -y/\lambda) \] (for $\lambda > 0$) for $y > 0$ and is zero elsewhere.

Determine the moment-generating function of $Y$.
Use the moment-generating function to compute the mean and variance of the exponential distribution.

Exercise 5.19 Prove that for a continuous random variable $X$ which has a distribution that is symmetric about $0$ then $M_X(t) = M_{-X}(t)$. Hence prove that for such a random variable, all odd moments about the origin are zero.

Exercise 5.20 The continuous random variable $X$ is defined with the probability density function \[ f_X(x) = \frac{x + a + 1}{2(2 + a)}\quad \text{for $0 \le x \le 2$} \] for some real value $a$.

Find the possible values for $a$ such that $f_X(x)$ is a valid probability function.
If $\operatorname{E}[X] = 4/3$, find the values of $a$ such that $f_X(x)$ is a valid probability function.

Exercise 5.21 The continuous random variable $X$ is defined with the probability density function \[ f_X(x) = x^{2a} - x^a + 7/6\quad \text{for $0 \le x \le 1$} \] for some real value $a$.

Find the possible values for $a$ such that $f_X(x)$ is a valid probability function.
If $\operatorname{E}[X] > 1/2$, find the values of $a$ such that $f_X(x)$ is a valid probability function.

Exercise 5.22 (This exercise follows from Exercise 3.16.) In a study modelling waiting times at a hospital (Khadem et al. 2008), patients are classified into one of three categories:

Red: Critically ill or injured patients.
Yellow: Moderately ill or injured patients.
Green: Minimally injured or uninjured patients.

For ‘Yellow’ patients, the service time of doctors are modelled using a triangular distribution, with a minimum at $3.5\,\text{mins}$, a maximum at $30.5\,\text{mins}$ and a mode at $5\,\text{mins}$.

Compute the mean of the service times.
Compute the variance of the service times.

Exercise 5.23 (This Exercise follows Ex. 3.18.) Five people, including you and a friend, line up at random. The random variable $X$ denotes the number of people between yourself and your friend.

Use the probability function of $X$ found in Ex. 3.18, and find the mean number of people between you and your friend. Simulate this in R to confirm your answer.

Exercise 5.24 The characteristic function of a random variable $X$, denoted $\varphi(t)$, is defined as $\varphi_X(t) = \operatorname{E}[\exp(i t X)]$, where $i = \sqrt{-1}$. Unlike the MGF, the characteristic function is always defined, so is sometimes preferred over the MGF.

Show that $M_X(t) = \varphi_X(-it)$.
Show that the mean of a random variable $X$ is given by $-i\varphi'(0)$ (where, as before, the notation means to compute the derivative of $\varphi(t)$ with respect to $t$, and evaluate at $t = 0$).

Exercise 5.25 App. B.2 will prove useful.

Write down the first three terms and general term of the expansion of $(1 - a)^{-1}$.
Write down the first three terms and general term of the expansion of $\operatorname{E}\left[(1 - tX)^{-1}\right]$.
Suppose $\mathcal{R}_X(t) = \operatorname{E}\left [(1 - tX)^{-1} \right]$, called the geometric generating function of $X$. Suppose the random variable $Y$ has a uniform distribution on $(0, 1)$; i.e., $f_Y(y) = 1$ for $0 < y < 1$. Determine the geometric generating function of $Y$ from the definition of the expected value. Your answer will involve a term $\log(1 - t)$.
Using the answer in Part 3, expand the term $\log(1 - t)$ by writing in terms of the infinite series.
Equate the two series expansions in Part 2 and Part 4 to determine an expression for $\operatorname{E}[Y^n]$, $n = 1, 2, 3,\dots$.

Exercise 5.26 Suppose the random variable $X$ is defined as \[ f_X(x) = k (3x^2 + 4)\quad\text{for $-c < x < c$}, \] and is zero elsewhere. Solve for $c$ and $k$ if $\operatorname{var}[X] = 28/15$. (Hint: Make sure you use the properties of the given probability distribution before embarking on complicated expressions!)

Exercise 5.27 Suppose the random variable $Y$ is defined as \[ f_Y(y) = \begin{cases} c & \text{for $0 < y < 1$}\\ k(y - 4)/3 & \text{for $1 < y < 4$};\\ 0 & \text{elsewhere}. \end{cases} \]

What values of $c$ and $k$ are possible?
If $c = k$, what are the values of $c$ and $k$?
If $k = 2$, what is the value of $c$?

Exercise 5.28 Suppose the random variable $Y$ is defined as \[ f_Y(y) = \begin{cases} \exp(-y^2) & \text{for $0 < y < k$}\\ 0 & \text{elsewhere}, \end{cases} \] and $\operatorname{E}[Y] = 1/2$. What is the value of $k$?

Exercise 5.29 Suppose the random variable $X$ is defined as \[ f_X(x) = \begin{cases} x^r & \text{for $0 < x < 5$}\\ 0 & \text{elsewhere}, \end{cases} \] and $\operatorname{E}[X] = 625$. What is the value of $r$?

Exercise 5.30 Benford’s law (also see Exercise 3.24) describes the distribution of the leading digits of numbers that span many orders of magnitudes (e.g., lengths of rivers) as \[ p_D(d) = \log_{10}\left( \frac{d + 1}{d} \right) \quad\text{for $d\in\{1, 2, \dots 9\}$}, \] Find the mean of $D$ (i.e., the mean leading digit).

Exercise 5.31 The von Mises distribution is used to model angular data. The probability function is \[ p_Y(y) = k \exp\{ \lambda\cos(y - \mu) \} \] for $0\le y < 2\pi$, $0 \le \mu < 2\pi$ where $\mu$ is the mean, and with $\lambda > 0$.

Show that the constant $k$ is a function of $\lambda$ only.
Find the median of the distribution.
Using R, numerically integrate to find the value $k$ when $\mu = \pi/2$ and $\lambda = 1$.
The distribution function has no closed form. Use R to plot the distribution function for $\mu = \pi/2$ with $\lambda = 1$.

Exercise 5.32 The inverse Gaussian distribution has the PDF \[ P_Y(y) = \frac{1}{\sqrt{2\pi y^3\phi}} \exp\left\{ -\frac{1}{2\phi} \frac{(y - \mu)^2}{y\mu^2}\right\} \] for $y > 0$, $\mu > 0$ and $\phi > 0$.

Plot the distribution for $\mu = 1$ for various values of $\phi$; comment.
The MGF is \[ M_Y(t) = \exp\left\{ \frac{\lambda}{\mu} \left( 1 - \sqrt{1 - \frac{2\mu^2 t}{\lambda}} \right) \right\}. \] Use the MGF to deduce the mean and variance of the inverse Gaussian distribution.

Exercise 5.33 Consider the random variable $W$ such that \[ f_W(w) = \frac{c}{w^3}\quad\text{for $w > c$.} \]

Find the value of $c$.
Find $\operatorname{E}[W]$.
Find $\operatorname{var}[W]$.

Exercise 5.34 The Pareto distribution has the distribution function \[ F_X(x) = \begin{cases} 1 - \left(\frac{k}{x}\right)^\alpha & \text{for $x > k$}\\ 0 & \text{elesewhere}, \end{cases} \] for $\alpha > 0$ and parameter $k$,

What values of $k$ are possible?
Find the density function for the Pareto distribution.
Compute the mean and variance for the Pareto distribution.
Find the mode of the Pareto distribution.
Plot the distribution for $\alpha = 3$ and $k = 3$.
For $\alpha = 3$ and $k = 3$, compute $\Pr(X > 4 \mid X < 5)$.
The Pareto distribution is often used to model incomes. For example, the “$80$-–$20$ rule” states that $20$% of people receive $80$% of all income (and, further, that $20$% of the highest-earning $20$% receive $80$% of that $80$%). Find the value of $\alpha$ for which this rule holds.

Exercise 5.35 A mixture distribution is a mixture of two or more univariate distributions. For example, the heights of all adults may follow a mixture distribution: one normal distribution for adult females, and another for adult males. For a set of probability functions $p^{(i)}_X(x)$ for $i = 1, 2, \dots n$ and a set of weights $w_i$ such that $\sum w_i = 1$ and $w_i \ge 0$ for all $i$, the mixture distribution $f_X(x)$ is \[ f_X(x) = \sum_{i = 1}^n w_i p^{(i)}_X(x). \]

Compute the distribution function for $p_X(x)$.
Compute the mean and variance of $f_X(x)$.
Consider the case where $p^{(i)}_X(x)$ has a normal distribution for $i = 1, 2, 3$, where the means are $-1$, $2$, and $4$ respectively, and the variances are $1$, $1$ and $4$ respectively. Plot the probability density function of $f_X(x)$ for various instructive values of the weights.
Suppose heights of female adults have a normal distribution with mean $163\,\text{cm}$ and a standard deviation of $5\,\text{cm}$, and adult males have heights with a mean of $175\,\text{cm}$ with standard deviation of $7\,\text{cm}$, and constitute $48$% of the population (Australian Bureau of Statistics 1995). Deduce and plot the probability density of heights of adult Australians.

Exercise 5.36 Consider the distribution such that \[ p_X(x) = \begin{cases} 1/K & \text{for $x = 1$};\\ 1/\left(x(x - 1)\right) & \text{for $x = 2, 3, \dots, K$};\\ 0 & \text{elsewhere} \end{cases} \] for $K > 2$.

Find the mean and variance of $X$ (as well as possible).
Plot the distribution for various values of $K$.
For $K = 6$, determine the MGF.

Exercise 5.37 The random variable $V$ has the PMF \[ f_V(v) = (1 - p)^{v - 1} p\quad\text{for $v = 1, 2, \dots$}, \] and zero elsewhere.

Show that this is a valid PMF.
Find $\operatorname{E}[V]$.
Find $\operatorname{var}[V]$.

Exercise 5.38 Two dice are rolled. Deduce the PMF for the absolute difference between the two numbers that appear uppermost.

Exercise 5.39 The random variable $Y$ has the PMF \[ p_Y(y) = \frac{e^{-\lambda}\lambda^y}{y!}\quad\text{for $y = 0, 1, 2, \dots$}, \] where $\lambda > 0$. Find the MGF of $Y$, and hence show that $\operatorname{E}[Y] = \operatorname{var}[Y]$.

Exercise 5.40 In practice, some distributions cannot be written in closed form, but can be given by writing their moment-generating function. To evaluate the density then requires an infinite summation or an infinite integral. Given a moment-generating function $M(t)$, the probability density function can be reconstructed numerically from the integral using the inversion formula in Eq. (5.4).

The evaluation of the integral generally requires advanced numerical techniques. In this question, we just consider the exponential distribution as a simple example to demonstrate the use of the inversion formula.

Write down the expression in Eq. (5.4) in the case of the exponential distribution, for which $M_X(t) = \lambda/(\lambda - t)$ for $t < \lambda$.
Only the real part of the integral is needed. Extract the real parts of this expression, and simplify the integrand. (The integrand is the expression to be integrated.)
Plot the integrand from the last part from $t = -50$ to $t = 2$ in the case $\lambda = 2$ and $x = 1$.

Exercise 5.41 The density function for the random variable $X$ is given as \[ f(x) = x e^{-x} \quad\text{for $x > 0$}. \]

Determine the moment-generating function (MGF) of $X$.
Use the MGF to verify that $\operatorname{E}[X] = \operatorname{var}[X]$.
Suppose that $Y = 1 - X$. Determine $\operatorname{E}[Y]$.

Exercise 5.42 The Gumbel distribution has the cumulative distribution function \[\begin{equation} F(x; \mu, \beta) = \exp\left[ -\exp\left( -\frac{x - \mu}{\sigma}\right)\right] \tag{5.7} \end{equation}\] (for $\mu > 0$ and $\sigma > 0$) and is often used to model extremes values (such as the distribution of the maximum river height).

Deduce the probability function for the Gumbel distribution.
Plot the Gumbel distribution for a variety of parameters.
The maximum daily precipitation (in mm) in Oslo, Norway, is well-modelled using a Gumbel distribution with $\mu = 2.6\,\text{mm}$ and $\sigma = 1.86\,\text{mm}$. Draw this distribution, and explain what it means.

Exercise 5.43 The density function for the random variable $X$ is given as \[ f(x) = x e^{-x} \quad\text{for $x > 0$}. \]

Determine $\operatorname{E}[X]$.
Verify that $\operatorname{E}[X] = \operatorname{var}[X]$.

Exercise 5.44 (This exercise follows from Ex. 3.22.) To detect disease in a population through a blood test, usually every individual is tested. If the disease is uncommon, however, an alternative method is often more efficient.

In the alternative method (called a pooled test), blood from $n$ individuals is combined, and one test is conducted. If the test returns a negative result, then none of the $n$ people have the disease; if the test returns a positive result, all $n$ individuals are then tested individually to identify which individual(s) have the disease.

Suppose a disease occurs in an unknown proportion of people $p$ of people. Let $X$ be the number of tests to be performed for a group of $n$ individuals.

What is the expected number of tests needed in a group of $n$ people using the pooled method?
What is the variance of the number of tests needed in a group of $n$ people using the pooled method?
Explain what happens to the mean and variance as $p \to 1$ and as $p \to 0$, and how these results make sense in the context of the question
If pooling was not used with a group of $n$ people, the number of tests would be $n$: one for each person. Deduce an expression for the value of $p$ for which the expected number of tests using the pooled approach exceeds the non-pooled approach.
Produce a well-labelled plot showing the expected number of tests that are saved by using the pooled method when $p = 0.1$ for values of $n$ from $2$ to $10$, and comment on what this shows practically.
Suppose a test costs $$15$. What is the expected cost-saving for using the pooled-testing method with $n = 4$ and $p = 0.1$, if $200$ people must be tested?

Exercise 5.45 Consider rolling a fair, six-sided die. The ‘running total’ is the total of all the numbers rolled on the die.

Find the probability mass function for $R$, the number of rolls needed to obtain a running total of $3$ or more.
Find the expected number of rolls until the running total reaches $3$ or more.

Exercise 5.46 Besides the variance, an alternative measure of variation is the mean absolute deviation (MAD), defined as $\operatorname{E}[\,|X - \mu|\,]$.

Consider the fair die described in Example 5.9.

Find $\operatorname{E}[X]$.
Find $\operatorname{MAD}[X]$ using the above definition.

Exercise 5.47 Suppose a random variable $W$ has the probability density function \[ f_W(w) = K \, \exp(-w^4)\quad\text{for $w\in \mathbb{R}$,} \] for some normalising constant $K$.

Using a computer, determine a value for $K$.
Plot the density function and the distribution function of $W$.
Using a computer, find the mean of the distribution.
Using a computer, find the variance of the distribution.
Using a computer, find $\Pr(W < 1 \mid W > -1)$.

Exercise 5.48 Suppose a random variable $Y$ has the probability density function \[ f_Y(y) = k\, y^{\alpha - 1} (1 + y)^{-\alpha-\beta}\quad\text{for $y > 0$,} \] for some normalising constant $k$, where $\alpha > 0$ and $\beta > 0$.

Using a computer, determine a value for $k$ when $\alpha = 0.5$ and $\beta = 2.5$.
Plot the density function and the distribution function of $Y$.
Using a computer, find the mean of the distribution. (Compare to the theoretical mean of $\operatorname{E}[Y] = \alpha/(\beta - 1)$ provided $\beta > 1$.)
Using a computer, find the variance of the distribution.
Using a computer, find $\Pr(Y < 1)$.

The mean is not defined if $\beta < 1$.

What happens if you use a computer to produce the density function for $\alpha = 0.5$ and $\beta = 0.5$?
What does the simulation suggest for the value of $\operatorname{E}[Y]$ for $\alpha = 0.5$ and $\beta = 0.5$?

Exercise 5.49 Consider the random variable $X$ with the probability density function \[ f_X(x) = 3 x^2/2\quad\text{for $-1 < x < 1$} \] and is zero elsewhere.

Plot the PDF.
Compute the mean and the variance.
Without computation, determine the skewness.
Compute the kurtosis, and then show that the excess kurtosis is a negative value.
The text states that distributions with negative excess kurtosis (platykurtic distributions) have fewer, or less extreme, observations in the tail compared to the normal distribution. Explain why this distribution has a negative kurtosis.

Exercise 5.50 Consider the random variable $X$ with the probability density function \[ f_X(x) = x/2 \quad\text{for $0 < x < 2$} \] and is zero elsewhere.

Plot the PDF.
Find an expression for the $r$th raw moment.
Compute the mean and the variance.
Compute the skewness, and explain what this value means.
Compute the kurtosis, and explain what this value means.

4 Bivariate distributions

6 Transformations of random variables

	\(x = 0\)	\(x = 1\)	\(x = 2\)	Total
\(y = -1\)	\(1/8\)	\(1/4\)	\(1/8\)	\(1/2\)
\(y = +1\)	\(1/6\)	\(1/12\)	\(1/4\)	\(1/2\)
Total	\(7/24\)	\(1/3\)	\(3/8\)	\(1\)

	\(x = -1\)	\(x = 0\)	\(x = 1\)	Total
\(y = 0\)	\(0\)	\(1/3\)	\(0\)	\(1/3\)
\(y = 1\)	\(1/3\)	\(0\)	\(1/3\)	\(2/3\)
Total	\(1/3\)	\(1/3\)	\(1/3\)	\(1\)

	\(u = 10\)	\(u = 11\)	\(u = 12\)	Total
\(v = 0\)	\(1/9\)	\(1/18\)	\(1/6\)	\(1/3\)
\(v = 1\)	\(1/3\)	\(1/3\)	\(0\)	\(2/3\)
Total	\(4/9\)	\(7/18\)	\(1/6\)	\(1\)