12 Describing samples

use appropriate techniques to determine the sampling distributions of $t$, $F$, and $\chi^2$ distributions.
explain how the above distributions are related to the normal distribution.
apply the concepts of the Central Limit Theorem in appropriate circumstances.
use the Central Limit Theorem to approximate binomial probabilities by normal probabilities in appropriate circumstances.

12.1 Introduction

The preceding chapters established the theoretical foundations of probability distributions: how random variables behave in populations and how their properties can be characterised mathematically. However, in applied statistics, entire populations are rarely observed. Instead, samples are drawn, and used to make inferences about the unknown parameters that describe the population.

The bridge between samples and populations is built on three distributions that arise naturally when working with normally-distributed data: the chi-squared ($\chi^2$) distribution, the $t$-distribution, and the $F$-distribution. Each one describes a specific type of quantity computed from sample data, and together they underpin the majority of classical hypothesis tests and confidence intervals encountered in practice.

12.2 From populations to observations

Until now, theoretical probability distributions describing ideal distributions of infinite populations have been our focus. In practice, of course, ideal distributions and infinite populations do not exist, and finite (usually relatively small) samples from unknown distributions are observed.

Example 12.1 (Populations and models) Suppose our population of interest is Australian adult females, and their heights are of interest. ‘Height of Australian adult females’ is a continuous random variable, say $X$.

This population of heights can be modelled theoretically using a specified distribution, with probability density function $f_X(x)$. The mean height of Australian adult females is then (Def. 5.1) \[ \mu = \operatorname{E}[X] = \int_{-\infty}^\infty x\cdot f_X(x)\,dx. \]

If the heights are modelled using the normal distribution $N(\mu = 173, \sigma^2 = 9)$, the mean is modelled as $173\,\text{cm}$ and the variance as $9\,\text{cm}^2$.

While height is a continuous random variable, and heights can be modelled using a continuous probability distribution, height measurements of individual Australian females in practice are not made with infinite precision; measured heights must be rounded and hence recorded as discrete values. Any tool to measure actual heights must record a rounded version of the heights. Heights, for example, may be rounded to the nearest centimetre or the nearest millimetre. So even if it was possible to measure the height of every Australian adult female, we would in practice have a list of discrete values $\{x_1, x_2, \dots, x_N\}$, where $N$ is the size of the population.

So, the observed heights are actually discrete (rounded) values, and the population mean of the measured heights would be computed (from Def. 5.1) using \[ \mu = \operatorname{E}[X] = \frac{1}{N} \sum_{i = 1}^N x_i. \]

12.3 Random sampling and statistics

In practice, entire populations are very rarely studied: the population is usually too large (and so studying the whole population is prohibitively expensive and time consuming), and some members of the population may be inaccessible. So rather than studying entire populations, a sample of observations from that population is almost always studied.

The task of the inferential statistics is to learn about population parameters, based on what is learnt from studying a small subset of that population: the sample.

Definition 12.1 (Sample) A subset of observations from the population is called a sample.

This definition is very broad: any subset from the population is a sample. Importantly, how the sample is drawn determines its statistical properties; the most important type of sample for statistical inference is the idea of a random sample.

Definition 12.2 (Random sample) The random variables $X_1, X_2, \dots, X_n$ are a random sample from distribution $D$ if and only if they are independently and identically distributed (iid) with distribution function $F_X(x)$. We write $X_1, X_2, \dots, X_n \overset{\text{iid}}{\sim} D$.

The definition uses the distribution function and so applies to discrete, continuous and mixed random variables. Because the elements of the random sample are assumed to be independent, results from previous chapters concerning independent random variables therefore apply directly.

Once the sample is obtained, the sample data can be used to compute the value of some statistic. A statistic is a function of the sample values.

Definition 12.3 (Statistic) A statistic is any function $G = g(X_1, X_2, \ldots, X_n)$ that is computable from the sample alone; that is, it does not depend on any unknown population parameters.

Since the sample values $X_1, \dots, X_n$ are random variables, any statistic $G = g(X_1, \dots, X_n)$ is itself a random variable, with its own probability distribution called its sampling distribution.

Recall that a parameter is a numerical characteristic of a population or its distribution (Def. 4.7). In statistical inference, parameters are typically unknown, and it is the goal of estimation to infer their values from sample data.

Since the individuals chosen to be in the sample can vary from sample to sample, the value of a statistic can vary from sample to sample.

Example 12.2 (Statistic) For a random sample $X_1, X_2, \dots, X_n$, define the statistic $R$ as \[ R = \max(X_1, X_2, \dots, X_n) - \min(X_1, X_2, \dots, X_n). \] $R$ is the sample range.

A special type of statistic is an estimator.

Definition 12.4 (Estimator) An estimator $\hat{\theta} = g(X_1, X_2, \dots, X_n)$ is a statistic used to estimate an unknown population parameter $\theta$. As a function of random variables, an estimator is a random variable.

Definition 12.5 (Estimate) An estimate $\hat{\theta}(x_1, x_2, \dots, x_n)$ is the numerical value taken by an estimator upon observing a realisation $x_1, x_2, \dots, x_n$ of the sample.

For example, the sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ is an estimator of the population mean $\mu$. Once data are observed, the computed value $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ is the corresponding estimate.

The study of estimators and their properties—including point estimation, interval estimation, and hypothesis testing—is called statistical inference.

Example 12.3 (The sample median) Suppose we take a sample of $n = 12$ Australian adult females, and measure their heights to the nearest centimetre. We could define the statistic \[ M = \text{median}(X_1, X_2, \dots, X_{12}). \] $M$ is a statistic, as it only depends on the sample data. Since $M$ is likely to be different when computed from another sample of $12$ Australian adult females, $M$ is a random variable.

The purpose of computing the sample median is to estimate the unknown median value in the population, so $M$ is an estimator of the population median.

If the median height in the sample is $169\,\text{cm}$, then \[ m = 169 \] is an estimate of the unknown population median. Notice that a capital $M$ is used for the estimator, and a lowercase $m$ for the realised value.

While $M$ is an estimator of the population median, we have not studied whether it is an estimator with desirable properties for estimating the population median (or even what properties would be considered desirable).

12.4 Sampling distributions

12.4.1 Sampling distributions: small population example

Since a statistic is a function of the random variables $X_1, X_2, \ldots, X_n$, statistics are random variables. This means that statistics have a distribution that describes how they vary.

Definition 12.6 (Sampling distribution of a statistic) The sampling distribution of a statistic is the theoretical probability distribution of the values taken by the statistic across all possible samples of a particular size drawn from a particular population.

Typically, a population will be very large, and listing all elements of the population is impossible. However, for demonstration purposes, suppose the population contains only $N = 5$ discrete elements: \[ 22\qquad 23\qquad 27\qquad 28\qquad 38. \] The median value in the population (a parameter) is $27$. The sample median is an estimator of this parameter.

For each possible sample of size $n < 5$ (chosen without replacement), we can compute the sample median. Since the population is (artificially) small in this scenario, the exact sampling distribution can be determined by enumerating every possible sample of any given size $n$.

Using R, the population can be defined:

Population <- c(22, 23, 27, 28, 38)      # ALL values in the population
median(Population)                       # Find the median of the population
#> [1] 27

Suppose we consider samples of size $n = 2$; there are $\binom{5}{2} = 10$ such samples:

combn(Population, 2)                     # List all samples
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,]   22   22   22   22   23   23   23   27   27    28
#> [2,]   23   27   28   38   27   28   38   28   38    38

The median of each sample is easily computed also:

median_2 <- combn(Population, 2, FUN = median)  # Medians of every sample
median_2
#>  [1] 22.5 24.5 25.0 30.0 25.0 25.5 30.5 27.5 32.5 33.0

The sampling distribution is the distribution of these sample medians:

table(median_2)                          # Tabulate the sample medians
#> median_2
#> 22.5 24.5   25 25.5 27.5   30 30.5 32.5   33 
#>    1    1    2    1    1    1    1    1    1

Since the population median is $27$, the sample medians are often a poor estimate of the population median (the sample medians range from $22.5$ to $33$). Suppose larger samples were taken; say, samples of size $n = 4$:

median_S_4 <- combn(Population, 4, FUN = median)
table(median_S_4)
#> median_S_4
#>   25 25.5 27.5 
#>    2    1    2

So the sample median when $n = 4$ is often a very good estimate of the population median (the sample medians range from $25$ to $27.5$). The larger sample size produces a better estimate, in general, and the value of the statistic varies less from sample to sample.

12.4.2 Sampling distributions: large populations

For larger populations, enumerating every possible is sample implausible or impossible, and obtaining the selected elements is usually impractical or impossible. Nonetheless, the idea of a sampling distribution of a statistic can be simulated using R. These sampling distributions, generated from simulated data, are called empirical distributions; that is, they are based on sample data not theoretical ideals.

Example 12.4 (Empirical sampling distributions) Consider taking a single random sample (with replacement) of $n = 5$ observations from a $N(\mu = 173, \sigma^2 = 9)$ distribution, and computing the median value in the sample:

x_Mn <- 173; x_Sd <- 3        # Population (model) mean and std dev.
sample_Size <- 5              # Sample size is 5
x1 <- rnorm(sample_Size,      # Take a sample of this size...
            mean = x_Mn, sd = x_Sd)     # ... from this distribution
median(x1)                    # Compute the sample median for this sample
#> [1] 172.609

This process can be repeated numerous times in R to simulate selecting many samples of size $n=5$. The median can be computed for each sample (Fig. 12.1, left panel):

num_Sims <- 100000            # Number of simulations

# Generate random values, and place in a suitably-sized array
Xsample_5 <- array(
  rnorm(num_Sims * sample_Size, mean = x_Mn, sd = x_Sd), # The random values
  dim = c(sample_Size, num_Sims)                         # The array size
)

Xmedian_5 <- apply(Xsample_5,     # For this data...
                   MARGIN = 2,    # ... and for each column...
                   FUN = median)  # ... compute the median
hist(Xmedian_5)                   # Plot the histogram of sample medians

Suppose we take larger samples, and find the sampling distribution again (Fig. 12.1, right panel):

sample_Size_100 <- 100

# Generate random values, and place in a suitably-sized array
Xsample_100 <- array(
  rnorm(num_Sims * sample_Size_100, mean = x_Mn, sd = x_Sd),
  dim = c(sample_Size_100, num_Sims) 
)

# Find the minimum of each sample (i.e., column)
Xmedian_100 <- apply(Xsample_100,   # For this data...
                  MARGIN = 2,       # ... and for each column...
                  FUN = median)     # ... compute the median
hist(Xmedian_100)                   # Plot the histogram of sample medians

FIGURE 12.1: Sampling distributions for the sample medians, for samples of size $5$ (left) and size $100$ (right).

These histograms display the sampling distributions of the sample median for samples of size $n = 5$ and $n = 100$. Again, the value of the statistic varies less from sample to sample for the larger sample size.

12.5 Estimating population parameters

12.5.1 Estimating the population mean

One of the most common reasons for taking a sample is to estimate the population mean $\mu$. For a population of $N$ values $X_1, X_2, \dots, X_N$, the population mean is $\mu = \operatorname{E}[X] = \frac{1}{N}\sum_i X_i$, as noted in Sect. 12.2. The natural estimator of $\mu$ is the sample mean $\overline{X}$, an estimator, defined as \[ \overline{X} = \frac{1}{n}\sum_{i = 1}^n X_i. \]

We can use R to simulate the selection of many samples of size $n = 5$, and many sample of size $n = 100$, and compute the empirical mean for each sample (Fig. 12.2), using the same samples created in Sect. 12.4.2:

# Mean: n = 5
Xmean   <- apply(Xsample_5, 
                 MARGIN = 2, 
                 FUN = mean) 
hist(Xmean)              # Create the histogram of sample minimums

# Mean: n = 100
Xmean_100   <- apply(Xsample_100, 
                     MARGIN = 2, 
                     FUN = mean) 
hist(Xmean_100)          # Create the histogram of sample minimums

Sampling distributions for the sample mean for sample of size $n = 5$ (left) and $n = 100$ (right). The solid dots represent the population mean, and the cross is the mean of the sample estimates.

FIGURE 12.2: Sampling distributions for the sample mean for sample of size $n = 5$ (left) and $n = 100$ (right). The solid dots represent the population mean, and the cross is the mean of the sample estimates.

The sample means vary around the true population mean $\mu$ (some are smaller than $\mu$; some are larger), but are centred on $\mu$. Larger samples produce estimates that are more tightly concentrated around $\mu$, reflecting the smaller variance of $\overline{X}$ for larger values of $n$.

12.5.2 Unbiased estimators

The sample means $\bar{x}$ are centred on $\mu$ (some estimates fall below $\mu$, some above), but on average the values of the estimator is equal to the parameter it estimates. An estimator with this property is called unbiased.

Definition 12.7 (Unbiased estimator) A statistic $\hat{\theta}$ is an unbiased estimator of a parameter $\theta$ if \[ \operatorname{E}[\hat{\theta}] = \theta. \]

When we say that $\overline{X}$ is an unbiased estimator of $\mu$, we mean that $\operatorname{E}[\overline{X}] = \mu$,

Theorem 12.1 (An unbiased estimator for $\mu$) If the sample $X_1, X_2, \dots, X_n$ are iid, then the sample mean \[ \overline{X} = \frac{1}{n} \sum_{i = 1}^n X_i \] is an unbiased estimator of the population mean $\mu$.

Proof. Using the definition of an unbiased estimator and the properties of expectations: \[\begin{align*} \operatorname{E}[\overline{X}] &= \operatorname{E}\left[ \frac{1}{n}(X_1 + X_2 + \cdots + X_n) \right]\\ &= \frac{1}{n} \operatorname{E}[ X_1 + X_2 + \cdots + X_n ]\\ &= \frac{1}{n} \left( \operatorname{E}[ X_1 ] + \operatorname{E}[X_2] + \cdots + \operatorname{E}[X_n] \right) \\ &= \frac{1}{n} (n\times \mu) = \mu. \end{align*}\] That is, $\operatorname{E}[\overline{X}] = \mu$, and so $\overline{X}$ is an unbiased estimator of $\mu$. Notice that no assumption has been made about the distribution of $X$.

Example 12.5 (A unbiased estimate) Let $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \operatorname{Poisson}(\lambda)$, and consider the estimate $\overline{X}$. Because the expectation is a linear operator, and because $\operatorname{E}[X_i] = \lambda$, \[ \operatorname{E}[\overline{X}] = \operatorname{E}\!\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \operatorname{E}[X_i] = \frac{1}{n} \cdot n\lambda = \lambda. \] Hence $\overline{X}$ is an unbiased estimator of $\lambda$ for a Poisson distribution.

Not all estimators are unbiased estimators; even estimators that appear to be natural estimators of a parameter may not be unbiased estimators.

Example 12.6 (A biased estimate) Let $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \operatorname{Exponential}(\lambda)$, where $\lambda > 0$ is the rate parameter, so that $\operatorname{E}[X_i] = 1/\lambda$ (Sect. 7.4.2). Consider the natural estimator of $\lambda$: that is, $1/\overline{X}$.

Since $X_i \sim \operatorname{Exponential}(\lambda)$, then $\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i$. Since (see Example 7.12) \[ \overline{X} \sim \operatorname{Gamma}(\alpha = n,\, 1/(n\lambda) ), \] where we use the shape–rate parameterisation, then $\operatorname{E}[\overline{X}] = 1/\lambda$ and $\operatorname{var}[\overline{X}] = 1/(n\lambda^2)$.

$\operatorname{E}[1/\overline{X}]$ can be computed directly: \[ \operatorname{E}\!\left[\frac{1}{\overline{X}}\right] = \int_0^\infty \frac{1}{x} \cdot \frac{(n\lambda)^n}{\Gamma(n)}\,x^{n-1} \exp(-n\lambda x)\,dx = \frac{(n\lambda)^n}{\Gamma(n)} \int_0^\infty x^{n-2} \exp(-n\lambda x)\,dx. \] The integral is related to the probability density function of a gamma distribution (Def. 7.9) with shape $n - 1$ and rate $1/(n\lambda)$, and so \[ \int_0^\infty x^{n - 2} \exp(-n\lambda x)\,dx = \frac{\Gamma(n - 1)}{(n\lambda)^{n - 1}}, \] valid for $n > 1$. Substituting then, \[ \operatorname{E}\!\left[\frac{1}{\overline{X}}\right] = \frac{(n\lambda)^n}{\Gamma(n)} \cdot \frac{\Gamma(n - 1)}{(n\lambda)^{n - 1}} = n\lambda \cdot \frac{\Gamma(n - 1)}{\Gamma(n)} = n\lambda \cdot \frac{1}{n-1} = \frac{n\lambda}{n - 1}, \] using $\Gamma(n) = (n - 1)\,\Gamma(n-1)$. Hence \[ \operatorname{E}\!\left[\frac{1}{\overline{X}}\right] = \frac{n}{n - 1}\,\lambda = \lambda + \frac{\lambda}{n - 1}, \] and so the bias is $\lambda/(n - 1) > 0$: the estimator $1/\overline{X}$ systematically overestimates the value of $\lambda$.

Thus, $1/\overline{X}$ is a biased estimator of $\lambda$. Note also that the bias $\lambda/(n - 1) \to 0$ as $n \to \infty$, so $1/\overline{X}$ is asymptotically unbiased.

A simulation (Fig. 12.3) using $\lambda = 2$ shows that, in a sample of size $n = 3$, the mean value of $1/\overline{x}$ is $2.93$, a poor estimate of $\lambda = 2$. The theoretical expected value is \[ \frac{n\lambda}{n - 1} = \frac{3\times 2}{2 - 1} = 3, \] which is close to the value in the simulation.

$The sampling distribution of $1/\overline{X}$ shows that this is a biased estimator of\ $\lambda$. The solid dot shows the value of $\lambda$, and the cross is the mean of the values of $1/\overline{X}$.$

FIGURE 12.3: The sampling distribution of $1/\overline{X}$ shows that this is a biased estimator of $\lambda$. The solid dot shows the value of $\lambda$, and the cross is the mean of the values of $1/\overline{X}$.

12.5.3 Estimating the population variance

In the same way that a population mean can be defined for a population of $N$ values $X_1, X_2, \dots, X_N$, the population variance is \[ \operatorname{var}[X] = \sigma^2 = \frac{1}{N}\sum_{i = 1}^N (x_i - \mu)^2. \]

Theorem 12.2 (An unbiased estimator for $\sigma^2$) The sample variance \[\begin{equation} S^2_\mu = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \tag{12.1} \end{equation}\] is an unbiased estimator of the population variance $\sigma^2$.

Proof. See Exercise 12.20. Notice that no assumption has been made about the distribution of $X$.

Using the samples found earlier (Sect. 12.4.2), the empirical estimates $S^2_\mu$ can be found for samples of size $n = 5$ and $n = 100$ (Fig. 12.4), where $\mu = 173$:

s2mu <- function(x) { (sum(x - x_Mn)^2) / length(x) }  # Function to find sS2_mu
Xvar_5 <- apply(Xsample_5,                             # Apply fn to each col
                MARGIN = 2,
                FUN = s2mu)
hist(Xvar_5) 
mean(Xvar_5)
#> [1] 9.069085

Xvar_100 <- apply(Xsample_100,
                  MARGIN = 2,
                  FUN = s2mu)
hist(Xvar_100) 
mean(Xvar_100)
#> [1] 9.008441

$Sampling distributions for the sample variance for sample of size $n = 5$ (left) and $n = 100$ (right), using the unbiased estimator $S^2_\mu$. The solid dots represent the population variance, and the cross is the mean of the empirical values.$

FIGURE 12.4: Sampling distributions for the sample variance for sample of size $n = 5$ (left) and $n = 100$ (right), using the unbiased estimator $S^2_\mu$. The solid dots represent the population variance, and the cross is the mean of the empirical values.

The average value of the estimates, from the simulations, appear close to $\sigma^2 = 9$. Again the larger sample size produces estimates with less variation compared to the estimates from the smaller sample size.

While $S^2_\mu$ is an unbiased estimator of $\sigma^2$, Eq. (12.1) has a practical difficulty: its computation requires the population mean $\mu$, which is unknown in practice. This suggests using the estimator \[ S^2_n = \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})^2. \] to estimate the population variance. To determine if this is an unbiased estimator, find $\operatorname{E}[S^2_n]$: \[\begin{align*} \operatorname{E}[S^2_n] &= \operatorname{E}\left[ \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})^2 \right]\\ &= \frac{1}{n}\operatorname{E}\left[ \sum_{i=1}^n \left(X^2_i - 2X_i\overline{X} + \overline{X}^2\right)\right]\\ &= \frac{1}{n}\operatorname{E}\left[ \sum_{i=1}^n X^2_i - \overline{X}\sum_{i=1}^n 2X_i + \sum_{i=1}^n \overline{X}^2\right]. \end{align*}\] Since $\overline{X} = \sum_i X_i/n$, then $\sum_i X_i = n \overline{X}$; hence \[\begin{align*} \operatorname{E}[S^2_n] &= \frac{1}{n}\operatorname{E}\left[ \sum_{i=1}^n X^2_i - 2n\overline{X}^2 + n \overline{X}^2\right]\\ &= \frac{1}{n}\operatorname{E}\left[ \sum_{i=1}^n X^2_i - n \overline{X}^2 \right]\\ &= \frac{1}{n}\left(\operatorname{E}\left[ \sum_{i=1}^n X^2_i\right] - n\operatorname{E}\left[ \overline{X}^2 \right]\right). \end{align*}\] This expression can be simplified by noting that \[ \operatorname{var}[\overline{X}] = \operatorname{E}[\overline{X}^2] - \operatorname{E}[\overline{X}]^2 = \operatorname{E}[\overline{X}^2] - \mu^2, \] and so $\operatorname{E}[\overline{X}^2] = \operatorname{var}[\overline{X}] + \mu^2 = \sigma^2/n + \mu^2$. Also, \[ \operatorname{var}[X] = \operatorname{E}[X^2] - \operatorname{E}[X]^2 = \operatorname{E}[X^2] - \mu^2, \] so that $\operatorname{E}[X^2] = \sigma^2 + \mu^2$. Hence, \[\begin{align*} \operatorname{E}[S^2_n] &= \frac{1}{n}\left[ \sum_{i=1}^n (\sigma^2 + \mu^2) - n\left(\frac{\sigma^2}{n} + \mu^2\right)\right]\\ &= \frac{1}{n}\left[ n\sigma^2 + n\mu^2 - \sigma^2 - n\mu^2\right]\\ &= \frac{n - 1}{n} \sigma^2. \end{align*}\] This shows that $\operatorname{E}[S^2_n] \ne\sigma^2$, and so the estimator is a biased estimator of $\sigma^2$. Again, the estimator is asymptotically unbiased, as $\operatorname{E}[S^2_n] \to\sigma^2$ as $n\to\infty$. Replacing the unknown $\mu$ with the estimate $\overline{X}$ introduces additional variation, causing $S^2_n$ to systematically underestimate the value of $\sigma^2$.

Using the sample found earlier, the empirical estimates $S^2_n$ can be found for samples of size $n = 5$ and $n = 100$ (Fig. 12.5, top panels). The values of $\operatorname{E}[S^2_n]$ for $n = 5$ and $n = 100$ are \[ \operatorname{E}[S^2_n] = \frac{5 - 1}{5}\times 3^2 = 7.2 \quad\text{and}\quad \operatorname{E}[S^2_n] = \frac{100 - 1}{100}\times 3^2 = 8.91 \] respectively, similar to those means computed from the R simulation. These both suggest that the estimator is biased.

Sampling distributions for the sample variance for sample of size $n = 5$ (left panels) and $n = 100$ (right panels), using the biased estimator $S^2_n$ (top panels) and the unbiased estimator $S^2$ (bottom panels). The solid dots represent the population variance, and the cross the mean of the sample estimates.

FIGURE 12.5: Sampling distributions for the sample variance for sample of size $n = 5$ (left panels) and $n = 100$ (right panels), using the biased estimator $S^2_n$ (top panels) and the unbiased estimator $S^2$ (bottom panels). The solid dots represent the population variance, and the cross the mean of the sample estimates.

To obtain an unbiased estimator, we could multiply $S^2_n$ by $(n - 1)/n$ (which is a constant), and this does produce an unbiased estimator.

Theorem 12.3 (An unbiased estimator for $\sigma^2$ ($\mu$ unknown)) When the value of $\mu$ is unknown, an unbiased estimator of $\sigma^2$ is \[\begin{equation} S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - \overline{X})^2. \tag{12.2} \end{equation}\] Notice that the divisor in front of the summation is $n - 1$ rather than $n$.

Proof. Easily shown by adapting the proof for $S^2_n$ above; see Exercise 12.21.

Using the sample found earlier, the empirical estimates $S^2$ can be found for samples of size $n = 5$ and $n = 100$ (Fig. 12.5, bottom panels). The means of the empirical estimates suggest that the estimator is unbiased.

12.6 The distribution of the sample mean

12.6.1 Introduction

In previous sections, unbiased estimators for $\mu$ and $\sigma^2$ were introduced and their properties studied. All estimators are random variables, and so have distributions. Previously, we used simulations to display the empirical sampling distribution of various statistics.

In this section, theoretical sampling distributions are derived (beginning with the distribution of the sample mean $\overline{X}$) which form the foundation for statistical inference.

12.6.2 The mean and variance of $\overline{X}$

Usually the sample mean $\overline{X}$ is of greatest interest when sampling. Figure 12.2 suggests that larger sample sizes produce sampling distributions with smaller variances. The following theorem quantifies how the variance of $\overline{X}$ decreases with sample size.

Theorem 12.4 (Sampling distribution of the mean) If $X_1, X_2, \dots, X_n$ is a random sample of size $n$ from a population with mean $\operatorname{E}[X] = \mu$ and variance $\operatorname{var}[X] = \sigma^2$, then the sample mean $\overline{X}$ has a sampling distribution with mean $\mu$ and variance $\sigma^2 / n$.

Proof. To show that $\operatorname{E}[\overline{X}] = \mu$, apply Theorem 12.1. Similarly, using the properties of the variance, the variance of $\overline{X}$ is: \[\begin{align*} \operatorname{var}[\overline{X}] &= \operatorname{var}[(X_1 + X_2 + \cdots + X_n)/n]\\ &= \frac{1}{n^2} \operatorname{var}[X_1 + X_2 + \cdots X_n]\\ &= \frac{1}{n^2} \times n\sigma^2\quad\text{(since the $X_i$ are iid)}\\ &= \frac{\sigma^2}{n}. \end{align*}\] We sometimes write $\sigma^2_{\overline{X}} = \operatorname{var}[\overline{X}]$. The value $\sigma_{\overline{X}} = \operatorname{sd}[\overline{X}]$ is called the standard error of the mean.

Theorem 12.4 applies to random samples as defined in Sect. 12.3, since independence and identical distributions are guaranteed by definition.

Theorem 12.4 shows that the expected value of $\overline{X}$ does not depend on the size of the sample, but the variance of $\overline{X}$ decreases as the sample size increases: larger samples tend to produce more precise estimates of $\mu$. This makes sense intuitively; after all, one reason for preferring larger samples over smaller samples is that they produce more precise estimates.

Importantly, Theorem 12.4 shows how the variance of $\overline{X}$ decreases as $n$ gets larger: inversely proportional to $n$. A sample size twice as large reduces the variance by half or, more intuitively, reduces the standard deviation by a factor of $\sqrt{2}$. In practice this means that to double the precision of $\overline{X}$ (i.e., to halve the size of the standard deviation), the sample size needs to be increased four-fold.

Another important feature of Theorem 12.4 is that the results do not rely on knowing the distribution of the population $X$. The result applies for random samples drawn from any distribution, provided the $X_i$ are iid. In Sect. 12.6.3, we consider the shape of the sampling distribution when the population specifically has a normal distribution; then in Sect. 12.6.4 we consider the shape of the sampling distribution when the population has any distribution

In Theorem 12.4, the mean and variance of $\overline{X}$ depend only on $\mu$ and $\sigma^2$, not on the shape of the population distribution.

Example 12.7 (Sampling mean and variance from an exponential distribution) Suppose $X_1, \dots, X_n \overset{\text{iid}}{\sim}\text{Exponential}(\lambda = 2)$.

Hence, $\operatorname{E}[X_i] = 1/2 = 0.5$, and $\operatorname{var}[X_i] = 1/\lambda^2 = 0.25$, so $\operatorname{E}[\overline{X}] = 0.5$, and $\operatorname{var}[\overline{X}] = 0.25/n$.

The empirical distribution of the means of the samples is shown in Fig. 12.6, for samples of size $n = 5$, $n = 20$ and $n = 100$. The empirical estimates of $\operatorname{E}[\overline{X}]$ and $\operatorname{var}[\overline{X}]$ (shown on the figure) are very close to the theoretical values.

set.seed(123)

ns <- c(5, 20, 100)
lambda <- 2
num_Sims <- 10000  # number of simulations

sim_means <- lapply(ns, function(n) {
  replicate(num_Sims, 
            mean(rexp(n, rate = lambda)))
})

hist(sim_means[[1]]) 
hist(sim_means[[2]])
hist(sim_means[[3]])

# The variance when n = 100 should be about twenty times larger
# than the variance for n = 5:
var(sim_means[[1]]) / var(sim_means[[3]])
#> [1] 20.23392

$Sampling distributions, for samples drawn from an exponential distribution with $\lambda = 2$, for various sample sizes. The solid dots represent the population mean $1/\lambda = 0.5$.$

FIGURE 12.6: Sampling distributions, for samples drawn from an exponential distribution with $\lambda = 2$, for various sample sizes. The solid dots represent the population mean $1/\lambda = 0.5$.

12.6.3 Sampling distribution of $\overline{X}$: when $X$ has a normal distribution

The normal distribution is used to model many naturally occurring phenomena (Sect. 7.3). Consequently, we begin by studying the sampling distributions of statistics based on data drawn from normal populations.

The following result proves establishes that any linear combination of independent normal random variables is itself normally distributed.

Theorem 12.5 (Linear combinations) Let $X_i\sim N(\mu_i,\sigma^2_i)$ where $i = 1, 2, \dots n$. Define the linear combination $Y$ as \[ Y = a_1 X_1 + a_2 X_2 + \dots + a_nX_n. \] Then $Y \sim N(\sum_i a_i\mu_i, \sum_i a^2_i\sigma^2_i)$.

Proof. From Theorem 7.2, the MGF of the random variable $X_i$ is \[ M_{X_i} (t) = \exp\left\{\mu_i t + \frac{1}{2} \sigma^2_i t^2\right\}\quad\text{for $i = 1, 2, \dots, n$}. \] So, for a constant $a_i$ (using Theorem 5.2 with $\beta = 0$, $\alpha = a_i$): \[\begin{align*} M_{a_i X_i}(t) &= M_{X_i}(a_it)\\ &= \exp\left\{\mu_i a_i t + \frac{1}{2} \sigma^2_i a^2_i t^2\right\}. \end{align*}\] Since the MGF of a sum of independent random variables is equal to the product of their MGFs, \[\begin{align*} M_Y(t) &= \prod^n_{i = 1} \exp\left( \mu_i a_i t + \frac{1}{2} \sigma^2_i a^2_i t^2 \right)\\ &= \exp\left( t\Sigma a_i\mu_i +\frac{1}{2} t^2 \Sigma a^2_i \sigma^2_i \right). \end{align*}\] This is the MGF of a normal random variable with mean $\operatorname{E}[Y] = \Sigma a_i\mu_i$ and variance $\operatorname{var}[Y] = \Sigma a^2_i\sigma^2_i$.

Example 12.8 (Linear combinations) Define \[ X_1 \sim N(2, \sigma^2 = 5) \quad\text{and}\quad X_2 \sim N(25, \sigma^2 = 1) \] and $a_1 = -8$ and $a_2 = 10$. Then, by Theorem 12.5, $Y = a_1 X_1 + a_2 X_2$ has a normal distribution, with \[\begin{align*} \operatorname{E}[Y] &= a_1\mu_1 + a_2\mu_2 = (-8\times 2) + (10\times 25) = 234;\\ \text{and}\quad \operatorname{var}[Y] &= a^2_1\sigma^2_1 + a_2^2\sigma^2_2 = ((-8)^2\times 5) + (10^2\times 1) = 420. \end{align*}\] This can be demonstrated in R (below); the simulated distribution of $Y$ is in Fig. 12.7.

set.seed(8979704) # For reproducibility

a_1 <- -8; a_2 <- 10
num_Sims <- 10000

X_1 <- rnorm(num_Sims,                 # This many simulations from a Normal...
             mean = 2, sd = sqrt(5) )  # ... with these parameters
X_2 <- rnorm(num_Sims,
             mean = 25, sd = sqrt(1) )

Y <- a_1 * X_1 + a_2 * X_2             # The linear combinations
hist(Y)                                # Produce the histogram

mean(Y); var(Y)
#> [1] 233.8263
#> [1] 423.8184

$Left: The normal distribution of $X_1$. Centre: the normal distribution of $X_2$. Right: the empirical distribution of a linear combination of the two normal random variates is a normal distribution. The solid dot represents the theoretical mean of $Y$, and the solid curve is the theoretical distribution of\ $Y$.$

FIGURE 12.7: Left: The normal distribution of $X_1$. Centre: the normal distribution of $X_2$. Right: the empirical distribution of a linear combination of the two normal random variates is a normal distribution. The solid dot represents the theoretical mean of $Y$, and the solid curve is the theoretical distribution of $Y$.

The sampling distribution of the sum and mean of a random sample from a normal population follows directly from Theorem 12.5.

Theorem 12.6 (Sum and mean of a random sample) Let $X_1, X_2, \dots, X_n \overset{\text{iid}}{\sim} N(\mu,\sigma^2)$. Define the sum $T_n$ and mean $\overline{X}$ respectively as \[\begin{align*} T_n &= X_1 + X_2 + \dots + X_n\\ \overline{X} &= (X_1+X_2 + \dots + X_n)/n. \end{align*}\] Then $S\sim N(n\mu, n\sigma^2)$ and $\overline{X}\sim N(\mu, \sigma^2/n)$.

Proof. See Exercise 12.22.

Theorem 12.6 holds for any sample size $n$, provided the population has a normal distribution.

For any size sample drawn from a normal distribution, the sample means have a normal distribution.

Example 12.9 (Sums of rvs) Sugar sachets (for sweetening hot drinks) typical are manufactured so that the weights of sugar (in grams) in each sachet $W$ are iid with a normal distribution having a mean of $4\,\text{g}$ and standard deviation of $0.1\,\text{g}$.

For catering purposes, sachets are sold in cartons of $1000$. What is the probability that a carton contains more than $4020\,\text{g}$ of sugar?

Since the weights $W$ follow a normal distribution, then the weight of the sugar in each carton is \[ T = W_1 + W_2 + \cdots + W_{1000} \sim N(1000\times 4, \sigma^2=1000 \times 0.1^2). \] or $N(4000, 10)$. Then, we seek $\Pr(T > 4020)$: \[ \Pr(T > 4020) = \Pr(Z > (4020 - 4000) / 10) = \Pr(Z > 2) = 1 - \Phi(2). \] In R:

1 - pnorm(2)
#> [1] 0.02275013

The probability is about $2.3$%.

Example 12.10 (CLT) The IQs for a large population of $10$ year-old boys is assumed to be normally distributed, with a mean of $110$ and a variance of $144$. How large a sample is needed to have a probability of $0.9$ that the mean IQ of the sample would not differ from the expected value $110$ by more than $5$?

Let $X_i$ be the IQ of the $i$th boy; then $X_i \sim N(110, 144)$. Consider a sample of size $n$ and let $\overline{X} = \sum^n_{i = 1}X_i/n$; then $\overline{X}\sim N(110, 144/n)$. The question asks to find the values of $n$ such that \[ \Pr(|\overline{X} - 110|\leq 5) = 0.90. \] That is, \[ \Pr\left(\frac{|\overline{X} - 110|}{12/\sqrt{n}} \leq \frac{5}{12/\sqrt{n}}\right) = 0.90 \] hence \[ \Pr(Z \leq 5\sqrt{n} /12) = 0.90. \] Using R and being careful (see Fig. 12.8):

z <- qnorm(0.95); z
#> [1] 1.644854

Thus, \[ 5\sqrt{n}/12 = 1.644854 \implies n = (12\times 1.644854/5)^2 = 15.58. \]

The smallest size sample, then, would be a sample of $n = 16$.

FIGURE 12.8: Left: the situation in terms of $Z$. Right: finding the required value of $Z$.

Example 12.11 (Two random variables) A design involves a plunger fitting into a cylindrical tube. The diameter of the plunger can be considered a normal random variable with mean $2.1\,\text{cm}$ and standard deviation $0.1\,\text{cm}$. The inside diameter of the cylindrical tube is a normal random variable with mean $2.3\,\text{cm}$ and standard deviation $0.05\,\text{cm}$. For a plunger and tube chosen randomly from a day’s production run, find the probability that the plunger will not fit into the cylinder.

Let $X$ and $Y$ be the diameter of the plunger and cylinder respectively. Then $X\sim N\big(2.1, (0.1)^2\big)$ and $Y\sim N\big( 2.3, (0.05)^2\big)$, and we seek $\Pr(Y < X)$. The distribution of $Y - X$ is $N(2.3 - 2.1, 0.0025 + 0.01)$ so that \[\begin{align*} \Pr(Y - X < 0) &= \Pr\left(Z <\frac{0 - 0.2}{\sqrt{0.0125}}\right) \quad\text{where $Z\sim N(0,1)$}\\ &= \Pr(Z < -1.78) = 0.0375. \end{align*}\] In R:

pnorm( -1.78 )
#> [1] 0.03753798

12.6.4 The Central Limit Theorem

In Sect. 12.6.2, general results are given describing the mean and variance of the sample mean which holds for any population distribution, but do not say anything about the distribution of the sample means. Theorem 12.6 states that the distribution is normal when the population is normally distributed.

Although these results are important, usually the distribution of the population from which the random sample is drawn is unknown. Remarkably, even when the population is not normally distributed, the sampling distribution still has an approximate normal distribution for large sample sizes.

Theorem 12.7 (Central Limit Theorem (CLT)) Let $X_1, X_2, \dots, X_n$ be a random sample from a distribution with mean $\mu$ and variance $\sigma^2$. Then the random variable \[ Z_n = \frac{\overline{X} - \mu}{\sigma / \sqrt{n}} \] converges in distribution to a standard normal variable as $n\to\infty$. ‘Convergence in distribution’ means that the distribution of $Z_n$ converges to a standard normal distribution as $n\to \infty$.

Proof. The proof is long, so is deferred to Appendix 12.10.

The Central Limit Theorem (Theorem 12.7) is one of the most important theorems in statistics.

The Central Theorem states that when $n$ is ‘large’, $Z_n$ is expected to approximate a standard normal distribution. Transforming $Z_n$ back to the sample mean, $\overline{X}$ can be expected to approximate a $N(\mu, \sigma^2/n)$ distribution.

In practice, the distribution of $\overline{X}$ is sufficiently close to that of a normal distribution when $n$ is larger than about $25$ in most situations. However, if the population distribution is severely skewed, larger samples sizes may be necessary for the approximation to be adequate in practice.

In simple terms, the Central Limit Theorem states that, for a random variable $X$, whatever its distribution, the distribution of the sample mean $\overline{X}$ becomes approximately normal as the sample size $n$ increases.

In particular, for large $n$,

the distribution of $\overline{X}$ is approximately normal,
with mean $\operatorname{E}[\overline{X}] = \mu$ where $\mu = \operatorname{E}[X]$, and
with variance $\operatorname{var}[\overline{X}] = \sigma^2/n$, where $\sigma^2 = \operatorname{var}[X]$.

The approximation generally improves as the sample size $n$ increases.

For many distributions that are not highly skewed, the approximation is reasonably good for sample sizes larger than about $20$ to $30$.

If the data come from a normal distribution, then the distribution of the sample mean is exactly normal for every sample size $n$ (Sect. 12.6.3).

To see the CLT in practice, adjust the sample size in the visualisation below. The sampling distribution of the sample mean appears approximately normal as the sample size gets larger.

FIGURE 12.9: Exact sampling distribution of the mean (blue) overlaid with the asymptotic CLT normal approximation (red dashed). The samples are drawn from a Gamma$(2, 1)$ distribution.

Example 12.12 (CLT and a Poisson distribution) In the R code below, data come from a Poisson distribution (Fig. 12.10, left panel) with $\lambda = 2$. However, the mean of samples of size $n = 10$ (centre panel) and $n = 100$ (right panel) are approximately distributed as a normal distribution. The variance of the sample means for the larger sample size is smaller than that for $n = 10$, and the distribution looks more like a normal distribution for the larger sample size.

The Central Limit Theorem. Left: the distribution of the individual observations that follow a Poisson distribution. Centre: the empirical sampling distribution of the sample mean for samples of size $n = 10$. Right: the empirical sampling distribution of the sample mean for samples of size $n = 100$. The thick dashed lines represent the theoretical asymptotic normal distribution. Note that the scale on the horizontal axis is different for each plot.

FIGURE 12.10: The Central Limit Theorem. Left: the distribution of the individual observations that follow a Poisson distribution. Centre: the empirical sampling distribution of the sample mean for samples of size $n = 10$. Right: the empirical sampling distribution of the sample mean for samples of size $n = 100$. The thick dashed lines represent the theoretical asymptotic normal distribution. Note that the scale on the horizontal axis is different for each plot.

Example 12.13 (CLT) When a coffee machine makes a single espresso-style shot, the volume of water in millimetres $W$ that is dispensed has a mean of $30\,\text{mL}$ with a standard deviation of $2.2\,\text{mL}$.

Find the probability that the average volume of water is a random sample of size $36$ shots is less than $31\,\text{mL}$.

The distribution of $W$ is not given, but the mean and standard deviation of $W$ are given. The distribution of the sample mean $\overline{W}$ can be approximated by the normal distribution with mean of $\mu = 30$ and standard error of $\sigma_{\overline{W}} = \sigma/\sqrt{n} = 2.2/\sqrt{36} = 0.3718\,\text{mL}$, according to the CLT. That is, approximately $\overline{W} \sim N(30, 0.3718^2)$. Now \[ \Pr(\overline{W} < 31) \approx \Pr\left(Z < \frac{31 - 30}{0.3718}\right) = \Pr(Z < 2.69). \] In R:

pnorm(2.69)
#> [1] 0.9964274

Hence, $\Pr(\overline{W} < 31) \approx 0.0036$, or approximately $99.6$%.

Example 12.14 (Throwing dice) Consider throwing a fair die $n$ times and observing the sum of the faces showing. For $n = 12$ rolls, find the probability that the sum of the faces is at least $52$.

Let the random variable $X_i$ be the number showing on the $i$th throw. Then define $Y = X_1 + \dots + X_{12}$, so that we seek $\Pr(Y\geq 52)$.

To use Theorem 12.7, see that ‘$Y\geq 52$’ is equivalent to ‘$\overline{X}\geq 52/12$’, where $\overline{X} = Y/12$ is the mean number showing from the $n = 12$ tosses.

Since the distribution of each $X_i$ is a continuous uniform distribution with $\Pr(X_i = x) = 1/6$ (for $x = 1, 2, \dots, 6$), then $\operatorname{E}[X_i] = 7/2$ and $\operatorname{var}[X_i] = 35/12$ (Sect. 6.2.2). Hence, $\operatorname{E}[\overline{X}] = 7/2$ and $\operatorname{var}[\overline{X}] = 35/(12^2)$, and so $\overline{X} \sim N(7/2, \sigma^2 = 35/(12^2))$.

Then, from Theorem 12.7, \[\begin{align*} \Pr(Y\geq 52) & \simeq \Pr(\overline{X}\geq 52/12)\\ &= \Pr_N \left(Z\geq \frac{52/12 - 7/2}{\sqrt{35/144}}\right)\\ &= 1 - \Phi(1.690) = 0.0455 \end{align*}\] The probability is approximately $4.6$%.

The Central Limit Theorem also applies to the sum \[ Y = \sum_{i=1}^n X_i, \] not just the sample mean. If $X_1,\ldots,X_n$ are iid random variables with \[ \operatorname{E}[X_i] = \mu \qquad \text{and} \qquad \operatorname{var}[X_i] = \sigma^2, \] then for large $n$, \[ Y = \sum_{i=1}^n X_i \approx N(n\mu, n\sigma^2). \] That is, the sum is approximately normally distributed with

mean $\operatorname{E}[Y] = n\mu$, and
variance $\operatorname{var}[Y] = n\sigma^2$.

This result is often useful when modelling totals, such as total sales, total demand, total rainfall, or total waiting time.

Example 12.15 (CLT (Voltages)) Suppose we have a number of independent noise voltages $V_i$ (for $i = 1, 2, \dots, n$). Let $V$ be the sum of the voltages, and suppose each $V_i$ is distributed $U(0, 10)$. For $n = 20$, find $\Pr(V > 105)$.

This is an example of the CLT written as a sum. To find $\Pr(V > 105)$, the distribution of $V$ must be known.

Since $\operatorname{E}[V_i] = 5$ and $\operatorname{var}[V_i] = 100/12$, $V$ has an approximate normal distribution with mean $20\times 5 = 100$ and variance $20\times 100/12$. That is, $\displaystyle{\frac{V - 100}{10\sqrt{5/3}}}$ is distributed $N(0, 1)$ approximately. So \[ \Pr(V > 105) \approx \Pr\left (Z > \frac{105 - 100}{12.91}\right ) = 1 - \Phi(0.387) = 0.352. \] The probability is approximately $35$%.

12.6.5 The normal approximation to the binomial

The normal approximation to the binomial distribution (Sect. 7.3.5) can be seen as an application of the Central Limit Theorem. The essential observation is that a sample proportion is a sample mean. Consider a sequence of independent Bernoulli trials resulting in the random sample $X_1, X_2, \dots, X_n$ where \[ X_i = \begin{cases} 0 & \text{if failure}\\ 1 & \text{if success} \end{cases} \] denotes whether or not the $i$th trial is a success. Then the sum \[ Y = \sum_{i = 1}^n X_i \] represents the number of successes in the $n$ trials and \[ \overline{X} = \frac{1}{n}\sum_{i = 1}^n X_i = \frac{Y}{n} \] is a sample mean representing the proportion or fraction of trials which are successful. In this context, $\overline{X}$ is usually denoted by the sample proportion $\widehat{p}$.

Note that $\operatorname{E}[X_i] = p$ and $\operatorname{var}[X_i] = p(1 - p)$. Therefore \[ \operatorname{E}[Y] = np \quad\text{and}\quad \operatorname{var}[Y] = np(1 - p) \] and \[ \operatorname{E}[\overline{X}] = p \quad\text{and}\quad \operatorname{var}[\overline{X}] = \frac{p(1 - p)}{n}. \] Theorem 12.7 is applicable to $\overline{X}$ and $Y$ respectively. Hence \[\begin{align*} \overline{X} &= \widehat{p} \sim N\left(p, \frac{p(1 - p)}{n}\right)\text{ approximately}\\ \text{and}\quad Y &= n\widehat{p} \sim N(np, np(1 - p))\text{ approximately}. \end{align*}\]

12.7 The $\chi^2$-distribution

Having discussed the sampling distribution of the sample mean, we now turn to the sampling distribution of the sample variance \[\begin{equation} S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \overline{X})^2 \tag{12.3} \end{equation}\] for a sample $X_1, \dots, X_n \overset{\text{iid}}{\sim} N(\mu, \sigma^2)$ (Theorem 12.3). We seek the distribution of $S^2$.

Before deriving the sampling distribution of $S^2$, one foundational result is needed…

Theorem 12.8 (Sample mean and variance are independent) Let $X_1, X_2,\dots, X_n$ be a random sample of size $n$ from $N(\mu, \sigma^2)$. Then the sample mean $\overline{X}$ and the sample variance $S^2$ are independent.

Proof. This proof is not given.

While the proof is not given, see Exercise 12.25. This result relies on normality; for non-normal distributions, $\overline{X}$ and $S^2$ are generally dependent.

The formula for the estimator $S^2$ in Eq. (12.3) involves an $\sum X_i^2$ term. Hence, to find the distribution of $S^2$, we start with the distribution of a squared normal variate. Specifically, consider \[ Z = \frac{X - \mu}{\sigma} \sim N(0,1) \] and the distribution of $Y = Z^2$. The distribution of $Y$ is easy to determine using the techniques in Chap. 9 (see Exercise 12.23), and yields \[ f_Y(y) = \frac{y^{-1/2}\exp(-y/2)}{\sqrt{2\pi}} \quad\text{for $y > 0$}. \] Note that $Y \sim \text{Gamma}(\alpha = 1/2, \beta = 2)$.

Now consider $Z_1, Z_2, \dots Z_n\overset{\text{iid}}{\sim} N(0, 1)$, and the summation $\sum_{i = 1}^n Z_i^2$. Since each $Z_i^2$ has the same gamma distribution, the distribution of the summation can be found, using MGFs (Exercise 12.24), as \[ \sum_{i = 1}^n Z_i^2 \sim \text{Gamma}(\alpha = n/2, \beta = 2). \] This distribution occurs so often in practice that it has its own name. A $\text{Gamma}(\alpha = n/2, \beta = 2)$ distribution is defined as a $\chi^2_n$-distribution, and so \[\begin{equation} \sum_{i = 1}^n Z_i^2 \sim \chi^2_n, \tag{12.4} \end{equation}\] where $n$ here is the degrees of freedom.

Definition 12.8 (Chi-squared distribution) A continuous random variable $X$ with probability density function \[\begin{equation} f_X(x) = \frac{x^{(\nu/2) - 1}\exp(-x/2)}{2^{\nu/2}\,\Gamma(\nu/2)} \quad\text{for $x > 0$} \end{equation}\] is said to have a chi-squared distribution with parameter $\nu > 0$, called the degrees of freedom. We write $X \sim \chi^2_\nu$.

Some plots of $\chi^2$-distributions are shown in Fig. 12.11.

$Some $\chi^2\!$-distribution probability density functions.$

FIGURE 12.11: Some $\chi^2\!$-distribution probability density functions.

Theorem 12.9 (Properties of the chi-squared distribution) If $X\sim\chi^2_\nu$ then

$\operatorname{E}[X] = \nu$.
$\operatorname{var}[X] = 2\nu$.
$M_X(t) = (1 - 2t)^{-\nu/2}$ for $t < 1/2$.

Proof. Since the chi-squared distribution is a special case of the gamma distribution, these properties can be obtained directly from those for the gamma distribution (see Theorem 7.7).

Having found the distribution of the sum of $n$ random variables, we can return to the original task: finding the sampling distribution of the sample variance $S^2$ in Eq. (12.3). First, write \[ \sum_{i=1}^n (X_i - \overline{X})^2 = \sum_{i=1}^n (X_i - \mu)^2 - n (\overline{X} - \mu)^2 \] (see Exercise 12.45). Dividing this expression by $\sigma^2$ then gives \[\begin{align*} W &= \sum_{i=1}^n \frac{(X_i - \overline{X})^2}{\sigma^2} \\ &= \sum_{i=1}^n \left( \frac{X_i - \mu}{\sigma}\right)^2 - \frac{n(\overline{X} - \mu)^2}{\sigma^2}. \end{align*}\] Now, write $W = U - V$, where \[\begin{align*} U &= \sum_{i=1}^n \frac{(X_i - \mu)^2}{\sigma^2} \sim \chi^2_n\\ \text{and}\quad V &= \frac{n(\overline{X} - \mu)^2}{\sigma^2} \sim \chi^2_1. \end{align*}\] Using the fact that $U$ and $V$ are independent (a consequence of the independence of sample mean and sample variance under normality; Theorem 12.8), we proceed via MGFs. Since $U = W + V$ and $W$ and $V$ are independent, the MGF of $U$ factorises as \[ M_U(t) = M_W(t)\cdot M_V(t). \] The MGF of a $\chi^2_k$ distribution is $M(t) = (1- 2 t)^{-k/2}$. Hence \[ (1-2t)^{-n/2} = M_W(t)\cdot (1-2t)^{-1/2}, \] which implies \[ M_W(t) = (1 - 2t)^{-(n-1)/2}. \] This is the MGF of a $\chi^2_{n-1}$ distribution, and hence \[ W \sim \chi^2_{n-1}. \] This means that \[ \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \overline{X})^2 \sim \chi^2_{n-1} \] and hence \[ \frac{(n - 1)S^2}{\sigma^2} \sim \chi^2_{n-1} \] after substituting Eq. (12.3) and rearranging. Thus, $S^2$ has a scaled $\chi^2_{n-1}$ distribution.

Theorem 12.10 (Sampling distribution of the variance) If $X_1, X_2, \dots, X_n$ is an iid sample of size $n$ from a $N(\mu, \sigma^2)$ distribution, then \[ \frac{(n - 1)S^2}{\sigma^2} \sim \chi^2_{n - 1}. \]

Proof. The proof has been outlined in the above discussion.

The four R functions for working with the $\chi_\nu^2$-distribution have the form [dpqr]chisq(., df), where df${} = \nu$ is the degrees of freedom (see App. E):

dchisq(x, df) computes the PDF at $X = {}$x;
pchisq(q, df) computes the CDF at $X = {}$q;
qchisq(p, df) computes the quantile for cumulative probability p; and
rchisq(n, df) generates n random observations.

Example 12.16 (Sampling distribution of the variance) Consider taking samples of size $n = 5$ and $n = 20$ from a $N(5, \sigma^2 = 4)$ distribution. The sampling distribution can be found from Theorem 12.10: \[ \frac{n - 1}{4} S^2 \sim \chi^2_{n - 1}. \] Figure 12.12 shows the empirical sampling distribution for the scaled sample standard deviation for both sample sizes. Again the larger sample size produces a sampling distribution with a smaller variance.

set.seed(42)

n_a <- 5; n_b <- 20
mu <- 5; sigma2 <- 4
num_Sims   <- 10000

# Simulate nsim values of (n-1)S^2 / sigma^2
scaled_S2_a <- replicate(num_Sims, {
  x <- rnorm(n_a, 
             mean = mu, sd = sqrt(sigma2))
  (n_a - 1) * var(x) / sigma2
})
scaled_S2_b <- replicate(num_Sims, {
  x <- rnorm(n_b, 
             mean = mu, sd = sqrt(sigma2))
  (n_b - 1) * var(x) / sigma2
})

# Plot histogram vs true chi^2(n-1) density
hist(scaled_S2_a, freq = FALSE)
curve(dchisq(x, df = n_a - 1), add = TRUE, lwd = 2)

hist(scaled_S2_b, freq = FALSE)
curve(dchisq(x, df = n_b - 1), add = TRUE, lwd = 2)

FIGURE 12.12: The sampling distribution of the scaled standard deviation. The scale on the horizontal axis is different for the two plots.

Example 12.17 (Sample standard deviation) Quality control engineers routinely need to assess the variability of a production process. A process may be centred correctly, yet still produce defective items if the spread is too large.

Suppose a manufacturer produces steel rods whose diameter (in mm) should follow a specified $N(\mu, \sigma^2)$ distribution. The target standard deviation is $\sigma = 0.05\,\text{mm}$; large variances are outside tolerance.

The production line is monitored by a quality engineer by drawing a random sample of $n = 25$ rods each hour and computing the sample variance $S^2$. In one sample of $n = 25$ rods, the engineer finds $s^2 = 0.0031\,\text{mm}^2$. Find the probability that $S^2 > 0.0031$ when $\sigma^2 = 0.0025$.

Using the chi-squared result, \[ \frac{n-1}{\sigma^2} S^2 \sim \chi^2_{n-1} \quad \text{so that} \quad \frac{24}{0.0025} S^2 = 9600\, S^2 \sim \chi^2_{24}. \]

Thus, \[ \Pr(S^2 > 0.0031) = \Pr\left(9600\,S^2 > \frac{24\times 0.0031}{0.0025}\right). \] In R:

1 - pchisq(29.76, 
           df = 24)
#> [1] 0.192834

There is an approximate $19$% probability of observing a sample variance this large or larger purely by chance, if the true variance is $\sigma^2 = 0.0025$.

12.8 The $t$-distribution

In statistical inference, a common goal is to make statements about a population mean $\mu$ using the sample mean $\overline{X}$. For example, Lin et al. (2021) studied whether the average overnight sleep time for pre-school children equals the recommended $10\,\text{h}$.

If the population variance $\sigma^2$ was known, then the Central Limit Theorem implies that \[ Z = \frac{\overline{X} - \mu}{\sigma/\sqrt{n}} \] is approximately standard normal for large $n$, regardless of the distribution of $X$.

In practice, however, the population variance $\sigma^2$ is almost never known. Instead, it is estimated using the sample variance $S^2$, leading to the statistic \[ T = \frac{\overline{X} - \mu}{S/\sqrt{n}}. \] Unlike $Z$, this statistic does not have a normal distribution.

The exact distribution of $T$ is the Student’s $t$-distribution, which accounts for the additional uncertainty introduced by estimating $\sigma$. This extra uncertainty is especially important for small samples.

Suppose that the population $X$ has a normal distribution. Then, \[ Z = \frac{\overline{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \quad\text{and}\quad V = \frac{(n - 1) S^2}{\sigma^2}\sim \chi^2_{n-1}. \] In addition, $Z$ and $V$ are independent. Hence \[ T = \frac{Z}{\sqrt{V/(n-1)}} \] has a $t_{n - 1}$ distribution, where $(n-1)$ is the degrees of freedom. This ratio occurs so often in statistical inference, and in practice, that it has its own name. More generally, if $Z\sim N(0, 1)$ and $V \sim \chi^2_\nu$ are independent, then $T = Z/\sqrt{V/\nu}$ defines the Student’s $t$-distribution with $\nu$ degrees of freedom, written $T \sim t_\nu$.

Definition 12.9 ($t$-distribution) Suppose $Z \sim N(0,1)$ and $V \sim \chi^2_\nu$. Then the random variable \[\begin{equation} T = \frac{Z}{\sqrt{V/\nu}} \tag{12.5} \end{equation}\] has a $t$-distribution with $\nu$ degrees of freedom.

For non-normal populations, the statistic $T$ is not exactly $t$-distributed. However, for large samples, its distribution is often well approximated by a $t$-distribution due to the Central Limit Theorem and the fact that $S^2$ converges to $\sigma^2$ for large samples

Theorem 12.11 ($t$-distribution PDF) A continuous random variable $X$ with probability density function \[\begin{equation} f_X(x) = \frac{\Gamma\left((\nu + 1)/2\right)}{\sqrt{\pi \nu}\,\Gamma(\nu/2)} \left(1 + \frac{x^2}{\nu}\right)^{-(\nu + 1)/2} \quad\text{for $x\in\mathbb{R}$} \end{equation}\] is said to have a $t$-distribution with parameter $\nu > 0$. The parameter $\nu$ is called the degrees of freedom. We write $X \sim t_\nu$.

Proof. This proof is not given.

The probability density function of the $t$-distribution is very similar to the standard normal distribution (Fig. 12.13): bell-shaped and symmetric about zero. However, the distribution has heavier tails than the standard normal distribution, reflecting its larger variance (Theorem 12.12).

The four R functions for working with the $t$-distribution have the form [dpqr]t(., df) (see App. E):

dt(x, df) computes the PDF at $X = {}$x;
pt(q, df) computes the CDF at $X = {}$q;
qt(p, df) computes the quantile for cumulative probability p; and
rt(n, df) generates n random numbers,

where df is the degrees of freedom (see Appendix E).

$Some $t$-distributions (with normal distributions in lighter-coloured lines), with mean\ $0$ and variance\ $1$.$

FIGURE 12.13: Some $t$-distributions (with normal distributions in lighter-coloured lines), with mean $0$ and variance $1$.

Theorem 12.12 (Properties of the $t$-distribution) If $X\sim t_\nu$ then

For $\nu > 1$, $\operatorname{E}[X] = 0$.
For $\nu > 2$, $\operatorname{var}[X] = \displaystyle{\frac{\nu}{\nu - 2}}$.
The MGF does not exist. Only moments of order less than $\nu$ exist; that is, $\operatorname{E}[|X|^k]$ is finite if and only if $k < \nu$.

Proof. This proof is not given.

Although we won’t prove it, as $\nu\to\infty$ the $t$-distribution converges to the standard normal (which can be seen in Fig. 12.13). In addition, from Theorem 12.12, $\operatorname{var}[X] \to 1$ as $\nu \to \infty$. Notice that $T$ represents a standardised version of the sample mean.

Example 12.18 (Calculating $T$) A random sample $\{21, 18, 16, 24, 16\}$ is drawn from a normal population with mean of $20$.

To find the value of $T$ for this sample, first find (from the sample): $\overline{x} = 19.0$ and $s^2 = 12.0$. Therefore $t = \frac{19.0 - 20}{\sqrt{12.0/5}} = -0.645$. (Again, lower-case symbols are used for specific values of statistics, and upper-case symbols for the random variables.)

We can then ask: In random samples from this population, what is the probability that $T$ is less than the value found above?

Interest here is in $\Pr(T < -0.645)$ where $T\sim t_4$. From R, the answer is approximately $0.277$:

pt(-0.645, df = 4)
#> [1] 0.2770289

Example 12.19 ($T$-statistic) The recommendation for Taiwanese pre-school children is to have at least $10\,\text{h}$ of sleep per night. Lin et al. (2021) sampled $39$ girls and found the sample mean was = $8.64\,\text{h}$ with a sample standard deviation of $s = 0.37\,\text{h}$. Determine the probability that the sample would would be less than $8.64\,\text{h}$ if the mean in the population was $10\,\text{h}$, as recommended.

Define the overnight sleep time for each girls as $X$. Since the sample size is moderately large ($n = 39$), the Central Limit Theorem implies that the sample mean $\overline{X}$ is approximately normally distributed.

Because the population variance $\sigma^2$ is unknown, we use $S^2$. Then \[ T = \frac{\overline{X} - \mu}{S/\sqrt{n}} \] has an approximate $t$-distribution, with $\nu = n - 1 = 38$ degrees of freedom.

Using $\mu = 10$, $s^2 = 0.37^2$ and $\overline{X} = 8.64$, compute \[ t = \frac{8.64 - 10}{0.37/\sqrt{39}} \approx -22.97. \] Hence $\Pr(\overline{X} < 8.64) = \Pr(T < -22.97)$; using R:

pt(-22.97, df = 38)
#> [1] 3.470478e-24

If the population mean is $10\,\text{h}$ (as per the recommendation), it is extremely unlikely that the sample mean from a sample of $39$ girls would be lower than $8.64\,\text{h}$.

12.9 The $F$-distribution

Another distribution that arises naturally in statistical inference is the $F$-distribution. It appears as the distribution of a ratio of two independent chi-squared random variables, and is used for comparing variances and in analysis of variance (ANOVA), a standard method for comparing multiple population means.

Suppose we have two independent random samples from normal populations with variances $\sigma_1^2$ and $\sigma_2^2$, of sizes $n_1$ and $n_2$, respectively. Then, from Theorem 12.10, \[ U_1 = \frac{(n_1 - 1)S_1^2}{\sigma_1^2} \sim \chi^2_{n_1 - 1} \quad\text{and}\quad U_2 = \frac{(n_2 - 1)S_2^2}{\sigma_2^2} \sim \chi^2_{n_2 - 1}. \] The two sample variances can be compared using \[ \frac{U_1/(n_1-1)}{U_2/(n_2-1)}. \] This ratio occurs so often in practice that is has its own name: an $F$-distribution.

Definition 12.10 ($F$-distribution) A random variable $F$ has an $F$-distribution with degrees of freedom $\nu_1$ and $\nu_2$, written $F \sim F_{\nu_1,\nu_2}$, if it can be expressed as \[ F = \frac{U_1/\nu_1}{U_2/\nu_2}, \] where $U_1 \sim \chi^2_{\nu_1}$, $U_2 \sim \chi^2_{\nu_2}$, and $U_1$ and $U_2$ are independent. We write $X \sim F_{\nu_1, \nu_2}$.

This structure makes the $F$-distribution central to inference on variances and mean comparisons. Some $F$-distributions are shown in Fig. 12.14.

Theorem 12.13 (Properties of the $F$-distribution) If $X\sim F_{\nu_1, \nu_2}$ then

For $\nu_2 > 2$, $\operatorname{E}[X] = \displaystyle{\frac{\nu_2}{\nu_2-2}}$.
For $\nu_2 > 4$, $\operatorname{var}[X] = \displaystyle{\frac{2\nu_2^2(\nu_1 + \nu_2-2)} {\nu_1(\nu_2 - 2)^2(\nu_2 - 4)}}$.
The MGF does not exist.

Proof. Not covered.

The four R functions for working with the $F$-distribution have the form [dpqr]f(., df1, df2), where df1${} = \nu_1$ and df2${} = \nu_2$ are the numerator and denominator degrees of freedom respectively (see App. E):

df(x, df1, df2) computes the PDF at $X = {}$x;
pf(q, df1, df2) computes the CDF at $X = {}$q;
qf(p, df1, df2) computes the quantile for cumulative probability p; and
rf(n, df1, df2) generates n random observations.

$Some $F$-distributions, for various numerator (df${}_1$) and denominator (df$_2$) degrees of freedom.$

FIGURE 12.14: Some $F$-distributions, for various numerator (df${}_1$) and denominator (df$_2$) degrees of freedom.

The main result connecting the $F$-distribution to samples from normal populations is stated in the following Theorem.

Theorem 12.14 (Sampling distribution of the ratio of two sample variances) Let $X_1, X_2, \dots, X_{n_1}$ be a random sample of size $n_1$ from $N(\mu_1, \sigma_1^2)$ and $Y_1, Y_2,\dots, Y_{n_2}$ be an independent random sample of size $n_2$ from $N(\mu_2, \sigma_2^2)$. Then the random variable \[ F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \] follows an $F_{n_1 - 1,\, n_2 - 1}$-distribution.

Proof. *A partial proof only is given.** By Theorem 12.10, \[ U_1 = \frac{(n_1 - 1)S_1^2}{\sigma_1^2} \sim \chi^2_{n_1 - 1} \qquad\text{and}\qquad U_2 = \frac{(n_2 - 1)S_2^2}{\sigma_2^2} \sim \chi^2_{n_2 - 1}, \] and $U_1$, $U_2$ are independent because the two samples are independent. Setting $\nu_1 = n_1 - 1$ and $\nu_2 = n_2 - 1$, the statistic can be written as \[ F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} = \frac{U_1/\nu_1}{U_2/\nu_2}. \] That this ratio has an $F_{\nu_1, \nu_2}$-distribution follows directly from Def. 12.10; the derivation of the probability density function is not given here.

Proof. From Def. 12.9, $T = Z / \sqrt{V/n}$ where $Z \sim N(0,1)$ and $V \sim \chi^2_n$ are independent. Then \[ T^2 = \frac{Z^2/1}{V/n}. \] Since $Z^2 \sim \chi^2_1$ (Sect. 12.7), this is a ratio of the form $\chi^2_1/1$ over $\chi^2_n/n$, which by Def. 12.10 has an $F_{1,n}$-distribution.

Example 12.20 ($F$-distribution probabilities) Suppose $X\sim F_{2, 10}$.

Then $\Pr(X < 1)$ can be found using R :

pf(1,    df1 = 2, df2 = 10)
#> [1] 0.5981224

We can also find the value of $x$ such that $\Pr(X > x) = 0.01$ using R:

qf(0.99, df1 = 2, df2 = 10)
#> [1] 7.559432

Example 12.21 (Comparing two variances) Suppose two independent random samples are taken from two normal populations:

Sample 1: $n_1 = 12$ and $S_1^2 = 18$;
Sample 2: $n_2 = 10$ and $S_2^2 = 8$.

To compare the population variances, consider the statistic \[ F = \frac{S_1^2}{S_2^2} = \frac{18}{8} = 2.25. \]

If we assume that the two population variances are equal (so that $\sigma_1^2 = \sigma_2^2$), then \[ F \sim F_{11, 9} \] since the numerator and denominator degrees of freedom are \[ \nu_1 = n_1 - 1 = 11 \qquad\text{and}\qquad \nu_2 = n_2 - 1 = 9. \]

Large values of $F$ provide evidence that the first population variance is larger than the second. The probability that we would observe a value of the $F$-statistic as large as $F = 2.25$ or larger is:

1 - pf(2.25, df1 = 11, df2 = 9)
#> [1] 0.1167768

Thus, it is reasonably likely that we would observe an $F$-value as large as $F = 2.25$ or larger, just through random sampling.

12.10 Appendix: Proof of the CLT

This Appendix outlines the proof of the Central Limit Theorem (Theorem 12.7).

Without loss of generality, centre the $X_i$ by letting $Y_i = (X_i - \mu)/\sigma$, so that $\operatorname{E}[Y_i] = 0$, $\operatorname{var}[Y_i] = 1$, and \[ Z_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n Y_i. \] Let $\varphi(t) = \operatorname{E}[\exp(itY)]$ denote the common characteristic function (CF) of the $Y_i$. Since the $Y_i$ are independent, the CF of $Z_n$ is \[\begin{align*} \varphi_{Z_n}(t) &= \operatorname{E}[\exp(itZ_n)]\\ &= \operatorname{E}\left[\exp\left(\frac{it}{\sqrt{n}}\sum_{i=1}^n Y_i\right)\right]\\ &= \prod_{i=1}^n \operatorname{E}\left[\exp(it Y_i/\sqrt{n})\right]\\ &= \left[\varphi\!\left(t/\sqrt{n}\right)\right]^n. \end{align*}\]

Since $\operatorname{E}[Y^2] = 1 < \infty$, the CF $\varphi(t)$ is twice differentiable at $t = 0$, with \[ \varphi'(t)\big|_{t=0} = i\operatorname{E}[Y] = 0 \qquad\text{and}\qquad \varphi''(t)\big|_{t=0} = i^2 \operatorname{E}[Y^2] = -1. \] The Taylor expansion of $\varphi$ about $t = 0$ is therefore \[\begin{align*} \varphi(t) &= 1 + \varphi'(0)\,t + \frac{1}{2}\varphi''(0)\,t^2 + o(t^2)\\ &= 1 - \frac{1}{2}t^2 + o(t^2) \quad \text{as $t \to 0$}, \end{align*}\] where $o(t^2)$ means $o(t^2)/t^2 \to 0$ as $t\to 0$. Substituting to write as a function of $t/\sqrt{n}$ gives \[ \varphi\!\left(\frac{t}{\sqrt{n}}\right) = 1 - \frac{t^2}{2n} + o\!\left(\frac{t^2}{n}\right). \]

Taking logs (which is valid for large enough $n$ since $\varphi(t/\sqrt{n}\,) \to 1$) and using $\log(1 + u) = u + o(u)$ as $u \to 0$: \[ \log \varphi\!\left(\frac{t}{\sqrt{n}}\right) = -\frac{t^2}{2n} + o\!\left(\frac{t^2}{n}\right). \] Therefore \[\begin{align*} \log \varphi_{Z_n}(t) &= n \log \varphi\!\left(\frac{t}{\sqrt{n}}\right)\\ &= n\left[-\frac{t^2}{2n} + o\!\left(\frac{t^2}{n}\right)\right]\\ &= -\frac{t^2}{2} + n\cdot o\left(\frac{t^2}{n}\right) \to\, -\frac{t^2}{2} \end{align*}\] as $n \to \infty$, for each fixed $t \in \mathbb{R}$. Taking exponents, \[ \varphi_{Z_n}(t) \to \exp(-t^2/2) \quad \text{for all $t \in \mathbb{R}$}. \]

The function $\exp(-t^2/2)$ is the CF of the $N(0, 1)$ distribution. Since $\varphi_{Z_n}(t) \to \varphi(t)$ pointwise for all $t$, and the limit $\varphi(t)$ is continuous at $t = 0$, then $Z_n \overset{d}{\to} Z$, where $Z$ has characteristic function $\varphi(t)$, and $\overset{d}{\to}$ means ‘converges in distribution’. Since $\exp(-t^2/2)$ is continuous everywhere and is the CF of $N(0, 1)$, conclude that \[ Z_n \overset{d}{\to} N(0, 1). \]

This is a partial proof since the concept of convergence in distribution has not been formally defined. However, the arguments used in the proof are powerful and worth understanding.

12.11 Exercises

Selected answers appear in Sect. F.11.

Exercise 12.1 When sampling without replacement from a finite population of size $N$, observations are no longer independent. For this reason, when sampling without replacement the standard error is multiplied by the finite population correction factor \[ \sqrt{\frac{N - n}{N - 1}} \] where $n$ is the sample size.

What happens if the sample size is the same as the population size? Explain why this is a sensible result.
Show that, for large $N$, the FPCF is approximately $\sqrt{ 1 - n/N}$.
For what ratio $n/N$ (the sampling fraction) do you think that using the FPCF is unnecessary in practice?

Exercise 12.2 A random sample of size $81$ is taken from a population (with inknown distribution), with mean $128$ and standard deviation $6.3$.

What is the probability that an individual observation will fall between $126.6$ and $129.4$?
What is the probability that the sample mean will fall between $126.6$ and $129.4$?
What is the probability that the sample mean will not fall between $126.6$ and $129.4$?

Exercise 12.3 Let $Y_1$, $Y_2$, $\dots$, $Y_n$ be $n$ independent random variables, each with PDF \[ f_Y(y) = \begin{cases} (2 - y)/2 & \text{for $0\le y \le 2$};\\ 0 & \text{otherwise}. \end{cases} \]

Determine the probability that a single observation will be within one standard deviation of the population mean.
Determine the probability that the sample mean will be within one standard deviation of the population mean, using the Central Limit Theorem.

Exercise 12.4 Suppose the weights of eggs in a carton (of twelve eggs) have a weight that is normally distributed with mean $59\,\text{g}$ and variance $0.7\,\text{g}$.

Find the probability that, in a sample of $20$ cartons, the sample mean weight will exceed $59.5\,\text{g}$.
Find the probability that a sample of twelve eggs will produce a sample variance of greater than $1$.

Exercise 12.5 In a carton of a twelve eggs, the number broken has a Poisson distribution with mean $0.2$.

Find the probability that, in a sample of $20$ cartons, the sample mean of the number of broken eggs per carton is more than one. (Use the Central Limit Theorem.)
Find the probability that, in any single carton, the probability that more than one egg is broken.

Exercise 12.6 The random variable $M$ has the following probability density function \[ f_M(m) = \begin{cases} 3m^2 & \text{for $0 < m < 1$};\\ 0 & \text{otherwise}. \end{cases} \] A random sample of size $n = 9$ is taken from the distribution, and the sample mean $\overline{M}$ is computed.

Compute the mean of $M$.
Compute the variance of $M$.
State the approximate distribution of $\overline{M}$ including the values of the parameters.
Compute the probability that the sample mean will be within $0.1$ of the true mean.

Exercise 12.7 Generate a random sample of size $n = 9$ from a $N(10, 36)$ distribution hundreds of times. Obtain the mean and variance of $\sum_i X_i$ and $\overline{X}$ for each sample.

Verify that the distribution of the sample means is approximately normally distributed as expected.
Verify that the distribution of the sum is approximately normally distributed as expected.
Explain why these are expected.

Exercise 12.8 A manufacturing plant produces $2$ tonnes of waste product on a given day, with a standard deviation of $0.2$ tonnes per day. Find the probability that, over a $20$ day period, the plant produces less than $25$ tonnes of waste if daily productions can be assumed independent.

Exercise 12.9 Let $Y$ be the change in depth of a river from one day to the next measured (in cms) at a specific location. Assume $Y$ is uniformly distributed for $y \in [-70, 70]$.

Find the probability that the mean change in depth over a period of $30$ days will be greater than $10\,\text{cm}$.
Use simulation to estimate the probability above.

Exercise 12.10 The number of deaths per year due to typhoid fever is assumed to have a Poisson distribution with rate $\lambda = 4.1$ per year.

If deaths from year to year can be assumed to be independent, what is the distribution over a $20$ year period?
Find the probability that there will be more than $100$ deaths due to typhoid fever in period of $20$ years.

Exercise 12.11 The probability that a cell is a lymphocyte is $0.2$.

Write down an exact expression for the probability that in a sample containing $150$ cells that at least $40$ are lymphocytes. Evaluate this expression using R.
Write down an approximate expression for this probability, and evaluate it.

Exercise 12.12 Suppose the probability of a person aged $80+$ years dying after receiving influenza vaccine is $0.006$. In a sample of $200$ persons aged $80+$ years:

Write down an exact expression for the probability that more than $5$ will die after vaccination for influenza. Evaluate this expression using R.
Write down an approximate expression for this probability and evaluate it.
If $4$ persons died in a sample of $200$, what conclusion would you make about the probability of dying after vaccination? Justify your answer.

Exercise 12.13 Illustrate the Central Limit Theorem for the uniform distribution on $[-1, 1]$ by simulation and repeated sampling, for various sample sizes.

Exercise 12.14 Illustrate the Central Limit Theorem for the exponential distribution with mean $1$, by simulation and repeated sampling, for various sample sizes.

Exercise 12.15 Consider a random variable $Z$ with a standard normal distribution $N(0, 1)$, and a random variable $V$ with a chi-squared distribution on $v$ degrees of freedom.

Simulate the distribution of \[ T = \frac{Z}{\sqrt{V/\nu}}, \] then show this is a $t$-distribution with $\nu$ degrees of freedom. Hint: use the R functions for the $t$-distribution (e.g., dt()) and the chi-squared distribution (e.g, dchisq()).

Exercise 12.16 Consider a zero-modified Gamma$(2, 1)$-distribution, with $p = 0.5$.

Derive $\operatorname{E}[\overline{Y}_n]$ and $\operatorname{var}[\overline{Y}_n]$.
Simulate the distribution of $\operatorname{E}[\overline{Y}_n]$ for $n \in \{10, 50, 200\}$, overlayed with the appropriate normal approximation. Comment.

Exercise 12.17 Consider a compound Poisson–gamma distributions with $\lambda = 1.5$, $\alpha = 2$ and $\beta = 1$.

Derive $\operatorname{E}[\overline{Y}_n]$ and $\operatorname{var}[\overline{Y}_n]$.
Simulate the distribution of $\operatorname{E}[\overline{Y}_n]$ for $n \in \{10, 50, 200\}$, overlayed with the appropriate normal approximation. Comment.

Exercise 12.18 Let $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \operatorname{Exponential}(\lambda)$, where $\lambda$ is the rate parameter so that $\operatorname{E}[X_i] = 1/\lambda$.

Show that $\overline{X}$ is a biased estimator of $1/\lambda$.
Find an unbiased estimator of $\lambda$.

Exercise 12.19 Let $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \operatorname{Uniform}(0, \theta)$.

Show that $2\overline{X}$ is an unbiased estimator of $\theta$.
Let $M = \max(X_1, \ldots, X_n)$. Derive the distribution of $M$ and show that $M$ is not unbiased for $\theta$.
Find a function of $M$ that is unbiased for $\theta$.

Exercise 12.20 Show that the estimator \[ S^2_\mu = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2. \] is an unbiased estimator of the population variance $\sigma^2$.

Exercise 12.21 Show that the estimator \[ S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - \overline{X})^2. \] is an unbiased estimator of the population variance $\sigma^2$.

Exercise 12.22 Prove Theorem 12.6.

Exercise 12.23 Consider a random variable $X\sim N(0, 1)$. Show that the distribution of $Z = X^2$ is a $\text{Gamma}(\alpha = 1/2, \beta = 2)$ distribution.

Exercise 12.24 Consider $Z_i \overset{\text{iid}}{\sim}\sim \text{Gamma}(\alpha=1/2, \beta2)$ for $i = 1, 2, \dots, n$. Use the MGF method (Sect. 9.4) to show that the distribution of $\sum_{i=1}^n Z_i^2$ is a $\text{Gamma}(\alpha = n/2, \beta = 2)$ distribution.

Exercise 12.25 Let $X_1, X_2,\dots, X_n$ be a random sample of size $n$ from $N(\mu, \sigma^2)$. By Theorem 12.8), the sample mean $\overline{X}$ and the sample variance $S^2$ are independent.

To demonstrate using R, generate $1000$ simulations of data $X_i \sim N(1, 3)$ for $i = 1, 2, \dots n$ for $n = 50$. For each of the $1\,000$ simulations, compute the sample mean and standard deviation.

Plot the sample means against the sample variances. Comment.
Compute the correlation between the sample means against the sample variances. Comment.
Show that the covariance between the sample mean $\overline{X}$ and the deviations of the observations from the sample means $\overline{X} - X_i$ is zero.

Exercise 12.26 The variable $X$ has a chi-squared distribution with $12$ df. Determine the value of $X$ below which lies 90% of the distribution.

Exercise 12.27 A random sample of size $n = 10$ is selected from the $N(20, 5)$ distribution. What is the probability that the sample variance exceeds $10$?

Exercise 12.28 A population consists of the five values $\{1, 3, 5, 7, 9\}$.

Compute the population mean $\mu$ and population variance $\sigma^2$.
List all possible samples of size $n = 2$ (without replacement) and compute the sample mean $\overline{X}$ for each.
Find the mean and variance of the sampling distribution of $\overline{X}$.
Verify that $\operatorname{E}[\overline{X}] = \mu$.
How does the variance of $\overline{X}$ relate to $\sigma^2$ and $n$?

Exercise 12.29 Let $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \operatorname{Poisson}(\lambda = 3)$.

State $\operatorname{E}[X_i]$ and $\operatorname{var}[X_i]$.
Show algebraically that $\overline{X}$ is an unbiased estimator of $\lambda$.
Using R, simulate $5\,000$ samples of size $n = 20$ from a $\operatorname{Poisson}(3)$ distribution. Compute the sample mean for each and verify empirically that the average of the sample means is close to $3$.
Plot a histogram of the sample means. Comment on the shape.

Exercise 12.30 Using R, simulate $5\,000$ samples of size $n = 5$ from a $N(\mu = 10, \sigma^2 = 4)$ distribution.

For each sample, compute both the biased estimator \[ S^2_n = \frac{1}{n}\sum_{i=1}^n (X_i - \overline{X})^2 \] and the unbiased estimator \[ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \overline{X})^2. \]
Compute the average value of $S^2_n$ and $S^2$ across all simulations.
Compare both averages to the true population variance $\sigma^2 = 4$. Which estimator is unbiased?
Repeat for $n = 50$. Comment on how bias changes with sample size.

Exercise 12.31 Suppose $X \sim \operatorname{Exponential}(\lambda = 0.5)$, so that $\operatorname{E}[X] = 2$ and $\operatorname{var}[X] = 4$.

State the theoretical mean and variance of $\overline{X}$ for samples of size $n = 16$.
Using R, simulate $5\,000$ sample means from samples of size $n = 16$ drawn from $\operatorname{Exponential}(0.5)$.
Compute the empirical mean and variance of these sample means and compare to the theoretical values.
Repeat for $n = 64$. Comment on how sample size affects the variance of $\overline{X}$.

Exercise 12.32 Let $X \sim \operatorname{Uniform}(0, 1)$, so that $\operatorname{E}[X] = 0.5$ and $\operatorname{var}[X] = 1/12$.

For sample sizes $n \in \{1, 5, 30, 100\}$, simulate $5\,000$ sample means using R.
Plot histograms of the sample means for each sample size, in a $2 \times 2$ panel.
Overlay each histogram with the theoretical normal distribution $N(0.5,\; 1/(12n))$.
Comment on how quickly the sampling distribution approaches normality as $n$ increases.

Exercise 12.33 The number of customers arriving at a service desk per hour follows a $\operatorname{Poisson}(\lambda = 4)$ distribution.

State the mean and variance of the total number of customers over $n = 50\,\text{h}$.
Using the CLT, find the approximate probability that the total number of customers over $50\,\text{h}$ exceeds $215$.
Verify your answer in R using ppois() for the exact probability and compare.

Exercise 12.34 Weights of a particular product are normally distributed with mean $500\,\text{g}$ and standard deviation $8\,\text{g}$.

Find the probability that a single item weighs less than $490\,\text{g}$.
Find the probability that the mean weight of a random sample of $n = 25$ items is less than $490\,\text{g}$.
Find the probability that the mean weight of a random sample of $n = 64$ items is less than $490\,\text{g}$.
Explain why the probabilities in Parts 1–-3 decrease as the sample size increases.

Exercise 12.35 Let $X_1 \sim N(5, 9)$ and $X_2 \sim N(2, 4)$ be independent. Define $Y = 3X_1 - 2X_2$.

Find $\operatorname{E}[Y]$ and $\operatorname{var}[Y]$.
State the distribution of $Y$.
Find $\Pr(Y > 20)$.
Verify using R simulation: generate $5\,000$ realisations of $Y$ and estimate $\Pr(Y > 20)$.

Exercise 12.36 Let $X \sim \chi^2_{10}$.

State the mean and variance of $X$.
Find $\Pr(X < 15.99)$ using R.
Find the value $x$ such that $\Pr(X > x) = 0.05$ using R.
Simulate $5\,000$ values from $\chi^2_{10}$ and plot a histogram. Overlay the theoretical density curve.

Exercise 12.37

Show by simulation that if $Z_1, Z_2, \ldots, Z_5 \overset{\text{iid}}{\sim} N(0,1)$, then $\sum_{i=1}^5 Z_i^2 \sim \chi^2_5$. Use $5\,000$ simulations, plot the histogram, and overlay the $\chi^2_5$ density.
Compute the empirical mean and variance of $\sum Z_i^2$ and compare to the theoretical values.

Exercise 12.38 Suppose $X_1, \ldots, X_{15} \overset{\text{iid}}{\sim} N(\mu = 20, \sigma^2 = 9)$.

State the distribution of $(n-1)S^2/\sigma^2$.
Find the probability that $S^2 > 14$ using R.
Find the value $s^2$ such that $\Pr(S^2 > s^2) = 0.10$.
Verify Part 2 by simulation: generate $5\,000$ samples of size $15$ and compute the proportion with $S^2 > 14$.

Exercise 12.39 Let $T \sim t_8$.

State the mean and variance of $T$.
Find $\Pr(T < 1.86)$ using R.
Find the value $t$ such that $\Pr(|T| > t) = 0.05$ using R.
Show by simulation that $T = Z / \sqrt{V/8}$ has a $t_8$ distribution, where $Z \sim N(0,1)$ and $V \sim \chi^2_8$ independently. Use $10\,000$ simulations, plot the histogram, and overlay the $t_8$ density.

Exercise 12.40 A sample of $n = 10$ observations is drawn from a normal population with unknown mean and variance. The sample values are: \[14.2,\ 15.8,\ 13.5,\ 16.1,\ 14.9,\ 15.3,\ 13.8,\ 16.4,\ 14.6,\ 15.0.\]

Compute $\overline{x}$ and $s^2$ from the data in R.
Compute the $T$-statistic assuming the population mean is $\mu_0 = 15$.
Find the probability of observing a $T$-statistic as extreme or more extreme than this value (two-sided) under $t_9$.
Comment on whether the data are consistent with $\mu = 15$.

Exercise 12.41 Let $X \sim F_{4, 20}$.

State the mean and variance of $X$.
Find $\Pr(X < 2.87)$ using R.
Find the value $x$ such that $\Pr(X > x) = 0.01$.
Show by simulation that the ratio $\frac{U_1/4}{U_2/20}$, where $U_1 \sim \chi^2_4$ and $U_2 \sim \chi^2_{20}$ independently, follows an $F_{4,20}$ distribution. Use $10\,000$ simulations.

Exercise 12.42

State the result connecting $T \sim t_n$ to the $F$-distribution.
Let $T \sim t_{12}$. Find $\Pr(T^2 > 3)$ using the $t$-distribution in R.
Verify this answer by instead computing the equivalent probability using the $F$-distribution in R.
Confirm the equivalence by simulation: generate $10\,000$ values of $T \sim t_{12}$ and $50{,}000$ values of $W \sim F_{1,12}$ and compare histograms of $T^2$ and $W$.

Exercise 12.43 In a large population, $35\%$ of people have a particular characteristic. A random sample of $n = 120$ people is taken.

Let $Y$ be the number of people in the sample with the characteristic. State the exact distribution of $Y$.
Using the CLT, state the approximate distribution of $Y$.
Use the normal approximation to find $\Pr(Y \geq 50)$.
Compute the exact probability using R and compare to the approximation.

Exercise 12.44 Using R, simulate $5\,000$ samples of size $n = 30$ from $N(0, 1)$. For each sample, record the sample mean $\overline{X}$ and sample variance $S^2$.

Produce a scatterplot of $\overline{X}$ against $S^2$.
Compute the correlation between $\overline{X}$ and $S^2$.
Comment on what the plot and correlation suggest about the relationship between the sample mean and sample variance when sampling from a normal distribution.

Exercise 12.45 Show that \[ \sum_{i=1}^n (X_i - \overline{X})^2 = \sum_{i=1}^n (X_i - \mu)^2 - n(\overline{X}-\mu)^2. \] Start by writing \[ X_i - \overline{X} = (X_i - \mu) - (\overline{X} - \mu), \] then expand the square and simplify the resulting sums.

Exercise 12.46 Let $X_1, X_2, \dots, X_n$ be a random sample from a $\text{Uniform}(0, \theta)$ distribution, where $\theta > 0$ is unknown and is to be estimated.

Show that $\operatorname{E}[X_i] = \theta/2$, and hence that $\hat{\theta} = 2\overline{X}$ is an unbiased estimator of $\theta$.
Now consider the estimator $\tilde{\theta} = \max(X_1, X_2, \dots, X_n)$. In Chap. 13, we see that $\displaystyle \operatorname{E}[\tilde{\theta}] = \frac{n}{n + 1}\theta$. Show that $\tilde{\theta}$ is biased. Find the bias, and state its direction.
Find a simple constant multiple of $\tilde{\theta}$ that is unbiased.
For large $n$, which estimator would you prefer: $\hat{\theta} = 2\overline{X}$ or the adjusted $\tilde{\theta}$? Give a brief intuitive reason.
Verify your answers by simulation in R using $\theta = 5$ and $n = 10$. Generate $10\,000$ samples and compare the two unbiased estimators $\hat{\theta} = 2\overline{X}$ and $\displaystyle \frac{n + 1}{n}\max(X_1, \dots, X_n)$ in terms of bias and variability.

Exercise 12.47 Let $X_1, X_2, \dots, X_n$ be a random sample from an $\text{Exponential}(\lambda)$ distribution.

Find the median $M$ for a $\text{Exponential}(\lambda)$ distribution.
To estimate the median, consider the estimator $\widehat{M} = \overline{X}$ (i.e., the sample mean). Find $\operatorname{E}[\widehat{M}]$.
Show that $\widehat{M} = \overline{X}$ is a biased estimator of $M$, and find the bias.
Suggest a simple adjustment to $\overline{X}$ that produces an unbiased estimator of $M$. Verify your answer by showing the adjusted estimator is unbiased.

Exercise 12.48 Let $T \sim t_\nu$.

Using R, find $P(T > 2)$ for $\nu = 5, 10, 30, 100$ and $\nu \to \infty$ (i.e., the standard normal distribution). Arrange your results in a table and comment on the pattern.
The $t$-distribution is said to have heavier tails than the normal distribution. Explain what this means in terms of your results in Part 1.
Find the value $t^*$ such that $P(-t^* < T < t^*) = 0.95$ for $\nu = 5, 10, 30$ and the standard normal distribution. Comment on how $t^*$ changes with $\nu$.

Exercise 12.49 Let $F \sim F_{\nu_1, \nu_2}$.

Using R, sketch the pdf of $F_{\nu_1, \nu_2}$ for the four combinations $\nu_1, \nu_2 \in \{5, 30\}$. Comment on how the shape changes with the degrees of freedom.
If $F \sim F_{\nu_1, \nu_2}$, show that $1/F \sim F_{\nu_2, \nu_1}$. (Hint: use the definition of the $F$-distribution in terms of chi-squared random variables.)
Use the result in Part 2 to show that \[ P(F_{\nu_1,\nu_2} \leq x) = P\!\left(F_{\nu_2,\nu_1} \geq \frac{1}{x}\right). \]
Verify Part 3 numerically in R by confirming that $P(F_{5,10} \leq 0.4) = P(F_{10,5} \geq 2.5)$.

11 Multivariate distributions*

13 Order statistics

The Theory of Statistical Distributions

12 Describing samples

12.1 Introduction

12.2 From populations to observations

12.3 Random sampling and statistics

12.4 Sampling distributions

12.4.1 Sampling distributions: small population example

12.4.2 Sampling distributions: large populations

12.5 Estimating population parameters

12.5.1 Estimating the population mean

12.5.2 Unbiased estimators

12.5.3 Estimating the population variance

12.6 The distribution of the sample mean

12.6.1 Introduction

12.6.2 The mean and variance of \(\overline{X}\)

12.6.3 Sampling distribution of \(\overline{X}\): when \(X\) has a normal distribution

12.6.4 The Central Limit Theorem

12.6.5 The normal approximation to the binomial

12.7 The \(\chi^2\)-distribution

12.8 The \(t\)-distribution

12.9 The \(F\)-distribution

12.10 Appendix: Proof of the CLT

12.11 Exercises