31  First tests about about a normal mean

Author

Karl Gregory

Here we consider testing hypotheses about the unknown value of a population mean \(\mu\). We will consider null and alternate hypotheses taking one of these three forms:

  1. \(H_0\): \(\mu \leq \mu_0\) versus \(H_1\): \(\mu > \mu_0\).
  2. \(H_0\): \(\mu \geq \mu_0\) versus \(H_1\): \(\mu < \mu_0\).
  3. \(H_0\): \(\mu = \mu_0\) versus \(H_1\): \(\mu \neq \mu_0\).

In the above, \(\mu_0\) represents a specific value of the unknown parameter \(\mu\) used to define the null hypothesis. We may call \(\mu_0\) the null value of \(\mu\). One may formulate null and alternate hypotheses in other ways, but this is rarely seen.

Consider the following example:

Example 31.1 (Golden ratio data) Recall the golden ratio data in Example 15.2:

gr <- c(1.66, 1.61, 1.62, 1.69, 1.58, 1.43, 1.66, 
        1.69, 1.58, 1.20, 1.52, 1.60, 1.55, 1.67, 
        1.77, 1.50, 1.64, 1.54, 1.40, 1.36, 1.50, 
        1.40, 1.35, 1.48, 1.64, 1.91, 1.70)
xbar <- mean(gr)

From the data we obtain the sample mean \(\bar X_n = 1.5648148\). Here it was of interest to discover whether or not our fingers grow according to the golden ratio such that the ratio of the distances \(B/A\) had a mean equal to \(1.618\) in the human population. Here one could formulate the null and alternate hypotheses as \[ \text{$H_0$: }\mu = 1.618 \text{ versus } \text{$H_1$: }\mu \neq 1.618. \] If one rejected \(H_0\) as formulated here, one would conclude that the true mean of the ratio \(B/A\) was not the golden ratio but some other (unknown) value. The sample mean \(\bar X_n\) is somewhat different from the null value \(1.618\), but we must consider that every time a random sample is drawn, a different value of \(\bar X_n\) will be obtained, s we do not expect the sample mean \(\bar X_n\) to ever be exactly equal to the population mean \(\mu\). To decide whether to reject \(H_0\): \(\mu = 1.618\), we would need to consider whether \(\bar X_n = 1.5648148\) is far enough away from \(1.618\) to render the latter an implausible value for the population mean.

If one were particularly interested in showing that the ratio \(B/A\) has a mean greater than the golden ratio, one could consider \[ \text{$H_0$: }\mu \leq 1.618 \text{ versus } \text{$H_1$: }\mu > 1.618. \] If one rejected this version of \(H_0\) would would conclude that the mean of \(B/A\) was greater than the golden ratio. The data, with sample mean \(\bar X_n = 1.5648148\), actually carry evidence in support of \(H_0\): \(\mu \leq 1.618\), so there are no grounds to reject this null hypothesis.

On the other hand, one could test \[ \text{$H_0$: }\mu \geq 1.618 \text{ versus } \text{$H_1$: }\mu < 1.618. \] to see if one might conclude that the mean of the ratio \(B/A\) is less than the golden ratio. The data, with sample mean \(\bar X_n = 1.5648148\), appear to carry some evidence against this null and in favor of this alternate hypothesis, so it may be appropriate to reject \(H_0\) and conclude that the ratio \(B/A\) has a mean less than the golden ratio.

In all of these formulations of the null and alternate hypothesis, the null value is \(\mu_0 = 1.618\).


Considering the golden ratio example, we see that we will need to come up with some kind of rule for when to reject \(H_0\) based on the sample mean \(\bar X_n\) (taking for granted that basing our decision on \(\bar X_n\) is the right thing to do, as opposed to basing it on some other quantity computed from the data). If testing \(H_0\): \(\mu = \mu_0\) versus \(H_1\): \(\mu \neq \mu_0\), it would make sense to reject \(H_0\) if \(\bar X_n\) is far enough above or below \(\mu_0\), that is lying far enough to the right or to the left of \(\mu_0\); if testing \(H_0\): \(\mu \leq \mu_0\) versus \(H_1\): \(\mu > \mu_0\), it would make sense to reject \(H_0\) when \(\bar X_n\) is far enough above or to the right of \(\mu_0\); and if testing \(H_0\): \(\mu \geq \mu_0\) versus \(H_1\): \(\mu < \mu_0\), to reject \(H_0\) when \(\bar X_n\) is far enough below or to the left of \(\mu_0\). In each case, how far is “far enough” must be carefully considered. To make a good determination of how far is far enough, we will use what we know of the sampling distribution of \(\bar X_n\).

Before entering into these details, however, it will be necessary to consider how statistical inferences may err—that is, in what ways it is possible to make an incorrect decision about rejecting or not rejecting the null hypothesis. There are two ways to err: One may commit what is called a Type I error, which is to reject \(H_0\) when it is true, or one may commit a Type II error, which is to fail to reject \(H_0\) when it is false. So the possible outcomes of a test of hypotheses are as below:

\[ \begin{array}{r|c|c} &\text{$H_0$ true}& \text{$H_0$ false}\\\hline \text{reject $H_0$} & \text{Type I error} & \text{correct decision}\\\hline \text{fail to reject $H_0$}& \text{correct decision} & \text{Type II error} \end{array} \]

For example, in the golden ratio experiment, if the true mean of \(B/A\) is in fact equal to the golden ratio \(1.618\) but we reject \(H_0\): \(\mu = 1.618\), then we will have committed a Type I error. If the true mean is, say, \(1.7\) and we fail to reject \(H_0\): \(\mu = 1.618\), we will have committed a Type II error.

Hypotheses are typically set up so that rejecting \(H_0\) corresponds to making a scientific discovery of some kind, so that rejecting \(H_0\) constitutes a “finding”. A Type I error thus represents a spurious finding, or a discovery that is false. When an investigator fails to reject a null hypothesis, the investigator usually considers the experiment to have yielded no “findings”. A Type II error is thus a failure to discover or find the “finding”—whatever it was of interest to the investigator to show. A Type I error is generally consider the graver of the two types of errors. If an investigator makes a Type I error, he or she may report a “finding” which is false, whereas if he or she makes a Type II error, he or she will simply miss discovering that which was there to be discovered, and the world will go on as before…

The typical strategy for calibrating a rule or threshold for rejecting \(H_0\) is to decide upon the largest probability with which one is willing to allow a Type I error to occur. We will denote this probability by \(\alpha\), where \(\alpha \in (0,1)\) is generally small; a typical choice is \(\alpha = 0.05\). We will refer to \(\alpha\) as the size of a test. By “test” we mean a rule for rejecting \(H_0\).

Definition 31.1 (Size of a test) The size of a test of hypotheses is the greatest possible probability that the test will result in a Type I error.


Next we will present rejection thresholds implied by this strategy—the strategy of controlling the size—for testing hypotheses about a mean based on a random sample from a normal distribution. To decide whether to reject any of the null hypothesis in 1, 2, or 3 from the top of this page, we will begin by comparing the value of \(\bar X_n\) to the null value \(\mu_0\) by computing the test statistic \[ Z_{\operatorname{test}}= \frac{\bar X_n - \mu_0}{\sigma/\sqrt{n}}. \tag{31.1}\] The value \(Z_{\operatorname{test}}\) gives the distance of \(\bar X_n\) from \(\mu_0\) in terms of a number of standard deviations. So this looks like sending \(\bar X_n\) into the Z world, but using \(\mu_0\) as the mean instead of the (unknown) true mean \(\mu\). So, by computing \(Z_{\operatorname{test}}\), we are asking how many standard deviations away \(\bar X_n\) is from \(\mu_0\), and in which direction, where the sign of \(Z_{\operatorname{test}}\) tells us whether \(\bar X_n\) was above \(\mu_0\) or below. We call \(Z_{\operatorname{test}}\) a test statistic, because it is based upon this value that we will decide whether or not to reject the null hypothesis.

After computing \(Z_{\operatorname{test}}\) we will compare it to a threshold, depending on which set of hypotheses we are testing, and this comparison will tell us whether or not to reject the null hypothesis. The threshold to which we will compare our test statistic is called the critical value. The result below gives the critical values and rejection rules for the sets of hypotheses 1, 2, and 3 which will ensure that the Type I error probability will not exceed some \(\alpha \in (0,1)\). The critical values are found using the fact that if \(\mu = \mu_0\) we have \(Z_{\operatorname{test}}\sim \mathcal{N}(0,1)\).

Proposition 31.1 (First tests of hypotheses for a normal mean) Let \(X_1,\dots,X_n \overset{\text{ind}}{\sim}\mathcal{N}(\mu,\sigma^2)\), where \(\sigma^2\) is known. Then the following tests have size \(\alpha\):

  1. For \(H_0\): \(\mu \leq \mu_0\) versus \(H_1\): \(\mu > \mu_0\), reject \(H_0\) if \(Z_{\operatorname{test}}> z_\alpha\).
  2. For \(H_0\): \(\mu \geq \mu_0\) versus \(H_1\): \(\mu < \mu_0\), reject \(H_0\) if \(Z_{\operatorname{test}}< -z_\alpha\).
  3. For \(H_0\): \(\mu = \mu_0\) versus \(H_1\): \(\mu \neq \mu_0\), reject \(H_0\) if \(|Z_{\operatorname{test}}| > z_{\alpha/2}\).


Tests 1, 2, and 3 in the above are referred to as the right-sided, the left-sided, and the two-sided test, respectively. The right- and left-sided tests are both called one-sided tests. For the right-sided test, with \(H_1\): \(\mu > \mu_0\), the rejection rule \(Z_{\operatorname{test}}> z_\alpha\) rejects \(H_0\): \(\mu \leq \mu_0\) when \(\bar X_n\) lies \(z_\alpha\) standard deviations above \(\mu_0\); for the left-sided test, with \(H_1\): \(\mu < \mu_0\), the rejection rule \(Z_{\operatorname{test}}< -z_\alpha\) rejects \(H_0\): \(\mu \geq \mu_0\) when \(\bar X_n\) lies \(z_\alpha\) standard deviations below \(\mu_0\); the two-side test, with rejection rule \(|Z_{\operatorname{test}}| > z_{\alpha/2}\) reject \(H_0\): \(\mu = \mu_0\) when \(\bar X_n\) lies \(z_{\alpha/2}\) standard deviation in either direction from \(\mu_0\).

For all three tests, one can see that if we have \(\mu = \mu_0\), so that the null hypothesis is true (true on the boundary in the case of the one-sided tests), then the probability of rejecting \(H_0\) will be exactly \(\alpha\); recall that to reject any of these null hypotheses when \(\mu = \mu_0\) would be to commit a Type I error. Thus the critical values in these tests are calibrated so that each test has size (maximum Type I error probability) equal to \(\alpha\).

Example 31.2 (Golden ratio data continued) Let’s test the three hypotheses

  1. \(H_0\): \(\mu \leq 1.618\) versus \(H_1\): \(\mu > 1.618\)
  2. \(H_0\): \(\mu \geq 1.618\) versus \(H_1\): \(\mu < 1.618\)
  3. \(H_0\): \(\mu = 1.618\) versus \(H_1\): \(\mu \neq 1.618\)

using \(\alpha = 0.10\), \(\alpha = 0.05\), and \(\alpha = 0.01\) under the assumption that the population standard deviation is \(\sigma = 0.15\). The first step is to compute the value of the test statistic \(Z_{\operatorname{test}}\). We obtain \[ Z_{\operatorname{test}}= \frac{1.5648148 - 1.618}{0.15/\sqrt{27}} = -1.842382. \] Using \(\alpha = 0.10\) the critical values become \(z_{0.10} = 1.282\), \(-z_{0.10} = -1.282\), and \(z_{0.10/2} = 1.645\). Therefore with \(\alpha = 0.10\) we:

  1. Fail to reject \(H_0\): \(\mu \leq 1.618\).
  2. Reject \(H_0\): \(\mu \geq 1.618\) and conclude \(\mu > 1.618\).
  3. Reject \(H_0\): \(\mu = 1.618\) and conclude \(\mu \neq 1.618\).

If we use \(\alpha =0.05\), the critical values become \(z_{0.05} = 1.645\), \(-z_{0.05} = -1.645\), and \(z_{0.05/2} = 1.96\). Therefore with \(\alpha = 0.05\) we:

  1. Fail to reject \(H_0\): \(\mu \leq 1.618\).
  2. Reject \(H_0\): \(\mu \geq 1.618\) and conclude \(\mu > 1.618\).
  3. Fail to reject \(H_0\): \(\mu = 1.618\).

If we use \(\alpha =0.05\), the critical values become \(z_{0.01} = 2.326\), \(-z_{0.01} = -2.326\), and \(z_{0.01/2} = 2.576\). Therefore with \(\alpha = 0.01\) we:

  1. Fail to reject \(H_0\): \(\mu \leq 1.618\).
  2. Fail to reject \(H_0\): \(\mu \geq 1.618\).
  3. Fail to reject \(H_0\): \(\mu = 1.618\).

From the above example, we see that whether or not we reject a null hypothesis can depend on the choice of \(\alpha\). Studying the critical values in the example, we see that for smaller choices of \(\alpha\) the critical values lie further from zero, such that the threshold for rejection is raised. That is, for smaller choices of \(\alpha\), the value of \(\bar X_n\) must be further away from \(\mu\) (in the right direction, for one-sided tests) to cause the test to reject \(H_0\). In summary, choosing smaller \(\alpha\) corresponds to requiring stronger evidence in the data for rejecting the null hypothesis.

Thus \(\alpha\) is also referred to as the significance level of a test (strictly defined, a test is said to have significance level \(\alpha\) if its size is at most \(\alpha\), but we need not worry about this technical distinction. See page 385 of Casella and Berger (2024)). So, if we use \(\alpha = 0.05\) and we reject a null hypothesis, we will say that we have rejected the null hypothesis “at significance level \(0.05\).” Using a smaller significance level requires greater evidence for the rejection of \(H_0\).