29 Confidence interval for a proportion

Author

Karl Gregory

We now consider constructing a confidence interval for the probability (or “proportion”) \(p\) based on a random sample \(X_1,\dots,X_n\) drawn from the \(\text{Bernoulli}(p)\) distribution. There are several ways in which this can be done, but we will first consider constructing a confidence interval based on the central limit result given in Proposition 22.2, which gives \[ \frac{\hat p_n - p}{\sqrt{p(1-p)/n}} \overset{\operatorname{approx}}{\sim}\mathcal{N}(0,1) \] for large \(n\). Based on the above, one may begin deriving a confidence interval for \(p\) in the following steps:

For any \(\alpha \in (0,1)\), write \[ P\Big(-z_{\alpha/2} \leq \frac{\hat p_n - p}{\sqrt{p(1-p)/n}} \leq z_{\alpha/2}\Big) \approx 1- \alpha \] for large \(n\).
Rearrange the above to obtain \[ P\Big( \hat p_n - z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} \leq p \leq \hat p_n + z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\Big) \approx 1 - \alpha \] for large \(n\).

From the above, we see that the interval with endpoints \[ \hat p_n \pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} \] will contain \(p\) with probability approximately \(1-\alpha\) when \(n\) is large. While this is true, if we stop for a moment we will see that, while we have used the same steps we have previously used to derive confidence intervals, this time it has resulted in a interval of which we cannot compute the endpoints unless we already know the population proportion \(p\). But the reason we are building a confidence interval in the first place is that we do not know the value of \(p\)!

A quick fix, which results in a feasible interval (meaning an interval with endpoints we can compute from our data), is to simply plug \(\hat p_n\) into the expression \(p(1-p)\) appearing in the margin of error. This results in what we call the Wald-type confidence interval for \(p\):

Proposition 29.1 (Wald-type interval for a population proportion) If \(X_1,\dots,X_n \overset{\text{ind}}{\sim}\text{Bernoulli}(p)\) then for any \(\alpha \in (0,1)\) the interval with endpoints \[ \hat p_n \pm z_{\alpha/2} \sqrt{\frac{\hat p_n(1-\hat p_n)}{n}} \tag{29.1}\] will contain \(p\) with probability approximately \(1-\alpha\) for large enough \(n\).

It is interesting to note that when \(X_1,\dots,X_n\) are zeros and ones, we have \[ S_n^2 = \hat p_n(1-\hat p_n) \Big(\frac{n-1}{n}\Big) \approx \hat p_n (1-\hat p_n), \tag{29.2}\] where the approximation “\(\approx\)” holds when \(n\) is large (so that \((n-1)/n \approx 1\)). From this we can see that the interval in Equation 29.1 is nearly the same as \(\bar X_n \pm z_{\alpha/2}S_n /\sqrt{n}\) (recalling that \(\hat p_n = \bar X_n\)), which we considered in Proposition 28.2.

Now, the interval Equation 29.1 actually performs notoriously poorly (meaning that the probability with which it will cover its target \(p\) is very different from the advertised probability of \(1-\alpha\)) if \(n\) is too small, so one is cautioned to ensure the condition \[ \min\{n \hat p_n ,n(1-\hat p_n)\} \geq 15 \tag{29.3}\] holds as a rule of thumb, before trusting it. To understand this rule of thumb, note that \(np\) is the expected number of successes in the sample (the number of successes follows exactly a \(\text{Binomial}(n,p)\) distribution, of which the expected value is given by \(np\)), and \(n(1-p)\) is the expected number of failures (which can be seen by swapping the labels “success” and “failure”). The quantities \(n \hat p_n\) and \(n(1-\hat p_n)\) are the observed numbers of successes and failures, respectively, in the random sample; so the number of “successes” and the number of “failures” in our sample be at least \(15\) before we trust the Wald-type interval.

An interval which in many situations exhibits superior performance (achieves closer to advertised or nominal coverage probability) to the Wald-type interval is the Agresti-Coull interval. This interval is constructed in what seems a rather curious fashion: One adds to ones data two “successes” and two “failures” and then computes the Wald-type interval on the augmented data:

Proposition 29.2 (Wald-type interval for a population proportion) If \(X_1,\dots,X_n \overset{\text{ind}}{\sim}\text{Bernoulli}(p)\) then for any \(\alpha \in (0,1)\) the interval with endpoints \[ \tilde p_n \pm z_{\alpha/2} \sqrt{\frac{\tilde p_n(1-\tilde p_n)}{n + 4}}, \quad \text{ where }\quad \tilde p_n = \frac{\#\{\text{successes}\} + 2}{n + 4}, \tag{29.4}\] will contain \(p\) with probability approximately \(1-\alpha\) for large enough \(n\).

For use of this interval it is advised to ensure that the condition \[ \min\{n \hat p ,n(1-\hat p)\} \geq 5 \tag{29.5}\] holds, which, when compared to the rule of thumb in Equation 29.5, is seen to allow for smaller sample sizes; we only need to have \(5\) successes and \(5\) failures in the sample in order to trust the Agresti–Coull interval.

Exercise 29.1 (Political poll) Suppose you draw a random sample of \(1000\) registered voters and \(478\) of them say they will vote for candidate A. Using \(\alpha = 0.05\) construct for the unknown proportion of registered voters who will vote for candidate A the

Wald-type interval.¹
Agresti-Coull interval.²

Exercise 29.2 Suppose you randomly sample \(50\) USC undergraduates and find that \(5\) of them hang-dry their laundry to conserve electricity. Using \(\alpha = 0.05\) construct for the unknown proportion of USC undergraduates who hang-dry their laundry the

Wald-type interval.³
Agresti-Coull interval.⁴

For a \(95\%\) confidence interval, \(\alpha=0.05\), and \(z_{0.025}=1.96\). From here, the Wald interval is obtained as \[ \frac{478}{1000} \pm 1.96\sqrt{\frac{(478/1000)(1 - 478/1000)}{1000}} = 0.478 \pm 0.031 = [0.447 , 0.509] \]↩︎
The Agresti-Coull interval is given by \[ \frac{480}{1004} \pm 1.96\sqrt{\frac{(480/1004)(1 - 480/1004)}{1004}} = 0.478 \pm 0.031 = [0.447 , 0.509]. \] So the Wald-type and the Agresti-Coull interval in this case have the same endpoints out to the third decimal place.↩︎
For a \(95\%\) confidence interval \(\alpha = 0.05\), so we use \(z_{0.025} = 1.96\). So the Wald-type interval is \[ \frac{5}{50} \pm 1.96\sqrt{\frac{(5/50)(1 - 5/50)}{50}} = [0.017 , 0.183]. \] ↩︎
The Agresti-Coull interval is \[ \frac{7}{54} \pm 1.96\sqrt{\frac{(7/54)(1 - 7/54)}{54}} = (0.040 , 0.219) \] The Wald and Agresti-Coull intervals in this example are very different. It is better here to use the Agresti-Coull interval because of the smaller sample and the small number of successes.↩︎