22 Sample proportion

Author

Karl Gregory

Here we consider the mean of a random sample when the sample consists of independent Bernoulli random variables. Recall that a Bernoulli random variables has support \(\mathcal{X}= \{0,1\}\), taking the value \(1\) with some probability \(p\) and the value \(0\) with probability \(1-p\).

Given a random sample \(X_1,\dots,X_n \overset{\text{ind}}{\sim}\text{Bernoulli}(p)\), we usually use a special notation for the sample mean. Instead of “\(\bar X_n\)”, we use “\(\hat p_n\)” and call this the sample proportion:

Definition 22.1 (Sample proportion) If \(X_1,\dots,X_n \overset{\text{ind}}{\sim}\text{Bernoulli}(p)\), then \[ \hat p_n = \frac{1}{n}(X_1 + \dots + X_n) \] is called the sample proportion.

Since the values in a Bernoulli random sample are all equal to zero or one, the sample mean is equal to the proportion of ones in the sample.

Recall from Proposition 9.1 that for \(X \sim \text{Bernoulli}(p)\) we have \(\mathbb{E}X = p\) and \(\operatorname{Var}X = p(1-p)\). This allows us to give the expected value and variance of \(\hat p_n\) based on Proposition 16.1 as below:

Proposition 22.1 (Mean and variance of the sample proportion) If \(X_1,\dots,X_n \overset{\text{ind}}{\sim}\text{Bernoulli}(p)\) and \(\hat p_n = (X_1+\dots+X_n)/n\) then \(\mathbb{E}\hat p_n = p\) and \(\operatorname{Var}\hat p_n = p(1-p)/n\).

We can compute probabilities concerning \(\hat p_n\) using the Binomial distribution by noting that \[ n \hat p_n = X_1 + \dots + X_n \sim \text{Binomial}(n,p), \] since we define a Binomial random variable as the number of successes in \(n\) independent Bernoulli trials with success probability \(p\).

Exercise 22.1 (Proportion of heads) Suppose you flip a fair coin one hundred times. Find the probability of observing:

Over \(60\%\) heads.¹
Between \(45\%\) and \(55\%\) heads.²

Repeat the above suppose you flip the coin two hundred times. ³

If \(n\) is large, we may find it useful to apply the central limit theorem in Proposition 21.1 to the sample proportion. We therefore consider sending \(\hat p_n\) (which is merely \(\bar X_n\)) into the Z world with \[ Z_n = \frac{\hat p_n - p}{\sqrt{p(1-p)/n}}, \] where we subtract the mean \(p\) and divide by the standard deviation \(\sqrt{p(1-p)/n}\). So the central limit theorem applied to the sample proportion becomes:

Proposition 22.2 (Central limit theorem for the sample proportion) If \(X_1,\dots,X_n\overset{\text{ind}}{\sim}\text{Bernoulli}(p)\) and \(\hat p_n = (X_1+\dots+X_n)/n\), then \[ \frac{\hat p_n - p}{\sqrt{p(1-p)/n}} \text{ behaves more and more like } Z \sim \mathcal{N}(0,1) \] for larger and larger \(n\).

In light of the above result, one can also say that for large \(n\) we have \[ \hat p_n \overset{\operatorname{approx}}{\sim}\mathcal{N}\Big(p,\frac{p(1-p)}{n}\Big), \tag{22.1}\] which is merely Equation 21.1 in the special case of a Bernoulli random sample.

Exercise 22.2 (Proportion of heads continued) Suppose you flip a fair coin one hundred times. Give an approximation based on the normal distribution to the probability of observing:

Over \(60\%\) heads.⁴
Between \(45\%\) and \(55\%\) heads.⁵

We have \[ \begin{align*} P(\hat p_n > 0.60) &= P(100 \hat p_n > 100 0.60) \\ &= P(Y > 100 0.60), \quad Y \sim \text{Binomial}(100,1/2). \\ &= 1 - P(Y \leq 100 0.60)\\ &= 1 - P(Y \leq 60)\\ &= 1 - \sum_{y=0}^{60} {100\choose y}(1/2)^y(1 - 1/2)^{100 - y}\\ &= 0.0176, \end{align*} \] which we can compute with 1 - pbinom(60,100,1/2).↩︎
We have \[ \begin{align*} P(0.45 < \hat p_n < 0.55) &= P( 45 < 100 \hat p_n < 55) \\ &= P(45 < Y < 55), \quad Y \sim \text{Binomial}(100,1/2). \\ &= P(Y \leq 54) - P(Y \leq 45)\\ &= 0.6318, \end{align*} \] which we can compute with pbinom(54,100,0.5) - pbinom(45,100,0.5).↩︎
We will get \(0.0018\) using 1 - pbinom(120,200,1/2) for the first probability and \(0.7959\) using pbinom(108,200,0.5) - pbinom(90,200,0.5) for the second probability.↩︎
We use the fact that \(\hat p_n \overset{\operatorname{approx}}{\sim}\mathcal{N}(1/2,1/2(1-1/2)/100)\) to write \[ \begin{align*} P(\hat p_n > 0.60) &= P\Big(\frac{\hat p_n - 0.50 }{\sqrt{0.50(1-0.50)/100}} >\frac{0.60 - 0.50 }{\sqrt{0.50(1-0.50)/100}} \Big) \\ &\approx P\Big(Z > \frac{0.60 - 0.50 }{\sqrt{0.50(1-0.50)/100}} \Big), \quad Z \sim \mathcal{N}(0,1)\\ &= P(Z > 2 )\\ & = 0.02275, \end{align*} \] which we can compute with 1 - pnorm(2).↩︎
By similar steps we have \[ \begin{align*} P(0.45 < \hat p_n < 0.55) &\approx P( -1 < Z < 1), \quad Z \sim \mathcal{N}(0,1) \\ &= P(Z < 1) - P(Z < -1)\\ &= 0.6827, \end{align*} \] which we can compute with pnorm(1) - pnorm(-1).↩︎