11  Sampling with versus without replacement

Author

Karl Gregory

As in Chapter 10, consider drawing a sample of size \(n\) from a population with \(N\) members, \(M\) of which may be considered “successes” or “red marbles” and the rest “failures” or “non-red” marbles (see Proposition 10.1). As before, let \(X\) be the number of “successes” or “red-marbles” one obtains in the sample of size \(n\). Here we will consider what happens to the probability distribution of \(X\) as the size of the population, \(N\), grows.

A reason for considering this is that in many situations, for example in political polling, where one is interested in how many members of a population plan to vote for a particular candidate (call them “successes”), the population size \(N\) is quite large (possibly in the millions). The probability mass function PMF of the hypergeometric distribution involves, for one, the quantity \({N \choose n}\). Try evaluating this for, say \(N = 100\) and \(n=10\); it is massive, and become more massive very quickly as \(N\) increases. If we put \(N\) equal to a million and \(n\) say, a thousand, we will not be able to compute \(N\choose n\). If we try to use R, we obtain

choose(1000000,1000)
[1] Inf

So, if the population size \(N\) is too large, we will not be able to compute hypergeometric probabilities.

In order to make a guess of how these un-computable hypergeometric probabilities will behave when the population size \(N\) is large, we can think about whether our draws from the population are dependent or independent. The outcomes of the draws are, in fact, dependent, because if we draw a “success”, then on the next draw (since we are sampling without replacement—not allowing ourselves to draw the sample population member twice), there will be one less “success” in the population we are drawing from. So each draw changes the composition of the population, inducing a dependence among the draws. However, if the population size \(N\) is large enough relative to the sample size \(n\), we find that the change in the composition of the population due to drawing out \(n\) members is negligible, so that it is safe to regard the draws as independent. The next example illustrates this.

Example 11.1 (Shirt or hat) The merch department of a sports franchise will mass produce a new shirt or a new hat. To decide which they will randomly sample some fans during the next match and offer to each one the shirt or the hat for free, taking note of which is more often chosen.

Suppose \(24830\) fans are attend the next match, among whom \(9932\) would choose the shirt. Now, let \(S_1\) be the event that the first fan sampled chooses the shirt and \(S_2\) be the event that the second fan sampled chooses the shirt. We know the events are dependent, because when the second fan is sampled, the first-sample fan has been removed from consideration. However, let’s investigate the strength of the dependence between these two events.

First, we have \[ P(S_1) = \frac{9932}{24830} = 0.40. \] To obtain \(P(S_2)\) we write \[ \begin{align*} P(S_2) &= P(S_2 \cap S_1) + P(S_2 \cap S_1^c) \\ &= P(S_2 |S_1)P(S_1) + P(S_2 | S_1^c)P(S_1^c) \\ &= \frac{9931}{24829}\frac{9932}{24830} + \frac{9932}{24829}\Big(1 - \frac{9932}{24830}\Big) \\ &=0.1599903 + 0.2400097 \\ &= 0.40. \end{align*} \] In the above we have used already \[ P(S_2 \cap S_1) = P(S_2 |S_1)P(S_1) = 0.1599903. \] Now we compare this to \[ P(S_1)P(S_2) = 0.40 \times 0.40 = 0.16. \] From the above it appears that the events \(S_1\) and \(S_2\) are very nearly independent.

In the shirt or hat example, imagine a sampling scheme in which we draw a fan, write down whether he or she chooses a shirt, and then place him or her back in the population, so that he or she could potentially be chosen again. Suppose we repeat this \(n\) times. This is called sampling with replacement. Under sampling with replacement, each draw can be regarded as a Bernoulli trial, and all the draws/trials are independent. Under sampling without replacement, which is what we actually do in practice, each draw can still be regarded as a Bernoulli trial, but our entire sample represents a sequence of dependent Bernoulli trials. However, if the population has a large number of successes and failures in it, the draws when sampling without replacement will be very nearly independent, as illustrated in the shirt or hat example, so that the draws can safely be regarded as independent Bernoulli trials even though they are not truly independent!

The advantage to regarding the draws under sampling without replacement as independent Bernoulli trials is that it allows one to compute probabilities using the binomial distribution PMF instead of the hypergeometric PMF, so that in our calculations we will be able to handle large population sizes. The next example shows that the probability distribution of the number of successes drawn in the sample is very nearly the same under sampling with and without replacement, provided the population is large (or more precisely, that both the number of successes \(M\) and the number non-successes \(N - M\) in the population are large).

Example 11.2 (Shirt or hat continued) Suppose the merch department samples \(6\) fans at the next match without replacement, and suppose, as before that there are \(24830\) fans at the match among whom \(9932\) would choose the shirt. Defining \(X\) as the number of fans in the sample who choose the shirt, we have \[ X \sim \text{Hypergeometric}(N = 24830, M = 9932, n = 6). \] We can tabulate this probability distribution using the hypergeometric PMF (see Equation 10.1) as \[ \begin{array}{c|ccccccc} x & 0 & 1 & 2 &3 &4 & 5 &6\\ \hline P(X = x) &0.0466 & 0.1866 & 0.3111 & 0.2765 & 0.1382 & 0.0368 & 0.0041 \\ P(X \leq x) & 0.0466 &0.2332& 0.5443 &0.8208 &0.9591 &0.9959 &1.0000 \end{array} \] using the code

N <- 24830
M <- 9932
n <- 6
x <- 0:6
px <- choose(M,x)*choose(N-M,n-x)/choose(N,n)
Px <- cumsum(px)
round(px,4)
round(Px,4)

Now suppose the merch department samples \(n = 6\) fans with replacement and let \(X\) be the number that choose the shirt (so the same fan may be selected more than once). Then \(X\) is equal to the number of successes in \(n\) independent Bernoulli trials, each with success \(p = 9932/24830 = 0.40\), so that \[ X \sim \text{Binomial}(n= 6, p = 0.40). \] This probability distribution can be tabulated using the Binomial PMF (see Equation 9.2) as \[ \begin{array}{c|ccccccc} x & 0 & 1 & 2 &3 &4 & 5 &6\\ \hline P(X = x) & 0.0467 &0.1866& 0.3110 &0.2765& 0.1382 &0.0369& 0.0041\\ P(X \leq x) & 0.0467 & 0.2333& 0.5443 &0.8208 &0.9590& 0.9959& 1.0000 \end{array} \] using the code

n <- 6
p <- 0.4
x <- 0:6
px <- choose(n,x)*p**x*(1 - p)**(n-x)
Px <- cumsum(px)
round(px,4)
round(Px,4)

We see that the two probability distributions are almost exactly the same.

In light of the above discussions, when we are sampling “successes” and “failures” from a population, we will typically assume that \(X\), the number of “successes” in our sample, has the \(\text{Binomial}(n,p)\) distribution, where \(p\) is the proportion of “successes” in the population; it is safe to do this when we are sampling from a large population, which is very often the case.