10 Hypergeometric distribution

Author

Karl Gregory

Here we consider another distribution called the hypergeometic distribution, which arises when we consider drawing a random sample from a population of finite size (Recall that one of the main goals of statistics is to study what we can learn about a population based on the values in a random sample). Let us suppose that we have a population with \(N\) elements (people, critters, financial transactions, experimental units, etc.), and we draw a sample of size \(n\). Further, consider the particular case in which we are only interested in a binary (taking two values) trait of the members of the population, so that each member may be labeled as a “successes” or a “failure.” Then, when we draw our sample of size \(n\), some number of these will be successes, and the rest will be failures. If the goal is to learn the proportion of successes in the population we will need to understand how the number of successes in our sample will behave. More precisely, letting \(X\) be the number of successes in our sample of size \(n\) drawn from a population of size \(N\), and supposing there are \(M\) successes in the population, we would like to know the probability distribution of \(X\).

Here is an example.

Exercise 10.1 (Finite population of smokers and non-smokers) Suppose we draw a sample of size \(5\) from a population with \(100\) people of whom \(10\) are smokers, letting \(X\) be the number of smokers in the sample. We can tabulate the probability distribution of \(X\) by noting that for each \(x \in \mathcal{X}= \{0,1,\dots,5\}\) we have the expression \[ P(X = x) = \frac{{ 10 \choose x}{100 - 10 \choose 5 - x}}{{100 \choose 5}}. \] We obtain the above as follows: The denominator is equal to the total number of ways we can draw \(5\) people without replacement from the population of size \(100\). The numerator is equal to the number of ways we can draw \(x\) people from among the \(10\) smokers and \(5-x\) people from among the \(100 - 10\) non-smokers. Evaluating the expression for all \(x\) in the support results in the table \[ \begin{array}{c|cccccc} x & 0 & 1 & 2 &3 &4 & 5\\ \hline P(X = x) & 0.5838& 0.3394 & 0.0702 & 0.0064 & 0.0003 & 0.0000\\ P(X \leq x) &0.5838 & 0.9231& 0.9934 & 0.9997 & 1.0000 & 1.0000 \end{array} \] after rounding to four decimal places. We can obtain the values in the table above with the following R code:

x <- 0:5
N <- 100
n <- 5
M <- 10
px <- choose(M,x)*choose(N - M,n - x)/choose(N,n) # probabilities
Px <- cumsum(px) # cumulative probabilities

We now generalize the smokers and non-smokers example with the introduction of the hypergeometric distribution.

Proposition 10.1 (Hypergeometric probability mass function) Suppose we draw \(n\) marbles from a bag of \(N\) marbles, \(M\) of which are red, and let \(X\) be the number of red marbles drawn. Then \(X\) has probability mass function given by \[ p(x) = \frac{ {M \choose x}{ N - M \choose n - x }}{ {N \choose n}} \tag{10.1}\] for \(x \in \mathcal{X}= \{ \max\{n -(N-M),0\},\dots,\min\{M,n\} \}\).

A random variable having the PMF in Equation 10.1 is said to have the hypergeometric distribution, and we write \[ X \sim \text{Hypergeometric}(N,M,n). \]

One may have expected the support of a hypergeometric random variable to be \(\{0,1,\dots,n\}\), since we are drawing \(n\) marbles. The reason that the support has maximum value equal to \(\min\{M,n\}\) is to accommodate the case in which there are fewer than \(K\) red marbles in the bag (in which case one cannot draw \(n\) red marbles, but can draw only as many as \(M\) red marbles). Likewise, the reason the support has minimum value equal to \(\max\{n - (N-M),0\}\) is that there may be fewer than \(n\) non-red marbles in the bag (in which case one can draw at most \(N-M\) non-red marbles or, equivalently, at least \(n - (N - M)\) red marbles).

Proposition 10.2 (Hypergeometric mean and variance) For a random variable \(X\) having the \(\text{Hypergeometric}(N,M,n)\) distribution we have \[ \begin{align*} \mathbb{E}X &= n\frac{M}{N} \\ \operatorname{Var}X &= n\frac{M}{N}\Big(1 - \frac{M}{N}\Big)\frac{N-n}{N-1}. \end{align*} \]

The above expressions for \(\mathbb{E}X\) and \(\operatorname{Var}X\) come from applying Proposition 8.3 with the hypergeometric PMF.

Exercise 10.2 (Twenty-eight teeth) Suppose that in a class of \(20\) students, \(12\) of the students have \(28\) teeth. Consider drawing a sample of \(6\) students and letting \(X\) be the number of students among the \(6\) who have \(28\) teeth.

Tabulate the probability distribution of \(X\). Include cumulative probabilities.¹
Give the expected value and standard deviation of \(X\).²

We can modify the code from the smokers/non-smokers example to obtain \[ \begin{array}{c|ccccccc} x & 0 & 1 & 2 &3 &4 & 5 &6\\ \hline P(X = x) & 0.001 &0.017 &0.119& 0.318& 0.358 &0.163 & 0.024\\ P(X \leq x) & 0.001 &0.018 &0.137 &0.455 &0.813 &0.976 &1.000\\ \end{array} \]↩︎
We obtain \(\mathbb{E}X = 3.6\) and \(\sqrt{\operatorname{Var}X} = 1.0300741\).↩︎