14  Bootstrap for the sample mean

Throughout this section, let \(X_1,\dots,X_n\) be independent, identically distributed random variables with mean \(\mu\) and variance \(\sigma^2 < \infty\). We will consider using the bootstrap to construct a confidence interval for \(\mu\).

Let \(\bar X_n = n^{-1}\sum_{i=1}^n X_i\) and consider the quantity

\[ Y_{n} \equiv \sqrt{n}(\bar X_n - \mu) \overset{\text{d}}{\longrightarrow}\mathcal{N}(0,\sigma^2), \] as \(n \to \infty\), where the convergence in distribution is given by the central limit theorem. Making use of the asymptotic distribution of \(Y_n\), we see that the interval with endpoints given by \[ \bar X_n \pm z_{\alpha/2}\sigma/\sqrt{n} \] will have coverage probability approaching \(1-\alpha\) as \(n \to \infty\) for any \(\alpha \in (0,1)\). However, this interval is infeasible without knowledge of the variance \(\sigma^2\). If we substitute an estimator for \(\sigma\), say the sample standard deviation \(S_n\), where \(S_n^2 = (n-1)^{-1}\sum_{i=1}^n(X_i - \bar X_n)^2\), the central limit theorem together with Slutzky’s Theorem gives that the interval with endpoints \[ \bar X_n \pm z_{\alpha/2} S_n/\sqrt{n} \tag{14.1}\] will have coverage probability approaching \(1-\alpha\) as \(n \to \infty\) provided \(\mathbb{E}|X_1|^3 < \infty\) (this moment condition ensures that \(S_n\) is a consistent estimator of \(\sigma\)). However, the interval in Equation 14.1 has notoriously bad coverage (has a coverage probability far from the nominal level) in some settings even when the sample size \(n\) is moderately large. The idea of the bootstrap is to estimate the sampling distribution of the quantity \(Y_n\) under the sample size \(n\) at hand and to use this estimated sampling distribution to make inferences on \(\mu\).

So, for each \(n\geq 1\), define the cumulative distribution function \(G_{Y_n}\) as \[ G_{Y_n}(x) = \mathbb{P}(Y_n \leq x) \] for all \(x \in \mathbb{R}\), and denote the corresponding quantile function by \(G^{-1}_{Y_n}\). Now, if one knew the function \(G^{-1}_{Y_n}\), one could construct a confidence interval for \(\mu\) having coverage probability exactly equal to \(1-\alpha\) as

\[ \big[\bar X_n - G_{Y_n}^{-1}(1-\alpha/2)/\sqrt{n}, \bar X_n - G_{Y_n}^{-1}(\alpha/2)/\sqrt{n}\big]. \tag{14.2}\]

If \(Y_n \sim G_{Y_n}\) we may write \[ \mathbb{P}(G_{Y_n}^{-1}(\alpha/2) \leq \sqrt{n}(\bar X_n - \mu) \leq G_{Y_n}^{-1}(1-\alpha/2)) = 1-\alpha. \] Now, if we rearrange the expression inside the probability operator we find \[ \mathbb{P}(\bar X_n - G_{Y_n}^{-1}(1-\alpha/2)/\sqrt{n} \leq \mu \leq \bar X_n - G_{Y_n}^{-1}(\alpha/2)/\sqrt{n}) = 1-\alpha, \] which gives the upper and lower endpoints of the interval in Equation 14.2.

The bootstrap provides a method for estimating the distribution \(G_{Y_n}\), which is as follows:

Definition 14.1 (Bootstrap estimator of \(G_{Y_n}\)) Conditional on \(X_1,\dots,X_n\), introduce random variables \(X_1^*,\dots,X_n^*\) such that \(X_1^*,\dots,X_n^*|X_1,\dots,X_n \overset{\text{ind}}{\sim}\hat F_n\), where \(\hat F_n\) is the empirical distribution of \(X_1,\dots,X_n\), and define the bootstrap version \(Y_n^*\) of \(Y_n = \sqrt{n}(\bar X_n - \mu)\) as

\[ Y_n^* \equiv \sqrt{n}(\bar X_n^* - \bar X_n), \]

where \(\bar X_n^* = n^{-1}\sum_{i=1}^n X_i\). Then the bootstrap estimator \(\hat G_{Y_n}(x)\) of \(G_{Y_n}(x)\) is defined as \[ \hat G_{Y_n}(x) = \mathbb{P}(Y_n^* \leq x | X_1,\dots,X_n) \] for all \(x \in \mathbb{R}\).

Once one has obtained the bootstrap estimate \(\hat G_{Y_n}(x)\) for all \(x \in \mathbb{R}\), one can use the corresponding quantile function to construct the bootstrap confidence interval

\[ [\bar X_n - \hat G_{Y_n}^{-1}(1-\alpha/2)/\sqrt{n}, \bar X_n - \hat G_{Y_n}^{-1}(\alpha/2)/\sqrt{n}]. \tag{14.3}\]

How do we actually obtain the bootstrap estimators \(\hat G_{Y_n}^{-1}(1-\alpha/2)\) and \(G_{Y_n}^{-1}(\alpha/2)\) with which to construct the bootstrap interval in Equation 14.3? We describe this next.