13 A variance estimator in nonparametric regression
$$
$$
Again, suppose we observe covariate and response data \((x_1,Y_1),\dots,(x_n,Y_n)\in [0,1] \times \mathbb{R}\), where \[ Y_i = m(x_i) + \varepsilon_i, \] \(i = 1,\dots,n\), where \(m:[0,1] \to \mathbb{R}\) is an unknown function and \(\varepsilon_1,\dots,\varepsilon_n\) are independent random variables with mean \(0\) and variance \(\sigma^2\).
So far we have focused on estimating the unknown function \(m\). Now we will consider the estimation of the error term variance \(\sigma^2\), which will be necessary if we wish to making inferences on the function \(m\), for example via the construction of pointwise confidence intervals for \(m(x)\) for each \(x \in [0,1]\) or of a confidence band \(\{[L(x),U(x)], x\in [0,1]\}\) for the set of values \(\{m(x),x \in [0,1]\}\).
While it is typical in regression contexts to base estimates of an error term variance on residuals, which in this case would be the values \(\hat \varepsilon_i = Y_i - \hat m_n(x_i)\), \(i=1,\dots,n\), we will consider an estimator which does not depend on an estimate of the unknown function \(m\). Such an estimator may not seem possible, but, if it can be assumed that the function \(m\) changes by only small increments between adjacent design points among \(x_1,\dots,x_n\), this smoothness can be exploited to define a very simple estimator of \(\sigma^2\) based on differencing re-reordered response values.
Definition 13.1 (Differencing estimator of the error term variance) Let \((x_{(1)},Y_{(1)}),\dots,(x_{(n)},Y_{(n)})\) represent the data pairs \((x_1,Y_1),\dots,(x_n,Y_n)\) after re-ordering these such that \(x_{(1)} \leq \dots \leq x_{(n)}\). Then the differencing estimator of the error term variance is defined as \[ \hat \sigma_n^2 = \frac{1}{2(n-1)}\sum_{i=1}^{n-1}(Y_{(i+1)} - Y_{(i)})^2. \]
The estimator \(\hat \sigma^2\), which is recommended, for example, in Rice et al. (1984), is based on writing \[ Y_{(i+1)} - Y_{(i)} = m(x_{(i+1)}) - m(x_{(i)}) + \varepsilon_{(i+1)} - \varepsilon_{(i)}, \] and assuming that the difference \(m(x_{(i+1)}) - m(x_{(i)})\) will be small (if \(m\) is “smooth” and the design points \(x_{(i+1)}\) and \(x_{(i)}\) are close to each other), where we have re-indexed the error terms \(\varepsilon_1,\dots,\varepsilon_n\) as \(\varepsilon_{(1)},\dots,\varepsilon_{(n)}\) according to the same re-ordering of the data pairs as in Definition 13.1.
Proposition 13.1 (Consistency of differencing estimator of variance under Lipschitz smoothness) If \(\varepsilon_1,\dots,\varepsilon_n\) are identically distributed and if \(\mathbb{E}\varepsilon_1^4 < \infty\) and \(m \in \text{Lipschitz}(L,[0,1])\) then \[ \hat \sigma_n^2 \stackrel{p}{\rightarrow}\sigma^2 \] as \(n \to \infty\).
We begin by writing \[ \begin{align*} \hat \sigma_n^2 &= \frac{1}{2(n-1)}\sum_{i=1}^{n-1}[m(x_{(i+1)}) - m(x_{(i)}) + \varepsilon_{(i+1)} - \varepsilon_{(i)}]^2 \\ &= \frac{1}{2(n-1)}\sum_{i=1}^{n-1} (\varepsilon_{(i+1)} - \varepsilon_{(i)})^2+ \frac{1}{2(n-1)}\sum_{i=1}^{n-1} [m(x_{(i+1)}) - m(x_{(i)})]^2\\ & \quad \quad \quad + \frac{2}{2(n-1)}\sum_{i=1}^{n-1} (\varepsilon_{(i+1)} - \varepsilon_{(i)})[m(x_{(i+1)}) - m(x_{(i)})], \end{align*} \] where we notice the first term has expectation \(\sigma^2\), the last term has expectation \(0\), and the second term may be bounded as \[ \begin{align} \frac{1}{2(n-1)}\sum_{i=1}^{n-1} [m(x_{(i+1)}) - m(x_{(i)})]^2 &\leq \frac{1}{2(n-1)}\sum_{i=1}^{n-1} L^2(x_{(i+1)} - x_{(i)})^2 \\ &\leq \frac{1}{2(n-1)}\sum_{i=1}^{n-1} L^2|x_{(i+1)} - x_{(i)}|\\ &\leq \frac{L^2}{2(n-1)}, \end{align} \] where we have used the fact that \(x_1,\dots,x_n \in [0,1]\), implying \(|x_{(i+1)} - x_{(i)}| < 1\) for each \(i=1,\dots,n-1\), and \(\sum_{i=1}^{n-1}|x_{(i+1)} - x_{(i)}| \leq 1\). It follows that \(\mathbb{E}\sigma_n^2 \to \sigma^2\).
It remains to show that the variance of \(\sigma_n^2\) converges to \(0\) as \(n \to \infty\). We have \[ \begin{align} \mathbb{V}\hat \sigma_n^2 &= \frac{1}{4(n-1)^2}\Big[\sum_{i=1}^{n-1} \mathbb{V}( (Y_{(i+1)} - Y_{(i)})^2) + \sum_{|i-j|=1}\mathbb{C}( (Y_{(i+1)} - Y_{(i)})^2,(Y_{(j+1)} - Y_{(j)})^2)\Big] \\ &\leq\frac{1}{4(n-1)^2}\Big[\sum_{i=1}^{n-1} \mathbb{V}( (Y_{(i+1)} - Y_{(i)})^2) + \sum_{|i-j|=1}\sqrt{\mathbb{V}((Y_{(i+1)} - Y_{(i)})^2)\mathbb{V}((Y_{(j+1)} - Y_{(j)})^2)}\Big], \end{align} \] where we have used the Cauchy-Schwarz inequality. Further, for each \(i\) we have \[ \begin{align} \mathbb{V}(Y_{(i+1)} - Y_{(i)}) &= \mathbb{V}((m(x_{(i+1)}) - m(x_{(i)}) + \varepsilon_{(i+1)} - \varepsilon_{(i)})^2) \\ &= \mathbb{V}( (\varepsilon_{(i+1)} - \varepsilon_{(i)})^2 + 2(\varepsilon_{(i+1)} - \varepsilon_{(i)}) (m(x_{(i+1)}) - m(x_{(i)}))) \\ &\leq 2 \mathbb{V}((\varepsilon_{(i+1)} - \varepsilon_{(i)})^2 ) + 2 \mathbb{V}( 2(\varepsilon_{(i+1)} - \varepsilon_{(i)}) (m(x_{(i+1)}) - m(x_{(i)}))) \\ & \leq 2 \mathbb{E}(\varepsilon_{(i+1)} - \varepsilon_{(i)})^4 + 8 |m(x_{(i+1)}) - m(x_{(i)})|^2 \mathbb{E}(\varepsilon_{(i+1)} - \varepsilon_{(i)})^2 \\ &\leq 2(2\mu_4 + 6 \sigma^4) + 16 \sigma^2 L^2 \end{align} \] for all \(i = 1,\dots,n\), where \(\mu_4 = \mathbb{E}\varepsilon_1^4\), since \[ \begin{align*} \mathbb{E}(\varepsilon_{(i+1)} - \varepsilon_{(i)})^4 &= \mathbb{E}[\varepsilon_{(i+1)}^4 - 4 \varepsilon_{(i+1)}^3 \varepsilon_{(i)} + 6 \varepsilon_{(i+1)}^2\varepsilon_{(i)}^2 - 4 \varepsilon_{(i+1)}\varepsilon_{(i)}^3 + \varepsilon_{(i)}^4] \\ &= 2\mu_4 + 6 \sigma^4 \end{align*} \] and \(|m(x_{(i+1)}) - m(x_{(i)})|^2 \leq L^2|x_{(i+1)} - x_{(i)}|^2 \leq L^2\). We may therefore write \[ \begin{align} \mathbb{V}\hat \sigma_n^2 &\leq \frac{1}{4(n-1)^2}[(n-1) + 2(n-1)][2(2\mu_4 + 6 \sigma^4) + 16 \sigma^2 L^2] \\ & =\frac{3}{2(n-1)}[(2\mu_4 + 6 \sigma^4) + 8 \sigma^2 L^2] \\ & \to 0 \end{align} \] as \(n \to \infty\). This completes the proof.
The reader is encouraged to run a simulation to assess the performance of the estimator \(\hat \sigma_n^2\) in various settings.