9  Mean squared error of the Nadaraya-Watson estimator

In this section we will establish a bound on the \(\operatorname{MSE}\) of the Nadaraya-Watson estimator in Definition 8.1. This will, as in the case of kernel density estimation, give insight into how the bandwidth should be chosen. We will proceed assuming \(m\) belongs to a Lipschitz class, that is \(m \in \text{Lipschitz}(L,[0,1])\).

First, it will be useful to re-express the Nadaraya-Watson estimator as \[ \hat m_{n,h}^\operatorname{NW}(x) = \sum_{i=1}^n W_{ni}(x)Y_i, \] for each \(x\), where, for each \(i=1,\dots,n\), we define the weight \[ W_{ni}(x) = \frac{K((x_i - x)/h)}{\sum_{j=1}^nK((x_j - x)/h)}. \] for each \(x\). The weights \(W_{ni}(x)\) will play a role in many of our results; note that \(\sum_{i=1}^n W_{ni}(x) =1\) for all \(x\).

Proposition 9.1 (Variance of Nadaraya-Watson estimator) Let \(K\) be a kernel such that \(0 \leq K(u) \leq K_\max < \infty\) for all \(u \in \mathbb{R}\) and suppose \(\hat f_{n,h}(x) \geq f_\min > 0\) for all \(x \in [0,1]\). Then \[ \mathbb{V}\hat m_{n,h}^\operatorname{NW}(x) \leq \frac{1}{nh} \frac{\sigma^2K_\max}{f_\min} \] for all \(x\).

In the above \(\hat f_{n,h}\) is as defined in Equation 8.1.

Note that \(K(u) \geq 0\) for all \(u \in \mathbb{R}\) implies \(W_{ni}(x) \geq 0\) for all \(x \in [0,1]\). We may therefore write \[ \begin{align} \mathbb{V}\hat m_{n,h}^\operatorname{NW}(x) &= \sigma^2 \sum_{i=1}^n W^2_{ni}(x) \\ &\leq \sigma^2 \Big(\max_{1 \leq i \leq n} W_{ni}(x) \Big)\sum_{i=1}^nW_{ni}(x) \\ &=\frac{\sigma^2}{nh}\frac{nh}{\sum_{j=1}^nK((x_j - x)/h)}\Big(\max_{1 \leq i \leq n} K((x_i - x)/h) \Big) \\ & \leq \frac{1}{nh} \frac{\sigma^2 K_\max}{f_\min}, \end{align} \] for all \(x\).

In the above, we see that the density of the covariate values \(x_i\) as captured by the function \[ \hat f_{n,h}(x) =\frac{1}{nh}\sum_{i=1}^nK\Big(\frac{x_i - x}{h}\Big) \] over \(x\in [0,1]\) plays a role in that \(f_\min\) appears in the denominator of the variance bound: a small value of \(\hat f_{n,h}(x)\) indicates that there are few \(x_i\) values close to \(x\), so that only a small number of observations are available for estimating \(m(x)\), which leads to high variance. Distributions for \(x_1,\dots,x_n\) closer to the uniform distribution, which spreads the covariate values evenly over \([0,1]\), would correspond to larger values of \(f_\min\) and lead to smaller bounds on the variance.

The upper bound on the variance also shows us that we can reduce the variance of the Nadaraya-Watson estimator by increasing the bandwidth \(h\).

To consider the bias, it will be useful to re-write it as \[ \mathbb{E}\hat m_{n,h}^\operatorname{NW}(x) - m(x) = \sum_{i=1}^n W_{ni}(x)[m(x_i) - m(x)]. \tag{9.1}\]

We have \[ \begin{align} \mathbb{E}\hat m_{n,h}^\operatorname{NW}(x) - m(x) &= \mathbb{E}\sum_{i=1}^n W_{ni}(x)Y_i - m(x) \\ &= \sum_{i=1}^n W_{ni}(x)m(x_i) - m(x) \\ &= \sum_{i=1}^n W_{ni}(x)[m(x_i) - m(x)], \end{align} \] since \(\sum_{i=1}^n W_{ni}(x) = 1\).

To analyze the bias it is necessary to make an assumption about the smoothness of the function \(m\), in particular about how different \(m(x_i)\) may be from \(x\) when \(x_i\) is in a neighborhood of \(x\). Placing \(m\) in a Lipschitz class of functions allows us to govern this difference.

Proposition 9.2 (Bias of the Nadaraya-Watson estimator) Let \(K\) be a kernel with support on \([-1,1]\) such that \(K(u) \geq 0\) for all \(u \in \mathbb{R}\). Then if \(m \in \text{Lipschitz}(L,[0,1])\) we have \[ |\mathbb{E}\hat m_{n,h}^\operatorname{NW}(x) - m(x)| \leq L h \] for all \(x\).

Noting that \(W_{ni}(x) \geq 0\), we may write \[ \begin{align*} |\mathbb{E}\hat m_{n,h}^\operatorname{NW}(x) - m(x)| &\leq \sum_{i=1}^n W_{ni}(x)|m(x_i) - m(x)| \\ &\leq \sum_{i=1}^n W_{ni}(x) L |x_i - x| \\ &=\sum_{i=1}^n W_{ni}(x) L |x_i - x| \mathbb{I}(|x_i - x| \leq h)\\ &\leq \sum_{i=1}^n W_{ni}(x) L h \\ &= Lh, \end{align*} \] where we have used \(W_{ni}(x) = 0\) if \((x_i - x)/h \notin [-1,1]\) since \(K\) has support only on \([-1,1]\).

From the above we see that we can reduce the bias of the Nadaraya-Watson estimator by choosing a smaller bandwidth. Putting the variance and bias results together gives a bound on the \(\operatorname{MSE}\).

Proposition 9.3 (MSE of Nadaraya-Watson estimator under Lipschitz smoothness) Under the assumptions of Proposition 9.1 and Proposition 9.2 we have \[ \operatorname{MSE}\hat m_{n,h}^\operatorname{NW}(x_0) \leq h^2 L^2 + \frac{1}{nh} \frac{\sigma^2K_\max}{f_\min} \tag{9.2}\] for all \(x \in [0,1]\) with \(\operatorname{MSE}\)-optimal bandwidth given by \(h_{\operatorname{opt}}= c^* n^{-1/3}\) such that \[ \operatorname{MSE}\hat m_{n,h_{\operatorname{opt}}}^\operatorname{NW}(x) \leq C^* n^{-2/3} \tag{9.3}\] for all \(x \in [0,1]\), where \(c^*>0\) and \(C^*>0\) depend on \(f_\min\), \(L\), \(K_\max\), and \(\sigma^2\).

The bound in Equation 9.2 follows immediately from Proposition 9.1 and Proposition 9.2. To find the MSE-optimal choice of the bandwidth, we minimize the right side of Equation 9.2 in \(h\). Setting \[ \frac{d}{dh}\Big( h^2 L^2 + \frac{1}{nh} \frac{\sigma^2K_\max}{f_\min} \Big) = 2hL^2 - \frac{1}{nh^2}\frac{\sigma^2K_\max}{f_\min}=0 \] and solving for \(h\) gives \(h_{\operatorname{opt}}\) and plugging \(h_{\operatorname{opt}}\) back into Equation 9.2 gives the bound in Equation 9.3.