Definition 2.1 (Multivariate Nadaraya-Watson estimator) A multivariate version of the Nadaraya-Watson estimator is given by \[
\hat m_{n}^{\text{NW}}(\mathbf{x}) = \frac{\sum_{i=1}^n Y_i K(h^{-1}(\mathbf{x}_i - \mathbf{x}))}{\sum_{i'=1}^nK(h^{-1}(\mathbf{x}_{i'} - \mathbf{x}))}
\] for all \(\mathbf{x}\in [0,1]^p\) for some kernel function \(K:\mathbb{R}^p \to \mathbb{R}\) and bandwidth \(h > 0\).
The kernel is often of the form \(K(\mathbf{u}) = \prod_{j=1}^p G(u_j)\) where \(G\) a univariate kernel like \[
G(z) = \phi(z) \quad \text{ or } \quad G(z) = \frac{3}{4}(1 - z^2)\mathbf{1}(|z| \leq 1).
\]
Here is an example of what the estimator \(\hat m_{n}^{\text{NW}}(\mathbf{x})\) can look like when \(p=2\).
Code
m <-function(x){ u <-sqrt(sum(x^2)) val1 <-32/3* u^3* ( ( (0<= u) & ( u <1/4))) val2 <- (32* u^2-32* u^3-8* u +2/3 ) * ( (1/4<= u) & ( u <2/4) ) val3 <- (32* ( 1- u )^2-32* ( 1- u )^3-8* ( 1- u ) +2/3) * ( (2/4<= u) & ( u <3/4) ) val4 <-32/3* ( 1- u )^3* ( (3/4<= u) & ( u <1)) val <- val1 + val2 + val3 + val4return(val)}n <-500X <-matrix(0,n,2)theta <-runif(n,0,2*pi)r <-rbeta(n,2,2)X[,1] <- r*cos(theta)X[,2] <- r*sin(theta)Y <-apply(X,1,m) +rnorm(n,0,.1)Ker <- dnormbivNW <-function(x, data =list(X,Y), K, h){ X <- data$X Y <- data$Y num <-sum(Y *K((X[,1] - x[1] )/h) *K((X[,2] - x[2])/h)) den <-sum( K((X[,1] - x[1] )/h) *K((X[,2] - x[2])/h)) val <- num / denreturn(val)}grid <-100x1 <- x2 <-seq(-1,1,length=grid)hh <-c(0.05,0.1,0.2)mx <-matrix(0,grid,grid) mNWx <-array(0,dim =c(grid,grid,length(hh)))for(i in1:grid)for( j in1:grid){ mx[i,j] <-m(c(x1[i],x2[j]))for(k in1:length(hh)){ mNWx[i,j,k] <-bivNW(c(x1[i],x2[j]),data =list(X = X, Y = Y), K = dnorm, h = hh[k]) } }par(mfrow=c(2,2), mar=c(0,0,0,0))persp_out <-persp(x = x1,y = x2,z = mx,theta =30,phi =30,r =sqrt(1),d =1,box =FALSE,scale =FALSE,border ="dark gray")points(trans3d(x = X[,1], y = X[,2], z = Y, pmat = persp_out),pch =19,cex = .6,col =rgb(.545,0,0,.1))for(k in1:length(hh)){ persp_out <-persp(x = x1,y = x2,z = mNWx[,,k],theta =30,phi =30,r =sqrt(1),d =1,box =FALSE,scale =FALSE,border =rgb(.545,0,0,.5))}
Figure 2.1: Nadaraya-Watson estimator in \(p=2\) setting at three bandwidth choices. Top left shows the true function \(m\).
The following is a result about the variance \(\mathbb{V}\hat m_n^{\text{NW}}(\mathbf{x})\).
Proposition 2.1 (Bounds on the variance of the multivariate Nadaraya-Watson estimator) Given a kernel function \(K\) and a bandwidth \(h\), suppose \(\mathbf{x}_1,\dots,\mathbf{x}_n \in [0,1]^p\) are deterministic such that for some \(n_0>0\) we have \[
0 < c_1 \leq \frac{1}{nh^p}\sum_{i=1}^n K(h^{-1}(\mathbf{x}_i - \mathbf{x})) \leq c_1^{-1}
\] and \[
0 < c_2 \leq \frac{1}{nh^p}\sum_{i=1}^n K^2(h^{-1}(\mathbf{x}_i - \mathbf{x})) \leq c_2^{-1}
\] for some \(c_1,c_2 > 0\) for all \(x \in [0,1]^p\) for all \(n \geq n_0\). Then for all \(n \geq n_0\), we have \[
\frac{\sigma^2}{nh^p} c_1^2 c_2 \leq \mathbb{V}\hat m_n^{\text{NW}}(\mathbf{x}) \leq \frac{\sigma^2}{nh^p} \frac{1}{c_1^2 c_2}
\tag{2.1}\]
for all \(\mathbf{x}\in [0,1]^p\).
Just as in the case of multivariate kernel density estimation, we encounter the curse of dimensionality. That is, the dimension \(p\) appears in our bounds for the variance just as in the case of multivariate kernel density estimation.
Write \(\hat m_n^\text{NW}(\mathbf{x}) = \sum_{i=1}^n W_{ni}(\mathbf{x})Y_i\) with \[
W_{ni}(\mathbf{x}) = \frac{K(h^{-1}(\mathbf{x}_i - \mathbf{x}))}{\sum_{i'=1}^nK(h^{-1}(\mathbf{x}_{i'} - \mathbf{x}))}.
\] Then for all \(n \geq n_0\) we have \[\begin{align}
\mathbb{V}\hat m_n^{\text{NW}}(\mathbf{x}) &= \sum_{i=1}^n W^2_{ni}(\mathbf{x}) \sigma^2 \\
&= \sigma^2 \sum_{i=1}^n \frac{K^2(h^{-1}(\mathbf{x}_i - \mathbf{x}))}{[\sum_{i'=1}^nK(h^{-1}(\mathbf{x}_{i'} - \mathbf{x}))]^2} \\
&= \frac{\sigma^2}{nh^p}\frac{(nh^p)^{-1}\sum_{i=1}^n K^2(h^{-1}(\mathbf{x}_i - \mathbf{x}))}{[(nh^p)^{-1}\sum_{i'=1}^n K(h^{-1}(\mathbf{x}_{i'} - \mathbf{x}))]^2}
\end{align}\] The result follows from applying the bounds in (Equation 2.1).
In order to analyze the bias, one would need, just as in the case of multivariate kernel density estimation, to make an assumption about the smoothness of the unknown function \(m\). If one, for example, places \(m\) in a \(p\)-dimensional Hölder class \(\mathcal{H}_p(\beta,L)\) and chooses a \(p\)-dimensional kernel \(K\) of corresponding order, one could bound the magnitude of the bias by a constant times \(h^{\beta}\). We omit the details.
One way to mitigate the curse of dimensionality is to make assumptions about \(m\) which place it in a smaller class of functions. An assumption known as additivity is very often used.