6 Bias and variance of multivariate kernel density estimators
$$
$$
Here we will consider the variance and the bias of the multivariate kernel density estimator (KDE) given by \[ \hat f_{n,h}(\mathbf{x}) = \frac{1}{nh^d}\sum_{i=1}^n K_d(h^{-1}(\mathbf{X}_i - \mathbf{x})) \] for all \(\mathbf{x}\in \mathbb{R}\), which was introduced in Chapter 5.
First we consider the variance, as we can obtain an upper bound for the variance without making assumptions about the smoothness of the true density \(f\). We have the following result:
Proposition 6.1 (Multivariate KDE variance bound) Suppose \(f(\mathbf{x}) \leq f_\max < \infty\) for all \(\mathbf{x}\in \mathbb{R}^d\) and let \(\hat f_{n,h}(\mathbf{x})\) be a multivariate KDE with kernel \(K_d\) satisfying \(\int_{\mathbb{R}^d}K^2_d(\mathbf{u})d\mathbf{u}\leq \kappa^2 < \infty\). Then for any \(\mathbf{x}\in \mathbb{R}^d\) we have \[ \mathbb{V}\hat f_{n,h}(\mathbf{x}) \leq \frac{1}{nh^d} \kappa^2 f_\max. \]
We have \[\begin{align*} \mathbb{V}\hat f_{n,h}(\mathbf{x}) & = \mathbb{V}\Big( \frac{1}{nh^d}\sum_{i=1}^n K_d(h^{-1}(\mathbf{X}_i - \mathbf{x}))\Big) \\ & = \frac{1}{n^2h^{2d}} \sum_{i=1}^n\mathbb{V}K_d(h^{-1}(\mathbf{X}_i - \mathbf{x}))\Big)\\ &= \frac{1}{nh^{2d}}\mathbb{V}K_d\Big(h^{-1}(\mathbf{X}_1 - \mathbf{x})\Big) \\ &\leq \frac{1}{nh^{2d}}\mathbb{E}K^2_d\Big(h^{-1}(\mathbf{X}_1 - \mathbf{x})\Big) \\ &= \frac{1}{nh^{2d}} \int_{\mathbb{R}^d} K^2_d\Big(h^{-1}(\mathbf{z}- \mathbf{x})\Big) f(\mathbf{z})d\mathbf{z}\\ &= \frac{1}{nh^d} \int_{\mathbb{R}^d} K^2_d(\mathbf{u}) f(\mathbf{x}+ h\mathbf{u}) d\mathbf{u}\quad \text{ with } \mathbf{u}= h^{-1}(\mathbf{z}- \mathbf{x})\\ & \leq \frac{1}{nh^d} \kappa^2 f_\max, \end{align*}\] noting that the substitution \(\mathbf{u}= h^{-1}(\mathbf{z}- \mathbf{x})\) has Jacobian \(|\partial \mathbf{z}/ \partial\mathbf{u}|=|h\mathbf{I}_d| = h^d\).
As before, bounding the bias requires that we put some assumptions on the smoothness of the true density \(f\). In the univariate setting we began our analysis of the bias by establishing the identity in Equation 2.1. Similarly, in the multivariate setting we have \[ \mathbb{E}\hat f_{n,h}(\mathbf{x}) - f(\mathbf{x}) = \int_{\mathbb{R}^d} K_d(\mathbf{u})[f(\mathbf{x}+ h \mathbf{u}) - f(\mathbf{x})]d\mathbf{u} \tag{6.1}\] for all \(\mathbf{x}\in \mathbb{R}^d\), provided \(\int_{\mathbb{R}^d}K_d(\mathbf{u})d\mathbf{u}= 1\).
We have \[\begin{align*} \mathbb{E}\hat f_{n,h}(\mathbf{x}) - f(\mathbf{x}) &= \mathbb{E}\frac{1}{nh^d}\sum_{i=1}^n K_d(h^{-1}(\mathbf{X}_i - \mathbf{x})) - f(\mathbf{x}) \\ &= \frac{1}{h^d} \mathbb{E}K_d(h^{-1}(\mathbf{X}_1 - \mathbf{x})) - f(\mathbf{x})\\ &= \frac{1}{h^d} \int_{\mathbb{R}^d} K_d(h^{-1}(\mathbf{z}- \mathbf{x}))f(\mathbf{z})d\mathbf{z}- f(\mathbf{x})\\ &= \int_{\mathbb{R}^d} K_d(\mathbf{u})f(\mathbf{x}+ h \mathbf{u})d\mathbf{u}- f(\mathbf{x}) \quad \text{ with } \mathbf{u}= h^{-1}(\mathbf{z}- \mathbf{x})\\ &=\int_{\mathbb{R}^d} K_d(\mathbf{u})[f(\mathbf{x}+ h \mathbf{u})d\mathbf{u}- f(\mathbf{x}) ] d \mathbf{u}, \end{align*}\] where the last equality follows from the assumption that \(\int_{\mathbb{R}^d}K(\mathbf{u})d\mathbf{u}= 1\).
To control the bias, we will make an assumption governing how different the value of \(f(\mathbf{x}+ h\mathbf{u})\) may be from that of \(f(\mathbf{x})\) for small values of \(h > 0\). To do so, we will introduce a multivariate class of Hölder functions. Defining such functions will require so-called multi-index notation, which we introduce next.
Definition 6.1 (Multi-index notation) For a \(d\times 1\) vector of real numbers \(\mathbf{x}= (x_1,\dots,x_d)^T\) and a vector of positive integers \(\boldsymbol{\upsilon}= (\upsilon_1,\dots,\upsilon_d)^T\), let \(\mathbf{x}^{\boldsymbol{\upsilon}} = \prod_{j=1}^dx_j^{\upsilon_j}\), \(|\mathbf{x}|^{\boldsymbol{\upsilon}} = \prod_{j=1}^d|x_j|^{\upsilon_j}\), \(|\boldsymbol{\upsilon}| =\sum_{j=1}^d\upsilon_j\), and \(\boldsymbol{\upsilon}! = \prod_{j=1}^d \upsilon_j!\). Moreover define \(\mathbf{D}^\boldsymbol{\upsilon}\) as \[ \mathbf{D}^{\boldsymbol{\upsilon}} = \frac{\partial^{\boldsymbol{\upsilon}}}{\partial \mathbf{x}^\boldsymbol{\upsilon}}= \frac{\partial^{|\boldsymbol{\upsilon}|}}{\partial x_1^{\upsilon_1}\cdots \partial x_d^{\upsilon_d}}. \]
Now we are ready to define a multivariate class of Hölder functions.
Definition 6.2 (Multivariate Hölder class) For a subset \(T \subset \mathbb{R}^d\) and constants \(C \geq 0\), \(\alpha \in (0,1]\), and \(m\) a nonnegative integer, the Hölder class of \(d\)-dimensional functions \(\text{Hölder}_d(C,m,\alpha,T)\) is the set of functions \(f:\mathbb{R}^d \to \mathbb{R}\) having all partial derivatives up to order \(m\) and satisfying \[ |\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x}) -\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x}') | \leq C \|\mathbf{x}- \mathbf{x}'\|_2^\alpha \quad \text{ for all } \quad \mathbf{x}, \mathbf{x}' \in T \] for all vectors of positive integers \(\boldsymbol{\upsilon}\) such that \(|\boldsymbol{\upsilon}|= m\).
Lastly, before stating our result on the bias, we must define a \(d\)-dimensional kernel of order \(m\):
Definition 6.3 (Kernel of order \(m\)) For a positive integer \(m\), a function \(K_d: \mathbb{R}^d \to \mathbb{R}\) is \(d\)-dimensional kernel of order \(m\) if \[ \int_{\mathbb{R}^d} K_d(\mathbf{u})d\mathbf{u}= 1 \quad \text{ and } \quad \int_{\mathbb{R}^d}\mathbf{u}^\boldsymbol{\upsilon}K_d(\mathbf{u})d\mathbf{u}= 0 \] for all \(d \times 1\) vectors \(\boldsymbol{\upsilon}\) of positive integers with \(|\boldsymbol{\upsilon}| \in\{ 1,\dots,m\}\).
Proposition 6.2 (Multivariate KDE bias bound for a Hölder density) Suppose \(f \in \text{Hölder}_d(C,M,\alpha,\mathbb{R}^d)\) and set \(\beta = m+\alpha\). Then if \(K_d\) is a \(d\)-dimensional kernel of order \(m\) satisfying \(\int_{\mathbb{R}^d} |\mathbf{u}|^\boldsymbol{\upsilon}\|\mathbf{u}\|_2^\alpha |K_d(\mathbf{u})|d\mathbf{u}\leq \kappa_\beta < \infty\) for all \(d\times 1\) vectors \(\boldsymbol{\upsilon}\) of positive integers with \(|\boldsymbol{\upsilon}|=m\) we have \[ |\mathbb{E}\hat f_{n,h}(\mathbf{x}) - f(\mathbf{x})| \leq h^\beta\Big(\sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!} \Big)C \kappa_\beta \] for all \(\mathbf{x}\in \mathbb{R}^d\).
Compare this with the result in Proposition 3.1, noting that the results match in the case \(d = 1\).
According to the Taylor expansion in Proposition 38.1 we may write \[ f(\mathbf{x}+ h\mathbf{u}) = f(\mathbf{x}) + \sum_{1 \leq |\boldsymbol{\upsilon}| \leq m-1} \frac{\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x})}{\boldsymbol{\upsilon}!}(h\mathbf{u})^\boldsymbol{\upsilon}+ \sum_{|\boldsymbol{\upsilon}| = m} \frac{\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x}+\tau h\mathbf{u})}{\boldsymbol{\upsilon}!}(h\mathbf{u})^\boldsymbol{\upsilon}, \] for some \(\tau \in [0,1]\), where the first sum on the right side appears only when \(m > 1\).
Now, beginning from the expression in Equation 6.1, we may use the assumption that \(K_d\) is a \(d\)-dimensional kernel of order \(m\) to write \[\begin{align*} \mathbb{E}\hat f_{n,h}(\mathbf{x}) - f(\mathbf{x}) &= \int_{\mathbb{R}^d} K_d(\mathbf{u})[f(\mathbf{x}+ h \mathbf{u}) - f(\mathbf{x})]d\mathbf{u}\\ &= \int_{\mathbb{R}^d} K_d(\mathbf{u})\sum_{|\boldsymbol{\upsilon}| = m} \frac{\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x}+\tau h\mathbf{u})}{\boldsymbol{\upsilon}!}(h\mathbf{u})^\boldsymbol{\upsilon}d\mathbf{u}\\ &= h^m\sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!}\int_{\mathbb{R}^d} \mathbf{u}^\boldsymbol{\upsilon}K_d(\mathbf{u})[\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x}+\tau h\mathbf{u}) - \mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x})] d\mathbf{u}. \end{align*}\] From here, using the definition of the Hölder class, we have \[\begin{align*} |\mathbb{E}\hat f_{n,h}(\mathbf{x}) - f(\mathbf{x})| &\leq h^m \sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!}\int_{\mathbb{R}^d} |\mathbf{u}|^\boldsymbol{\upsilon}|K_d(\mathbf{u})||\mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x}+\tau h\mathbf{u}) - \mathbf{D}^\boldsymbol{\upsilon}f(\mathbf{x})| d\mathbf{u}\\ &\leq h^m \sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!}\int_{\mathbb{R}^d} |\mathbf{u}|^\boldsymbol{\upsilon}|K_d(\mathbf{u})| L\|\tau h \mathbf{u}\|_2^\alpha d\mathbf{u}\\ &\leq h^{m + \alpha} C\sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!}\int_{\mathbb{R}^d} |\mathbf{u}|^\boldsymbol{\upsilon}\|\mathbf{u}\|_2^\alpha |K_d(\mathbf{u})| d\mathbf{u}\\ &\leq h^\beta C \kappa_\beta \sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!}. \end{align*}\]
Putting the variance and bias bounds in Proposition 6.1 and Proposition 6.2, respectively, together gives a bound on the MSE.
Proposition 6.3 (MSE of multivariate KDE under Hölder smoothness) Under the assumptions of Proposition 6.1 and Proposition 6.2 we have \[ \operatorname{MSE}\hat f_{n,h}(\mathbf{x}) \leq h^{2\beta}\Big(\sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!} C \kappa_\beta\Big)^2 + \frac{1}{nh^d} \kappa^2 f_\max \tag{6.2}\] for each \(\mathbf{x}\in \mathbb{R}^d\) with MSE-optimal bandwidth \(h\) given by \(h_{\operatorname{opt}}= c^*n^{-1/(2\beta + d)}\) such that \[ \operatorname{MSE}\hat f_{n,h}(\mathbf{x}) \leq C^* n^{-2\beta/(2\beta + d)} \tag{6.3}\] for all \(\mathbf{x}\in \mathbb{R}^d\), where \(c^*,C^* > 0\) depend on \(f_\max\), \(C\), \(m\), \(\kappa^2\), and \(\kappa_\beta\).
The bound in Equation 6.2 follows immediately from Proposition 6.1 and Proposition 6.2. To find the MSE-optimal choice of the bandwidth, we minimize the right side of Equation 6.2 in \(h\). Setting \[\begin{align*} \frac{d}{dh}\Big(h^{2\beta} &\Big(\sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!} C \kappa_\beta\Big)^2 + \frac{1}{nh^d} \kappa^2 f_\max\Big) \\ &= 2\beta h^{2\beta - 1}\Big(\sum_{|\boldsymbol{\upsilon}| = m} \frac{1}{\boldsymbol{\upsilon}!} C \kappa_\beta\Big)^2 - \frac{d}{nh^{d+1}}\kappa^2 f_\max=0 \end{align*}\] and solving for \(h\) gives \(h_{\operatorname{opt}}\) and plugging \(h_{\operatorname{opt}}\) back into Equation 6.2 gives the bound in Equation 6.3.
The main thing to note about the MSE bound in Proposition 6.3, in particular in the bound in Equation 6.3 obtained under the MSE-optimal choice of \(h\), is the dependence of the bound on the dimension \(d\). In particular, for larger values of \(d\) the bound approaches \(0\) more slowly as \(n \to \infty\). We will next see just how dramatically increasing the dimension can affect the variance (note that it does not affect the bias) of the multivariate kernel density estimator. It turns out that achieving a given variance bound as the dimension \(d\) increases requires an exponentially increasing sample size \(n\). This effect of the dimension \(d\) is known as the curse of dimensionality.