3  Kernel density estimation of Hölder densities

We now introduce a more general class of functions than Lipschitz functions, namely Hölder classes of functions, which we may think of as describing functions which are “smoother” than Lipschitz functions. We begin by introducing Hölder classes of functions, of which Lipschitz classes in Definition 2.1 are special cases.

Definition 3.1 (Hölder class of functions) For an interval \(T\in \mathbb{R}\) and constants \(C \geq 0\), \(\alpha \in (0,1]\), and \(m\) a nonnegative integer, the Hölder class of functions \(\text{Hölder}(C,k,\alpha,T)\) is the set of functions \(f:T\to \mathbb{R}\) which have \(k\) continuous derivatives on \(T\) and for which \[ |f^{(k)}(x) - f^{(k)}(x')| \leq C |x - x'|^{\alpha} \quad \text{ for all } \quad x,x' \in T. \]

If \(\alpha = 1\), then \(f \in \text{Hölder}(C,k,\alpha,T)\) implies that the \(k\)-th derivative of \(f\) belongs to the class \(\text{Lipschitz}(C,T)\). Moreover \(\text{Hölder}(C,0,1,\mathbb{R}) = \text{Lipschitz}(C,\mathbb{R})\), so that if \(k=0\) and \(\alpha=1\), the Hölder class is a Lipschitz class. We may think of larger values of \(k\) as corresponding to smoother functions.

We will consider the bias of the kernel density estimator (KDE) \(\hat f_{n,h}\) of a density \(f \in \text{Hölder}(C,k,\alpha,T)\). Before proceeding, we will require one more definition.

Definition 3.2 (Kernel of order \(k\)) For a positive integer \(k\), a function \(K: \mathbb{R}\to \mathbb{R}\) is kernel of order \(k\) if \[ \int_\mathbb{R}K(u)du = 1 \quad \text{ and } \quad \int_{\mathbb{R}}u^j K(u)du = 0 \quad \text{ for } j = 1,\dots,k. \]

If \(f\) has \(k\) continuous derivatives and if \(K\) is a kernel of order \(k\), we find we can use Taylor expansion together with the above properties of the kernel to express the bias of \(\hat f_n(x)\) as \[ \mathbb{E}\hat f_{n,h}(x) - f(x) = \frac{h^k}{k!}\int_\mathbb{R}u^kK(u) \big[f^{(k)}(x + \tau(u)uh) - f^{(k)}(x)\big] du, \tag{3.1}\] where \(\tau(u) \in [0,1]\) for all \(u \in \mathbb{R}\).

Recall that Equation 2.1 gives \[ \mathbb{E}\hat f_{n,h}(x) - f(x) = \int_\mathbb{R}K(u) [f(x + uh) - f(x)] du. \]

Now consider a Taylor expansion of \(f\) around the point \(x\), evaluated at \(x + uh\). For each \(u \in \mathbb{R}\), using the Lagrange form of the remainder, we may write \[ f(x + uh) - f(x) = \left\{\begin{array}{ll} f^{(1)}(x + \tau(u)uh)uh,& k = 1\\ \displaystyle \sum_{j=1}^{k-1}\frac{f^{(j)}(x)(uh)^j}{j!} + \frac{f^{(k)}(x+ \tau(u) uh)}{k!}(uh)^k,& k > 1,\end{array}\right. \] for some \(\tau(u) \in[0,1]\).

Now, in the case \(k =1\), we have \[\begin{align*} \int_\mathbb{R}K(u) [f(x + uh) - f(x)] &= \int_\mathbb{R}K(u) f^{(1)}(x + \tau(u)uh) uh du \\ &= h\int_\mathbb{R}uK(u) [f^{(1)}(x + \tau(u)uh) - f^{(1)}(x)] du], \end{align*}\] by the assumption \(\int_\mathbb{R}u K(u)du = 0\), whereby \(\int_\mathbb{R}u K(u)f^{(1)}(x)du = 0\).

In the case \(k > 1\), using the properties of a kernel of order \(k\), we have \[\begin{align*} \int_\mathbb{R}K(u) [f(x + uh) - f(x)] &= \int_\mathbb{R}K(u)\Big[ \sum_{j=1}^{k-1}\frac{f^{(j)}(x)(uh)^j}{j!} + \frac{f^{(k)}(x+ \tau(u)uh)}{k!}(uh)^k\Big] du \\ &= \frac{h^k}{k!}\int_\mathbb{R}u^k K(u) [f^{(k)}(x + \tau(u)uh) - f^{(k)}(x)] du. \end{align*}\] This completes the derivation of Equation 3.1.

From the expression for the bias in Equation 3.1 we can obtain the bias bound presented in the following result:

Proposition 3.1 (KDE bias bound for a Hölder density) Suppose \(f \in \text{Hölder}(C,k,\alpha,\mathbb{R})\) and set \(\beta = k+\alpha\). Then if \(K\) is a kernel of order \(k\) satisfying \(\int_\mathbb{R}|u|^\beta |K(u)| du \leq \kappa_\beta < \infty\) we have \[ |\mathbb{E}\hat f_{n,h}(x) - f(x)| \leq h^{\beta} \frac{1}{k!} C \kappa_{\beta} \] for all \(x \in \mathbb{R}\).

Equation 3.1 implies \[\begin{align*} |\mathbb{E}\hat f_{n,h}(x) - f(x)| &\leq \frac{h^k}{k!}\int_\mathbb{R}|u|^k |K(u)| |f^{(k)}(x + \tau(u)uh) - f^{(k)}(x)| du \\ &\leq \frac{h^k}{k!}\int_\mathbb{R}|u|^k |K(u)| C|\tau(u)uh|^\alpha du \\ &\leq \frac{h^{k+\alpha}}{k!} C \int_\mathbb{R}|u|^{k+\alpha} |K(u)| du \\ &\leq h^{\beta} \frac{C \kappa_{\beta}}{k!}. \end{align*}\]

Having obtained a bias bound, we may easily formulate a bound for the mean squared error of the KDE when the density to be estimated lies in a Hölder class; the variance bound in Proposition 2.1 still holds, as the variance is unaffected by the smoothness properties of the density.

Proposition 3.2 (MSE of KDE under Hölder smoothness) Under the assumptions of Proposition 2.1 and Proposition 3.1 we have \[ \operatorname{MSE}\hat f_{n,h}(x) \leq h^{2\beta} \Big(\frac{1}{m!}C\kappa_\beta\Big)^2 + \frac{1}{nh} \kappa^2 f_\max \tag{3.2}\] for each \(x \in \mathbb{R}\) with \(\operatorname{MSE}\)-optimal bandwidth \(h\) given by \(h_\operatorname{opt} = c^*n^{-1/(2\beta + 1)}\) such that \[ \operatorname{MSE}\hat f_{n,h_{\operatorname{opt}}}(x) \leq C^*n^{-2\beta/(2\beta + 1)}, \tag{3.3}\] for all \(x \in \mathbb{R}\), where \(c^* > 0\) and \(C^*>0\) depend on \(f_\max\), \(C\), \(k\), \(\kappa^2\), and \(\kappa_\beta\).

The bound in Equation 11.1 follows immediately from Proposition 2.1 and Proposition 3.1. To find the MSE-optimal choice of the bandwidth, we minimize the right side of Equation 11.1 in \(h\). Setting \[ \frac{d}{dh}\Big(h^{2\beta} \Big(\frac{1}{k!}C\kappa_\beta\Big)^2 + \frac{1}{nh} \kappa^2 f_\max\Big) = 2\beta h^{2\beta - 1}\Big(\frac{1}{k!}C\kappa_\beta\Big)^2 - \frac{1}{nh^2}\kappa^2 f_\max=0 \] and solving for \(h\) gives \(h_{\operatorname{opt}}\) and plugging \(h_{\operatorname{opt}}\) back into Equation 11.1 gives the bound in Equation 11.2.

We see again in Proposition 3.2 the bias-variance trade-off entailed in the choice of the bandwidth \(h\). Note that with \(\beta = 1\), the MSE bound becomes equal to those in Proposition 2.3 under Lipschitz smoothness.

We will come to recognize \(n^{-2\beta/(2\beta +1)}\) as the typical “convergence rate” of nonparametric estimators. As \(\beta = k + \alpha \to \infty\) the class \(\text{Hölder}(C,k,\alpha,\mathbb{R})\) excludes more and more functions of lesser smoothness (excludes “wiggly” functions), and \(n^{-2\beta/(2\beta +1)}\) approaches the “parametric rate” of \(n^{-1}\) which obtains when one is estimating a finite number of parameters.