3  Kernel density estimation of Hölder densities

We now introduce a more general class of functions than Lipschitz functions, namely Hölder classes of functions, which we may think of as describing functions which are “smoother” than Lipschitz functions. We begin by presenting the Hölder condition, which is a generalization of the Lipschitz condition from Definition 2.1.

Definition 3.1 (Hölder condition) Let \(T\) be an interval on \(\mathbb{R}\). A function \(f\) satisfies the Hölder condition with constants \(C \geq 0\) and \(\alpha \in (0,1]\) on \(T\) if \[ |f(x) - f(x')| \leq C |x - x'|^{\alpha} \quad \text{ for all } \quad x,x' \in T. \]

Note that when \(\alpha = 1\), the Hölder condition is equivalent to the Lipschitz condition with \(C = L\).

We now use the Hölder condition to define Hölder function classes.

Definition 3.2 (Hölder class of functions) For an interval \(T\in \mathbb{R}\) and constants \(C \geq 0\), \(\alpha \in (0,1]\), and \(m\) a nonnegative integer, the Hölder class of functions \(\text{Hölder}(C,m,\alpha,T)\) is the set of functions \(f\) which have \(m\) continuous derivatives on \(T\) and for which \[ |f^{(m)}(x) - f^{(m)}(x')| \leq C |x - x'|^{\alpha} \quad \text{ for all } \quad x,x' \in T, \] i.e. the \(m\)-th derivative satisfies the Hölder condition with Hölder constants \(C\) and \(\alpha\).

If \(\alpha = 1\), then \(f \in \text{Hölder}(C,m,\alpha,T)\) means that the \(m\)-th derivative of \(f\) satisfies the Lipschitz condition with Lipschitz constant \(L = C\). Moreover \(\text{Hölder}(C,0,1,T) = \text{Lipschitz}(C,\mathbb{R})\), so that if \(m=0\) and \(\alpha=1\), the Hölder class is a Lipschitz class. We may think of larger values of \(m\) as corresponding to smoother functions.

We will consider the bias of the kernel density estimator (KDE) \(\hat f_n\) of a density \(f \in \text{Hölder}(C,m,\alpha,T)\). Before proceeding, we will require one more definition.

Definition 3.3 (Kernel of order \(m\)) For a positive integer \(m\), a function \(K: \mathbb{R}\to \mathbb{R}\) is kernel of order \(m\) if \[ \int_\mathbb{R}K(u)du = 1 \quad \text{ and } \quad \int_{\mathbb{R}}u^j K(u)du = 0 \quad \text{ for } j = 1,\dots,m. \]

If \(f\) has \(m\) continuous derivatives and if \(K\) is a kernel of order \(m\), we find we can use Taylor expansion together with the above properties of the kernel to express the bias of \(\hat f_n(x)\) as \[ \mathbb{E}\hat f_n(x) - f(x) = \frac{h^m}{m!}\int_\mathbb{R}u^mK(u) \big[f^{(m)}(x + \tau(u)uh) - f^{(m)}(x)\big] du, \tag{3.1}\] where \(\tau(u) \in [0,1]\) for all \(u \in \mathbb{R}\).

Recall that Equation 2.1 gives \[ \mathbb{E}\hat f_n(x) - f(x) = \int_\mathbb{R}K(u) [f(x + uh) - f(x)] du. \]

Now consider a Taylor expansion of \(f\) around the point \(x\), evaluated at \(x + uh\). For each \(u \in \mathbb{R}\), using the Lagrange form of the remainder, we may write \[ f(x + uh) - f(x) = \left\{\begin{array}{ll} f^{(1)}(x + \tau(u)uh)uh,& m = 1\\ \displaystyle \sum_{j=1}^{m-1}\frac{f^{(j)}(x)(uh)^j}{j!} + \frac{f^{(m)}(x+ \tau(u)uh)}{m!}(uh)^m,& m > 1,\end{array}\right. \] for some \(\tau(u) \in[0,1]\).

Now, in the case \(m =1\), we have \[\begin{align*} \int_\mathbb{R}K(u) [f(x + uh) - f(x)] &= \int_\mathbb{R}K(u) f^{(1)}(x + \tau(u)uh) uh du \\ &= h\int_\mathbb{R}uK(u) [f^{(1)}(x + \tau(u)uh) - f^{(1)}(x)] du], \end{align*}\] by the assumption \(\int_\mathbb{R}u K(u)du = 0\), whereby \(\int_\mathbb{R}u K(u)f^{(1)}(x)du = 0\).

In the case \(m > 1\), using the properties of a kernel of order \(m\), we have \[\begin{align*} \int_\mathbb{R}K(u) [f(x + uh) - f(x)] &= \int_\mathbb{R}K(u)\Big[ \sum_{j=1}^{m-1}\frac{f^{(j)}(x)(uh)^j}{j!} + \frac{f^{(m)}(x+ \tau(u)uh)}{m!}(uh)^m\Big] du \\ &= \frac{h^m}{m!}\int_\mathbb{R}u^m K(u) [f^{(m)}(x + \tau(u)uh) - f^{(m)}(x)] du. \end{align*}\] This completes the derivation of Equation 3.1.

From the expression for the bias in Equation 3.1 we can obtain the bias bound presented in the following result:

Proposition 3.1 (KDE bias bound for a Hölder density) Suppose \(f \in \text{Hölder}(C,m,\alpha,\mathbb{R})\) and set \(\beta = m+\alpha\). Then if \(K\) is a kernel of order \(m\) satisfying \(\int_\mathbb{R}|u|^\beta |K(u)| du \leq \kappa_\beta < \infty\) we have \[ |\mathbb{E}\hat f_n(x) - f(x)| \leq h^{\beta} \frac{1}{m!} C \kappa_{\beta} \] for all \(x \in \mathbb{R}\).

Equation 3.1 implies \[\begin{align*} |\mathbb{E}\hat f_n(x) - f(x)| &\leq \frac{h^m}{m!}\int_\mathbb{R}|u|^m |K(u)| |f^{(m)}(x + \tau(u)uh) - f^{(m)}(x)| du \\ &\leq \frac{h^m}{m!}\int_\mathbb{R}|u|^m |K(u)| C|\tau(u)uh|^\alpha du \\ &\leq \frac{h^{m+\alpha}}{m!} C \int_\mathbb{R}|u|^{m+\alpha} |K(u)| du \\ &\leq h^{\beta} \frac{C \kappa_{\beta}}{m!}. \end{align*}\]

Having obtained a bias bound, we may easily formulate a bound for the mean squared error of the KDE when the density to be estimated lies in a Hölder class; the variance bound in Proposition 2.1 still holds, as the variance is unaffected by the smoothness properties of the density.

Proposition 3.2 (MSE of KDE under Hölder smoothness) Under the assumptions of Proposition 2.1 and Proposition 3.1 we have \[ \operatorname{MSE}\hat f_n(x) \leq h^{2\beta} \Big(\frac{1}{m!}C\kappa_\beta\Big)^2 + \frac{1}{nh} \kappa^2 f_\max \tag{3.2}\] for each \(x \in \mathbb{R}\) with \(\operatorname{MSE}\)-optimal bandwidth \(h\) given by \(h_\operatorname{opt} = c^*n^{-1/(2\beta + 1)}\) such that \[ \operatorname{MSE}\hat f^{h_{\operatorname{opt}}}_n(x) \leq C^*n^{-2\beta/(2\beta + 1)}, \tag{3.3}\] for all \(x \in \mathbb{R}\), where \(c^* > 0\) and \(C^*>0\) depend on \(f_\max\), \(C\), \(m\), \(\kappa^2\), and \(\kappa_\beta\).

The bound in Equation 3.2 follows immediately from Proposition 2.1 and Proposition 3.1. To find the MSE-optimal choice of the bandwidth, we minimize the right side of Equation 3.2 in \(h\). Setting \[ \frac{d}{dh}\Big(h^{2\beta} \Big(\frac{1}{m!}C\kappa_\beta\Big)^2 + \frac{1}{nh} \kappa^2 f_\max\Big) = 2\beta h^{2\beta - 1}\Big(\frac{1}{m!}C\kappa_\beta\Big)^2 - \frac{1}{nh^2}\kappa^2 f_\max=0 \] and solving for \(h\) gives \(h_{\operatorname{opt}}\) and plugging \(h_{\operatorname{opt}}\) back into Equation 3.2 gives the bound in Equation 3.3.

We see again in Proposition 3.2 the bias-variance trade-off entailed in the choice of the bandwidth \(h\). Note that with \(\beta = 1\), the MSE bound becomes equal to those in Proposition 2.3 under Lipschitz smoothness.

We will come to recognize \(n^{-2\beta/(2\beta +1)}\) as the typical “convergence rate” of nonparametric estimators. As \(\beta \to \infty\) the function \(f\) is constrained to be more and more smooth, and \(n^{-2\beta/(2\beta +1)}\) approaches the “parametric rate” of \(n^{-1}\) which obtains when one is estimating a finite number of parameters.