19 Getting to the Normal means model

Suppose we observe \(Y_1,\dots,Y_n\) arising as \[ Y_i = m(i/n) + \sigma \varepsilon_i, \quad i = 1,\dots,n \] for some function \(m:[0,1]\to\mathbb{R}\), where \(\varepsilon_1,\dots,\varepsilon_n\) are independent \(\mathcal{N}(0,1)\) random variables, and \(\sigma > 0\). We have set the design points \(x_i = i/n\) for \(i=1,\dots,n\) for the sake of later convenience.

Recall the quote from Wasserman (2006):

The basic idea of nonparametric inference is to use data to infer an unknown quantity while making as few assumptions as possible. Usually, this means using statistical models that are infinite-dimensional.

Following the quotation, we will equate the estimation of \(m\) with that of an infinite-dimensional parameter \(\boldsymbol{\theta}= (\theta_1,\theta_2,\dots)\). First, let \[ L_2([0,1]) = \Big\{g:[0,1]\to \mathbb{R}: \int_0^1 |g(x)|^2dx < \infty\Big\}, \] be the set of square integrable functions on \([0,1]\) and assume \(m \in L_2([0,1])\). our strategy will be to express \(m\) as a weighted sum of an infinite number of basis functions, so that knowing the weights will be the same as knowing the function \(m\). We first define an orthonormal basis for \(L_2([0,1])\).

Definition 19.1 (Orthonormal basis for \(L_2([0,1])\)) Let \(\varphi_1,\varphi_2,\dots~:[0,1]\to \mathbb{R}\) be a collection of functions such that \[ \int_0^1 \varphi_i(x)\varphi_j(x)dx = \left\{\begin{array}{ll} 1& i = j\\ 0& i \neq j, \end{array}\right. \] for all \(i\) and \(j\) and such that for any function \(g \in L_2([0,1])\) \[ \int_0^1 g(x)\varphi_j(x)dx = 0 \quad \forall j \iff g(x) = 0. \] Then the functions \(\varphi_1,\varphi_2,\dots\) comprise an orthonormal basis for \(L_2([0,1])\).

Theorem 19.1 (Weights for an orthonormal basis and Parseval’s identity) For \(g \in L_2([0,1])\) we may write \[ g(x) = \sum_{j=1}^\infty \theta_j \varphi_j(x), \] where \[ \theta_j = \int_0^1 g(x)\varphi_j(x)dx, \quad j = 1,2,\dots \] Moreover we have \(\int_0^1|g(x)|^2dx = \sum_{j=1}^\infty \theta_j^2\), which is called Parseval’s identity.

Heuristic for Theorem 19.1

For any integer \(N \geq 1\), let \(\boldsymbol{\varphi}_1,\dots,\boldsymbol{\varphi}_N\) be vectors in \(\mathbb{R}^N\) such that \[ \boldsymbol{\varphi}_i^T\boldsymbol{\varphi}_j = \left\{\begin{array}{ll} 1& i = j\\ 0& i \neq j, \end{array}\right. \] for all \(i\) and \(j\) and such that \(\boldsymbol{\varphi}_j^T\mathbf{g}= 0\) for all \(j\) if and only if \(\mathbf{g}= \mathbf{0}\). Then \(\boldsymbol{\varphi}_1,\dots,\boldsymbol{\varphi}_N\) comprise an orthonormal basis for \(\mathbb{R}^N\). In consequence, for any \(\mathbf{g}\in \mathbb{R}^N\) we have \[ \mathbf{g}= \sum_{j=1}^N (\boldsymbol{\varphi}_j^T\mathbf{g}) \boldsymbol{\varphi}_j, \] where the right hand side is the orthogonal projection of \(\mathbf{g}\) onto the space spanned by \(\boldsymbol{\varphi}_1,\dots,\boldsymbol{\varphi}_N\). Setting \(\theta_j = \boldsymbol{\varphi}_j^T\mathbf{g}\) allows us to write \(\mathbf{g}= \sum_{j=1}^N \theta_j \boldsymbol{\varphi}_j\). Moreover, we have \(\mathbf{g}^T\mathbf{g}= \sum_{j=1}^N \theta_j^2\). Theorem 19.1 is essentially a limiting version of these same arguments as \(N \to \infty\).

The Fourier basis in Definition 19.2 is such an orthonormal basis.

Definition 19.2 (Fourier basis) The Fourier basis on \([0,1]\) is the collection of functions \(\varphi_1(x) = 1\) and \[\begin{align} \varphi_{2k}(x) &= \sqrt{2}\cos(2\pi k x) \\ \varphi_{2k+1}(x) &= \sqrt{2}\sin(2\pi k x) \end{align}\] for \(k=1,2,\dots\)

The panels of Figure 19.1 plot the Fourier basis functions \(\varphi_1,\dots,\varphi_{2K+1}\) for \(K=1,2,3,4\).

Code

plotFourier0to1 <- function(K){

  x <- seq(0,1,length = 201)
  Bx <- matrix(0,length(x),2*K+1)
  Bx[,1] <- 1

  for(j in 1:K){

    Bx[,2*j] <- sqrt(2)*cos(2*pi*j*x)
    Bx[,2*j+1] <- sqrt(2)*sin(2*pi*j*x)

  }

  plot(NA,
       ylim = c(-sqrt(2),sqrt(2)),
       xlim = range(x),
       xlab = "",
       ylab = "")

  xpos <- grconvertX(.5, from = "nfc", to = "user")
  ypos <- grconvertY(.92, from = "nfc", to = "user")

  text(x = xpos, y = ypos, labels = paste("K = ",K, sep = ""), xpd = NA)

  for( k in 1:(2*K+1)){

    lines(Bx[,k]~x, col = k)

  }

}

par(mfrow = c(2,2),mar= c(2.1,2.1,2.1,1.1))

for(k in 1:4){

  plotFourier0to1(K = k)

}

Four panels showing increasing numbers of Fourier basis functions. — Figure 19.1: Fourier basis functions for \(k=1,\dots,K\).

Now, suppose we can write \[ m(x) = \sum_{j=1}^\infty \theta_j \varphi_j(x), \] where \(\theta_j = \int_0^1m(x)\varphi_j(x)dx\) for \(j=1,2,\dots\), that is, suppose \(m\) lies in the space of functions spanned by \(\varphi_1,\varphi_2,\dots\) Moreover, suppose (we will make this more precise later) that for large enough \(n\) we have \[ m(x) \approx \sum_{j=1}^n \theta_j\varphi_j(x), \] so that the prominent features of the function \(m\) can be resolved with coefficients \(\theta_j\) corresponding to small indices \(j=1,\dots,n\). In this setup, consider estimating \(\theta_1,\dots,\theta_n\) based on \(Y_1,\dots,Y_n\) with \[ \hat \theta_j = \frac{1}{n}\sum_{i=1}^n Y_i \varphi_j(i/n) \] for \(j = 1,\dots,n\). Then we have \[ \hat \theta_j = \frac{1}{n}\sum_{i=1}^n m(i/n)\varphi_j(i/n) + \frac{1}{n}\sum_{i=1}^n \varepsilon_i \varphi_j(i/n), \] where \[ \frac{1}{n}\sum_{i=1}^n m(i/n)\varphi_j(i/n) \to \int_0^1 m(x)\varphi_j(x)dx = \theta_j \] as \(n \to \infty\) and the second term is Normally distributed with variance such that \[ n\mathbb{V}\Big(\frac{1}{n}\sum_{i=1}^n \varepsilon_i \varphi_j(i/n)\Big) = \frac{\sigma^2}{n}\sum_{i=1}^n\varphi^2_j(i/n) \to \sigma^2 \int_0^1\varphi_j^2(x)dx = \sigma^2 \] as \(n \to \infty\). Thus for large \(n\) we may write \[ \hat \theta_j \overset{d}{\approx} \theta_j + n^{-1/2}\sigma \xi_j, \] for \(j=1,\dots,n\), where \(\xi_1,\dots,\xi_n\) are independent \(\mathcal{N}(0,1)\) random variables.

Moreover, setting \(\hat m_n(x) = \sum_{i=1}^n \hat \theta_j\varphi_j(x)\), the mean integrated squared error of the estimator \(\hat m_n\) is given by \[ \operatorname{MISE}\hat m_n = \int_0^1|\hat m_n(x) - m(x)|^2dx \approx \sum_{i=1}^n(\hat \theta_i - \theta_i)^2. \]

This brings us to the so-called Normal means model:

Definition 19.3 (Normal means model) Let \[ Z_i = \theta_i + n^{-1/2}\sigma \xi_i \] for \(i = 1,\dots,n\), where \(\boldsymbol{\theta}= (\theta_1,\dots,\theta_n) \in \Theta \subset \mathbb{R}^n\) is unknown, \(\xi_1,\dots,\xi_n\) are independent \(\mathcal{N}(0,1)\) random variables, and \(\sigma > 0\).

Due to the reduction of the nonparametric regression model to the Normal means model via the approximations outlined above, what we learn about estimation in the Normal means model can be to some extent carried over to nonparametric regression. That is, studying how well we can estimate \(\theta_1,\theta_2,\dots\) in the Normal means model can help us understand how well we can estimate the unknown function \(m\) in the original nonparametric regression model.

One might imagine that assumptions concerning the smoothness of the unknown function \(m\) will correspond to assumptions about the parameters \(\theta_1,\theta_2,\dots\) This is indeed the case, as we will see in the next section, where we introduce the periodic Sobolev class of functions. These functions can be constructed from the Fourier basis such that the coefficients \(\theta_1,\theta_2,\dots\) will satisfy certain constraints. We will see that for smoother functions, the coefficients \(\theta_1,\theta_2,\dots\) must satisfy stricter constraints, making them “easier” to estimate.