8  Minimax risk in the Normal means model

Consider the so-called Normal means model from Definition 7.3, in which we observe \[ Z_i = \theta_i + n^{-1/2}\sigma \xi_i, \quad i = 1,\dots,n, \] where \(\boldsymbol{\theta}= (\theta_1,\dots,\theta_n) \in \Theta \subset \mathbb{R}^n\) is unknown, \(\xi_1,\dots,\xi_n\) are independent \(\mathcal{N}(0,1)\) random variables, and \(\sigma > 0\). We will consider how to find the best estimator \(\hat{\boldsymbol{\theta}} = (\hat \theta_1,\dots,\hat \theta_n)\).

In order to identify “the best” estimator of \(\boldsymbol{\theta}\), we must define what we mean by “the best”. We will say that the best estimator is one (I will not say the one, since it may not be unique) which has the smallest worst-case estimation error over all possible values of \(\boldsymbol{\theta}\). Such an estimator is called a minimax estimator.

More formally, let \(L(\cdot,\cdot)\) be a function such that the value \(L(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})\) is interpreted as the cost of estimating \(\boldsymbol{\theta}\) with \(\hat{\boldsymbol{\theta}}\). Such a function is called a loss function, and is often defined as a distance between the estimator \(\hat{\boldsymbol{\theta}}\) and its estimation target \(\boldsymbol{\theta}\); we will consider the squared error loss function, defined as \[ L(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}) = \|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|^2 = \sum_{i=1}^n(\hat \theta_i - \theta_i)^2. \] The expected value of the loss an estimator will incur is called the risk of an estimator, which may depend on the true value of \(\boldsymbol{\theta}\). We define the risk of an estimator \(\hat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) as the function \[ R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}) = \mathbb{E}L(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}). \] Having defined the risk of an estimator, we can now define the minimax risk.

Definition 8.1 (Minimax risk) Let \(\boldsymbol{\theta}\in \Theta\). The minimax risk over \(\Theta\) is defined as \[ M(\Theta) = \inf_{\hat{\boldsymbol{\theta}}} \sup_{\boldsymbol{\theta}\in \Theta} R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}), \] where the infimum is taken over all estimators of \(\boldsymbol{\theta}\).

The minimax lens evaluates an estimator on the basis of its worst-possible performance. One could choose a restaurant with the minimax approach by choosing the one whose worst dish is tastier (or not less tasty) than the worst dish of each other restaurant under consideration.

If one uses the minimax way of evaluating estimators, then as soon as one finds an estimator which achieves, in a given setup, the minimax risk, one does not need to spend time looking for a better estimator.

In the Normal means model we will derive the minimax risk over a few choices of the parameter space \(\Theta\). Our strategy for finding the minimax risk will require a couple more definitions: First, the integrated risk of an estimator \(\hat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) with respect to a prior distribution \(\pi\) is defined as \[ I_\pi(\hat{\boldsymbol{\theta}}) = \int R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}) d\pi(\boldsymbol{\theta}). \] The notation \(\int\) is intended to mean integration over all of \(\mathbb{R}^n\), and we allow that the support of the prior \(\pi\) may differ from the set \(\Theta\). The Bayes estimator under the prior \(\pi\) is defined as \[ \hat{\boldsymbol{\theta}}_\pi = \underset{\hat{\boldsymbol{\theta}}}{\operatorname{argmin}} \int R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})d\pi(\boldsymbol{\theta}), \] which is the estimator achieving the smallest integrated risk with respect to \(\pi\) (the Bayes estimator is the posterior mean in the case of squared-error loss). In addition, we define the integrated Bayes risk as \[ I_\pi = I_\pi(\hat{\boldsymbol{\theta}}_\pi) = \int R(\hat{\boldsymbol{\theta}}_\pi,\boldsymbol{\theta})d\pi(\boldsymbol{\theta}). \] Now, the integrated Bayes risk can be used in establishing a lower bound for the minimax risk via the following arguments: For any estimator \(\hat{\boldsymbol{\theta}}\) we may write \[\begin{align} \sup_{\boldsymbol{\theta}\in \Theta} R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}) &\geq \int_\Theta R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})d\pi(\boldsymbol{\theta}) \\ &= \int R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})d\pi(\boldsymbol{\theta}) - \int_{\Theta^c} R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})d\pi(\boldsymbol{\theta}) \\ &\geq I_\pi - \int_{\Theta^c} R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})d\pi(\boldsymbol{\theta}), \end{align}\] since the Bayes estimator minimizes the integrated Bayes risk. Taking the infimum of both sides of the above over all estimators \(\hat{\boldsymbol{\theta}}\) gives \[ M(\Theta) \geq I_\pi - \sup_{\hat{\boldsymbol{\theta}}}\int_{\Theta^c} R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta})d\pi(\boldsymbol{\theta}), \tag{8.1}\] where the second term will be equal to zero if \(\pi\) has support only on \(\Theta\) (or will converge to zero under a sequence of priors which concentrate on \(\Theta\)). The inequality in Equation 8.1 informs the following two-step strategy for finding the minimax risk:

To find the minimax risk \(M(\Theta)\), propose a candidate value \(M^*\) and then:

  1. Find an estimator which has worst-case risk equal to (or bounded above by) \(M^*\). This shows \(M(\Theta) \leq M^*\).
  2. Find a prior (or a sequence of priors) such that the integrated Bayes risk over \(\Theta\) is equal to (or converges to) \(M^*\). This will show \(M(\Theta) \geq M^*\).

Steps 1 and 2 give \(M(\Theta) = M^*\).

It is important to note that the inequality in Equation 8.1 holds for any choice of the prior \(\pi\). The trick in Step 2 is to choose a prior \(\pi\) such that the integrated Bayes risk \(I_\pi\) is equal to (or converges to) the candidate \(M^*\) for the minimax risk and such that the second term is equal to (or converges to) zero; then we can “squeeze” \(M(\Theta)\) to establish \(M^*\) as the minimax risk.

Proposition 8.1 (Minimax risk in Normal means model with \(\Theta = \mathbb{R}^n\)) In the Normal means model the minimax risk over \(\mathbb{R}^n\) under squared error loss is given by \[ M(\mathbb{R}^n) = \sigma^2. \]

Implement the two-step strategy:

  1. Propose \(M^* = \sigma^2\) as a candidate for the minimax risk and consider the estimator \(\hat{\boldsymbol{\theta}} = (Z_1,\dots,Z_n)\). We have \[ R(\hat{\boldsymbol{\theta}},\boldsymbol{\theta}) = \mathbb{E}\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|^2 = \sum_{i=1}^n\mathbb{E}(Z_i - \theta_i)^2 = \sigma^2. \] So \(M^* = \sigma^2\) can serve as an upper bound for the minimax risk. That is, we have \(M(\Theta) \leq \sigma^2\).

  2. Choose for the prior \(\pi\) the \(\mathcal{N}(\mathbf{0},n^{-1}\tau^2\mathbf{I}_n)\) distribution. Then the Bayes estimator, i.e. the posterior mean of \(\boldsymbol{\theta}\) given \(Z_1,\dots,Z_n\), is given by \[ \hat{\boldsymbol{\theta}}_\pi = \Big(Z_1 \frac{\tau^2}{\sigma^2 + \tau^2},\dots,Z_n\frac{\tau^2}{\sigma^2 + \tau^2} \Big). \] The risk of the Bayes estimator is given by \[\begin{align} R(\hat{\boldsymbol{\theta}}_\pi,\boldsymbol{\theta}) &= \mathbb{E}\Big[\sum_{i=1}^n\Big(Z_i \frac{\tau^2}{\sigma^2 + \tau^2} - \theta_i\Big)^2 \Big| \boldsymbol{\theta}\Big]\\ &=\mathbb{E}\Big[\sum_{i=1}^n\Big(\frac{\tau^2}{\sigma^2 + \tau^2}(Z_i - \theta_i) - \theta_i\frac{\sigma^2}{\sigma^2 + \tau^2}\Big)^2 \Big| \boldsymbol{\theta}\Big]\\ &= \Big(\frac{\tau^2}{\sigma^2 + \tau^2}\Big)^2 \sigma^2 + \|\boldsymbol{\theta}\|^2\Big(\frac{\sigma^2}{\sigma^2 + \tau^2}\Big)^2, \end{align}\] so that the integrated Bayes risk is \[\begin{align} I_\pi &= \int_{\mathbb{R}^n}R(\hat{\boldsymbol{\theta}}_\pi,\boldsymbol{\theta})d\pi(\boldsymbol{\theta}) \\ &= \Big(\frac{\tau^2}{\sigma^2 + \tau^2}\Big)^2 \sigma^2 + \int_{\mathbb{R}^n}\|\boldsymbol{\theta}\|^2d\pi(\boldsymbol{\theta})\Big(\frac{\sigma^2}{\sigma^2 + \tau^2}\Big)^2 \\ &= \Big(\frac{\tau^2}{\sigma^2 + \tau^2}\Big)^2 \sigma^2 + \tau^2\Big(\frac{\sigma^2}{\sigma^2 + \tau^2}\Big)^2\\ &= \frac{\sigma^2\tau^2}{\sigma^2 + \tau^2}. \end{align}\] From here, we can use Equation 8.1 (in which the second term is equal to zero, since the support of \(\pi\) is equal to \(\Theta\)) to write \[ M(\Theta) \geq \frac{\sigma^2\tau^2}{\sigma^2 + \tau^2} \] for all \(\tau^2 > 0\). So we have \(\sigma^2\tau^2/(\sigma^2 + \tau^2) \leq M(\Theta) \leq \sigma^2\) for all \(\tau^2\), which gives \(M(\Theta) = \sigma^2\).

Proposition 8.2 (Minimax risk in Normal means model with \(\Theta\) a ball in \(\mathbb{R}^n\)) Let \[ \Theta_n(c) = \Big\{\boldsymbol{\theta}\in \mathbb{R}^n : \|\boldsymbol{\theta}\| \leq c\Big\}. \] Then in the Normal means model the minimax risk over \(\Theta_n(c)\) under squared error loss satisfies \[ \liminf_{n\to \infty} M(\Theta_n(c)) = \frac{\sigma^2c^2}{\sigma^2 + c^2}. \]

Implement the two-step strategy (we follow very closely the proof on pages 174–176 of Wasserman (2006)):

  1. Propose \(M^* = \sigma^2c^2/(\sigma^2 + c^2)\) as a candidate value for the minimax risk and consider any estimator of the form \(\hat{\boldsymbol{\theta}}_\lambda = (\lambda Z_1,\dots,\lambda Z_n)\), where \(\lambda > 0\). The risk of the estimator \(\hat{\boldsymbol{\theta}}\) is given by \[ R(\hat{\boldsymbol{\theta}}_\lambda,\boldsymbol{\theta}) = \sum_{j=1}^n \mathbb{E}(\lambda Z_j - \theta_j)^2 = \lambda^2\sigma^2 + (1-\lambda)^2\|\boldsymbol{\theta}\|^2. \] For such an estimator the worst-case risk is given by \[ \sup_{\boldsymbol{\theta}\in \Theta} R(\hat{\boldsymbol{\theta}}_\lambda,\boldsymbol{\theta}) = \lambda^2\sigma^2 + (1-\lambda)^2 c^2. \] The value of \(\lambda\) which minimizes the worst-case risk (take the derivative with respect to \(\lambda\) and set this equal to zero) is \(\lambda = c^2 / (\sigma^2 + c^2)\). Plugging this in, we find \[ \inf_{\lambda} \sup_{\boldsymbol{\theta}\in \Theta} R(\hat{\boldsymbol{\theta}}_\lambda,\boldsymbol{\theta}) = \frac{c^4}{(\sigma^2 + c^2)^2}\sigma^2 + \frac{\sigma^4}{(\sigma^2 + c^2)^2}c^2 = \frac{\sigma^2 c^2}{\sigma^2 + c^2}, \] which shows \(M(\Theta_n(c)) \leq M^*\). That is, if an estimator of the form \(\hat{\boldsymbol{\theta}}_\lambda = (\lambda Z_1,\dots,\lambda Z_n)\) can achieve worst-case risk \(M^*\), then the smallest worst-case risk achievable by any estimator is equal to or less than \(M^*\).

  2. Choose for the prior \(\pi\) the \(\mathcal{N}(\mathbf{0},n^{-1}c^2\delta^2\mathbf{I}_n)\) distribution. Note that this prior has support on \(\mathbb{R}^n\), some of which is outside of \(\Theta_n(c)\), so the second term in Equation 8.1 will have to be handled, as it is not equal to zero. As \(n\) grows, however, most of the support of the \(\mathcal{N}(\mathbf{0},n^{-1}c^2\delta^2\mathbf{I}_n)\) distribution will fall within the ball \(\Theta_n(c)\), so this term will disappear. This is shown rigorously in the proof given by Wasserman (2006), but we omit the details here. Under this prior the Bayes estimator (the posterior mean) of \(\boldsymbol{\theta}\) is given by \[ \hat{\boldsymbol{\theta}}_\pi = \Big(Z_1 \frac{c^2\delta^2}{\sigma^2 + c^2\delta^2},\dots,Z_n\frac{c^2\delta^2}{\sigma^2 + c^2\delta^2}\Big). \] The risk of the Bayes estimator is given by \[ R(\hat{\boldsymbol{\theta}}_\pi,\boldsymbol{\theta}) = \frac{(c^2\delta^2)^2}{(\sigma^2 + c^2\delta^2)^2}\sigma^2 + \frac{\sigma^4}{(\sigma^2 + c^2\delta^2)^2}\|\boldsymbol{\theta}\|^2, \] and the integrated Bayes risk by \[\begin{align} I_\pi &= \int R(\hat{\boldsymbol{\theta}}_\pi,\boldsymbol{\theta})d\pi(\boldsymbol{\theta})\\ &=\frac{(c^2\delta^2)^2}{(\sigma^2 + c^2\delta^2)^2}\sigma^2 + \frac{\sigma^4}{(\sigma^2 + c^2\delta^2)^2}c^2\delta^2\\ &=\frac{\sigma^2c^2\delta^2}{\sigma^2 + c^2\delta^2} \end{align}\] since \(\mathbb{E}\|\boldsymbol{\theta}\|^2 = c^2\delta^2\) according to the prior \(\pi\). Now, according to Equation 8.1, we may write \[ M(\Theta_n(c)) \geq \frac{\sigma^2c^2\delta^2}{\sigma^2 + c^2\delta^2} - \sup_{\hat{\boldsymbol{\theta}}} \int_{\Theta^c_n(c)}R(\hat{\boldsymbol{\theta}}_\pi,\boldsymbol{\theta})d\pi(\boldsymbol{\theta}) \]

The first term on the right hand side approaches \(M^*\) as \(\delta \uparrow 1\), and it can be shown that the second term vanishes as \(n \to \infty\) due to the concentration of \(\pi\) on the ball \(\Theta_n(c)\) (see pages 174–176 of Wasserman (2006)). This gives the result.