30  Asymptotic distribution of the Wilcoxon rank sum statistic

Here we will state and prove a result describing the asymptotic behavior of the Wilcoxon rank sum statistic \(W_R\) as the sample sizes \(n\) and \(m\) grow. We begin by stating a result giving its mean and variance.

Proposition 30.1 (Mean and variance of the rank-sum statistic) The rank-sum statistic \(W_R\) must take a value between \(m(m+1)/2\) and \(mn + m(m+1)/2\). Moreover, if \(F = G\) we have \[\begin{align} \mathbb{E}W_R &= \frac{1}{2}m(N + 1)\\ \mathbb{V}W_R & = \frac{1}{12}mn(N+1). \end{align}\]

First, \(W_R\) takes its smallest value when \(Y_1,\dots,Y_m\) are the smallest \(m\) observations among \(Z_1,\dots,Z_{n+m}\). In this case the set of ranks \(\{R_1,\dots,R_m\}\) is equal to the set of integers \(\{1,\dots,m\}\) so that \(\sum_{j=1}^m R_j = m(m+1)/2\). On the other hand, \(W_R\) takes its largest value when \(Y_1,\dots,Y_m\) are the largest \(m\) observations among \(Z_1,\dots,Z_{n+m}\), in which case the set of ranks \(\{R_1,\dots,R_m\}\) is equal to the set of integers \(\{n+1,\dots,n+m\}\) the sum of which is \(mn + m(m+1)/2\).

To compute the expected value and variance of \(W_R\), we first note that under \(H_0\) the position of a data point \(Y_j\) in the ordering of the \(N = n + m\) data points is a random draw from the integers \(1,\dots,N\) such that each position occurs with probability \(1/N\). That is, for any \(r_j =1,\dots,N\) we have \(\mathbb{P}(R_j = r_j) = 1/N\), so that \[ \mathbb{E}R_j = \frac{N+1}{2} \quad \text{ and } \quad \mathbb{V}R_j = \frac{N^2 - 1}{12}, \] where these are simply the mean and variance of the discrete uniform distribution on the integers \(1,\dots,N\).

From here we obtain \[ \mathbb{E}W_R = \mathbb{E}\sum_{j=1}^m R_j = \frac{1}{2}m(N+1). \]

To find \(\mathbb{V}W_R\) we write \[\begin{align} \mathbb{V}W_R &= \mathbb{V}\Big(\sum_{j=1}^m R_j\Big) \\ &= \sum_{j=1}^m \mathbb{V}R_j + 2\sum_{j < k} \mathbb{C}(R_j,R_k) \\ &= m \mathbb{V}R_1 + m(m-1)\mathbb{C}(R_1,R_2), \end{align}\] since \(\mathbb{C}(R_j,R_k)\) is the same for every \(j \neq k\). In order to find \(\mathbb{C}(R_1,R_2)\), assume for a moment that \(m = N\). In this case we must have \(W_R = N(N+1)/2\) so \(\mathbb{V}W_R = 0\), giving \[ N \mathbb{V}R_1 + N(N-1)\mathbb{C}(R_1,R_2) = 0, \] so that \[ \mathbb{C}(R_1,R_2) = -\frac{N \mathbb{V}R_1 }{N(N-1)} = -\frac{N^2-1}{12(N-1)} = -\frac{N+1}{12}. \] Plugging this into our earlier expression for \(\mathbb{V}W_R\), we obtain \[\begin{align} \mathbb{V}W_R &= m \frac{N^2-1}{12} - m(m-1)\frac{N+1}{12} \\ &= m \frac{(N+1)(N-1)}{12} - m(m-1)\frac{N+1}{12} \\ &= \frac{1}{12}mn(N+1). \end{align}\]

Now we state a central limit result for \(W_R\).

Theorem 30.1 (Asymptotic Normality of rank sum statistic under the null) Under \(H_0\): \(F = G\) we have \[ \frac{W_R - \mathbb{E}W_R}{\sqrt{\mathbb{V}W_R}} \overset{\text{d}}{\longrightarrow}\mathcal{N}(0,1) \tag{30.1}\] as \(n,m \to \infty\).

The main complication in proving the result is that \(W_R = R_1,\dots,R_m\) is a sum of dependent random variables rather than independent random variables. Our strategy will be to

  1. Introduce an approximation \(\tilde W_R\) to \(W_R\) which is a sum of independent random variables and such that we can show \[ \frac{\tilde W_R - \mathbb{E}\tilde W_R}{\sqrt{ \mathbb{V}\tilde W_R}} \overset{\text{d}}{\longrightarrow}\mathcal{N}(0,1) \tag{30.2}\] as \(n,m \to \infty\).
  2. Show that the approximation of \(\tilde W_R\) to \(W_R\) is close enough for the convergence in Equation 30.2 to imply the convergence in Equation 30.1.

Here we go. Under the null hypothesis \(H_0\): \(F = G\), the ranks \(R_1,\dots,R_m\) are simply \(m\) integers sampled without replacement from \(1,\dots,N\). One way to draw a sample of size \(m\) without replacement from the integers \(1,\dots,N\) is to (i) generate \(N\) independent realizations \(U_1,\dots,U_N\) from the \(\mathcal{U}(0,1)\) distribution, (ii) denote their order statistics by \(U_{(1)} <\dots <U_{(N)}\), and (iii) keep the integers in the set \(\{i: U_i \leq U_{(m)}\}\). Under this sampling scheme we find we may write \[ W_R = \sum_{i=1}^N i J_i, \] where \(J_i = \mathbf{1}(U_i \leq U_{(m)})\) is an indicator equal to \(1\) if the integer \(i\) was drawn and equal to zero otherwise for each \(i=1,\dots,N\). Note that \(J_1,\dots,J_N\) are dependent random variables.

Now, to construct an approximation \(\tilde W_R\) to \(W_R\), we consider replacing the dependent random variables \(J_i\) with independent analogues \(K_i\) defined as \(K_i = \mathbf{1}(U_i \leq m/N)\) for \(i=1,\dots,N\), where the \(U_1,\dots,U_N\) are the same \(\mathcal{U}(0,1)\) realizations with which the \(J_i\) were defined. Then we set \[ \tilde W_R = \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)K_i + \frac{1}{2}m(N+1) \] Now, noting that \(\mathbb{E}K_i = m/N\) and \(\mathbb{V}K_i = m/N(1 - m/N)\), we have \[\begin{align} \mathbb{E}\tilde W_R &=\sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)\frac{m}{N} + \frac{1}{2}m(N+1) \\ &= \frac{1}{2}m(N+1) \end{align}\] and \[\begin{align} \mathbb{V}\tilde W_R &= \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)^2 \frac{m}{N}\Big(1 - \frac{m}{N}\Big) =\frac{N-1}{N}\mathbb{V}W_R, \end{align}\] where the second equality will be useful later and requires some work to show. Now we find we may write \[\begin{align} \frac{\tilde W_R - \mathbb{E}\tilde W_R}{\sqrt{\mathbb{V}\tilde W_R}} = \frac{\sum_{i=1}^N a_i \xi_i}{\sqrt{\sum_{i=1}^N a_i^2}}, \end{align}\] where \[ a_i = i - \frac{N+1}{2} \quad \text{ and } \quad \xi_i = \frac{K_i - m/N}{\sqrt{m/N(1-m/N)}} \] for \(i=1,\dots,N\). Note that \(\xi_1,\dots,\xi_N\) are independent random variables with zero mean and unit variance, so according to the corollary to the Lindeberg central limit theorem stated in Corollary 32.1, the convergence in Equation 30.2 holds provided \[ \frac{\max_{1\leq i \leq N} |a_i|}{\sqrt{\sum_{i=1}^Na_i^2}} \to 0 \tag{30.3}\] as \(n,m \to \infty\). We have \[ \max_{1 \leq i \leq N} |a_i| = \Big( \Big|1 - \frac{N+1}{2}\Big| \vee \Big|N - \frac{N+1}{2}\Big|\Big) = \frac{N-1}{2} \] and we can (with some work) show \[ \sum_{i=1}^N a_i^2 = \frac{N(N^2 - 1)}{12}. \] Thus we see that Equation 30.3 holds and the convergence in Equation 30.2 is established.

Now we consider the quality of the approximation of \(\tilde W_R\) to \(W_R\). We may write \[ \frac{W_R - \mathbb{E}W_R}{\sqrt{\mathbb{V}W_R}} = \Big(\frac{\tilde W_R - \mathbb{E}\tilde W_R}{\sqrt{\mathbb{V}\tilde W_R}} + \frac{W_R - \tilde W_R}{\sqrt{\mathbb{V}\tilde W_R}} \Big)\sqrt{\frac{\mathbb{V}\tilde W_R}{\mathbb{V}W_R}}, \] since \(\mathbb{E}\tilde W_R = \mathbb{E}W_R\). Now, since \(\mathbb{V}\tilde W_R = N^{-1}(N-1)\mathbb{V}W_R\), a sufficient condition for the convergence in Equation 30.2 to imply the convergence in Equation 30.1 is the condition \[ \frac{\mathbb{E}(W_R - \tilde W_R)^2}{\mathbb{V}\tilde W_R} \to 0 \tag{30.4}\] as \(n,m \to 0\).

First, it will be useful to write \[\begin{align} W_R - \tilde W_R &= \sum_{i=1}^N i J_i - \sum_{i=1}^N\Big(i - \frac{N+1}{2}\Big)K_i - \frac{m(N+1)}{2} \\ &= \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)(J_i - K_i), \end{align}\] where we have used the fact that \(\sum_{i=1}^N J_i = m\).

Now, since \(\mathbb{E}W_R = \mathbb{E}\tilde W_R\) we have \(\mathbb{E}(W_R - \tilde W_R)^2 = \mathbb{V}(W_R - \tilde W_R)\), so \[ \mathbb{E}(W_R - \tilde W_R)^2 = \mathbb{V}(\mathbb{E}[W_R - \tilde W_R | U_{(1)},\dots,U_{(N)}]) + \mathbb{E}(\mathbb{V}[W_R - \tilde W_R | U_{(1)},\dots,U_{(N)}]), \tag{30.5}\] where \[\begin{align} \mathbb{E}[W_R - \tilde W_R | U_{(1)},\dots,U_{(N)}] &= \mathbb{E}\Big[\sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)(J_i - K_i)| U_{(1)},\dots,U_{(N)}\Big]\\ &= \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)\mathbb{E}[(J_1 - K_1)| U_{(1)},\dots,U_{(N)}]\\ &= 0, \end{align}\] and \[ \mathbb{V}[W_R - \tilde W_R | U_{(1)},\dots,U_{(N)}] = \mathbb{V}\Big[ \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)(J_i - K_i) | U_{(1)},\dots,U_{(N)}\Big]. \] To find the above variance, it helps to note that after conditioning on \(U_{(1)},\dots,U_{(N)}\) the values \(U_1,\dots,U_N\) on which \(K_1,\dots,K_N\) and \(J_1,\dots,J_N\) are based are a random permutation of \(U_{(1)},\dots,U_{(N)}\). We can therefore use Lemma 32.2 to write \[\begin{align} \mathbb{V}[W_R - \tilde W_R | U_{(1)},\dots,U_{(N)}] &= \frac{1}{N-1} \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)^2\sum_{i=1}^N((J_i - K_i) - (\bar J - \bar K))^2 \\ &\leq \frac{1}{N-1} \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)^2\sum_{i=1}^N(J_i - K_i)^2, \end{align}\] where \(\bar J = N^{-1}\sum_{i=1}^N J_i\) and \(\bar K =N^{-1}\sum_{i=1}^N K_i\), since the mean minimizes the least-squares criterion. Plugging the above result into Equation 30.5, we have \[ \mathbb{E}(W_R - \tilde W_R)^2 \leq \frac{1}{N-1} \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)^2 \mathbb{E}\sum_{i=1}^N (J_i - K_i)^2. \] In order to take the expectation, let \(K_{(1)},\dots,K_{(N)}\) and \(J_{(1)},\dots,J_{(N)}\) denote \(K_1,\dots,K_N\) and \(J_1,\dots,J_N\) when re-ordered according to the order of \(U_1,\dots,U_N\), so that \(K_{(i)} = \mathbf{1}(U_{(i)}\leq m/N)\) and \(J_{(i)} = \mathbf{1}(U_{(i)}\leq U_{(m)})\). Now, letting \(D = \sum_{i=1}^N\mathbf{1}(U_i \leq m/N)\), we may write \[ K_{(i)} = \left\{\begin{array}{ll} 1,& i = 1,\dots,D\\ 0,& i = D+1,\dots,N \end{array}\right. \quad \text{ and } \quad J_{(i)} = \left\{\begin{array}{ll} 1,& i = 1,\dots,m\\ 0,& i = m+1,\dots,N, \end{array}\right. \] so if \(D < m\) we have \[ \sum_{i=1}^N(J_{(i)} - K_{(i)})^2 =\sum_{i=1}^D(J_{(i)} - K_{(i)})^2+\sum_{i=D+1}^m(J_{(i)} - K_{(i)})^2+\sum_{i=m}^N(J_{(i)} - K_{(i)})^2 = m - D \] and if \(D > m\) we have \[ \sum_{i=1}^N(J_{(i)} - K_{(i)})^2 =\sum_{i=1}^m(J_{(i)} - K_{(i)})^2+\sum_{i=m+1}^D(J_{(i)} - K_{(i)})^2+\sum_{i=D}^N(J_{(i)} - K_{(i)})^2 = D - m, \] and if \(D = m\) we have \(\sum_{i=1}^N(J_{(i)} - K_{(i)})^2 = 0\). Putting these cases together and using the fact \(\sum_{i=1}^N (J_i - K_i)^2 = \sum_{i=1}^N (J_{(i)} - K_{(i)})^2\) gives \[ \sum_{i=1}^N (J_i - K_i)^2 = |D - n|, \] where \(D \sim \mathcal{B}(N,m/N)\), so, using Jensen’s inequality, we have \[ \mathbb{E}\sum_{i=1}^N (J_i - K_i)^2 = \mathbb{E}|D - m| \leq \sqrt{\mathbb{E}(D - m)^2}=\sqrt{Nm/N(1-m/N)}. \] From here we write \[\begin{align} \frac{\mathbb{E}(W_R - \tilde W_R)^2}{\mathbb{V}\tilde W_R} &\leq \frac{(N-1)^{-1} \sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)^2 \sqrt{Nm/N(1-m/N)}}{\sum_{i=1}^N \Big(i - \frac{N+1}{2}\Big)^2 m/N\Big(1 - m/N\Big) }\\ &= \frac{N}{N-1}\sqrt{\frac{N}{m(N-m)}}\\ &\leq \left\{\begin{array}{ll} N(N-1)^{-1}\sqrt{2/n},& m \leq N/2\\ N(N-1)^{-1}\sqrt{2/m},& m > N/2. \end{array}\right.\\ &\to 0 \end{align}\] as \(n,m\to \infty\). This completes the proof.

According to Theorem 30.1, the rule which rejects \(H_0\): \(F = G\) when \[ W_R \geq \mathbb{E}W_R + z_\alpha \sqrt{\mathbb{V}W_R}, \] which is equivalent to \[ W_{XY} \geq \frac{1}{2}nm + z_\alpha \sqrt{\frac{1}{12}m(N - m)(N+1)}, \tag{30.6}\] will have Type I error rate converging to \(\alpha\) as \(n,m\to \infty\) (Recall from the previous section that \(W_{XY} = W_R - m(m+1)/2\), so \(\mathbb{E}W_{XY} = nm/2\) and \(\mathbb{V}W_{XY} = \mathbb{V}W_R\)). A “continuity correction” is often employed, whereby one adds \(1/2\) to the right hand side of Equation 30.6.

The corresponding p-value is given by \[ \text{p-value }\approx 1 - \Phi\Big(\frac{W_{XY} - nm/2}{\sqrt{(m(N-m)(N+1)/12}}\Big), \] to which one can apply the continuity correction by subracting \(1/2\) from the numerator of the argument in \(\Phi(\cdot)\).

Figure Figure 30.1 gives an idea of how quickly the convergence in Theorem 30.1 takes place as \(n,m \to \infty\). It plots the exact Type I error rate of the rejection rule in Equation 30.6 over an increasing sequence of \(N\) values, with \(m = \lfloor N/3 \rfloor\) for each \(N\), at \(\alpha = 0.01,0.05,0.10\), with and without the “continuity correction.” We see that as \(N\) grows large, the rejection rule based on the asymptotic distribution has Type I error rates approaching the nominal levels. The continuity correction appears to make the Type I error rate significantly closer to the nominal level for smaller values of \(N\). Note that this plot is not the result of a simulation; each curve is computed according to the exact distribution of the statistic \(W_{XY}\) under \(H_0\): \(F = G\) with the function pwilcox().

Code
NN <- 6:50
typ1_01_exact <- numeric(length(NN))
typ1_05_exact <- numeric(length(NN))
typ1_10_exact <- numeric(length(NN))
typ1_corr_01_exact <- numeric(length(NN))
typ1_corr_05_exact <- numeric(length(NN))
typ1_corr_10_exact <- numeric(length(NN))
for(i in 1:length(NN)){

  N <- NN[i]
  m <- floor(N/3)
  n <- N - m
  mu <- n*m/2
  sigma <- sqrt(m*(N-m)*(N+1)/12)
  
  c01_asymp <- mu + qnorm(0.99)*sigma
  c05_asymp <- mu + qnorm(0.95)*sigma
  c10_asymp <- mu + qnorm(0.90)*sigma
  typ1_01_exact[i] <- 1 - pwilcox(c01_asymp-1,m=m,n=n)
  typ1_05_exact[i] <- 1 - pwilcox(c05_asymp-1,m=m,n=n)
  typ1_10_exact[i] <- 1 - pwilcox(c10_asymp-1,m=m,n=n)
  
  c01_asymp_corr <- mu + 1/2 + qnorm(0.99)*sigma
  c05_asymp_corr <- mu + 1/2 + qnorm(0.95)*sigma
  c10_asymp_corr <- mu + 1/2 + qnorm(0.90)*sigma
  typ1_corr_01_exact[i] <- 1 - pwilcox(c01_asymp_corr-1,m=m,n=n)
  typ1_corr_05_exact[i] <- 1 - pwilcox(c05_asymp_corr-1,m=m,n=n)
  typ1_corr_10_exact[i] <- 1 - pwilcox(c10_asymp_corr-1,m=m,n=n)

}

plot(NA,
     xlim = range(NN),
     ylim = range(typ1_01_exact, typ1_05_exact,typ1_10_exact),
     ylab = "Type I error rate",
     xlab = "N")

lines(typ1_01_exact ~ NN)
lines(typ1_corr_01_exact ~ NN, lty = 2)
abline(h = 0.01,lty = 3)

lines(typ1_05_exact ~ NN, col = "blue")
lines(typ1_corr_05_exact ~ NN, col ="blue",lty = 2)
abline(h = 0.05,lty = 3, col = "blue")

lines(typ1_10_exact ~ NN,col = "red")
lines(typ1_corr_10_exact ~ NN, col = "red",lty = 2)
abline(h = 0.10,lty = 3, col ="red")

legend(x = 30,
       y = 0.27, 
       legend = c("0.10","0.10 with correction", "0.05","0.05 with correction","0.01", "0.01 with correction"),
       col = c("red","red","blue","blue","black","black"),
       lty = c(1,2,1,2,1,2),
       bty = "n")
Curves tracing the Type I error rate over increasing $N$.
Figure 30.1: The exact Type I error rate of the rejection rule in Equation 30.6 over increasing \(N\) with \(m = \lfloor N / 3 \rfloor\) at each \(\alpha = 0.01,0.05,0.10\). Dashed lines trace the Type I error rates when the continuity correction is used.