29 The Wilcoxon rank sum test

The Wilcoxon rank sum test is the quintessential traditional nonparametric test. We take a deep dive into this one; other rank-based methods develop similarly.

Suppose we collect random samples from a “control” and “treatment” population such that \[\begin{align} X_1,\dots,X_n &\overset{\text{ind}}{\sim}F \quad \text{``control''} \\ Y_1,\dots,Y_m &\overset{\text{ind}}{\sim}G \quad \text{``treatment''} \end{align}\] with \(N = n + m\) the total number of observations.

The Wilcoxon rank sum test was contrived as a way to test the effectiveness of a treatment (say we draw \(N\) subjects from a population and randomly assign \(n\) to the control and \(m\) to the treatment group). For our development, suppose the treatment is effective if it tends to increase the measured outcomes—that is if it tends to make the \(Y_i\) greater than the \(X_i\).

There are various senses in which a treatment could “tend to make the \(Y_i\) greater than the \(X_i\).” The Wilcoxon rank sum test essentially estimates the probability \(P(X \leq Y)\), where \(X \sim F\) and \(Y \sim G\), and concludes effectiveness of the treatment if this estimate is significantly greater than \(1/2\). Some other ways found in the literature of ordering two random variables \(X\) and \(Y\) are the following:

If \(G(x)\leq F(x)\) for all \(x \in \mathbb{R}\) then we say \(Y\) is stochastically greater than \(X\).
If \((1-F(x)/(1 - G(x))\) non-increasing then \(Y\) and \(X\) satisfy a uniform stochastic ordering.
If \(f(x)/g(x)\) is non-increasing, where \(f\) and \(g\) are the densities of \(F\) and \(G\) respectively, provided they exist, then \(Y\) and \(X\) satisfy a liklihood ratio ordering.

The Wilcoxon rank sum test calls for the rejection of the null hypothesis \(H_0\): \(F = G\) when the test statistic \[ W_{XY} = \sum_{i=1}^n\sum_{j=1}^m \mathbf{1}(X_i \leq Y_j) \] is large, that is when \(W_{XY} \geq c\) for some critical value \(c\) calibrated to control the Type I error rate. Note that \((nm)^{-1}W_{XY}\) is an unbiased estimator of \(P(X \leq Y)\), which has the value \(1/2\) under \(H_0\): \(F = G\).

The more effective the treatment at “making the \(Y_i\) greater than the \(X_i\),” the greater we expect the value \(W_{XY}\) to be, so we reject the null hypothesis when \(W_{XY}\) meets or exceeds some threshold.

We can write \(W_{XY}\) in the form of a sum of ranks as follows:

Let \((Z_1,\dots,Z_N) = (X_1,\dots,X_n,Y_1,\dots,Y_m)\) denote the complete set of data values.
Obtain the ranks of the treatment observations as \(R_j = \sum_{i=1}^N\mathbf{1}(Z_i \leq Y_j)\), \(j = 1,\dots,m\) (Assume there are no tied data values so that no two ranks among \(R_1,\dots,R_m\) will be equal).
Set \(W_R = \sum_{j=1}^m R_j\).

Then one can show \[ W_{XY} = W_R - m(m+1)/2. \tag{29.1}\] Hence the name “rank sum test”.

Derivation of equation Equation 29.1

Write \[\begin{align} W_{XY} &= \sum_{i=1}^n\sum_{j=1}^m\mathbf{1}(X_i \leq Y_j) \\ &= \sum_{j=1}^m \sum_{i=1}^n \mathbf{1}(X_i \leq Y_{(j)}) \\ &= \sum_{j=1}^m \Big[\sum_{i=1}^n \mathbf{1}(X_i \leq Y_{(j)}) + \sum_{k=1}^m \mathbf{1}(Y_k \leq Y_{(j)}) - \sum_{k=1}^m \mathbf{1}(Y_k \leq Y_{(j)}) \Big]\\ &= \sum_{j=1}^m R_j - \sum_{j=1}^m \sum_{k=1}^m \mathbf{1}(Y_k \leq Y_{(j)})\\ &= W_R - \frac{m(m+1)}{2}. \end{align}\]

Since \(W_{XY}\) is merely a shifted version of \(W_R\), the two statistics carry equivalent information. We will find it more convenient to analyze the rank sum \(W_R\). Here is as good a place as any to mention that the Wilcoxon rank sum test is equivalent to a test called the Mann-Whitney U test.

Interestingly, under \(H_0\): \(F = G\) one can tabulate the exact distribution of the statistics \(W_{XY}\) and \(W_R\). If \(F\) and \(G\) are both continuous distributions (so that ties occur with probability zero), we have \[ P( (R_1,\dots,R_m) = (r_1,\dots,r_m)) = \frac{1}{{ N \choose m}} \] for each set of \(m\) distinct ranks \(r_1,\dots,r_m \in \{1,\dots,N\}\). This allows us to tabulate the distribution of \(W_{XY}\) and \(W_R\) for any \(m\) and \(N\) as in Example 29.1. When tied data values are allowed, writing down the exact distribution of \(W_{XY}\) or \(W_R\) becomes cumbersome; see Lehmann (1975).

Example 29.1 (Tabulating the distribution of the rank-sum statistic) For \(N = 5\) and \(m = 2\), for all possible pairs of ranks \((r_1,r_2)\), the values of \(W_S\) and \(W_{XY}\) are enumerated in Table 29.1

Table 29.1: Value of \(W_R\) and \(W_{XY}\) for all possible pairs of ranks \((r_1,r_2)\) when \(N = 5\) and \(m = 2\).

\(r_1\)	\(r_2\)	\(W_R\)	\(W_{XY}\)
1	2	3	0
1	3	4	1
1	4	5	2
1	5	6	3
2	3	5	2
2	4	6	3
2	5	7	4
3	4	7	4
3	5	8	5
4	5	9	6

Since each pair of ranks occurs with equal probability (under the null hypothesis), we can tabulate the null distribution of the statistic \(W_{XY}\) as in Table 29.2.

Table 29.2: Distribution of \(W_{XY}\) when \(N = 5\) and \(m = 2\).

\(w\)	0	1	2	3	4	5	6
\(P(W_{XY} = w)\)	0.1	0.1	0.2	0.2	0.2	0.1	0.1

From Table 29.2, we see that the test which rejects \(H_0\) when \(W_{XY} \geq 6\) has a Type I error rate of \(0.1\). Note that since \(W_{XY}\) is discrete, one may not be able to identify a critical value \(c\) such that the Type I error rate of the rule \(W_{XY} \geq c\) is exactly equal to the desired rate. Note that the distribution of \(W_R\) can be tabulated in the same manner.

Lehmann (1975) has several pages of tables in the back giving values of \(\mathbb{P}(W_{XY} \leq c)\) for different (small) values of \(n\) and \(m\). When \(m\) and \(N\) are both large, it can be computationally costly to compute probabilities for \(W_{XY}\) and \(W_R\) from their exact probability distributions, as one must essentially count rows of a table like that in Example 29.1. Therefore when \(m\) and \(N\) are large, one may prefer to use an approximations to the distributions of \(W_{XY}\) and \(W_R\).

Below is a quick illustration in R showing how to calculate the rank sum statistic \(W_R\) as well as the statistic \(W_{XY}\). The code exhibits the function pwilcox(), which evaluates the exact cdf of \(W_{XY}\). It is slow (the documentation warns) when \(n\) and \(m\) are large. One can compare the p-value obtained using the pwilcox() function to that obtained from the function wilcox.test(), which makes use of the exact distribution of \(W_{XY}\) so long as \(n\) and \(m\) are not too large.

# generate some data
n <- 20
m <- 25
X <- rnorm(n,1,1)
Y <- rnorm(m,1,1)

# compute WR and WXY
Z <- c(X,Y)
id <- c(rep(1,n),rep(2,m))
R <- rank(Z)[id == 2]
WR <- sum(R)
WXY <- sum(R) - m*(m+1)/2
WXY

[1] 315

pval <- 1 - pwilcox(WXY-1,m = m, n = n) # must subtract 1, since we reject when WXY >= c
pval

[1] 0.07091832

# see that the wilcox.test() function gives the same values
wilcox.test(x=Y,y=X,alternative="greater",exact = TRUE) # switch X and Y


    Wilcoxon rank sum exact test

data:  Y and X
W = 315, p-value = 0.07092
alternative hypothesis: true location shift is greater than 0

In the next section we will study a Normal approximation to the distribution of \(W_R\).