17 Backfitting in the additive model

Backfitting is a simple and fast algorithm for computing estimators of \(m_1,\dots,m_p\) in the additive model.

To understand the backfitting algorithm, consider more a moment a tuple of random variables \((X_1,\dots,X_p,Y) \in [0,1]^p\times\mathbb{R}\) such that \[ Y = m_1(X_1) + \dots + m_p(X_p) + \varepsilon, \] where \(\varepsilon\) is independent of \(X_1,\dots,X_p\) and has mean \(0\). Then we may write \[ m_j(X_j) = \mathbb{E}[Y|X_j] - \sum_{k\neq j}\mathbb{E}[m_k(X_k)|X_j] \] for \(j=1,\dots,p\). Now, if we let \(\Pi_j\) represent the operator which takes the conditional expectation \(\mathbb{E}[ \cdot | X_j]\), and if we write \(m_j(X_j)\) as simply \(m_j\), we can represent the above system of equations symbolically as \[ \left[\begin{array}{cccc} I&\Pi_1& \dots & \Pi_1 \\ \Pi_2& I & \dots & \Pi_2\\ \vdots & \vdots& \ddots & \vdots \\ \Pi_p & \Pi_p & \dots & I \end{array}\right] \left[\begin{array}{c} m_1 \\ m_2\\ \vdots \\ m_p \end{array}\right] = \left[\begin{array}{c} \Pi_1 Y \\ \Pi_2 Y\\ \vdots \\ \Pi_p Y \end{array}\right], \] where \(I\) is the identity operator (such that \(I m_j = m_j\)).

The strategy of backfitting is to solve an empirical version of this system of equations. Let

\(\mathbf{S}_1,\dots,\mathbf{S}_p\) be \(n \times n\) smoother matrices (Link to the definition of a smoother matrix) associated with univariate smoothers.
\(\mathbf{Y}= (Y_1,\dots,Y_n)^T\) be the \(n\times 1\) vector of response values.

Then the backfitting algorithm solves the equation \[ \left[\begin{array}{cccc} \mathbf{I}_n&\mathbf{S}_1& \dots & \mathbf{S}_1 \\ \mathbf{S}_2& \mathbf{I}_n & \dots & \mathbf{S}_2\\ \vdots & \vdots& \ddots & \vdots \\ \mathbf{S}_p & \mathbf{S}_p & \dots & \mathbf{I}_n \end{array}\right] \left[\begin{array}{c} \hat{\mathbf{m}}_1 \\ \hat{\mathbf{m}}_2\\ \vdots \\ \hat{\mathbf{m}}_p \end{array}\right] = \left[\begin{array}{c} \mathbf{S}_1 \mathbf{Y}\\ \mathbf{S}_2 \mathbf{Y}\\ \vdots \\ \mathbf{S}_p \mathbf{Y} \end{array}\right] \]

for \(\hat{\mathbf{m}}_1,\dots,\hat{\mathbf{m}}_p\) using the Gauss-Seidel method (see Buja, Hastie, and Tibshirani (1989)). These vectors are \(n\times 1\) vectors, the entries of which are the evaluations of the estimated additive components at the design points. That is \(\hat{\mathbf{m}}_j = (\hat m_j(x_{1j}),\dots,\hat m_j(x_{nj}))^T\) for \(j = 1,\dots,p\).

The steps of the backfitting algoritm are as follows:

Definition 17.1 (Backfitting algorithm) Initialize \(\hat{\mathbf{m}}_1,\dots,\hat{\mathbf{m}}_p = \mathbf{0}\). Then do: For \(j =1,\dots,p\) update \(\hat{\mathbf{m}}_j\) as

\(\hat{\mathbf{m}}_j \leftarrow \mathbf{S}_j(\mathbf{Y}- \sum_{k\neq j} \hat{\mathbf{m}}_k)\)
\(\hat{\mathbf{m}}_j \leftarrow \hat{\mathbf{m}}_j - n^{-1}\mathbf{1}_n^T\hat{\mathbf{m}}_j\) (centering step to impose identifiability condition)

until convergence (until the change in \(\hat{\mathbf{m}}_1,\dots,\hat{\mathbf{m}}_p\) is negligible).

To implement the backfitting algorithm, we just need construct the smoother matrices, for example:

Least-squares splines: \(\mathbf{S}_j = \mathbf{B}_j(\mathbf{B}_j^T\mathbf{B}_j)^{-1}\mathbf{B}_j^T\), where \(\mathbf{B}_j = (b_{k}(x_{ij}))_{1\leq i \leq n, 1 \leq k \leq d}\) for \(j=1,\dots,p\), where \(b_1,\dots,b_d\) are, say, cubic B-spline basis functions.
Penalized splines: \(\mathbf{S}_j = \mathbf{B}_j(\mathbf{B}_j^T\mathbf{B}_j + \lambda \boldsymbol{\Omega})^{-1}\mathbf{B}_j^T\) for \(j=1,\dots,p\), where \(\boldsymbol{\Omega}= (\int_0^1b''_k(x)b''_{k'}(x)dx)_{1\leq k,k' \leq d}\) and \(\lambda > 0\) tunes the smoothness.
Nadaraya-Watson: \(\mathbf{S}_j = \left( \dfrac{K((x_{i'j} - x_{ij})/h)}{\sum_{k=1}^nK((x_{k j} - x_{ij})/h)}\right)_{1 \leq i \leq n, 1 \leq i' \leq n}\) for each \(j=1,\dots,p\).

The following code uses the backfitting estimator to estimate each \(m_j\) in an additive model with data generated in a setup considered in Meier et al. (2009).

Code

# generate data
p <- 8
rho <- 0.5
n <- 200

m1 <- function(x){-2*sin(x*2)}
m2 <- function(x){x^2 - 25/12}
m3 <- function(x){x}
m4 <- function(x){(exp(-x)-2/5*sinh(5/2))/2}

# generate covariates marginally uniform and having correlation matrix R
R <- rho^abs( outer(1:p,1:p,"-"))
P <- 2*sin( R * pi / 6)
X <- (pnorm( matrix(rnorm(n*p),ncol = p) %*% chol(P)) - .5) * 5

# make the true functions satisfy the identifiability condition
m1X <- m1(X[,1]) - mean(m1(X[,1]))
m2X <- m2(X[,2]) - mean(m2(X[,2]))
m3X <- m3(X[,3]) - mean(m3(X[,3]))
m4X <- m4(X[,4]) - mean(m4(X[,4]))
mu <- 4

Y <- mu + m1X + m2X + m3X + m4X + rnorm(n,0,1)

# build smoother matrices for least-squares splines
library(splines)
K <- 8
d <- K + 3 # number of functions
u <- ((0:K)/K - 1/2)*5
urep <- c(rep(u[1],3),u,rep(u[K+1],3))  # knots
S <- array(0,dim=c(n,n,p)) # store smoother matrices in an array
for(j in 1:p){
  
  B <- splineDesign(knots = urep, x = X[,j], ord = 3 + 1)
  S[,,j] <- B %*% solve(t(B) %*% B) %*% t(B)
  
}

# define backfitting function
backfit <- function(Y,S,tol=1e-6){
  
  n <- length(Y)
  p <- dim(S)[3]
  
  conv <- FALSE
  M <- matrix(0,n,p)
  while(conv == FALSE){
    
    M0 <- M
    for(j in 1:p){
    
      M[,j] <- S[,,j] %*% (Y - apply(M[,-j],1,sum))
      M[,j] <- M[,j] - mean(M[,j])
      
    }
  
    conv <- max(abs(M - M0)) < tol
    
  }
  
  return(M)
  
}

# center the response and obtain the backfitting estimates
Ycent <- Y - mean(Y)
M <- backfit(Ycent,S)

par(mfrow = c(2,4), mar = c(2.1,2.1,1.1,1.1),oma = c(0,0,2,0))

for(j in 1:p){
  
  Yj <- Ycent - apply(M[,-j],1,sum)
  plot(Yj ~ X[,j],col = "gray")
  lines(M[order(X[,j]),j]~sort(X[,j]), col = rgb(.545,0,0))
  
}

Backfitting estimators of additive components using least-squares spline smoothers

Code

K <- 20
lam <- 1
d <- K + 3 # number of functions
u <- ((0:K)/K - 1/2)*5
urep <- c(rep(u[1],3),u,rep(u[K+1],3))  # knots

# obtain 2nd derivatives
ddBu <- spline.des(knots = urep,
                   x = u,
                   derivs = rep(2,length(u)),
                   ord = 4)$design

# compute the matrix Omega
Om <- matrix(NA,K+3,K+3)
for(l in 1:(K+3))
  for(j in 1:(K+3)){

    tk <- ddBu[-(K+1),l]
    gk <- ddBu[-(K+1),j]
    tkp1 <- ddBu[-1,l]
    gkp1 <- ddBu[-1,j]
    Om[l,j] <- sum((u[-1]-u[-(K+1)])*((1/2)*(gk*tkp1+gkp1*tk)+(1/3)*(tkp1-tk)*(gkp1-gk)))

  }

Spen <- array(0,dim=c(n,n,p)) # store smoother matrices in an array
for(j in 1:p){
  
  B <- splineDesign(knots = urep, x = X[,j], ord = 3 + 1)
  Spen[,,j] <- B %*% solve(t(B) %*% B + lam * Om) %*% t(B)
  
}

# center the response and obtain the backfitting estimates
Ycent <- Y - mean(Y)
M <- backfit(Ycent,Spen)

par(mfrow = c(2,4), mar = c(2.1,2.1,1.1,1.1),oma = c(0,0,2,0))
for(j in 1:p){
  
  Yj <- Ycent - apply(M[,-j],1,sum)
  plot(Yj ~ X[,j],col = "gray")
  lines(M[order(X[,j]),j]~sort(X[,j]), col = rgb(0,0,.545))
  
}

Backfitting estimators of additive components using penalized spline smoothers

Code

h <- 0.5
Snw <- array(0,dim=c(n,n,p)) # store smoother matrices in an array
for(j in 1:p){
  
  
    Sj <- dnorm(outer(X[,j],X[,j],"-")/h)
    Sj <- Sj / apply(Sj,2,sum)
    Snw[,,j] <- Sj
  
}

# center the response and obtain the backfitting estimates
Ycent <- Y - mean(Y)
M <- backfit(Ycent,Snw)

par(mfrow = c(2,4), mar = c(2.1,2.1,1.1,1.1),oma = c(0,0,2,0))
for(j in 1:p){
  
  Yj <- Ycent - apply(M[,-j],1,sum)
  plot(Yj ~ X[,j],col = "gray")
  lines(M[order(X[,j]),j]~sort(X[,j]), col = rgb(0,0.545,0))
  
}

Backfitting estimators of additive components using Nadaraya-Watson smoothers

Note that the backfitting algorithm only returns the values of the estimated additive components at the design points, that is at the observed values of the predictors.

In order to evaluate an estimated additive component \(\hat m_j\) at a value \(x\) which is not a design point, one can re-apply the univariate smoother corresponding to the smoother matrix \(\mathbf{S}_j\) to the data points \((x_{1j},\hat Y_{1j}),\dots,(x_{nj},\hat Y_{nj})\), where \(\hat Y_{1j},\dots,\hat Y_{nj}\) are the entries of the vector \(\mathbf{Y}- \sum_{k\neq j}\hat{\mathbf{m}}_k\).

Note that one may construct the smoother matrix differently for each \(j\). For example, in the case of spline smoothing one might, for each predictor, position the knots at equally spaced quantiles of the observed values, resulting in a different set of knots for each \(j\).