6 minute read

Recall that if there is a non-linear relationship between predictor and response, we can attempt to transform the predictor using a known function (log, reciprocal, polynomial, etc.) to improve the model structure and fit. What if the relationship is more complex and is not well captured with a known function? Generalized additive models may be used in these cases.

Recall that a linear model takes the form:

\[y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+...+\varepsilon\]

Additive models replace the linear terms (the $\beta$s) with flexible smoothing functions and take the form:

\[y=\beta_{0}+f_{1}(x_{1})+f_{2}(x_{2})+...+\varepsilon\]

There are many techniques and options for selecting the smoothing functions, but for this tutorial, we’ll discuss two: locally weighted error sum of squares (lowess and also commonly abbreviated as loess) and smoothing splines.

Loess

For the theory behind loess smoothing, please read this page on the NIST website. This chapter will focus on implementing loess smoothing in R.

All smoothers have a tuning parameter that controls how smooth the smoother is. The tuning parameter in loess is referred to as the span with larger values producing more smoothness.

library(tidyverse)
set.seed(42)

df = tibble(
  x = runif(100, 1.5, 5.5),
  y = sin(x*pi) + 2 + runif(100, 0, 0.5)
)

ex1.ls = loess(y~x, span=0.25, data=df)
ex2.ls = loess(y~x, span=0.5, data=df)
ex3.ls = loess(y~x, span=0.75, data=df)
xseq = seq(1.6, 5.4,length=100)

df = df %>% mutate(
  span25 = predict(ex1.ls, newdata=tibble(x=xseq)),
  span50 = predict(ex2.ls, newdata=tibble(x=xseq)),
  span75 = predict(ex3.ls, newdata=tibble(x=xseq))
  )

ggplot(df) +
  geom_point(aes(x, y)) +
  geom_line(aes(x=xseq, span25, linetype='span = 0.25'), color='red') +
  geom_line(aes(x=xseq, span50, linetype='span = 0.50'), color='red') +
  geom_line(aes(x=xseq, span75, linetype='span = 0.75'), color='red') +
  scale_linetype_manual(name="Legend", values=c(1,2,3)) +
  ggtitle("Loess Smoother Example") +
  theme_bw()

From this plot, a span of 0.75 provided too much smoothness, whereas the lower values of span we tested appear to be a better fit. Now let’s apply this to the airqaulity data set from the previous chapter. Initially, we’ll just consider the response(Ozone) and one predictor (Solar.R).

aq1.ls = loess(Ozone ~ Solar.R, span=0.25, data=airquality)
aq2.ls = loess(Ozone ~ Solar.R, span=0.5, data=airquality)
aq3.ls = loess(Ozone ~ Solar.R, span=0.75, data=airquality)
srseq = seq(10, 330, length=nrow(airquality))

aq = airquality %>% mutate(
  span25 = predict(aq1.ls, newdata=tibble(Solar.R=srseq)),
  span50 = predict(aq2.ls, newdata=tibble(Solar.R=srseq)),
  span75 = predict(aq3.ls, newdata=tibble(Solar.R=srseq))
  )

ggplot(aq) +
  geom_point(aes(Solar.R, Ozone)) +
  geom_line(aes(x=srseq, span25, linetype='span = 0.25'), color='red') +
  geom_line(aes(x=srseq, span50, linetype='span = 0.50'), color='red') +
  geom_line(aes(x=srseq, span75, linetype='span = 0.75'), color='red') +
  scale_linetype_manual(name="Legend", values=c(1,2,3)) +
  ggtitle("Loess Smoother Example") +
  theme_bw()

Here we can see that the higher span values appear to provide a better fit. In this case, choosing a low span value would be akin to over fitting a linear model with too high of a degree of polynomial. We can repeat this process to determine appropriate values of span for the other predictors.

Including loess smoothers in a GAM is as simple as including the non-linear terms within lo(). The gam package provides the needed functionality. The script below applies loess smoothers to three of the predictors and displays the model summary (note that the default value for span is 0.5).

library(gam)

aq.gam = gam(Ozone ~ lo(Solar.R, span=0.75) + lo(Wind) + lo(Temp), data=airquality, na=na.gam.replace)
summary(aq.gam)
##
## Call: gam(formula = Ozone ~ lo(Solar.R, span = 0.75) + lo(Wind) + lo(Temp),
##     data = airquality, na.action = na.gam.replace)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max
## -47.076  -9.601  -2.721   8.977  76.583
##
## (Dispersion Parameter for gaussian family taken to be 319.3603)
##
##     Null Deviance: 125143.1 on 115 degrees of freedom
## Residual Deviance: 33679.85 on 105.4604 degrees of freedom
## AIC: 1010.117
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
##                              Df Sum Sq Mean Sq F value    Pr(>F)    
## lo(Solar.R, span = 0.75)   1.00  14248   14248  44.615 1.160e-09 ***
## lo(Wind)                   1.00  35734   35734 111.894 < 2.2e-16 ***
## lo(Temp)                   1.00  15042   15042  47.099 4.794e-10 ***
## Residuals                105.46  33680     319                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
##                          Npar Df Npar F     Pr(F)    
## (Intercept)                                          
## lo(Solar.R, span = 0.75)     1.2 2.9766 0.0804893 .  
## lo(Wind)                     2.8 9.2752 2.617e-05 ***
## lo(Temp)                     2.5 6.8089 0.0006584 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,3))
plot(aq.gam, se=TRUE)

Splines

Spline smoothing can be conceptualized by imagining that your task is to bend a strip of soft metal into a curved shape. One way to do this would be to place pegs on a board (referred to as “knots” in non-linear regression parlance) to control the bends, and then guide the strip of metal over and under the pegs. Mathematically, this is accomplished by combining cubic regression at each knot with calculus to smoothly join the individual bends. The tuning parameter in the smooth.splines function is spar.

aq = aq %>% drop_na()

ss25 = smooth.spline(aq$Solar.R,aq$Ozone,spar=0.25)
ss50 = smooth.spline(aq$Solar.R,aq$Ozone,spar=0.5)
ss75 = smooth.spline(aq$Solar.R,aq$Ozone,spar=0.75)

ggplot() +
  geom_point(data=aq, aes(Solar.R, Ozone)) +
  geom_line(aes(x=ss25$x, ss25$y, linetype='spar = 0.25'), color='red') +
  geom_line(aes(x=ss50$x, ss50$y, linetype='spar = 0.50'), color='red') +
  geom_line(aes(x=ss75$x, ss75$y, linetype='spar = 0.75'), color='red') +
  scale_linetype_manual(name="Legend", values=c(1,2,3)) +
  ggtitle("Spline Smoother Example") +
  theme_bw()

Cross Validation

Comparing the spline smoother plot to the one generated with loess smoothers, we can see that the two methods essentially accomplish the same thing. It’s just a matter of finding the right amount of smoothness, which can be done through cross validation. The fANCOVA package contains a function loess.aq() that includes a criterion parameter that we can set to gcv for generalized cross validation, which is an approximation for leave-one-out cross-validation (Hastie1). Applying this function to the airquality data with Solar.R as the predictor and Ozone as the response, we can obtain a cross validated value for span.

library(fANCOVA)

aq.solar.cv = loess.as(aq$Solar.R, aq$Ozone, criterion="gcv")
summary(aq.solar.cv)
## Call:
## loess(formula = y ~ x, data = data.bind, span = span1, degree = degree,
##     family = family)
##
## Number of Observations: 111
## Equivalent Number of Parameters: 3.09
## Residual Standard Error: 29.42
## Trace of smoother matrix: 3.56  (exact)
##
## Control settings:
##   span     :  0.6991628
##   degree   :  1
##   family   :  gaussian
##   surface  :  interpolate	  cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

loess.as also includes a plot method so we can visualize the loess smoother.

loess.as(aq$Solar.R, aq$Ozone, criterion="gcv", plot=TRUE)

## Call:
## loess(formula = y ~ x, data = data.bind, span = span1, degree = degree,
##     family = family)
##
## Number of Observations: 111
## Equivalent Number of Parameters: 3.09
## Residual Standard Error: 29.42

Cross validation is also built in to smooth.spline() and is set to generalized cross validation by default. Instead of specifying spar in the call to smooth.spline(), we just leave it out to invoke cross validation.

aq.spl = smooth.spline(aq$Solar.R, aq$Ozone)
aq.spl
## Call:
## smooth.spline(x = aq$Solar.R, y = aq$Ozone)
##
## Smoothing Parameter  spar= 0.9837718  lambda= 0.01867197 (12 iterations)
## Equivalent Degrees of Freedom (Df): 4.060081
## Penalized Criterion (RSS): 66257.74
## GCV: 892.29

Plotting the cross validated spline smoother, we get a line that looks very similar to the lasso smoother.

ggplot() +
  geom_point(data=aq, aes(Solar.R, Ozone)) +
  geom_line(aes(x=aq.spl$x, aq.spl$y), color='red') +
  ggtitle("CV Spline Smoother") +
  theme_bw()

  1. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2008. “The Elements of Statistical Learning.” Springer. 

Comments