Generalized Additive Models
Recall that if there is a non-linear relationship between predictor and response, we can attempt to transform the predictor using a known function (log, reciprocal, polynomial, etc.) to improve the model structure and fit. What if the relationship is more complex and is not well captured with a known function? Generalized additive models may be used in these cases.
Recall that a linear model takes the form:
\[y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+...+\varepsilon\]Additive models replace the linear terms (the $\beta$s) with flexible smoothing functions and take the form:
\[y=\beta_{0}+f_{1}(x_{1})+f_{2}(x_{2})+...+\varepsilon\]There are many techniques and options for selecting the smoothing functions, but for this tutorial, we’ll discuss two: locally weighted error sum of squares (lowess and also commonly abbreviated as loess) and smoothing splines.
Loess
For the theory behind loess smoothing, please read this page on the NIST website. This chapter will focus on implementing loess smoothing in R.
All smoothers have a tuning parameter that controls how smooth the smoother is. The tuning parameter in loess is referred to as the span with larger values producing more smoothness.
library(tidyverse)
set.seed(42)
df = tibble(
x = runif(100, 1.5, 5.5),
y = sin(x*pi) + 2 + runif(100, 0, 0.5)
)
ex1.ls = loess(y~x, span=0.25, data=df)
ex2.ls = loess(y~x, span=0.5, data=df)
ex3.ls = loess(y~x, span=0.75, data=df)
xseq = seq(1.6, 5.4,length=100)
df = df %>% mutate(
span25 = predict(ex1.ls, newdata=tibble(x=xseq)),
span50 = predict(ex2.ls, newdata=tibble(x=xseq)),
span75 = predict(ex3.ls, newdata=tibble(x=xseq))
)
ggplot(df) +
geom_point(aes(x, y)) +
geom_line(aes(x=xseq, span25, linetype='span = 0.25'), color='red') +
geom_line(aes(x=xseq, span50, linetype='span = 0.50'), color='red') +
geom_line(aes(x=xseq, span75, linetype='span = 0.75'), color='red') +
scale_linetype_manual(name="Legend", values=c(1,2,3)) +
ggtitle("Loess Smoother Example") +
theme_bw()
From this plot, a span of 0.75 provided too much smoothness, whereas the lower values of span we tested appear to be a better fit. Now let’s apply this to the airqaulity
data set from the previous chapter. Initially, we’ll just consider the response(Ozone
) and one predictor (Solar.R
).
aq1.ls = loess(Ozone ~ Solar.R, span=0.25, data=airquality)
aq2.ls = loess(Ozone ~ Solar.R, span=0.5, data=airquality)
aq3.ls = loess(Ozone ~ Solar.R, span=0.75, data=airquality)
srseq = seq(10, 330, length=nrow(airquality))
aq = airquality %>% mutate(
span25 = predict(aq1.ls, newdata=tibble(Solar.R=srseq)),
span50 = predict(aq2.ls, newdata=tibble(Solar.R=srseq)),
span75 = predict(aq3.ls, newdata=tibble(Solar.R=srseq))
)
ggplot(aq) +
geom_point(aes(Solar.R, Ozone)) +
geom_line(aes(x=srseq, span25, linetype='span = 0.25'), color='red') +
geom_line(aes(x=srseq, span50, linetype='span = 0.50'), color='red') +
geom_line(aes(x=srseq, span75, linetype='span = 0.75'), color='red') +
scale_linetype_manual(name="Legend", values=c(1,2,3)) +
ggtitle("Loess Smoother Example") +
theme_bw()
Here we can see that the higher span values appear to provide a better fit. In this case, choosing a low span value would be akin to over fitting a linear model with too high of a degree of polynomial. We can repeat this process to determine appropriate values of span for the other predictors.
Including loess smoothers in a GAM is as simple as including the non-linear terms within lo()
. The gam
package provides the needed functionality. The script below applies loess smoothers to three of the predictors and displays the model summary (note that the default value for span is 0.5).
library(gam)
aq.gam = gam(Ozone ~ lo(Solar.R, span=0.75) + lo(Wind) + lo(Temp), data=airquality, na=na.gam.replace)
summary(aq.gam)
##
## Call: gam(formula = Ozone ~ lo(Solar.R, span = 0.75) + lo(Wind) + lo(Temp),
## data = airquality, na.action = na.gam.replace)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -47.076 -9.601 -2.721 8.977 76.583
##
## (Dispersion Parameter for gaussian family taken to be 319.3603)
##
## Null Deviance: 125143.1 on 115 degrees of freedom
## Residual Deviance: 33679.85 on 105.4604 degrees of freedom
## AIC: 1010.117
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## lo(Solar.R, span = 0.75) 1.00 14248 14248 44.615 1.160e-09 ***
## lo(Wind) 1.00 35734 35734 111.894 < 2.2e-16 ***
## lo(Temp) 1.00 15042 15042 47.099 4.794e-10 ***
## Residuals 105.46 33680 319
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## lo(Solar.R, span = 0.75) 1.2 2.9766 0.0804893 .
## lo(Wind) 2.8 9.2752 2.617e-05 ***
## lo(Temp) 2.5 6.8089 0.0006584 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,3))
plot(aq.gam, se=TRUE)
Splines
Spline smoothing can be conceptualized by imagining that your task is to bend a strip of soft metal into a curved shape. One way to do this would be to place pegs on a board (referred to as “knots” in non-linear regression parlance) to control the bends, and then guide the strip of metal over and under the pegs. Mathematically, this is accomplished by combining cubic regression at each knot with calculus to smoothly join the individual bends. The tuning parameter in the smooth.splines
function is spar
.
aq = aq %>% drop_na()
ss25 = smooth.spline(aq$Solar.R,aq$Ozone,spar=0.25)
ss50 = smooth.spline(aq$Solar.R,aq$Ozone,spar=0.5)
ss75 = smooth.spline(aq$Solar.R,aq$Ozone,spar=0.75)
ggplot() +
geom_point(data=aq, aes(Solar.R, Ozone)) +
geom_line(aes(x=ss25$x, ss25$y, linetype='spar = 0.25'), color='red') +
geom_line(aes(x=ss50$x, ss50$y, linetype='spar = 0.50'), color='red') +
geom_line(aes(x=ss75$x, ss75$y, linetype='spar = 0.75'), color='red') +
scale_linetype_manual(name="Legend", values=c(1,2,3)) +
ggtitle("Spline Smoother Example") +
theme_bw()
Cross Validation
Comparing the spline smoother plot to the one generated with loess smoothers, we can see that the two methods essentially accomplish the same thing. It’s just a matter of finding the right amount of smoothness, which can be done through cross validation. The fANCOVA
package contains a function loess.aq()
that includes a criterion parameter that we can set to gcv
for generalized cross validation, which is an approximation for leave-one-out cross-validation (Hastie1). Applying this function to the airquality
data with Solar.R
as the predictor and Ozone
as the response, we can obtain a cross validated value for span.
library(fANCOVA)
aq.solar.cv = loess.as(aq$Solar.R, aq$Ozone, criterion="gcv")
summary(aq.solar.cv)
## Call:
## loess(formula = y ~ x, data = data.bind, span = span1, degree = degree,
## family = family)
##
## Number of Observations: 111
## Equivalent Number of Parameters: 3.09
## Residual Standard Error: 29.42
## Trace of smoother matrix: 3.56 (exact)
##
## Control settings:
## span : 0.6991628
## degree : 1
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
loess.as
also includes a plot method so we can visualize the loess smoother.
loess.as(aq$Solar.R, aq$Ozone, criterion="gcv", plot=TRUE)
## Call:
## loess(formula = y ~ x, data = data.bind, span = span1, degree = degree,
## family = family)
##
## Number of Observations: 111
## Equivalent Number of Parameters: 3.09
## Residual Standard Error: 29.42
Cross validation is also built in to smooth.spline()
and is set to generalized cross validation by default. Instead of specifying spar
in the call to smooth.spline()
, we just leave it out to invoke cross validation.
aq.spl = smooth.spline(aq$Solar.R, aq$Ozone)
aq.spl
## Call:
## smooth.spline(x = aq$Solar.R, y = aq$Ozone)
##
## Smoothing Parameter spar= 0.9837718 lambda= 0.01867197 (12 iterations)
## Equivalent Degrees of Freedom (Df): 4.060081
## Penalized Criterion (RSS): 66257.74
## GCV: 892.29
Plotting the cross validated spline smoother, we get a line that looks very similar to the lasso smoother.
ggplot() +
geom_point(data=aq, aes(Solar.R, Ozone)) +
geom_line(aes(x=aq.spl$x, aq.spl$y), color='red') +
ggtitle("CV Spline Smoother") +
theme_bw()
-
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2008. “The Elements of Statistical Learning.” Springer. ↩
Comments