R scripts for the lecture course
Machine Learning, pattern recognition and statistical data modelling
Coryn A.L. Bailer-Jones, 2007

Kernel methods and additive models
----------------------------------


***** Kernel smooters

# After Bishop ch. 9
x <- seq(0,1,0.05)
h <- function(x){0.5 + 0.4*sin(2*pi*x)}             # define a function
set.seed(10)
h_noisy <- h(x) + rnorm(n=x, sd=0.05)
plot(x, h_noisy, pch=16)
x_temp <- seq(0,1,0.01)
lines(x_temp, h(x_temp), lty=2, lwd=2)

# Predict using ksmooth
library(stats)
ksmooth(x, h_noisy, x.points=x, bandwidth=0.2, kernel='normal')
lines(ksmooth(x, h_noisy, x.points=x, bandwidth=0.2, kernel='normal'), col='red', lwd=2)
# Repeat with bandwith values:
# 0.5  too smooth
# 0.1  okay 
# 0.05 overfit

# Predict using locpoly
# With locpoly we cannot define a prediction grid (only number of points spread uniformly)
library(KernSmooth)
# replot data and true function
plot(x, h_noisy, pch=16)
lines(x_temp, h(x_temp), lty=2, lwd=2)
locpoly(x, h_noisy, bandwidth=0.05)
lines(locpoly(x, h_noisy, bandwidth=0.05), col='blue', lwd=2)
# Repeat with bandwith values:
# 0.15 too smooth
# 0.02 overfit

Note that this application of locpoly is using a linear fit over the bandwidth
region about each point, yet appears to produce a nonlinear
function. Actually, it doesn't produce a function at all: it is just doing
point estimations at 401 points equally spread over the x-axis (this value is
fixed by the parameter gridsize). That is, at each of these points it does a
linear fit over the data lying within the box of size +/- bandwidth/2 and this
is the predicted y-value.


***** Generalized additive models


Relevant R functions are gam{mgcv} and gam{gam}. Latter is a clone of the S-PLUS function and is the one I use below.

# Example from section 8.8 of MASS4
library(gam)
rock.lm <- lm(log(perm) ~ area + peri + shape, data = rock)
summary(rock.lm)
# The s() function indicates that  cubic smoothing spline is used
# The alternative is lo() which indicates loess
# Do ?s and ?lo to get details
(rock.gam <- gam(log(perm) ~ s(area) + s(peri) + s(shape), data=rock))
#summary(rock.gam)
#anova(rock.lm, rock.gam)
# gam provides a plot method for objects of class gam
?plot.gam
# se is used to plot 2sigma error bars
par(mfrow = c(2, 3), pty = "s", cex.lab=1.5, cex.axis=1.5)
plot(rock.gam, se = TRUE, lw=2)
# Note what we see here: rug plot at bottom, 2sigma error bars. Mean of each function
# (in y) is zero, just as we set the GAM model up to be (mean is taken care of by alpha)
# Looks like a linear dependence on area and perimeter would be fine. Let's test this
rock.gam1 <- gam(log(perm) ~ area + peri + s(shape), data = rock)
plot(rock.gam1, se = TRUE, lw=2)
# Question: why do the error bars go to zero at one point for each of the linear terms?
# Answer: in the set-up of the GAM model we set the mean of each GAM function (in y) to 
# be zero. As the linear function is forced to go through this point (unlike the smoothers)
# the estimated errors are zero.

# Can plot again but now also showing the residuals
plot(rock.gam, se = TRUE, lw=2, residuals=TRUE)
# kill device window so that we plot panels together after this

We can compare the variance between the linear model and the full additive model using anova (for interpretation see the section on F-test below)

# Look at effect of varying smoothing parameters. Default is df=4
# Raising df reduces smoothing
(rock.gam2 <- gam(log(perm) ~ s(area, df=8) + s(peri) + s(shape), data=rock))
# We see that this has used four more d.o.f
rock.gam
par(mfrow = c(2, 3), pty = "s", cex.lab=1.5, cex.axis=1.5)
plot(rock.gam2, se = TRUE, lw=2)
plot(rock.gam, se = TRUE, lw=2)

# Now try with a loess() smoother
# default has span=0.5 and degree=1
(rock.gam3 <- gam(log(perm) ~ lo(area) + lo(peri) + lo(shape), data=rock))
plot(rock.gam3, se = TRUE, lw=2, main='loess() smoother')
plot(rock.gam, se = TRUE, lw=2, main='spline smoother')
# Again can play with parameters, e.g. lo(area, span=0.2) or lo(peri, degree=2)