GAM Plot with 95% Confidence Shade

GAM plot with conficence "shade" and customized rugs
The lightblue shade denoting the 95% pointwise confidence limits of the GAM estimate is a polygon() object in R.

Usually one would plot the GAM model with the default termplot() function and specifiy se=T to get the confidence limits as lines. I showed in a recent post how to plot the fitted GAM smooth and the confidence limits manually.

The same approach can be used to have the confidence limits as a shade. In order to achieve this:

  1. Fit the model
  2. library(mgcv)
    model <- gam( MyResponse ~ MyCovariate , data = MyDataset)

  3. get the estimated values of the fitted smoothing spline
  4. fit <- predict( model , se = TRUE )$fit

  5. and pointwise standard errors
  6. se <- predict( model , se = TRUE)$se.fit

  7. calculate the values of the upper and lower 95%-confidence limits
  8. lcl <- fit - 1.96 * se
    ucl <- fit + 1.96 * se

  9. plot an empty coordinate system with right scaling, labels , titles and so on. I prefer to include axes = FALSE and add the axes manually. See here for an example.
  10. plot( 0 , type = "n" , bty = "n" , xlab = "The x-axis lable" , ylab = "The y-axis lable" , main = "The Title" , xlim = c(0,5) , ylim = c(0,3) )

  11. No it gets a bit tricky with sorting the coordinates in the right order. R provides a very effective way to achieve this, though it is not easily understandable at once. We create two indices on the dataset i.for and i.back which number the dataset according to the independend variable on the x-axis – forward and backward respectively:
  12. i.for <- order( MyDataset$MyCovariate )
    i.back <- order( MyDataset$MyCovariate , decreasing = TRUE )

  13. The points of the shade polygon follow here first the upper confidence line (ucl) forward and then the lower confidence line backward. The coordinates of the shade-polygon are
  14. x.polygon <- c( MyDataset$MyCovariate[i.for] , MyDataset$MyCovariate[i.back] )
    y.polygon <- c( ucl[i.for] , lcl[i.back] )

  15. First plot the polygon with the confidence shade – otherwise the GAM plot vanishes behind it.
  16. polygon( x.polygon , y.polygon , col = "#A6CEE3" , border = NA )

  17. now the mainplot afterwards
  18. lines( MyDataset$MyCovariate[i.for] , fit[i.for], col = "#1F78B4" , lwd = 3 )

  19. add a horizontal line marking the intercept ( the mean of the covariate) – note: univariate only. I will post a remark on this later…
  20. abline( h = mean(MyDataset$MyCovariate) , lty = 2 )

      Hope no typos. Comment if you find some.

      Advertisement

Fancy Rugs in Regression Plots

Additive Regression Model with Customized Rug
Additive Regression Model with Customized Rug

Rugplots along the axes show the distribution of the underlying data in regression model plots. This is particulary useful in connection with additive (nonparametric) models where the plotted smooth function is the exclusive representation of the model in order to assess how much data contributed to the model fit at the different values of the exlanatory variable.

The custom plot.gam() function includes the possibility of such rugs and pointwise conficence intervalls by default.

Adding quartiles to the rugs requires some customization, though. I included the complete code to produce the above plot underneath.

The example for this GAM is borrowed from the excellent book of Alain Zuur et. al. Mixed Effects Models and Extensions in Ecology with R p.55ff. The ISIT data to run the code above is included in the R package AED which can be downloaded from the books website.

Note: the package AED is needed for the example dataset only. It is NOT necessary to use the example code on ones own dataset.

The main points concerning the rugs and quantile lables on the x-axis are:

  1. Plot the coordinate system without lables
  2. plot( ... , axes = FALSE )

  3. Plot the x-axis with x-lables
  4. axis(side = 1 , line = 0.3 , at = 0:5*1000 , tick = TRUE)

  5. Plot rugs – the jitter() is necessary since a lot of datapoints sit on the same values of SampleDepth – so shake them a bit.
  6. axis(side = 1 , line = -0.9 , at = jitter(ISIT$SampleDepth) , labels = F , tick = T , tcl = 0.8 , lwd.ticks = 0.1 , lwd = 0)

  7. Print the lables “1Q”, “Median” and “3Q” or whatever you like to call them on the right position. The line and padj parameter set the position of the text and cex.axis the textsize.
  8. axis(side = 1 , line = -0.8 , at = fivenum(ISIT$SampleDepth)[2:4], lwd = 0 , tick = F , labels = c("1Q","median","3Q"), cex.axis = 0.7, col.axis = "black" , padj = -2.8)

  9. Plot thick tickmarks crossing through the rug cloud at 1Q, median and 3Q
  10. axis(side = 1 , line = -0.8 , at = fivenum(ISIT$SampleDepth)[2:4], lwd = 0 , tick = T, tcl = 1.1 , lwd.ticks = 1 , col.ticks = "black", labels = FALSE)

  11. and finally short thick tickmarks under the text touching the x-axis
  12. axis(side = 1 , line = 0.3, at = fivenum(ISIT$SampleDepth)[2:4], lwd = 0 , tick = T, tcl = 0.2 , lwd.ticks = 1 , col.ticks = "black", labels = FALSE)

Here goes the complete code:
library(mgcv)
library(AED)
data(ISIT)
#
# Fit a univariate GAM model
model <- gam(Sources ~ s(SampleDepth) , data = ISIT)
fit <- predict(model, se = T)$fit
se <- predict(model, se = T)$se.fit
lcl <- fit - 1.96 * se
ucl <- fit + 1.96 * se
#
# open a jpeg
jpeg("FancyRugs.jpg" , width=400, height=400)
#
# set plotting options: 1 plot per page, horizontal labels and textsize
par(mfrow = c(1,1) , las = 1 , cex = 1)
#
# plot coordinatesystem and labels
plot(0 , bty = "n" , type = "n" , xlim = c(0,5000) , ylim = c(-10,50) , xlab = "Depth (m)" , ylab = expression(paste("Number of sources (" , m^-3 , ")")) , axes = FALSE)
#
title(main="Association between number of sources of\nbioluminescent organisms and ocean depth" , cex.main = 0.8)
#
## _____ X-AXIS ______
# x-axis values
axis(side = 1 , line = 0.3 , at = 0:5*1000 , tick = TRUE)
#
# rugs at datapoints
axis(side = 1 , line = -0.9 , at = jitter(ISIT$SampleDepth) , labels = F , tick = T , tcl = 0.8 , lwd.ticks = 0.1 , lwd = 0)
#
# labels at 1Q, median and 3Q
axis(side = 1 , line = -0.8 , at = fivenum(ISIT$SampleDepth)[2:4], lwd = 0 , tick = F , labels = c("1Q","median","3Q"), cex.axis = 0.7, col.axis = "black" , padj = -2.8)
#
# tick marks at 1Q, median and 3Q
axis(side = 1 , line = 0.3, at = fivenum(ISIT$SampleDepth)[2:4], lwd = 0 , tick = T, tcl = 0.2 , lwd.ticks = 1 , col.ticks = "black", labels = FALSE)
#
axis(side = 1 , line = -0.8 , at = fivenum(ISIT$SampleDepth)[2:4], lwd = 0 , tick = T, tcl = 1.1 , lwd.ticks = 1 , col.ticks = "black", labels = FALSE)
#
## _____ Y-AXIS ______
# y-axis values
axis(side = 2 , at = 0:5*10)
#
# rugs at datapoints
axis(side = 2 , line = -0.9 , at = jitter(ISIT$Sources) , labels = F , tick = T , tcl = 0.8 , lwd.ticks = 0.1 , lwd = 0)
#
# labels at 1Q, median and 3Q
axis(side = 2 , line = -0.7 , at = fivenum(ISIT$Sources)[2:4], lwd = 0 , tick = F , labels = c("1Q","median","3Q"), cex.axis = 0.7, col.axis = "black")
#
# thicker tick marks at 1Q, median and 3Q
axis(side = 2 , line = 0.3, at = fivenum(ISIT$Sources)[2:4], lwd = 0 , tick = T, tcl = 0.3 , lwd.ticks = 1 , col.ticks = "black", labels = FALSE , padj = -2)
axis(side = 2 , line = -0.7 , at = fivenum(ISIT$Sources)[2:4], lwd = 0 , tick = T, tcl = 1.1 , lwd.ticks = 1 , col.ticks = "black" , labels = FALSE)
#
# horizontal line marking the intercept = mean(Sources) (for univariate model only)
abline(h=mean(ISIT$Sources), lty=3)
#
# Scatterplot
lines(ISIT$SampleDepth , ISIT$Source , type = "p" , cex = 0.4 , lwd = 0.2 , col = "grey")
#
# plot main figure
lines(ISIT$SampleDepth[order(ISIT$SampleDepth)] , fit[order(ISIT$SampleDepth)] , col = "black" , lwd = 2)
#
# plot lower confidence limit (lcl)
lines(ISIT$SampleDepth[order(ISIT$SampleDepth)] , lcl[order(ISIT$SampleDepth)] , col = "grey" , lwd = 1)
#
# plot upper confidence limit (ucl)
lines(ISIT$SampleDepth[order(ISIT$SampleDepth)] , ucl[order(ISIT$SampleDepth)] , col = "grey" , lwd = 1)
#
# closing the jpg file
dev.off()

Installing RODBC with Ubuntu/Debian amd64

Trying to install RODBC in Ubuntu with
install.packages("RODBC")
failed throwing an error message
ODBC headers sql.h and sqlext.h not found

A glance at the r-help showed that it had to do with something called unixODBC – an ODBC driver manager.

The package was installed, but not the development package and thus not the headers which R complained about.

Again a not-so-obvious-for-the-newbee-Linux-Unix-Shell-goblish thing. The fix is
sudo aptitude install unixodbc-dev

Open Access .mdb Files with RODBC

Getting data into R can be done by reading colon separated files (.csv) via the read.table() function. It is also possible to access databases directly and send SQL queries directly from R to the database. This has some advantages: Using Sweave the queries get documented in the analysis report, variable formats are retained.

To install the RODBC package:
install.packages("RODBC")

Open a database connection to an Microsoft Access database file, e.g. “MyDataBase.mdb” sitting in the Folder “C:\ MyPath\MyDataBase.mdb”:
channel <- odbcConnectAccess("C:/MyPath/MyDataBase")
note that the Windows backslashes “\” become slashes “/” in R and the extension “.mdb” is omitted.

Getting the database table “MyTable” into the R dataframe “R.Table”
R.Table <- sqlQuery( channel , paste ("select * from MyTable"))
MyTable can also be a sql query in the Access database.

Correlation matrices

Correlation matrices are a great way to visualize covariation in large datasets. Thanks to Sarkar (2008) and Friendly (2002), we can borrow these examples. First we need to load a few packages:

library(lattice)
library(ellipse)

We then go on to importing our data:

MyData <- read.table("/FILE/LOCATION/FILE.txt", header=TRUE,sep='\t',quote='')

Then on for some tedious work. We’ll have to manually type in the name (as it named i the dataset!) of each of the variables we’re interested in (unless off course you’ll want to analyze the whole shebang – which often isn’t desirable).

var <- c("var1","var2","var3")

We then proceed to make a correlation matrix with the selected variables vector “var”:

cor.MyData <- cor(MyData[,var],use="pair")

And here’s the money shot; setting the order of the variables in the matrix based on the result of a cluster analysis. This is for easier visual interpretation.

ord <- order.dendrogram(as.dendrogram(hclust(dist(cor.MyData))))

We then can make our Corrgrams. First an easy one:

print(levelplot(cor.MyData[ord,ord],xlab=NULL,ylab=NULL,
at=do.breaks(c(-1.01,1.01),101),scales=list(x=list(rot=90)),
colorkey=list(space="top"),
col.regions=colorRampPalette(c("red","white","blue"))))

Simple correlation matrix

Then on to the more elaborate ones, first a informative one using the ellipse package. We’ll have to write our own panel function here (again, thanks to Sarkar 2008), but this is completly generic, so cut ‘n’ paste should suffice.

panel.corrgram<-function(x,y,z,subscripts,at,
level=0.9,label=FALSE,...
{require('ellipse',quietly=TRUE)
x<-as.numeric(x)[subscripts]
y<-as.numeric(y)[subscripts]
z<-as.numeric(z)[subscripts]
zcol<-level.colors(z,at=at,...)
for(i in seq(along=z)){
ell<-ellipse(z[i],level=level,npoints=50,
scale=c(.2,.2),centre=c(x[i],y[i]))
panel.polygon(ell,col=zcol[i],border=zcol[i],...)
}
if (label)
panel.text(x=x,y=y,lab=100*round(z,2),
cex=0.8,col=ifelse(z<0,'white','black'))
}

To create a *.pdf of your output we’ll run following code (remember your working directory!):

pdf('corrgram_ellipse.pdf',10,10)
levelplot(cor.MyData[ord,ord],
at=do.breaks(c(-1.01,1.01),20),xlab=NULL,
ylab=NULL,colorkey=list(space='top'),
scales=list(x=list(rot=90)),
panel=panel.corrgram,label=TRUE)
dev.off()

Correlation matrix - ellipse

However, if you’re more in the pacman thing, this next one might be a better choice. We’ll start by writing a new panel function:

panel.corrgram.2<-function(x,y,z,
subscripts,at=pretty(z),scale=0.8,...)
{
require('grid',quietly=TRUE)
x<-as.numeric(x)[subscripts]
y<-as.numeric(y)[subscripts]
z<-as.numeric(z)[subscripts]
zcol<-level.colors(z,at=at,...)
for(i in seq(along=z))
{
lims<-range(0,z[i])
tval<-2*base::pi*
seq(from=lims[1],to=lims[2],by=0.01)
grid.polygon(x=x[i]+.5*scale*c(0,sin(tval)),
y=y[i]+.5*scale*c(0,cos(tval)),
default.units='native',
gp=gpar(fill=zcol[i]))
grid.circle(x=x[i],y=y[i],r=.5*scale,
default.units='native')
}
}

Which we’ll export:

pdf('corrgram_pacman.pdf',10,10)
levelplot(cor.MyData[ord,ord],
at=do.breaks(c(-1.01,1.01),101),xlab=NULL,
ylab=NULL,colorkey=list(space='top'),
scales=list(x=list(rot=90)),
panel=panel.corrgram.2,
col.regions=colorRampPalette(c('red','white','blue')))
dev.off()

Correlation matrix - pacman

Of course, changing the output size (and type) is easy. Changing pdf –> png (or the format of your choosing). Size were in these examples 10 by 10 inches.

Installing and Updating R

These are the recommendations from CRAN on how to install and update R. I have streamlined it for my situtation.

To obtain the latest R packages, add an entry like
deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu hardy/
in your /etc/apt/sources.list file.

In my case (using Ubuntu Hardy 8.04 LTS in Bergen, Norway) this is
deb http://cran.ii.uib.no/bin/linux/ubuntu hardy/

See http://cran.r-project.org/mirrors.html
for the list of CRAN mirrors.

To install the complete R system, use
sudo aptitude update
sudo aptitude install r-recommended ess

ESS is a package to use R via the emacs editor. It is good! As an added convenience to Ubuntu users it is provided as an up-to-date version in the cran repositories.

SECURE APT

The Ubuntu archives on CRAN are signed with the key of “Vincent Goulet
<vincent.goulet@act.ulaval.ca>” with key ID E2A11821. You can fetch
this key with

gpg --keyserver subkeys.pgp.net --recv-key E2A11821

and then feed it to apt-key with
gpg -a --export E2A11821 | sudo apt-key add -

ADMINISTRATION AND MAINTENANCE OF R PACKAGES

The R packages part of the Ubuntu r-base and r-recommended packages
are installed into the directory /usr/lib/R/library. These can be
updated using aptitude with

sudo aptitude update
sudo aptitude upgrade

The other r-cran-* packages shipped with Ubuntu are installed into the
directory /usr/lib/R/site-library.

Installing R packages not provided with Ubuntu first requires tools to
compile the packages from source. These tools are installed via the R
development package with
sudo aptitude install r-base-dev

Then a site administrator can install R packages into the directory
/usr/local/lib/R/site-library by running R as root and using the install.packages() function:
sudo R
install.packages()

A routine update can then be undertaken from R using
update.packages(lib.loc = "/usr/local/lib/R/site-library")

Create LaTex tables from R output

Creating reasonable layouted LaTeX tables from R output was easier then expected. I should have googled it long ago…

install.packages("xtable")

Lets say you created a tabular output in R called “tab1”, e.g by doing:
data(CO2)
tab1 <- with(CO2, table(Treatment , Type))
tab1

your text output in R would look like

Type
Treatment Quebec Mississippi
nonchilled 21 21
chilled 21 21

Now you would like this or whatever table or data.frame as a nice LaTeX-table, the only thing to do is:
library(xtable)
xtable(tab1)

and the output will be:

% latex table generated in R 2.9.0 by xtable 1.5-5 package
% Wed Jul 08 16:20:54 2009
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrr}
\hline
& Quebec & Mississippi \\
\hline
nonchilled & 21 & 21 \\
chilled & 21 & 21 \\
\hline
\end{tabular}
\end{center}
\end{table}

If you are using Sweave the usage becomes
<< echo = FALSE , results = tex >>
library(xtable)
data(CO2)
with(CO2, xtable(table(Treatment , Type)))
@

and a the result of the R run is a LaTeX document. Another post will give a hint how to paste R graphics into the same Sweave or LaTeX document… later…

The full usage of xtable() is
xtable(x, caption=NULL, label=NULL, align=NULL, digits=NULL, display=NULL)
.. add table captions
xtable(table , caption = "My table caption")
and labels
xtable(table , label = " MyLaTeXlable")
to the LaTeX tables.

Install R packages without internet connection

You might have no internet connection or as in my case you have one, but the firewall prevents anything beside Windows IE to access the net. Grrr.

Ok. It takes more time, but is possible. Lets have an example:
Installing the package ‘gam’ for R-2.9 on Windows being in Norway (with cran.ii.uib.no as the nearest and fastest “Comprehensive R Archive Network”.

  1. Find another place with internet connection or as in my case, the browser works, so use the browser.
  2. Find the package on the repository corresponding to your operating system and your version of R in our above example here
  3. klick on the link gam_1.0.zip, download it and save it (e.g. using an USB disk). Remember the path to the file (in our example in the folder H:/DATA/).
  4. Open R and write install.packages("H:/DATA/gam_1.0.zip", repos=NULL) at the command line.

Thats it.