Update:I have written a much more detailed static page about the additive COX model: http://rforge.org/plothr/
The page has a download link to the function plotHR() which does all the fuzz. It is extensively commented. It should be easy to understand the syntax and modify it for individual purposes.
Therneau et al. refer to the proportional hazards model or COX-regression model as “the workhorse of regression analysis for censored data”. They show how to implement the additive form of this model in SAS and S-pluss; already mentioned by Hastie and Tibshirany in 1986 when introducing Generalized Additive Models (GAM).
I found modelling the functional form of the covariates in a regression model for rightcensored survival times with smoothing splines extremely useful. And the implementation is absolutely straightforward in R.
The only thing needed is the installation of the R-libraries “survival” and “pspline”:
In the following code I will refer to a dataset “MyData” with a binary status variable “death” and a time-to-event variable “days2death”.
The status variable “death” should be (not necessarily) 1 if the event of interesst occured to the subject and “days2death” gives then the time to this event.
Viualizing the functional form of a covariate takes the following steps:
- create the survival object of interesst
- fit a proportional hazards model with smoothing splines,
- predict the functional form of the covariate of interesst and
- plot it!
Note that there is the termplot() function in R which gives you the GAM plots after the modelfit, so step 3 would not be necessary – BUT: it has a bug and fails plotting a single covariate; and it does not allow all to much customizing.
This is the R code to achieve the analysis:
1 Create survival object:
surv.death <- Surv(MyData$days2death, MyData$death)
2 Fit proportional hazards model with smoothing splines for continuous covariates:
pham.fit <- coxph( surv.death ~ pspline(EF, df=4) + pspline(Age, df=4) + strata (Sex, df=4) , data = MyData)
The model above includes the continuous covariates “EF” (ejection fraction) and “Age” and stratifies for “Sex”.
3 Produce the fitted smoothing spline for the first covariate in the above model formula with standard errors
predicted <- predict(pham.fit , type = "terms" , se.fit = TRUE , terms = 1)
“terms=1” refers to “pspline(EF,df=4)”
4 Plot it
First plotting axes and labels
plot(0 , xlab="Ejection Fraction" , ylab = "Hazard Ratio" , main = "All-cause Death" , type = "n" , xlim=c(0,100) , ylim=c(0,3))
the range of values on the x-axis (“xlim=c(0,100)”) is chosen manually for this specific covariate; of course it is possible to use something like
ylim = c( 0 , max(MyData$EF) ).
Now plot the fitted smoothing spline using the lines() function:
lines( sm.spline(MyData$EF , exp(predicted$fit)) , col = "red" , lwd = 0.8)
Note that the term prediction gives log-hazard-ratios; therefore exp(predicted$fit) is plotted against the values of the covariate. The sm.spline() function is necessary since the points of the plot appear in random order and density, according to the underlying dataset; a plain lines() function would produce just a chaotic pattern. Alternative:
plot(MyData$EF , exp(predicted$fit) , col = "red" , cex = 0.2)
produces a scattered plot that reflects the distribution of the underlying data – I do prefer adding a rug-plot on the bottom of the graph to illustrate this (see under).
… upper and lower confidence limits with dashed thinner lines
lines(sm.spline(MyData$EF , exp(predicted$fit + 1.96 * predicted$se)) , col = "orange" , lty = 2 , lwd = 0.4)
lines(sm.spline(MyData$EF , exp(predicted$fit - 1.96 * predicted$se)) , col = "orange" , lty = 2 , lwd = 0.4)
… a tiny horizontal line at hazard level 1, do see where the confidence limits cross:
abline( h = 1 , col = "lightgrey" , lty = 2 , lwd = 0.4)
… tiny tickmarks on the x-axes to reflect the distribution of the underlying data:
axis( side = 1 , at = MyData$EF, labels = F , tick = T , tcl = 0.4 , lwd.ticks = 0.1)
… and some fancy red tickmarks to mark minimum, lower hinge, median, upper hinge and maximum of the covariate in the dataset:
axis( side = 1 , at = fivenum(MyData$EF), labels = F , tick = T , tcl = -0.2 , lwd.ticks = 1 , col.ticks = "red")
4b) The easy way (works ONLY with MORE then 1 continous covariate) – predicting the terms can be omitted:
termplot(pham.fit, se=T, rug=T)
Resulting in …
13 thoughts on “Additive COX-regression”
But I have a question. How to handle situation in which the two arguments of lines(sm.spline(MYDATA$x,predict$fit) have different dimensions, due to missing values of some covariates on the coxph? The dimension of predicted$fit is less than MYDATA$x.
Yes, thats often the case. The coxph() function simply excludes missing data. The brute force solution is to exclude them in the MYDATA$x values:
Say you have a model
coxph( survival ~ pspline(cov1) + pspline (cov2) , data = MYDATA)
Then all observations with missing covariate for cov1 and cov2 are excluded. To mask those observations from the for plotting you might create an index variable which is TRUE when the covariates are not missing:
i.complete <- with( MYDATA , !is.na(cov1) & !is.na(cov2))
and then plot
lines( sm.spline( MYDATA$x[i.complete] , predict$fit ))
I know, that is very unelegant… using a smooth to plot the line is not a good programming practice, but I am quite new to R, too.
I have some improvements for the code. I will post them soon. It would be better to sort the MYDATA$x and predict$fit according to the x-values and then just plot a line. Something like:
i.x <- order( MYDATA$x )
lines(MYDATA$x[i.x] , predict$fit[i.x])
I posted it in a recent post concerning usual regression GAMs.
This would save the sm.spline() and you would not even need to load the library(pspline).
Hope that works and no typos…
Update: I have rewritten the function, so that observation with missing covariates are omitted in the plot-function (as they are in the cox-model anyway). Look at the static “PLOTHR” page following the link in the page banner.
thanks a lot.
Congratulations on the excellent R package. I would appreciate if you could give me a statistical reference (methodology paper) to this R package.
It is based on
Therneau TM, Grambsch PM: Modeling Survival Data: Extending the Cox Model. Springer 2000.
The relevant part is on page 108ff, chapter 4, Modeling the functional form of covariates.
Therneau is the maintainer of the “survival” package in R. I wonder that this plot method is not implemented more explicitly in “library(survival)”. As I mentioned in one of the additive survival posts, you can get some crude additive plots with the termplot() function…
Very glad to understand how to make a smoothing spline plot for a cox model.
But I am very eager to know is there any similar programme to make a smoothing spline plot for a logistic model.
Glad, you could use it 🙂
Of course it is possible. You might want to look at library(mgcv) of Simon Wood.
I wrote a post long ago about it, which I consider now outdated – one of my first approaches, but you might get some idea.
a additive logistic regression model in mgcv would look like:
model <- gam( Event ~ s( Covariate1) + s(Covariate2) + factor(Cofactor1) , data = MyData , family = "binomial")
Note that family = “binomial” specifies the logit-link-function aka the logistic regression model.
Hope no typos, quite drafty written.
First of all, millions of thanks for your help, I have been able to get the plot.
While I would like to know does the Y axis of plot(model) represents Ln(OR)
I get some program on internet as follows
test <- gam(participant ~ s(age,fx=FALSE,bs='cr'), family=binomial(logit))
test.pred <- predict(test,newdata=data,se.fit=TRUE,type='response',
plot(age[I1], test.pred$fit[I1],lty=1, type="l")
plot(age[I1], test.pred$fit[I1] , type="l")
could you tell me the difference between plot(test, shade=TRUE) and plot(age[I1], test.pred$fit[I1],lty=1, type="l") , especially the meaning of test.pred$fit.
If I focus on the relation between age and Ln(OR), which plot is I need.
Millions of thanks
The plots should be the same.
Doing it manually gives you more customization.
There is a typo in the syntax for the lines of the confidence limits. The model predictions
test.pred <- predict(test,newdata=data,se.fit=TRUE,type='response', na.action=na.omit)
are stored in test.pred.
Therefore the lines should be
lines( age[I1] , test.pred$fit[I1] + 1.96 * test.pred$se[I1] , lty = 2 )
Note that it is 1.96 times standard error for the 95% CI ( not 2 times). Ok, thats not so serious. But it has to be 1.96 * test.pred$se.
For the logistic regression model you would use
type = "link"
type = "response"
as far as I recall for the logit aka ln(OR).
Have a look at
before you start.
Then both plots should be exactly the same.
You mean that the value of Y axis of plot(model) indeed represents ln(odds ratio), is it?
Thank you for this very good program. Could we used it for a quantitative time dependent variable? thank you
Sorry, the reply is all too late, but in case someone hits your comment on google. Time dependent variables as described in the vignette are handled by R exactly as usual variables. The difference is that the rows of the dataset describe person_i,timepoint_j this means each row is treated as an independent person at a certain time. When the linear estimates are valid this way, the smooth is too. So: yes, the smoothing spline could be used in my opinion. You might validate this on the R-help mailing list. Note that plotHR just plots the term in a publication oriented polished way. The modelling is slightly extended coxph() functionality and described by Therneau (the package maintainer) himself in the monograph quoted in the blogpost (again: Therneau T, Grambsch P: Extending the Cox Model. Springer 2000. p 108ff).