v2.2 versions of labs except Ch10
This commit is contained in:
@@ -1,27 +1,19 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: Rmd,ipynb
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 11
|
||||
|
||||
|
||||
|
||||
# Survival Analysis
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch11-surv-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch11-surv-lab.ipynb)
|
||||
|
||||
|
||||
# Lab: Survival Analysis
|
||||
In this lab, we perform survival analyses on three separate data
|
||||
sets. In Section 11.8.1 we analyze the `BrainCancer`
|
||||
data that was first described in Section 11.3. In Section 11.8.2, we examine the `Publication`
|
||||
data from Section 11.5.4. Finally, Section 11.8.3 explores
|
||||
sets. In Section~\ref{brain.cancer.sec} we analyze the `BrainCancer`
|
||||
data that was first described in Section~\ref{sec:KM}. In Section~\ref{time.to.pub.sec}, we examine the `Publication`
|
||||
data from Section~\ref{sec:pub}. Finally, Section~\ref{call.center.sec} explores
|
||||
a simulated call-center data set.
|
||||
|
||||
We begin by importing some of our libraries at this top
|
||||
@@ -37,7 +29,7 @@ from ISLP.models import ModelSpec as MS
|
||||
from ISLP import load_data
|
||||
|
||||
```
|
||||
|
||||
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
@@ -61,7 +53,7 @@ BrainCancer = load_data('BrainCancer')
|
||||
BrainCancer.columns
|
||||
|
||||
```
|
||||
|
||||
|
||||
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
|
||||
We first briefly examine the data.
|
||||
|
||||
@@ -69,20 +61,20 @@ We first briefly examine the data.
|
||||
BrainCancer['sex'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
BrainCancer['diagnosis'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
BrainCancer['status'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Before beginning an analysis, it is important to know how the
|
||||
`status` variable has been coded. Most software
|
||||
uses the convention that a `status` of 1 indicates an
|
||||
@@ -91,7 +83,7 @@ observation. But some scientists might use the opposite coding. For
|
||||
the `BrainCancer` data set 35 patients died before the end of
|
||||
the study, so we are using the conventional coding.
|
||||
|
||||
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure 11.2. The main
|
||||
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure~\ref{fig:survbrain}. The main
|
||||
package we will use for survival analysis
|
||||
is `lifelines`.
|
||||
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
|
||||
@@ -109,9 +101,9 @@ km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
|
||||
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next we create Kaplan-Meier survival curves that are stratified by
|
||||
`sex`, in order to reproduce Figure 11.3.
|
||||
`sex`, in order to reproduce Figure~\ref{fig:survbrain2}.
|
||||
We do this using the `groupby()` method of a dataframe.
|
||||
This method returns a generator that can
|
||||
be iterated over in the `for` loop. In this case,
|
||||
@@ -138,8 +130,8 @@ for sex, df in BrainCancer.groupby('sex'):
|
||||
km_sex.plot(label='Sex=%s' % sex, ax=ax)
|
||||
|
||||
```
|
||||
|
||||
As discussed in Section 11.4, we can perform a
|
||||
|
||||
As discussed in Section~\ref{sec:logrank}, we can perform a
|
||||
log-rank test to compare the survival of males to females. We use
|
||||
the `logrank_test()` function from the `lifelines.statistics` module.
|
||||
The first two arguments are the event times, with the second
|
||||
@@ -152,8 +144,8 @@ logrank_test(by_sex['Male']['time'],
|
||||
by_sex['Female']['status'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The resulting $p$-value is $0.23$, indicating no evidence of a
|
||||
difference in survival between the two sexes.
|
||||
|
||||
@@ -172,7 +164,7 @@ cox_fit = coxph().fit(model_df,
|
||||
cox_fit.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
The first argument to `fit` should be a data frame containing
|
||||
at least the event time (the second argument `time` in this case),
|
||||
as well as an optional censoring variable (the argument `status` in this case).
|
||||
@@ -186,7 +178,7 @@ with no features as follows:
|
||||
cox_fit.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Regardless of which test we use, we see that there is no clear
|
||||
evidence for a difference in survival between males and females. As
|
||||
we learned in this chapter, the score test from the Cox model is
|
||||
@@ -206,7 +198,7 @@ fit_all = coxph().fit(all_df,
|
||||
fit_all.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `diagnosis` variable has been coded so that the baseline
|
||||
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
|
||||
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
|
||||
@@ -233,7 +225,7 @@ def representative(series):
|
||||
modal_data = cleaned.apply(representative, axis=0)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We make four
|
||||
copies of the column means and assign the `diagnosis` column to be the four different
|
||||
diagnoses.
|
||||
@@ -245,7 +237,7 @@ modal_df['diagnosis'] = levels
|
||||
modal_df
|
||||
|
||||
```
|
||||
|
||||
|
||||
We then construct the model matrix based on the model specification `all_MS` used to fit
|
||||
the model, and name the rows according to the levels of `diagnosis`.
|
||||
|
||||
@@ -272,12 +264,12 @@ fig, ax = subplots(figsize=(8, 8))
|
||||
predicted_survival.plot(ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Publication Data
|
||||
The `Publication` data presented in Section 11.5.4 can be
|
||||
The `Publication` data presented in Section~\ref{sec:pub} can be
|
||||
found in the `ISLP` package.
|
||||
We first reproduce Figure 11.5 by plotting the Kaplan-Meier curves
|
||||
We first reproduce Figure~\ref{fig:lauersurv} by plotting the Kaplan-Meier curves
|
||||
stratified on the `posres` variable, which records whether the
|
||||
study had a positive or negative result.
|
||||
|
||||
@@ -291,7 +283,7 @@ for result, df in Publication.groupby('posres'):
|
||||
km_result.plot(label='Result=%d' % result, ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As discussed previously, the $p$-values from fitting Cox’s
|
||||
proportional hazards model to the `posres` variable are quite
|
||||
large, providing no evidence of a difference in time-to-publication
|
||||
@@ -308,8 +300,8 @@ posres_fit = coxph().fit(posres_df,
|
||||
posres_fit.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
However, the results change dramatically when we include other
|
||||
predictors in the model. Here we exclude the funding mechanism
|
||||
variable.
|
||||
@@ -322,7 +314,7 @@ coxph().fit(model.fit_transform(Publication),
|
||||
'status').summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that there are a number of statistically significant variables,
|
||||
including whether the trial focused on a clinical endpoint, the impact
|
||||
of the study, and whether the study had positive or negative results.
|
||||
@@ -332,7 +324,7 @@ of the study, and whether the study had positive or negative results.
|
||||
|
||||
In this section, we will simulate survival data using the relationship
|
||||
between cumulative hazard and
|
||||
the survival function explored in Exercise 8.
|
||||
the survival function explored in Exercise \ref{ex:all3}.
|
||||
Our simulated data will represent the observed
|
||||
wait times (in seconds) for 2,000 customers who have phoned a call
|
||||
center. In this context, censoring occurs if a customer hangs up
|
||||
@@ -372,7 +364,7 @@ model = MS(['Operators',
|
||||
intercept=False)
|
||||
X = model.fit_transform(D)
|
||||
```
|
||||
|
||||
|
||||
It is worthwhile to take a peek at the model matrix `X`, so
|
||||
that we can be sure that we understand how the variables have been coded. By default,
|
||||
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
|
||||
@@ -382,7 +374,7 @@ of the variable is dropped.
|
||||
X[:5]
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next, we specify the coefficients and the hazard function.
|
||||
|
||||
```{python}
|
||||
@@ -403,7 +395,7 @@ that the risk of a call being answered at Center B is 0.74 times the
|
||||
risk that it will be answered at Center A; in other words, the wait
|
||||
times are a bit longer at Center B.
|
||||
|
||||
Recall from Section 2.3.7 the use of `lambda`
|
||||
Recall from Section~\ref{Ch2-statlearn-lab:loading-data} the use of `lambda`
|
||||
for creating short functions on the fly.
|
||||
We use the function
|
||||
`sim_time()` from the `ISLP.survival` package. This function
|
||||
@@ -431,7 +423,7 @@ W = np.array([sim_time(l, cum_hazard, rng)
|
||||
D['Wait time'] = np.clip(W, 0, 1000)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now simulate our censoring variable, for which we assume
|
||||
90% of calls were answered (`Failed==1`) before the
|
||||
customer hung up (`Failed==0`).
|
||||
@@ -443,13 +435,13 @@ D['Failed'] = rng.choice([1, 0],
|
||||
D[:5]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
D['Failed'].mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
|
||||
|
||||
```{python}
|
||||
@@ -462,7 +454,7 @@ for center, df in D.groupby('Center'):
|
||||
ax.set_title("Probability of Still Being on Hold")
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next, we stratify by `Time`.
|
||||
|
||||
```{python}
|
||||
@@ -475,7 +467,7 @@ for time, df in D.groupby('Time'):
|
||||
ax.set_title("Probability of Still Being on Hold")
|
||||
|
||||
```
|
||||
|
||||
|
||||
It seems that calls at Call Center B take longer to be answered than
|
||||
calls at Centers A and C. Similarly, it appears that wait times are
|
||||
longest in the morning and shortest in the evening hours. We can use a
|
||||
@@ -488,8 +480,8 @@ multivariate_logrank_test(D['Wait time'],
|
||||
D['Failed'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Next, we consider the effect of `Time`.
|
||||
|
||||
```{python}
|
||||
@@ -498,8 +490,8 @@ multivariate_logrank_test(D['Wait time'],
|
||||
D['Failed'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
As in the case of a categorical variable with 2 levels, these
|
||||
results are similar to the likelihood ratio test
|
||||
from the Cox proportional hazards model. First, we
|
||||
@@ -514,8 +506,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
|
||||
F.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Next, we look at the results for `Time`.
|
||||
|
||||
```{python}
|
||||
@@ -527,8 +519,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
|
||||
F.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We find that differences between centers are highly significant, as
|
||||
are differences between times of day.
|
||||
|
||||
@@ -544,8 +536,8 @@ fit_queuing = coxph().fit(
|
||||
fit_queuing.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The $p$-values for Center B and evening time
|
||||
are very small. It is also clear that the
|
||||
hazard --- that is, the instantaneous risk that a call will be
|
||||
|
||||
Reference in New Issue
Block a user