v2.2 versions of labs except Ch10

2024-06-04 18:07:35 -07:00
parent e5bbb1a5bc
commit 29526fb7bc
25 changed files with 19373 additions and 10042 deletions
--- a/Ch11-surv-lab.Rmd
+++ b/Ch11-surv-lab.Rmd
@@ -1,27 +1,19 @@
---
-jupyter:
-  jupytext:
-    cell_metadata_filter: -all
-    formats: Rmd,ipynb
-    main_language: python
-    text_representation:
-      extension: .Rmd
-      format_name: rmarkdown
-      format_version: '1.2'
-      jupytext_version: 1.14.7
---
-
-
-# Chapter 11



+# Survival Analysis
+
+<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch11-surv-lab.ipynb">
+<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch11-surv-lab.ipynb)
+

-# Lab: Survival Analysis
 In this lab, we perform survival analyses on three separate data
-sets. In  Section 11.8.1 we analyze the  `BrainCancer` 
-data  that was first described in Section 11.3. In Section 11.8.2, we examine the  `Publication` 
-data  from Section 11.5.4. Finally,  Section 11.8.3  explores
+sets. In  Section~\ref{brain.cancer.sec} we analyze the  `BrainCancer` 
+data  that was first described in Section~\ref{sec:KM}. In Section~\ref{time.to.pub.sec}, we examine the  `Publication` 
+data  from Section~\ref{sec:pub}. Finally,  Section~\ref{call.center.sec}  explores
 a simulated call-center data set.

 We begin by importing some of our libraries at this top
@@ -37,7 +29,7 @@ from ISLP.models import ModelSpec as MS
 from ISLP import load_data

 ```
-
+ 
 We  also collect the new imports
 needed for this lab.

@@ -61,7 +53,7 @@ BrainCancer = load_data('BrainCancer')
 BrainCancer.columns

 ```
-
+    
 The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
 We first briefly examine the data.

@@ -69,20 +61,20 @@ We first briefly examine the data.
 BrainCancer['sex'].value_counts()

 ```
-
+    

 ```{python}
 BrainCancer['diagnosis'].value_counts()

 ```
-
+    

 ```{python}
 BrainCancer['status'].value_counts()

 ```
-
-
+    
+ 
 Before beginning an analysis, it is important to know how the
 `status` variable has been coded.  Most software
 uses the convention that a `status` of 1 indicates an
@@ -91,7 +83,7 @@ observation. But some scientists might use the opposite coding. For
 the  `BrainCancer`  data set 35 patients died before the end of
 the study, so we are using the conventional coding.

-To begin the analysis, we re-create  the Kaplan-Meier survival curve shown in Figure 11.2. The main
+To begin the analysis, we re-create  the Kaplan-Meier survival curve shown in Figure~\ref{fig:survbrain}. The main
 package we will use for survival analysis
 is `lifelines`.
 The variable  `time`  corresponds to $y_i$, the time to the $i$th event (either censoring or
@@ -109,9 +101,9 @@ km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
 km_brain.plot(label='Kaplan Meier estimate', ax=ax)

 ```
-
+ 
 Next we create Kaplan-Meier survival curves that are stratified by
-`sex`, in order to reproduce  Figure 11.3.
+`sex`, in order to reproduce  Figure~\ref{fig:survbrain2}.
 We  do this using the `groupby()` method of  a  dataframe.
 This  method returns a generator that can
 be  iterated over in the `for` loop. In this case,
@@ -138,8 +130,8 @@ for sex, df in BrainCancer.groupby('sex'):
    km_sex.plot(label='Sex=%s' % sex, ax=ax)

 ```
-
-As discussed in Section 11.4, we can perform a
+ 
+As discussed in Section~\ref{sec:logrank}, we can perform a
 log-rank test to compare the survival of males to females. We use
 the `logrank_test()` function from the `lifelines.statistics` module.
 The first two arguments are the event times, with the second
@@ -152,8 +144,8 @@ logrank_test(by_sex['Male']['time'],
             by_sex['Female']['status'])

 ```
-
-
+    
+ 
 The resulting $p$-value is $0.23$, indicating no evidence of a
 difference in survival between the two sexes.

@@ -172,7 +164,7 @@ cox_fit = coxph().fit(model_df,
 cox_fit.summary[['coef', 'se(coef)', 'p']]

 ```
-
+    
 The first argument to `fit` should be a data frame containing
 at least the event time (the second argument `time` in this case),
 as well as an optional censoring variable (the argument `status` in this case).
@@ -186,7 +178,7 @@ with no features as follows:
 cox_fit.log_likelihood_ratio_test()

 ```
-
+    
 Regardless of which test we use, we see that there is no clear
 evidence for a difference in survival between males and females.  As
 we learned in this chapter, the score test from the Cox model is
@@ -206,7 +198,7 @@ fit_all = coxph().fit(all_df,
 fit_all.summary[['coef', 'se(coef)', 'p']]

 ```
-
+    
 The `diagnosis` variable has been coded so that the baseline
 corresponds to HG glioma. The results indicate that the risk associated with HG glioma
 is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
@@ -233,7 +225,7 @@ def representative(series):
 modal_data = cleaned.apply(representative, axis=0)

 ```
-
+ 
 We make four
 copies of the column means and assign the `diagnosis` column to be the four different
 diagnoses.
@@ -245,7 +237,7 @@ modal_df['diagnosis'] = levels
 modal_df

 ```
-
+ 
 We then construct the model matrix based on the model specification `all_MS` used to fit
 the model, and name the rows according to the levels of `diagnosis`.

@@ -272,12 +264,12 @@ fig, ax = subplots(figsize=(8, 8))
 predicted_survival.plot(ax=ax);

 ```
-
+ 

 ## Publication Data
-The  `Publication`  data   presented in Section 11.5.4  can be
+The  `Publication`  data   presented in Section~\ref{sec:pub}  can be
 found in the `ISLP` package.
-We first reproduce Figure 11.5  by plotting the Kaplan-Meier curves
+We first reproduce Figure~\ref{fig:lauersurv}  by plotting the Kaplan-Meier curves
 stratified on the  `posres`  variable, which records whether the
 study had a positive or negative result.

@@ -291,7 +283,7 @@ for result, df in Publication.groupby('posres'):
    km_result.plot(label='Result=%d' % result, ax=ax)

 ```
-
+ 
 As discussed previously, the $p$-values from fitting Cox’s
 proportional hazards model to the `posres` variable are quite
 large, providing no evidence of a difference in time-to-publication
@@ -308,8 +300,8 @@ posres_fit = coxph().fit(posres_df,
 posres_fit.summary[['coef', 'se(coef)', 'p']]

 ```
-
-
+    
+ 
 However, the results change dramatically when we include other
 predictors in the model. Here we exclude the funding mechanism
 variable.
@@ -322,7 +314,7 @@ coxph().fit(model.fit_transform(Publication),
            'status').summary[['coef', 'se(coef)', 'p']]

 ```
-
+    
 We see that there are a number of statistically significant variables,
 including whether the trial focused on a clinical endpoint, the impact
 of the study, and whether the study had positive or negative results.
@@ -332,7 +324,7 @@ of the study, and whether the study had positive or negative results.

 In this section, we will simulate survival data using the relationship
 between cumulative hazard and
-the survival function explored in Exercise 8.
+the survival function explored in Exercise \ref{ex:all3}.
 Our simulated data will represent the observed
 wait times (in seconds) for 2,000 customers who have phoned a call
 center.  In this context, censoring occurs if a customer hangs up
@@ -372,7 +364,7 @@ model = MS(['Operators',
           intercept=False)
 X = model.fit_transform(D)
 ```
-
+ 
 It is worthwhile to take a peek at the model matrix `X`, so
 that we can be sure that we understand how the variables have been coded. By default,
 the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
@@ -382,7 +374,7 @@ of the variable is dropped.
 X[:5]

 ```
-
+    
 Next,  we specify the coefficients and the hazard function.

 ```{python}
@@ -403,7 +395,7 @@ that the risk of a call being answered at Center B is 0.74 times the
 risk that it will be answered at Center A; in other words, the wait
 times are a bit longer at Center B.

-Recall from Section 2.3.7 the use of `lambda`
+Recall from Section~\ref{Ch2-statlearn-lab:loading-data} the use of `lambda`
 for creating short functions on the fly.
 We use the function
 `sim_time()` from the `ISLP.survival` package. This function
@@ -431,7 +423,7 @@ W = np.array([sim_time(l, cum_hazard, rng)
 D['Wait time'] = np.clip(W, 0, 1000)

 ```
-
+ 
 We now simulate our censoring variable, for which we assume
 90% of calls were answered (`Failed==1`) before the
 customer hung up (`Failed==0`).
@@ -443,13 +435,13 @@ D['Failed'] = rng.choice([1, 0],
 D[:5]

 ```
-
+    

 ```{python}
 D['Failed'].mean()

 ```
-
+ 
 We now plot  Kaplan-Meier survival curves. First, we stratify by `Center`.

 ```{python}
@@ -462,7 +454,7 @@ for center, df in D.groupby('Center'):
 ax.set_title("Probability of Still Being on Hold")

 ```
-
+ 
 Next, we stratify by `Time`.

 ```{python}
@@ -475,7 +467,7 @@ for time, df in D.groupby('Time'):
 ax.set_title("Probability of Still Being on Hold")

 ```
-
+ 
 It seems that calls at Call Center B take longer to be answered than
 calls at Centers A and C. Similarly, it appears that wait times are
 longest in the morning and shortest in the evening hours. We can use a
@@ -488,8 +480,8 @@ multivariate_logrank_test(D['Wait time'],
                          D['Failed'])

 ```
-
-
+    
+ 
 Next, we consider the  effect of `Time`.

 ```{python}
@@ -498,8 +490,8 @@ multivariate_logrank_test(D['Wait time'],
                          D['Failed'])

 ```
-
-
+    
+ 
 As in the case of a categorical variable with 2 levels, these
 results are similar to the likelihood ratio test
 from the Cox proportional hazards model. First, we
@@ -514,8 +506,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
 F.log_likelihood_ratio_test()

 ```
-
-
+    
+ 
 Next, we look at the results for `Time`.

 ```{python}
@@ -527,8 +519,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
 F.log_likelihood_ratio_test()

 ```
-
-
+    
+ 
 We find that differences between centers are highly significant, as
 are differences between times of day.

@@ -544,8 +536,8 @@ fit_queuing = coxph().fit(
 fit_queuing.summary[['coef', 'se(coef)', 'p']]

 ```
-
-
+    
+ 
 The $p$-values for Center B and evening time
 are very small. It is also clear that the
 hazard --- that is, the instantaneous risk that a call will be