diff --git a/Ch14-surv-lab.ipynb b/Ch14-surv-lab.ipynb deleted file mode 100644 index 01921a7..0000000 --- a/Ch14-surv-lab.ipynb +++ /dev/null @@ -1,1046 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "c1af8696", - "metadata": {}, - "source": [ - "# Dummy\n", - "\n", - "## Lab: Survival Analysis\n", - "\n", - "### Some preliminary details\n", - "\n", - "We will want to be able to see plots in the notebook. We\n", - "will therefore include the following as introduced in\n", - "Chapter 2:\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0bce7ad7", - "metadata": {}, - "outputs": [], - "source": [ - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "id": "98e95e67", - "metadata": {}, - "source": [ - "#### New imports\n", - "\n", - "We again collect the new imports\n", - "needed for this lab.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3741ca31", - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "from lifelines import \\\n", - " (KaplanMeierFitter,\n", - " CoxPHFitter)\n", - "from lifelines.statistics import \\\n", - " (logrank_test,\n", - " multivariate_logrank_test)\n", - "from ISLP.survival import sim_time" - ] - }, - { - "cell_type": "markdown", - "id": "c1dff6d8", - "metadata": {}, - "source": [ - "In this lab, we perform survival analyses on three separate data\n", - "sets. In the first section we analyze the `BrainCancer`\n", - "data .
\n", - "Next, we examine the `Publication`\n", - "data . Finally, we explore explores\n", - "a simulated call center data set.\n", - "\n", - "We begin by importing some of our libraries at this top\n", - "level. This makes the code more readable, as scanning the first few\n", - "lines of the notebook tell us what libraries are used in this\n", - "notebook.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6402fd76", - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "from ISLP.models import ModelSpec as MS\n", - "from ISLP import load_data" - ] - }, - { - "cell_type": "markdown", - "id": "90eb295f", - "metadata": {}, - "source": [ - "### Brain Cancer Data" - ] - }, - { - "cell_type": "markdown", - "id": "26338f8e", - "metadata": {}, - "source": [ - "We begin with the `BrainCancer` data set, contained in the `ISLP` package.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b43f8c6f", - "metadata": {}, - "outputs": [], - "source": [ - "BrainCancer = load_data('BrainCancer')\n", - "BrainCancer.columns" - ] - }, - { - "cell_type": "markdown", - "id": "61eed2f0", - "metadata": {}, - "source": [ - "The rows index the 88 patients, while the columns contain the 8 predictors.\n", - "We first briefly examine the data.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4307ef61", - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [], - "source": [ - "BrainCancer['sex'].value_counts()" - ] - }, - { - "cell_type": "markdown", - "id": "1a83e4f7", - "metadata": {}, - "source": [ - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "19c25475", - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [], - "source": [ - "BrainCancer['diagnosis'].value_counts()" - ] - }, - { - "cell_type": "markdown", - "id": "7fcd779d", - "metadata": {}, - "source": [ - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d9fa6491", - "metadata": {}, - "outputs": [], - "source": [ - "BrainCancer['status'].value_counts()" - ] - }, - { - "cell_type": "markdown", - "id": "bd1e89cb", - "metadata": {}, - "source": [ - "Before beginning an analysis, it is important to know how the\n", - "`status` variable has been coded. Most software\n", - "uses the convention that a `status` of 1 indicates an\n", - "uncensored observation, and a `status` of 0 indicates a censored\n", - "observation. But some scientists might use the opposite coding. For\n", - "the `BrainCancer` data set 35 patients died before the end of\n", - "the study.\n", - "\n", - "To begin the analysis, we re-create a Kaplan-Meier survival curve shown in the book. The main\n", - "package we will use for survival analysis\n", - "is `lifelines`.\n", - "The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or\n", - "death). The first argument to `km.fit` are the event times with the\n", - "second argument being the censoring variable with a 1 indicating an observed\n", - "failure time. The `plot` method plots a curve with pointwise confidence\n", - "intervals. By default, they produce 90% confidence intervals. This can be changed\n", - "by setting the `alpha` argument to 1 minus the desired\n", - "confidence level.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "16b708ff", - "metadata": {}, - "outputs": [], - "source": [ - "km = KaplanMeierFitter()\n", - "km_brain = km.fit(BrainCancer['time'],\n", - " BrainCancer['status'])\n", - "km_brain.plot(label='Kaplan Meier estimate')" - ] - }, - { - "cell_type": "markdown", - "id": "71d6ef47", - "metadata": {}, - "source": [ - "Next we create Kaplan-Meier survival curves that are stratified by\n", - "`sex`, in order to reproduce a second figure from the book.\n", - "We do this by using the `groupby` method of a DataFrame.\n", - "This method returns a generator that can\n", - "be iterated over in the `for` loop. In this case,\n", - "the items in the `for` loop are 2-tuples representing\n", - "the groups: the first entry is the value\n", - "of the grouping column `sex` while the 2nd value\n", - "is the DataFrame consisting of all rows in the\n", - "DataFrame matching that value of `sex`.\n", - "We will want to use this data below\n", - "in the log-rank test, hence we store this\n", - "information in the dictionary `by_sex`. Finally,\n", - "we have also used the notion of\n", - " to automatically\n", - "label the different lines in the plot. String\n", - "interpolation is a powerful technique to format strings –\n", - "`python` has many ways to facilitate such operations.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cd07dfd6", - "metadata": {}, - "outputs": [], - "source": [ - "by_sex = {}\n", - "for sex, df in BrainCancer.groupby('sex'):\n", - " by_sex[sex] = df\n", - " km_sex = km.fit(df['time'],\n", - " df['status'])\n", - " km_sex.plot(label='Sex=%s' % sex)" - ] - }, - { - "cell_type": "markdown", - "id": "8d40bd2f", - "metadata": {}, - "source": [ - "As discussed in the book, we can perform a\n", - "log-rank test to compare the survival of males to females. We use\n", - "the `logrank_test` function from the `lifelines.statistics` module.\n", - "The first two arguments are the event times, with the second\n", - "denoting the corresponding (optional) censoring indicators.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f4e80bc9", - "metadata": {}, - "outputs": [], - "source": [ - "logrank_test(by_sex['Male']['time'],\n", - " by_sex['Female']['time'],\n", - " by_sex['Male']['status'],\n", - " by_sex['Female']['status'])" - ] - }, - { - "cell_type": "markdown", - "id": "3e025210", - "metadata": {}, - "source": [ - "The resulting $p$-value is $0.23$, indicating no evidence of a\n", - "difference in survival between the two sexes.\n", - "\n", - "Next, we fit Cox proportional hazards models using the `CoxPHFitter` estimator\n", - "from `lifelines`.\n", - "To begin, we consider a model that uses `sex` as the only predictor.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b57ac5fa", - "metadata": {}, - "outputs": [], - "source": [ - "coxph = CoxPHFitter # shorthand\n", - "sex_df = BrainCancer[['time', 'status', 'sex']]\n", - "model_df = MS(['time', 'status', 'sex'],\n", - " intercept=False).fit_transform(sex_df)\n", - "cox_fit = coxph().fit(model_df,\n", - " 'time',\n", - " 'status')\n", - "cox_fit.summary[['coef', 'se(coef)', 'p']]" - ] - }, - { - "cell_type": "markdown", - "id": "da8b9eff", - "metadata": {}, - "source": [ - "The first argument to `fit` should be a data frame containing\n", - "at least the event time (the second argument `time` in this case)\n", - "as well as an optional censoring variable (the argument `status` in this case).\n", - "Note also that the Cox model does not include an intercept, which is why\n", - "we used the `intercept=False` argument to `ModelSpec` above.\n", - "It is possible to obtain the likelihood ratio test comparing this model to the one\n", - "with no features as follows:\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9991b523", - "metadata": {}, - "outputs": [], - "source": [ - "cox_fit.log_likelihood_ratio_test()" - ] - }, - { - "cell_type": "markdown", - "id": "fc76fd07", - "metadata": {}, - "source": [ - "Regardless of which test we use, we see that there is no clear\n", - "evidence for a difference in survival between males and females. As\n", - "we learned in this chapter, the score test from the Cox model is\n", - "exactly equal to the log rank test statistic!\n", - "\n", - "Now we fit a model that makes use of additional predictors. We first note\n", - "that one of our `diagnosis` values is missing, hence\n", - "we drop that observation before continuing.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "391157a9", - "metadata": {}, - "outputs": [], - "source": [ - "cleaned = BrainCancer.dropna()\n", - "all_MS = MS(cleaned.columns,\n", - " intercept=False)\n", - "all_df = all_MS.fit_transform(cleaned)\n", - "fit_all = coxph().fit(all_df, \n", - " 'time', \n", - " 'status')\n", - "fit_all.summary[['coef', 'se(coef)', 'p']]" - ] - }, - { - "cell_type": "markdown", - "id": "18659bcd", - "metadata": {}, - "source": [ - "The `diagnosis` variable has been coded so that the baseline\n", - "corresponds to HG glioma. The results indicate that the risk associated with HG glioma\n", - "is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated\n", - "with meningioma. In other words, after adjusting for the other\n", - "predictors, patients with HG glioma have much worse survival compared\n", - "to those with meningioma. In addition, larger values of the Karnofsky\n", - "index, `ki`, are associated with lower risk, i.e. longer survival.\n", - "\n", - "Finally, we plot estimated survival curves for each diagnosis category,\n", - "adjusting for the other predictors. To make these plots, we set the\n", - "values of the other predictors equal to the mean for quantitative variables\n", - "and equal to the mode for categorical. To do this, we use the\n", - "`apply()` method across rows (i.e. `axis=0`) with a function\n", - "`representative` that checks if a column is categorical\n", - "or not.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c2c32bdf", - "metadata": {}, - "outputs": [], - "source": [ - "levels = cleaned['diagnosis'].unique()\n", - "def representative(series):\n", - " if hasattr(series.dtype, 'categories'):\n", - " return pd.Series.mode(series)\n", - " else:\n", - " return series.mean()\n", - "modal_data = cleaned.apply(representative, \n", - " axis=0)" - ] - }, - { - "cell_type": "markdown", - "id": "2f5e9767", - "metadata": {}, - "source": [ - "We make four\n", - "copies of the column means and assign the `diagnosis` column to be the four different\n", - "diagnoses.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7a6fc4b9", - "metadata": {}, - "outputs": [], - "source": [ - "modal_df = pd.DataFrame([modal_data.iloc[0] for _ in range(len(levels))])\n", - "modal_df['diagnosis'] = levels\n", - "modal_df" - ] - }, - { - "cell_type": "markdown", - "id": "dd013465", - "metadata": {}, - "source": [ - "We then construct the design matrix based on the model specification `all_MS` used to fit\n", - "the model, dropping the `time` and `status` variables as these are not\n", - "used to predict the survival functioin.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f1e6cb80", - "metadata": {}, - "outputs": [], - "source": [ - "modal_X = all_MS.transform(modal_df)#.drop(columns=['time', 'status'])\n", - "modal_X.index = levels\n", - "modal_X" - ] - }, - { - "cell_type": "markdown", - "id": "d98548ee", - "metadata": {}, - "source": [ - "The estimated survival\n", - "function can be found using the `predict_survival_function()` method.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8753eaa2", - "metadata": {}, - "outputs": [], - "source": [ - "predicted_survival = fit_all.predict_survival_function(modal_X)\n", - "predicted_survival" - ] - }, - { - "cell_type": "markdown", - "id": "a814ddee", - "metadata": {}, - "source": [ - "This returns a data frame,\n", - "whose plot methods yields the different survival curves. To avoid clutter in\n", - "the plots, we do not display confidence intervals.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ddccc3ac", - "metadata": {}, - "outputs": [], - "source": [ - "fig, ax = plt.subplots(1, 1, figsize=(10, 10))\n", - "predicted_survival.plot(ax=ax);" - ] - }, - { - "cell_type": "markdown", - "id": "53997de5", - "metadata": {}, - "source": [ - "### Publication Data" - ] - }, - { - "cell_type": "markdown", - "id": "47103217", - "metadata": {}, - "source": [ - "The `Publication` data can be\n", - "found in the `ISLP` package.\n", - "We begin by plotting the Kaplan-Meier curves\n", - "stratified on the `posres` variable, which records whether the\n", - "study had a positive or negative result.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "01217472", - "metadata": {}, - "outputs": [], - "source": [ - "Publication = load_data('Publication')\n", - "by_result = {}\n", - "for result, df in Publication.groupby('posres'):\n", - " by_result[result] = df\n", - " km_result = km.fit(df['time'], df['status'])\n", - " km_result.plot(label='Result=%d' % result)" - ] - }, - { - "cell_type": "markdown", - "id": "7929b6e9", - "metadata": {}, - "source": [ - "As discussed previously, the $p$-values from fitting Cox’s\n", - "proportional hazards model to the `posres` variable are quite\n", - "large, providing no evidence of a difference in time-to-publication\n", - "between studies with positive versus negative results.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2e966bd6", - "metadata": {}, - "outputs": [], - "source": [ - "posres_df = MS(['posres',\n", - " 'time',\n", - " 'status'], intercept=False).fit_transform(Publication)\n", - "posres_fit = coxph().fit(posres_df,\n", - " 'time',\n", - " 'status')\n", - "posres_fit.summary[['coef', 'se(coef)', 'p']]" - ] - }, - { - "cell_type": "markdown", - "id": "1020fe19", - "metadata": {}, - "source": [ - "As expected, the log-rank test provides an identical conclusion.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "42572496", - "metadata": {}, - "outputs": [], - "source": [ - "logrank_test(by_result[0]['time'],\n", - " by_result[1]['time'],\n", - " by_result[0]['status'],\n", - " by_result[1]['status'])" - ] - }, - { - "cell_type": "markdown", - "id": "c06a3106", - "metadata": {}, - "source": [ - "However, the results change dramatically when we include other\n", - "predictors in the model. Here we have excluded the funding mechanism\n", - "variable.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7c22a546", - "metadata": {}, - "outputs": [], - "source": [ - "model = MS(Publication.columns.drop('mech'),\n", - " intercept=False)\n", - "coxph().fit(model.fit_transform(Publication),\n", - " 'time',\n", - " 'status').summary[['coef', 'se(coef)', 'p']]" - ] - }, - { - "cell_type": "markdown", - "id": "2efd14bf", - "metadata": {}, - "source": [ - "We see that there are a number of statistically significant variables,\n", - "including whether the trial focused on a clinical endpoint, the impact\n", - "of the study, and whether the study had positive or negative results.\n", - "\n", - "### Call Center Data" - ] - }, - { - "cell_type": "markdown", - "id": "9dbf298f", - "metadata": {}, - "source": [ - "In this section, we will simulate survival data using the relationship\n", - "between cumulative hazard and\n", - "the survival function explored in an exercise.\n", - "Our simulated data will represent the observed\n", - "wait times (in seconds) for 2,000 customers who have phoned a call\n", - "center. In this context, censoring occurs if a customer hangs up\n", - "before his or her call is answered.\n", - "\n", - "There are three covariates: `Operators` (the number of call\n", - "center operators available at the time of the call, which can range\n", - "from $5$ to $15$), `Center` (either A, B, or C), and\n", - "`Time` of day (Morning, Afternoon, or Evening). We generate data\n", - "for these covariates so that all possibilities are equally likely: for\n", - "instance, morning, afternoon and evening calls are equally likely, and\n", - "any number of operators from $5$ to $15$ is equally likely. We use the\n", - "`ModelSpec` function from the `ISLP.models` package introduced\n", - "in Chapter 3.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "72e4c0c1", - "metadata": {}, - "outputs": [], - "source": [ - "np.random.seed(10)\n", - "N = 2000\n", - "Operators = np.random.choice(np.arange(5, 16),\n", - " N,\n", - " replace=True)\n", - "Center = np.random.choice(['A', 'B', 'C'], \n", - " N,\n", - " replace=True)\n", - "Time = np.random.choice(['Morn.', 'After.', 'Even.'], \n", - " N,\n", - " replace=True)\n", - "D = pd.DataFrame({'Operators': Operators,\n", - " 'Center': pd.Categorical(Center),\n", - " 'Time': pd.Categorical(Time)}) \n", - "model = MS(['Operators',\n", - " 'Center',\n", - " 'Time'],\n", - " intercept=False)\n", - "X = model.fit_transform(D)" - ] - }, - { - "cell_type": "markdown", - "id": "a3f7c728", - "metadata": {}, - "source": [ - "It is worthwhile to take a peek at the design matrix `X`, so\n", - "that we can be sure that we understand how the variables have been coded. By default,\n", - "the levels of categorical variables are sorted and the first column of the usual one-hot encoding\n", - "of the variable is dropped.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4b8f4b9f", - "metadata": {}, - "outputs": [], - "source": [ - "X[:5]" - ] - }, - { - "cell_type": "markdown", - "id": "06cd05ca", - "metadata": {}, - "source": [ - "Next, we specify the coefficients and the hazard function.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "982ae944", - "metadata": {}, - "outputs": [], - "source": [ - "true_beta = np.array([0.04, -0.3, 0, 0.2, -0.2])\n", - "true_linpred = X.dot(true_beta)\n", - "hazard = lambda t: 1e-5 * t" - ] - }, - { - "cell_type": "markdown", - "id": "5054aa25", - "metadata": {}, - "source": [ - "We use the function\n", - "`sim_time` from the `ISLP.survival` package. This function\n", - "uses the relationship between the survival function\n", - "and cumulative hazard $S(t) = \\exp(-H(t))$ and the specific\n", - "form of the cumulative hazard function in the Cox model\n", - "to simulate data based on values of the linear predictor\n", - "`true_linpred` and the cumulative hazard. We can generate\n", - "if we know the inverse of the cumulative hazard function.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e8939727", - "metadata": {}, - "outputs": [], - "source": [ - "cum_hazard = lambda t: 1e-5 * t**2 / 2" - ] - }, - { - "cell_type": "markdown", - "id": "c0ed581d", - "metadata": {}, - "source": [ - "Here, we have set the coefficient associated with `Operators` to\n", - "equal $0.04$; in other words, each additional operator leads to a\n", - "$e^{0.04}=1.041$-fold increase in the “risk” that the call will be\n", - "answered, given the `Center` and `Time` covariates. This\n", - "makes sense: the greater the number of operators at hand, the shorter\n", - "the wait time! The coefficient associated with `Center == B` is\n", - "$-0.3$, and `Center == A` is treated as the baseline. This means\n", - "that the risk of a call being answered at Center B is 0.74 times the\n", - "risk that it will be answered at Center A; in other words, the wait\n", - "times are a bit longer at Center B.\n", - "\n", - "We are now ready to generate data under the Cox proportional hazards\n", - "model. We’ll truncate the maximum time to 1000 seconds to keep\n", - "simulated wait times reasonable.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a850809e", - "metadata": {}, - "outputs": [], - "source": [ - "W = np.array([sim_time(l, cum_hazard)\n", - " for l in true_linpred])\n", - "D['Wait time'] = np.clip(W, 0, 1000)" - ] - }, - { - "cell_type": "markdown", - "id": "3f481113", - "metadata": {}, - "source": [ - "We now simulate our censoring variable, for which we assume\n", - "90% of calls were answered (`Failed==1`) before the\n", - "customer hungup (`Failed==0`).\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c93cd8c3", - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [], - "source": [ - "D['Failed'] = np.random.choice([1, 0],\n", - " N,\n", - " p=[0.9, 0.1])\n", - "D[:5]" - ] - }, - { - "cell_type": "markdown", - "id": "aa55f285", - "metadata": {}, - "source": [ - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bac8ff69", - "metadata": {}, - "outputs": [], - "source": [ - "D['Failed'].mean()" - ] - }, - { - "cell_type": "markdown", - "id": "75699d8c", - "metadata": {}, - "source": [ - "We now plot Kaplan-Meier survival curves. First, we stratify by `Center`\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cafb21e9", - "metadata": {}, - "outputs": [], - "source": [ - "by_center = {}\n", - "for center, df in D.groupby('Center'):\n", - " by_center[center] = df\n", - " km_center = km.fit(df['Wait time'], df['Failed'])\n", - " km_center.plot(label='Center=%s' % center)\n", - "ax = plt.gca()\n", - "ax.set_title(\"Probability of Still Being on Hold\")" - ] - }, - { - "cell_type": "markdown", - "id": "6c477316", - "metadata": {}, - "source": [ - "Next, we stratify by `Time`.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d404c129", - "metadata": {}, - "outputs": [], - "source": [ - "by_time = {}\n", - "for time, df in D.groupby('Time'):\n", - " by_time[time] = df\n", - " km_time = km.fit(df['Wait time'], df['Failed'])\n", - " km_time.plot(label='Time=%s' % time)\n", - "ax = plt.gca()\n", - "ax.set_title(\"Probability of Still Being on Hold\")" - ] - }, - { - "cell_type": "markdown", - "id": "dcca1dbd", - "metadata": {}, - "source": [ - "It seems that calls at Call Center B take longer to be answered than\n", - "calls at Centers A and C. Similarly, it appears that wait times are\n", - "longest in the morning and shortest in the evening hours. We can use a\n", - "log-rank test to determine whether these differences are statistically\n", - "significant using `multivariate_logrank_test`.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b928758d", - "metadata": {}, - "outputs": [], - "source": [ - "multivariate_logrank_test(D['Wait time'],\n", - " D['Center'],\n", - " D['Failed'])" - ] - }, - { - "cell_type": "markdown", - "id": "285f5644", - "metadata": {}, - "source": [ - "Next, we consider the effect of `Time`.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "976558e5", - "metadata": {}, - "outputs": [], - "source": [ - "multivariate_logrank_test(D['Wait time'],\n", - " D['Time'],\n", - " D['Failed'])" - ] - }, - { - "cell_type": "markdown", - "id": "e56b7df7", - "metadata": {}, - "source": [ - "As in the case of a categorical variable with 2 outcomes, these\n", - "results are similar to the likelihood ratio test\n", - "from the Cox proportional hazards model. First, let’s\n", - "look at the results for `Center`.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36f7fa13", - "metadata": {}, - "outputs": [], - "source": [ - "X = MS(['Wait time',\n", - " 'Failed',\n", - " 'Center'],\n", - " intercept=False).fit_transform(D)\n", - "F = coxph().fit(X, 'Wait time', 'Failed')\n", - "F.log_likelihood_ratio_test()" - ] - }, - { - "cell_type": "markdown", - "id": "68ab32c8", - "metadata": {}, - "source": [ - "Next, the results for `Time`.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e03b31e6", - "metadata": {}, - "outputs": [], - "source": [ - "X = MS(['Wait time',\n", - " 'Failed',\n", - " 'Time'],\n", - " intercept=False).fit_transform(D)\n", - "F = coxph().fit(X, 'Wait time', 'Failed')\n", - "F.log_likelihood_ratio_test()" - ] - }, - { - "cell_type": "markdown", - "id": "fda76031", - "metadata": {}, - "source": [ - "We find that differences between centers are highly significant, as\n", - "are differences between times of day.\n", - "\n", - "Finally, we fit a complete Cox’s proportional hazards model to the data.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e5022c12", - "metadata": {}, - "outputs": [], - "source": [ - "X = MS(D.columns,\n", - " intercept=False).fit_transform(D)\n", - "fit_queuing = coxph().fit(\n", - " X,\n", - " 'Wait time',\n", - " 'Failed')\n", - "fit_queuing.summary[['coef', 'se(coef)', 'p']]" - ] - }, - { - "cell_type": "markdown", - "id": "847ebaa7", - "metadata": {}, - "source": [ - "The $p$-values for Center B and evening time\n", - "are very small. It is also clear that the\n", - "hazard — that is, the instantaneous risk that a call will be\n", - "answered — increases with the number of operators. Since we\n", - "generated the data ourselves, we know that the true coefficients for\n", - "Center B, Center C, Operators\n", - "evening time, and morning time are $-0.3$,\n", - "$0$, $0.04$, and $0.2$, $-0.2$, respectively. The coefficient estimates\n", - "resulting from the Cox model are fairly accurate." - ] - } - ], - "metadata": { - "jupytext": { - "cell_metadata_filter": "all", - "formats": "ipynb,Rmd", - "notebook_metadata_filter": "all" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}