v2.1 notebooks excluding 10,13
This commit is contained in:
@@ -1,24 +1,11 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 2
|
||||
|
||||
# Lab: Introduction to Python
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Getting Started
|
||||
|
||||
|
||||
@@ -74,21 +61,21 @@ inputs. For example, the
|
||||
print('fit a model with', 11, 'variables')
|
||||
|
||||
```
|
||||
|
||||
|
||||
The following command will provide information about the `print()` function.
|
||||
|
||||
```{python}
|
||||
# print?
|
||||
print?
|
||||
|
||||
```
|
||||
|
||||
|
||||
Adding two integers in `Python` is pretty intuitive.
|
||||
|
||||
```{python}
|
||||
3 + 5
|
||||
|
||||
```
|
||||
|
||||
|
||||
In `Python`, textual data is handled using
|
||||
*strings*. For instance, `"hello"` and
|
||||
`'hello'`
|
||||
@@ -99,7 +86,7 @@ We can concatenate them using the addition `+` symbol.
|
||||
"hello" + " " + "world"
|
||||
|
||||
```
|
||||
|
||||
|
||||
A string is actually a type of *sequence*: this is a generic term for an ordered list.
|
||||
The three most important types of sequences are lists, tuples, and strings.
|
||||
We introduce lists now.
|
||||
@@ -115,7 +102,7 @@ x = [3, 4, 5]
|
||||
x
|
||||
|
||||
```
|
||||
|
||||
|
||||
Note that we used the brackets
|
||||
`[]` to construct this list.
|
||||
|
||||
@@ -127,14 +114,14 @@ y = [4, 9, 7]
|
||||
x + y
|
||||
|
||||
```
|
||||
|
||||
|
||||
The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists
|
||||
element-by-element?
|
||||
In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*.
|
||||
In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
This example reflects the fact that
|
||||
`Python` is a general-purpose programming language. Much of `Python`'s data-specific
|
||||
functionality comes from other packages, notably `numpy`
|
||||
@@ -149,8 +136,8 @@ See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/n
|
||||
As mentioned earlier, this book makes use of functionality that is contained in the `numpy`
|
||||
*library*, or *package*. A package is a collection of modules that are not necessarily included in
|
||||
the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*.
|
||||
|
||||
|
||||
|
||||
|
||||
To access `numpy`, we must first `import` it.
|
||||
|
||||
```{python}
|
||||
@@ -194,7 +181,7 @@ x
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
The object `x` has several
|
||||
*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`
|
||||
@@ -204,7 +191,7 @@ For instance, we can access the `ndim` attribute of `x` as follows.
|
||||
```{python}
|
||||
x.ndim
|
||||
```
|
||||
|
||||
|
||||
The output indicates that `x` is a two-dimensional array.
|
||||
Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is
|
||||
comprised of 64-bit integers:
|
||||
@@ -228,7 +215,7 @@ documentation associated with the function `fun`, if it exists.
|
||||
We can try this for `np.array()`.
|
||||
|
||||
```{python}
|
||||
# np.array?
|
||||
np.array?
|
||||
|
||||
```
|
||||
This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`.
|
||||
@@ -246,7 +233,7 @@ at its `shape` attribute.
|
||||
x.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
A *method* is a function that is associated with an
|
||||
object.
|
||||
@@ -283,10 +270,10 @@ x_reshape = x.reshape((2, 3))
|
||||
print('reshaped x:\n', x_reshape)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The previous output reveals that `numpy` arrays are specified as a sequence
|
||||
of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*.
|
||||
|
||||
|
||||
|
||||
`Python` (and hence `numpy`) uses 0-based
|
||||
indexing. This means that to access the top left element of `x_reshape`,
|
||||
@@ -316,13 +303,13 @@ print('x_reshape after we modify its top left element:\n', x_reshape)
|
||||
print('x after we modify top left element of x_reshape:\n', x)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces
|
||||
an *exception*, or error.
|
||||
|
||||
@@ -331,8 +318,8 @@ my_tuple = (3, 4, 5)
|
||||
my_tuple[0] = 2
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple.
|
||||
The `ndim` attribute yields the number of dimensions, and `T` provides its transpose.
|
||||
|
||||
@@ -340,7 +327,7 @@ The `ndim` attribute yields the number of dimensions, and `T` provides its tran
|
||||
x_reshape.shape, x_reshape.ndim, x_reshape.T
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple.
|
||||
|
||||
We will often want to apply functions to arrays.
|
||||
@@ -351,22 +338,22 @@ square root of the entries using the `np.sqrt()` function:
|
||||
np.sqrt(x)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can also square the elements:
|
||||
|
||||
```{python}
|
||||
x**2
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2.
|
||||
|
||||
```{python}
|
||||
x**0.5
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Throughout this book, we will often want to generate random data.
|
||||
The `np.random.normal()` function generates a vector of random
|
||||
normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`.
|
||||
@@ -383,7 +370,7 @@ x = np.random.normal(size=50)
|
||||
x
|
||||
|
||||
```
|
||||
|
||||
|
||||
We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`.
|
||||
|
||||
```{python}
|
||||
@@ -395,7 +382,7 @@ correlation between `x` and `y`.
|
||||
```{python}
|
||||
np.corrcoef(x, y)
|
||||
```
|
||||
|
||||
|
||||
If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few
|
||||
commands. In particular,
|
||||
each
|
||||
@@ -408,7 +395,7 @@ print(np.random.normal(scale=5, size=2))
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
In order to ensure that our code provides exactly the same results
|
||||
each time it is run, we can set a *random seed*
|
||||
using the
|
||||
@@ -424,7 +411,7 @@ print(rng.normal(scale=5, size=2))
|
||||
rng2 = np.random.default_rng(1303)
|
||||
print(rng2.normal(scale=5, size=2))
|
||||
```
|
||||
|
||||
|
||||
Throughout the labs in this book, we use `np.random.default_rng()` whenever we
|
||||
perform calculations involving random quantities within `numpy`. In principle, this
|
||||
should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible
|
||||
@@ -447,7 +434,7 @@ np.mean(y), y.mean()
|
||||
```{python}
|
||||
np.var(y), y.var(), np.mean((y - y.mean())**2)
|
||||
```
|
||||
|
||||
|
||||
|
||||
Notice that by default `np.var()` divides by the sample size $n$ rather
|
||||
than $n-1$; see the `ddof` argument in `np.var?`.
|
||||
@@ -456,7 +443,7 @@ than $n-1$; see the `ddof` argument in `np.var?`.
|
||||
```{python}
|
||||
np.sqrt(np.var(y)), np.std(y)
|
||||
```
|
||||
|
||||
|
||||
The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix.
|
||||
To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums.
|
||||
|
||||
@@ -470,14 +457,14 @@ Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its
|
||||
```{python}
|
||||
X.mean(axis=0)
|
||||
```
|
||||
|
||||
|
||||
The following yields the same result.
|
||||
|
||||
```{python}
|
||||
X.mean(0)
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Graphics
|
||||
In `Python`, common practice is to use the library
|
||||
@@ -543,7 +530,7 @@ As an alternative, we could use the `ax.scatter()` function to create a scatter
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
ax.scatter(x, y, marker='o');
|
||||
```
|
||||
|
||||
|
||||
Notice that in the code blocks above, we have ended
|
||||
the last line with a semicolon. This prevents `ax.plot(x, y)` from printing
|
||||
text to the notebook. However, it does not prevent a plot from being produced.
|
||||
@@ -584,7 +571,7 @@ fig.set_size_inches(12,3)
|
||||
fig
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Occasionally we will want to create several plots within a figure. This can be
|
||||
achieved by passing additional arguments to `subplots()`.
|
||||
@@ -613,8 +600,8 @@ Type `subplots?` to learn more about
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
To save the output of `fig`, we call its `savefig()`
|
||||
method. The argument `dpi` is the dots per inch, used
|
||||
to determine how large the figure will be in pixels.
|
||||
@@ -624,7 +611,7 @@ fig.savefig("Figure.png", dpi=400)
|
||||
fig.savefig("Figure.pdf", dpi=200);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it.
|
||||
|
||||
@@ -676,7 +663,7 @@ fig, ax = subplots(figsize=(8, 8))
|
||||
ax.imshow(f);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Sequences and Slice Notation
|
||||
|
||||
@@ -690,8 +677,8 @@ seq1 = np.linspace(0, 10, 11)
|
||||
seq1
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The function `np.arange()`
|
||||
returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence
|
||||
that starts at $0$ and ends at $10$.
|
||||
@@ -701,7 +688,7 @@ seq2 = np.arange(0, 10)
|
||||
seq2
|
||||
|
||||
```
|
||||
|
||||
|
||||
Why isn't $10$ output above? This has to do with *slice* notation in `Python`.
|
||||
Slice notation
|
||||
is used to index sequences such as lists, tuples and arrays.
|
||||
@@ -743,7 +730,7 @@ See the documentation `slice?` for useful options in creating slices.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Indexing Data
|
||||
To begin, we create a two-dimensional `numpy` array.
|
||||
@@ -753,7 +740,7 @@ A = np.array(np.arange(16)).reshape((4, 4))
|
||||
A
|
||||
|
||||
```
|
||||
|
||||
|
||||
Typing `A[1,2]` retrieves the element corresponding to the second row and third
|
||||
column. (As usual, `Python` indexes from $0.$)
|
||||
|
||||
@@ -761,7 +748,7 @@ column. (As usual, `Python` indexes from $0.$)
|
||||
A[1,2]
|
||||
|
||||
```
|
||||
|
||||
|
||||
The first number after the open-bracket symbol `[`
|
||||
refers to the row, and the second number refers to the column.
|
||||
|
||||
@@ -773,7 +760,7 @@ The first number after the open-bracket symbol `[`
|
||||
A[[1,3]]
|
||||
|
||||
```
|
||||
|
||||
|
||||
To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets.
|
||||
In this case we need to supply the first argument `:`
|
||||
which selects all rows.
|
||||
@@ -782,7 +769,7 @@ which selects all rows.
|
||||
A[:,[0,2]]
|
||||
|
||||
```
|
||||
|
||||
|
||||
Now, suppose that we want to select the submatrix made up of the second and fourth
|
||||
rows as well as the first and third columns. This is where
|
||||
indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns:
|
||||
@@ -791,21 +778,21 @@ indexing gets slightly tricky. It is natural to try to use lists to retrieve th
|
||||
A[[1,3],[0,2]]
|
||||
|
||||
```
|
||||
|
||||
|
||||
Oops --- what happened? We got a one-dimensional array of length two identical to
|
||||
|
||||
```{python}
|
||||
np.array([A[1,0],A[3,2]])
|
||||
|
||||
```
|
||||
|
||||
|
||||
Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:
|
||||
|
||||
```{python}
|
||||
A[[1,3],[0,2,3]]
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.
|
||||
|
||||
One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns.
|
||||
@@ -816,7 +803,7 @@ A[[1,3]][:,[0,2]]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
There are more efficient ways of achieving the same result.
|
||||
|
||||
@@ -828,7 +815,7 @@ idx = np.ix_([1,3],[0,2,3])
|
||||
A[idx]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
Alternatively, we can subset matrices efficiently using slices.
|
||||
|
||||
@@ -842,7 +829,7 @@ A[1:4:2,0:3:2]
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Why are we able to retrieve a submatrix directly using slices but not using lists?
|
||||
Its because they are different `Python` types, and
|
||||
are treated differently by `numpy`.
|
||||
@@ -858,7 +845,7 @@ Slices can be used to extract objects from arbitrary sequences, such as strings,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
### Boolean Indexing
|
||||
In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively).
|
||||
@@ -875,7 +862,7 @@ keep_rows[[1,3]] = True
|
||||
keep_rows
|
||||
|
||||
```
|
||||
|
||||
|
||||
Note that the elements of `keep_rows`, when viewed as integers, are the same as the
|
||||
values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When
|
||||
applied to two arrays, the `==` operation is applied elementwise.
|
||||
@@ -884,7 +871,7 @@ applied to two arrays, the `==` operation is applied elementwise.
|
||||
np.all(keep_rows == np.array([0,1,0,1]))
|
||||
|
||||
```
|
||||
|
||||
|
||||
(Here, the function `np.all()` has checked whether
|
||||
all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.)
|
||||
|
||||
@@ -896,14 +883,14 @@ The former retrieves the first, second, first, and second rows of `A`.
|
||||
A[np.array([0,1,0,1])]
|
||||
|
||||
```
|
||||
|
||||
|
||||
By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`.
|
||||
|
||||
```{python}
|
||||
A[keep_rows]
|
||||
|
||||
```
|
||||
|
||||
|
||||
This example shows that Booleans and integers are treated differently by `numpy`.
|
||||
|
||||
|
||||
@@ -927,7 +914,7 @@ A[idx_mixed]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
For more details on indexing in `numpy`, readers are referred
|
||||
to the `numpy` tutorial mentioned earlier.
|
||||
@@ -980,7 +967,7 @@ files. Before loading data into `Python`, it is a good idea to view it using
|
||||
a text editor or other software, such as Microsoft Excel.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We now take a look at the column of `Auto` corresponding to the variable `horsepower`:
|
||||
|
||||
@@ -1001,7 +988,7 @@ We see the culprit is the value `?`, which is being used to encode missing value
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`.
|
||||
Now, each instance of `?` in the file is replaced with the
|
||||
value `np.nan`, which means *not a number*:
|
||||
@@ -1013,8 +1000,8 @@ Auto = pd.read_csv('Auto.data',
|
||||
Auto['horsepower'].sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The `Auto.shape` attribute tells us that the data has 397
|
||||
observations, or rows, and nine variables, or columns.
|
||||
|
||||
@@ -1022,7 +1009,7 @@ observations, or rows, and nine variables, or columns.
|
||||
Auto.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
There are
|
||||
various ways to deal with missing data.
|
||||
In this case, since only five of the rows contain missing
|
||||
@@ -1033,7 +1020,7 @@ Auto_new = Auto.dropna()
|
||||
Auto_new.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Basics of Selecting Rows and Columns
|
||||
|
||||
@@ -1044,7 +1031,7 @@ Auto = Auto_new # overwrite the previous value
|
||||
Auto.columns
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array.
|
||||
Recall that the first argument to the `[]` method
|
||||
@@ -1303,8 +1290,8 @@ The plot methods of a data frame return a familiar object:
|
||||
an axes. We can use it to update the plot as we did previously:
|
||||
|
||||
```{python}
|
||||
ax = Auto.plot.scatter('horsepower', 'mpg');
|
||||
ax.set_title('Horsepower vs. MPG')
|
||||
ax = Auto.plot.scatter('horsepower', 'mpg')
|
||||
ax.set_title('Horsepower vs. MPG');
|
||||
```
|
||||
If we want to save
|
||||
the figure that contains a given axes, we can find the relevant figure
|
||||
@@ -1329,8 +1316,8 @@ Auto.plot.scatter('horsepower', 'mpg', ax=axes[1]);
|
||||
```
|
||||
|
||||
Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`.
|
||||
|
||||
|
||||
|
||||
|
||||
We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable.
|
||||
However, since there is only a small number of possible values for this variable, we may wish to treat it as
|
||||
qualitative. Below, we replace
|
||||
@@ -1349,7 +1336,7 @@ fig, ax = subplots(figsize=(8, 8))
|
||||
Auto.boxplot('mpg', by='cylinders', ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `hist()` method can be used to plot a *histogram*.
|
||||
|
||||
```{python}
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -1,22 +1,7 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 3
|
||||
|
||||
|
||||
|
||||
|
||||
# Lab: Linear Regression
|
||||
|
||||
## Importing packages
|
||||
@@ -29,7 +14,7 @@ import pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### New imports
|
||||
Throughout this lab we will introduce new functions and libraries. However,
|
||||
@@ -105,7 +90,7 @@ A.sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Simple Linear Regression
|
||||
In this section we will construct model
|
||||
@@ -127,7 +112,7 @@ Boston = load_data("Boston")
|
||||
Boston.columns
|
||||
|
||||
```
|
||||
|
||||
|
||||
Type `Boston?` to find out more about these data.
|
||||
|
||||
We start by using the `sm.OLS()` function to fit a
|
||||
@@ -142,7 +127,7 @@ X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]),
|
||||
X[:4]
|
||||
|
||||
```
|
||||
|
||||
|
||||
We extract the response, and fit the model.
|
||||
|
||||
```{python}
|
||||
@@ -164,7 +149,7 @@ method, and returns such a summary.
|
||||
summarize(results)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
|
||||
### Using Transformations: Fit and Transform
|
||||
@@ -235,8 +220,8 @@ The fitted coefficients can also be retrieved as the
|
||||
results.params
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
|
||||
prediction intervals for the prediction of `medv` for given values of `lstat`.
|
||||
|
||||
@@ -277,7 +262,7 @@ value of 25.05 for `medv` when `lstat` equals
|
||||
10), but the latter are substantially wider.
|
||||
|
||||
Next we will plot `medv` and `lstat`
|
||||
using `DataFrame.plot.scatter()`,
|
||||
using `DataFrame.plot.scatter()`, \definelongblankMR{plot.scatter()}{plot.slashslashscatter()}
|
||||
and wish to
|
||||
add the regression line to the resulting plot.
|
||||
|
||||
@@ -399,14 +384,14 @@ Notice how we have compacted the first line into a succinct expression describin
|
||||
|
||||
The `Boston` data set contains 12 variables, and so it would be cumbersome
|
||||
to have to type all of these in order to perform a regression using all of the predictors.
|
||||
Instead, we can use the following short-hand:
|
||||
Instead, we can use the following short-hand:\definelongblankMR{columns.drop()}{columns.slashslashdrop()}
|
||||
|
||||
```{python}
|
||||
terms = Boston.columns.drop('medv')
|
||||
terms
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can now fit the model with all the variables in `terms` using
|
||||
the same model matrix builder.
|
||||
|
||||
@@ -417,7 +402,7 @@ results = model.fit()
|
||||
summarize(results)
|
||||
|
||||
```
|
||||
|
||||
|
||||
What if we would like to perform a regression using all of the variables but one? For
|
||||
example, in the above regression output, `age` has a high $p$-value.
|
||||
So we may wish to run a regression excluding this predictor.
|
||||
@@ -492,7 +477,7 @@ model2 = sm.OLS(y, X)
|
||||
summarize(model2.fit())
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Non-linear Transformations of the Predictors
|
||||
The model matrix builder can include terms beyond
|
||||
@@ -567,7 +552,7 @@ there is little discernible pattern in the residuals.
|
||||
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
|
||||
to `poly()`.
|
||||
|
||||
|
||||
|
||||
|
||||
## Qualitative Predictors
|
||||
Here we use the `Carseats` data, which is included in the
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,16 +1,3 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 4
|
||||
|
||||
@@ -807,7 +794,7 @@ feature_std.std()
|
||||
|
||||
```
|
||||
|
||||
Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 103.
|
||||
Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 200.
|
||||
In this case it does not matter, as long as the variables are all on the same scale.
|
||||
|
||||
Using the function `train_test_split()` we now split the observations into a test set,
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,16 +1,3 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 5
|
||||
|
||||
@@ -518,13 +505,13 @@ slope. Interestingly, these are somewhat different from the estimates
|
||||
obtained using the bootstrap. Does this indicate a problem with the
|
||||
bootstrap? In fact, it suggests the opposite. Recall that the
|
||||
standard formulas given in
|
||||
{Equation 3.8 on page 80}
|
||||
{Equation 3.8 on page 82}
|
||||
rely on certain assumptions. For example,
|
||||
they depend on the unknown parameter $\sigma^2$, the noise
|
||||
variance. We then estimate $\sigma^2$ using the RSS. Now although the
|
||||
formula for the standard errors do not rely on the linear model being
|
||||
correct, the estimate for $\sigma^2$ does. We see
|
||||
{in Figure 3.8 on page 106} that there is
|
||||
{in Figure 3.8 on page 108} that there is
|
||||
a non-linear relationship in the data, and so the residuals from a
|
||||
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
|
||||
the standard formulas assume (somewhat unrealistically) that the $x_i$
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3a3f2f85",
|
||||
"id": "85ad9863",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
@@ -12,7 +12,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bb22af17",
|
||||
"id": "ac8b08af",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Lab: Cross-Validation and the Bootstrap\n",
|
||||
@@ -26,13 +26,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "60fad148",
|
||||
"id": "e7712cfe",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:37.622425Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:37.621828Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.459128Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.458689Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:01.252458Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:01.251970Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.044045Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.043730Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -49,7 +49,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "78fcfe7a",
|
||||
"id": "784a2ba3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There are several new imports needed for this lab."
|
||||
@@ -58,13 +58,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "2478aeb4",
|
||||
"id": "21c2ed4f",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.461290Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.461070Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.463158Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.462899Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.045927Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.045761Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.047761Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.047491Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -81,7 +81,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "713d30db",
|
||||
"id": "9ac3acd5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## The Validation Set Approach\n",
|
||||
@@ -102,13 +102,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "99c95faf",
|
||||
"id": "8af59641",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.464725Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.464616Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.472566Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.472315Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.049239Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.049145Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.055524Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.055162Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -121,7 +121,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "57be35df",
|
||||
"id": "e76383f0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`."
|
||||
@@ -130,13 +130,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "41b0717d",
|
||||
"id": "d9b0b7c8",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.474061Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.473957Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.477686Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.477432Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.057278Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.057182Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.062537Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.062265Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -150,7 +150,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7f1bef95",
|
||||
"id": "d196dd08",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We now use the `predict()` method of `results` evaluated on the model matrix for this model\n",
|
||||
@@ -160,13 +160,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "d7ea3c0d",
|
||||
"id": "3e77d831",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.479141Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.479053Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.483270Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.483037Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.064056Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.063966Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.068279Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.068024Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -190,7 +190,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6dba5d55",
|
||||
"id": "f4369ee6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Hence our estimate for the validation MSE of the linear regression\n",
|
||||
@@ -204,13 +204,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "a02a2d05",
|
||||
"id": "0aa4bfcc",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.484782Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.484689Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.486891Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.486642Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.069789Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.069682Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.071953Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.071703Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -235,7 +235,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39ab59b1",
|
||||
"id": "0271dc50",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let’s use this function to estimate the validation MSE\n",
|
||||
@@ -247,13 +247,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "51d93dea",
|
||||
"id": "a0dbd55f",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.488297Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.488205Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.497955Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.497708Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.073322Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.073229Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.088464Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.088192Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -280,7 +280,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "936e168a",
|
||||
"id": "a7401536",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"These error rates are $23.62, 18.76$, and $18.80$, respectively. If we\n",
|
||||
@@ -291,13 +291,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "83432f06",
|
||||
"id": "885136a4",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.499478Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.499391Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:38.509735Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:38.509466Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.089889Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.089804Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:02.105353Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:02.105089Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -327,7 +327,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f5ceb357",
|
||||
"id": "00785402",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using this split of the observations into a training set and a validation set,\n",
|
||||
@@ -341,7 +341,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6d624a5c",
|
||||
"id": "21c071b8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Cross-Validation\n",
|
||||
@@ -374,13 +374,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "bcfc433f",
|
||||
"id": "6d957d8c",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:38.511210Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:38.511122Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.069624Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.069325Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:02.106979Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:02.106884Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.184550Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.184259Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -410,7 +410,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5b0f6f30",
|
||||
"id": "c17e2bc8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The arguments to `cross_validate()` are as follows: an\n",
|
||||
@@ -426,7 +426,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b527f67f",
|
||||
"id": "5c7901f2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can repeat this procedure for increasingly complex polynomial fits.\n",
|
||||
@@ -442,13 +442,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "f951ffc8",
|
||||
"id": "e2b5ce95",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.071240Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.071138Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.674084Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.673774Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.186226Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.186108Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.782413Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.782122Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -480,7 +480,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "792f1304",
|
||||
"id": "03706248",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and\n",
|
||||
@@ -499,13 +499,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "e3610b5a",
|
||||
"id": "1dda1bd7",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.675725Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.675614Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.678046Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.677762Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.783997Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.783886Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.786132Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.785881Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -530,7 +530,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "983625b2",
|
||||
"id": "f5092f1b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the CV example above, we used $K=n$, but of course we can also use $K<n$. The code is very similar\n",
|
||||
@@ -541,13 +541,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "1627460d",
|
||||
"id": "fb25fa70",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.679517Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.679423Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.701200Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.700946Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.787622Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.787525Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.809671Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.809398Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -580,7 +580,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "32bf6662",
|
||||
"id": "c4ec6afb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice that the computation time is much shorter than that of LOOCV.\n",
|
||||
@@ -595,7 +595,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1e89127b",
|
||||
"id": "5edf407f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `cross_validate()` function is flexible and can take\n",
|
||||
@@ -606,13 +606,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "8a636468",
|
||||
"id": "d78795cd",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.702802Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.702718Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.708140Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.707865Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.811123Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.811046Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.817840Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.817582Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -641,7 +641,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2c0fb0d5",
|
||||
"id": "a081be63",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One can estimate the variability in the test error by running the following:"
|
||||
@@ -650,13 +650,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "746aeccd",
|
||||
"id": "0407ad56",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.709627Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.709548Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.729721Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.729428Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.819308Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.819228Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.851921Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.851658Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -684,7 +684,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3310fe80",
|
||||
"id": "b66db3cb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that this standard deviation is not a valid estimate of the\n",
|
||||
@@ -724,13 +724,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "daa53d0c",
|
||||
"id": "f04f15bd",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.731264Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.731179Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.734494Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.734221Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.853415Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.853334Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.857370Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.857115Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -745,7 +745,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fd439170",
|
||||
"id": "c88bd6a4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This function returns an estimate for $\\alpha$\n",
|
||||
@@ -758,13 +758,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "578c9564",
|
||||
"id": "f98c0323",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.736147Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.736062Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.738776Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.738545Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.858828Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.858753Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.861443Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.861198Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -785,7 +785,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cc18244c",
|
||||
"id": "58a78f00",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next we randomly select\n",
|
||||
@@ -797,13 +797,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "5754d6d5",
|
||||
"id": "bcd40175",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.740183Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.740108Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.743599Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.743267Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.862933Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.862830Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.865766Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.865514Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -829,7 +829,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0e97e132",
|
||||
"id": "e6058be4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This process can be generalized to create a simple function `boot_SE()` for\n",
|
||||
@@ -840,13 +840,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "8320a49c",
|
||||
"id": "ab6602cd",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.745013Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.744924Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:39.747163Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:39.746928Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.867170Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.867072Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:03.869326Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:03.869094Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -872,7 +872,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a1d25cfe",
|
||||
"id": "d94d383e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice the use of `_` as a loop variable in `for _ in range(B)`. This is often used if the value of the counter is\n",
|
||||
@@ -885,13 +885,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "e656aa1f",
|
||||
"id": "4a323513",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:39.748642Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:39.748543Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:40.034488Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:40.034215Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:03.870755Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:03.870664Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:04.157907Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:04.157623Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -916,7 +916,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "258ccf67",
|
||||
"id": "22343f53",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The final output shows that the bootstrap estimate for ${\\rm SE}(\\hat{\\alpha})$ is $0.0912$.\n",
|
||||
@@ -951,13 +951,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"id": "c5d14195",
|
||||
"id": "0220f3af",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:40.036061Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:40.035977Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:40.037907Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:40.037662Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:04.159500Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:04.159419Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:04.161332Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:04.161073Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -972,7 +972,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89a6fb3e",
|
||||
"id": "df0c7f05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This is not quite what is needed as the first argument to\n",
|
||||
@@ -986,13 +986,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"id": "7e0523f0",
|
||||
"id": "62037dcb",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:40.039299Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:40.039208Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:40.040837Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:40.040599Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:04.162950Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:04.162849Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:04.164486Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:04.164241Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1003,7 +1003,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4d8f9f61",
|
||||
"id": "61fbe248",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Typing `hp_func?` will show that it has two arguments `D`\n",
|
||||
@@ -1019,13 +1019,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "32836e93",
|
||||
"id": "b8bdb7a4",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:40.042164Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:40.042091Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:40.056730Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:40.056480Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:04.165879Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:04.165798Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:04.194029Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:04.193764Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1060,7 +1060,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "aa8cae71",
|
||||
"id": "2a831036",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we use the `boot_SE()` {} function to compute the standard\n",
|
||||
@@ -1070,13 +1070,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"id": "14ce3afa",
|
||||
"id": "36808258",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:40.058168Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:40.058092Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:41.197103Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:41.196820Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:04.195612Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:04.195529Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:06.747175Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:06.746638Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -1104,7 +1104,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1d0db4c6",
|
||||
"id": "38c65fbf",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This indicates that the bootstrap estimate for ${\\rm SE}(\\hat{\\beta}_0)$ is\n",
|
||||
@@ -1120,13 +1120,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"id": "6b1213ac",
|
||||
"id": "c9aea297",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:41.198611Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:41.198528Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:41.257926Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:41.257642Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:06.749614Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:06.749433Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:06.812583Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:06.812298Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -1152,7 +1152,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2b158ef6",
|
||||
"id": "d870ad6b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The standard error estimates for $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$\n",
|
||||
@@ -1164,13 +1164,13 @@
|
||||
"obtained using the bootstrap. Does this indicate a problem with the\n",
|
||||
"bootstrap? In fact, it suggests the opposite. Recall that the\n",
|
||||
"standard formulas given in\n",
|
||||
" {Equation 3.8 on page 80}\n",
|
||||
" {Equation 3.8 on page 82}\n",
|
||||
"rely on certain assumptions. For example,\n",
|
||||
"they depend on the unknown parameter $\\sigma^2$, the noise\n",
|
||||
"variance. We then estimate $\\sigma^2$ using the RSS. Now although the\n",
|
||||
"formula for the standard errors do not rely on the linear model being\n",
|
||||
"correct, the estimate for $\\sigma^2$ does. We see\n",
|
||||
" {in Figure 3.8 on page 106} that there is\n",
|
||||
" {in Figure 3.8 on page 108} that there is\n",
|
||||
"a non-linear relationship in the data, and so the residuals from a\n",
|
||||
"linear fit will be inflated, and so will $\\hat{\\sigma}^2$. Secondly,\n",
|
||||
"the standard formulas assume (somewhat unrealistically) that the $x_i$\n",
|
||||
@@ -1192,13 +1192,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"id": "af99b778",
|
||||
"id": "79c56529",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:41.259623Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:41.259482Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:43.037184Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:43.036911Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:06.814267Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:06.814125Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:10.162177Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:10.161855Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1226,7 +1226,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1badcfd1",
|
||||
"id": "9fccbbbd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We compare the results to the standard errors computed using `sm.OLS()`."
|
||||
@@ -1235,13 +1235,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"id": "0206281e",
|
||||
"id": "4d0b4edc",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:18:43.038778Z",
|
||||
"iopub.status.busy": "2023-08-07T00:18:43.038680Z",
|
||||
"iopub.status.idle": "2023-08-07T00:18:43.046810Z",
|
||||
"shell.execute_reply": "2023-08-07T00:18:43.046545Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:10.163852Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:10.163742Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:10.173834Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:10.173578Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1268,7 +1268,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0c11a71f",
|
||||
"id": "9a86ff6e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
@@ -1279,8 +1279,8 @@
|
||||
"metadata": {
|
||||
"jupytext": {
|
||||
"cell_metadata_filter": "-all",
|
||||
"formats": "ipynb,Rmd",
|
||||
"main_language": "python"
|
||||
"main_language": "python",
|
||||
"notebook_metadata_filter": "-all"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -1,16 +1,3 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 6
|
||||
|
||||
@@ -45,7 +32,7 @@ from ISLP.models import \
|
||||
(Stepwise,
|
||||
sklearn_selected,
|
||||
sklearn_selection_path)
|
||||
# !pip install l0bnb
|
||||
!pip install l0bnb
|
||||
from l0bnb import fit_path
|
||||
|
||||
```
|
||||
@@ -74,7 +61,7 @@ Hitters = load_data('Hitters')
|
||||
np.isnan(Hitters['Salary']).sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that `Salary` is missing for 59 players. The
|
||||
`dropna()` method of data frames removes all of the rows that have missing
|
||||
values in any variable (by default --- see `Hitters.dropna?`).
|
||||
@@ -84,8 +71,8 @@ Hitters = Hitters.dropna();
|
||||
Hitters.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We first choose the best model using forward selection based on $C_p$ (6.2). This score
|
||||
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
|
||||
it as a scorer. By default, `sklearn` tries to maximize a score, hence
|
||||
@@ -119,7 +106,7 @@ neg_Cp = partial(nCp, sigma2)
|
||||
```
|
||||
We can now use `neg_Cp()` as a scorer for model selection.
|
||||
|
||||
|
||||
|
||||
|
||||
Along with a score we need to specify the search strategy. This is done through the object
|
||||
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
|
||||
@@ -133,7 +120,7 @@ strategy = Stepwise.first_peak(design,
|
||||
max_terms=len(design.terms))
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit a linear regression model with `Salary` as outcome using forward
|
||||
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
|
||||
a model from `statsmodels` along with a search strategy and selects a model with its
|
||||
@@ -147,7 +134,7 @@ hitters_MSE.fit(Hitters, Y)
|
||||
hitters_MSE.selected_state_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
|
||||
|
||||
```{python}
|
||||
@@ -158,7 +145,7 @@ hitters_Cp.fit(Hitters, Y)
|
||||
hitters_Cp.selected_state_
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
|
||||
|
||||
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
|
||||
@@ -180,7 +167,7 @@ strategy = Stepwise.fixed_steps(design,
|
||||
full_path = sklearn_selection_path(OLS, strategy)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
|
||||
|
||||
```{python}
|
||||
@@ -189,8 +176,8 @@ Yhat_in = full_path.predict(Hitters)
|
||||
Yhat_in.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
|
||||
in-sample MSE. As expected, the in-sample MSE improves each step we take,
|
||||
indicating we must use either the validation or cross-validation
|
||||
@@ -279,7 +266,7 @@ ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
To repeat the above using the validation set approach, we simply change our
|
||||
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
|
||||
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
|
||||
@@ -309,7 +296,7 @@ ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Best Subset Selection
|
||||
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
|
||||
@@ -337,7 +324,7 @@ path = fit_path(X,
|
||||
max_nonzeros=X.shape[1])
|
||||
|
||||
```
|
||||
|
||||
|
||||
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
|
||||
|
||||
```{python}
|
||||
@@ -405,7 +392,7 @@ soln_path.index.name = 'negative log(lambda)'
|
||||
soln_path
|
||||
|
||||
```
|
||||
|
||||
|
||||
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
|
||||
To control the location of the legend we first set `legend` to `False` in the
|
||||
plot method, adding it afterward with the `legend()` method of `ax`.
|
||||
@@ -429,14 +416,14 @@ beta_hat = soln_path.loc[soln_path.index[39]]
|
||||
lambdas[39], beta_hat
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s compute the $\ell_2$ norm of the standardized coefficients.
|
||||
|
||||
```{python}
|
||||
np.linalg.norm(beta_hat)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
|
||||
Note the much larger $\ell_2$ norm of the
|
||||
coefficients associated with this smaller value of $\lambda$.
|
||||
@@ -490,7 +477,7 @@ results = skm.cross_validate(ridge,
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
|
||||
|
||||
The test MSE is 1.342e+05. Note
|
||||
that if we had instead simply fit a model with just an intercept, we
|
||||
would have predicted each test observation using the mean of the
|
||||
@@ -527,7 +514,7 @@ grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Alternatively, we can use 5-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
@@ -540,7 +527,7 @@ grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
Recall we set up the `kfold` object for 5-fold cross-validation on page 296. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
|
||||
Recall we set up the `kfold` object for 5-fold cross-validation on page 298. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
|
||||
to right.
|
||||
|
||||
```{python}
|
||||
@@ -553,7 +540,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
One can cross-validate different metrics to choose a parameter. The default
|
||||
metric for `skl.ElasticNet()` is test $R^2$.
|
||||
Let’s compare $R^2$ to MSE for cross-validation here.
|
||||
@@ -565,7 +552,7 @@ grid_r2 = skm.GridSearchCV(pipe,
|
||||
grid_r2.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Finally, let’s plot the results for cross-validated $R^2$.
|
||||
|
||||
```{python}
|
||||
@@ -577,7 +564,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Fast Cross-Validation for Solution Paths
|
||||
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
|
||||
@@ -597,7 +584,7 @@ pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
pipeCV.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s produce a plot again of the cross-validation error to see that
|
||||
it is similar to using `skm.GridSearchCV`.
|
||||
|
||||
@@ -613,7 +600,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that the value of $\lambda$ that results in the
|
||||
smallest cross-validation error is 1.19e-02, available
|
||||
as the value `tuned_ridge.alpha_`. What is the test MSE
|
||||
@@ -623,7 +610,7 @@ associated with this value of $\lambda$?
|
||||
np.min(tuned_ridge.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
|
||||
This represents a further improvement over the test MSE that we got
|
||||
using $\lambda=4$. Finally, `tuned_ridge.coef_`
|
||||
has the coefficients fit on the entire data set
|
||||
@@ -679,7 +666,7 @@ results = skm.cross_validate(pipeCV,
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
### The Lasso
|
||||
We saw that ridge regression with a wise choice of $\lambda$ can
|
||||
@@ -728,13 +715,13 @@ ax.set_ylabel('Standardized coefficiients', fontsize=20);
|
||||
```
|
||||
The smallest cross-validated error is lower than the test set MSE of the null model
|
||||
and of least squares, and very similar to the test MSE of 115526.71 of ridge
|
||||
regression (page 303) with $\lambda$ chosen by cross-validation.
|
||||
regression (page 305) with $\lambda$ chosen by cross-validation.
|
||||
|
||||
```{python}
|
||||
np.min(tuned_lasso.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s again produce a plot of the cross-validation error.
|
||||
|
||||
|
||||
@@ -759,7 +746,7 @@ variables.
|
||||
tuned_lasso.coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
As in ridge regression, we could evaluate the test error
|
||||
of cross-validated lasso by first splitting into
|
||||
test and training sets and internally running
|
||||
@@ -770,7 +757,7 @@ this as an exercise.
|
||||
## PCR and PLS Regression
|
||||
|
||||
### Principal Components Regression
|
||||
|
||||
|
||||
|
||||
Principal components regression (PCR) can be performed using
|
||||
`PCA()` from the `sklearn.decomposition`
|
||||
@@ -791,7 +778,7 @@ pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
When performing PCA, the results vary depending
|
||||
on whether the data has been *standardized* or not.
|
||||
As in the earlier examples, this can be accomplished
|
||||
@@ -805,7 +792,7 @@ pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can of course use CV to choose the number of components, by
|
||||
using `skm.GridSearchCV`, in this
|
||||
case fixing the parameters to vary the
|
||||
@@ -820,7 +807,7 @@ grid = skm.GridSearchCV(pipe,
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s plot the results as we have for other methods.
|
||||
|
||||
```{python}
|
||||
@@ -835,7 +822,7 @@ ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that the smallest cross-validation error occurs when
|
||||
17
|
||||
components are used. However, from the plot we also see that the
|
||||
@@ -859,8 +846,8 @@ cv_null = skm.cross_validate(linreg,
|
||||
-cv_null['test_score'].mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The `explained_variance_ratio_`
|
||||
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
|
||||
different numbers of components. This concept is discussed in greater
|
||||
@@ -870,7 +857,7 @@ detail in Section 12.2.
|
||||
pipe.named_steps['pca'].explained_variance_ratio_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Briefly, we can think of
|
||||
this as the amount of information about the predictors
|
||||
that is captured using $M$ principal components. For example, setting
|
||||
@@ -893,7 +880,7 @@ pls = PLSRegression(n_components=2,
|
||||
pls.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As was the case in PCR, we will want to
|
||||
use CV to choose the number of components.
|
||||
|
||||
@@ -906,7 +893,7 @@ grid = skm.GridSearchCV(pls,
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As for our other methods, we plot the MSE.
|
||||
|
||||
```{python}
|
||||
@@ -921,7 +908,7 @@ ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
|
||||
CV error is minimized at 12,
|
||||
though there is little noticable difference between this point and a much lower number like 2 or 3 components.
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,16 +1,3 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 7
|
||||
|
||||
@@ -30,7 +17,7 @@ from ISLP.models import (summarize,
|
||||
ModelSpec as MS)
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
```
|
||||
|
||||
|
||||
We again collect the new imports
|
||||
needed for this lab. Many of these are developed specifically for the
|
||||
`ISLP` package.
|
||||
@@ -51,7 +38,7 @@ from ISLP.pygam import (approx_lam,
|
||||
anova as anova_gam)
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Polynomial Regression and Step Functions
|
||||
We start by demonstrating how Figure 7.1 can be reproduced.
|
||||
Let's begin by loading the data.
|
||||
@@ -62,7 +49,7 @@ y = Wage['wage']
|
||||
age = Wage['age']
|
||||
|
||||
```
|
||||
|
||||
|
||||
Throughout most of this lab, our response is `Wage['wage']`, which
|
||||
we have stored as `y` above.
|
||||
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
|
||||
@@ -74,8 +61,8 @@ M = sm.OLS(y, poly_age.transform(Wage)).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
This polynomial is constructed using the function `poly()`,
|
||||
which creates
|
||||
a special *transformer* `Poly()` (using `sklearn` terminology
|
||||
@@ -83,7 +70,7 @@ for feature transformations such as `PCA()` seen in Section 6.5.3) which
|
||||
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
|
||||
the
|
||||
discussion of transformations on
|
||||
page 127.
|
||||
page 129.
|
||||
|
||||
In the code above, the first line executes the `fit()` method
|
||||
using the dataframe
|
||||
@@ -96,7 +83,7 @@ on the second line, as well as in the plotting function developed below.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We now create a grid of values for `age` at which we want
|
||||
predictions.
|
||||
|
||||
@@ -164,7 +151,7 @@ plot_wage_fit(age_df,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
With polynomial regression we must decide on the degree of
|
||||
the polynomial to use. Sometimes we just wing it, and decide to use
|
||||
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
|
||||
@@ -195,7 +182,7 @@ anova_lm(*[sm.OLS(y, X_).fit()
|
||||
for X_ in Xs])
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice the `*` in the `anova_lm()` line above. This
|
||||
function takes a variable number of non-keyword arguments, in this case fitted models.
|
||||
When these models are provided as a list (as is done here), it must be
|
||||
@@ -220,8 +207,8 @@ that `poly()` creates orthogonal polynomials.
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Notice that the p-values are the same, and in fact the square of
|
||||
the t-statistics are equal to the F-statistics from the
|
||||
`anova_lm()` function; for example:
|
||||
@@ -230,8 +217,8 @@ the t-statistics are equal to the F-statistics from the
|
||||
(-11.983)**2
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
However, the ANOVA method works whether or not we used orthogonal
|
||||
polynomials, provided the models are nested. For example, we can use
|
||||
`anova_lm()` to compare the following three
|
||||
@@ -246,8 +233,8 @@ XEs = [model.fit_transform(Wage)
|
||||
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
As an alternative to using hypothesis tests and ANOVA, we could choose
|
||||
the polynomial degree using cross-validation, as discussed in Chapter 5.
|
||||
|
||||
@@ -267,8 +254,8 @@ B = glm.fit()
|
||||
summarize(B)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Once again, we make predictions using the `get_prediction()` method.
|
||||
|
||||
```{python}
|
||||
@@ -277,7 +264,7 @@ preds = B.get_prediction(newX)
|
||||
bands = preds.conf_int(alpha=0.05)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now plot the estimated relationship.
|
||||
|
||||
```{python}
|
||||
@@ -319,8 +306,8 @@ cut_age = pd.qcut(age, 4)
|
||||
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
|
||||
quantiles directly instead of the argument `4`. For cuts not based
|
||||
on quantiles we would use the `pd.cut()` function.
|
||||
@@ -377,7 +364,7 @@ M = sm.OLS(y, Xbs).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
|
||||
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
|
||||
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
|
||||
@@ -435,7 +422,7 @@ deciding bin membership.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
In order to fit a natural spline, we use the `NaturalSpline()`
|
||||
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
|
||||
degrees of freedom (excluding the intercept) and plot the results.
|
||||
@@ -453,7 +440,7 @@ plot_wage_fit(age_df,
|
||||
'Natural spline, df=5');
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Smoothing Splines and GAMs
|
||||
A smoothing spline is a special case of a GAM with squared-error loss
|
||||
and a single feature. To fit GAMs in `Python` we will use the
|
||||
@@ -472,7 +459,7 @@ gam = LinearGAM(s_gam(0, lam=0.6))
|
||||
gam.fit(X_age, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
|
||||
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
|
||||
size of that dimension based on the remaining entries of the shape tuple.
|
||||
@@ -495,7 +482,7 @@ ax.set_ylabel('Wage', fontsize=20);
|
||||
ax.legend(title='$\lambda$');
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `pygam` package can perform a search for an optimal smoothing parameter.
|
||||
|
||||
```{python}
|
||||
@@ -508,7 +495,7 @@ ax.legend()
|
||||
fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
Alternatively, we can fix the degrees of freedom of the smoothing
|
||||
spline using a function included in the `ISLP.pygam` package. Below we
|
||||
find a value of $\lambda$ that gives us roughly four degrees of
|
||||
@@ -523,8 +510,8 @@ age_term.lam = lam_4
|
||||
degrees_of_freedom(X_age, age_term)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
|
||||
as the desired degrees of freedom plus one to account for the fact that these smoothing
|
||||
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
|
||||
@@ -636,7 +623,7 @@ ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of year on wage', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit the model (7.16) using smoothing splines rather
|
||||
than natural splines. All of the
|
||||
terms in (7.16) are fit simultaneously, taking each other
|
||||
@@ -728,7 +715,7 @@ gam_linear = LinearGAM(age_term +
|
||||
gam_linear.fit(Xgam, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice our use of `age_term` in the expressions above. We do this because
|
||||
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
|
||||
|
||||
@@ -748,7 +735,6 @@ ANOVA, $\mathcal{M}_2$ is preferred.
|
||||
|
||||
We can repeat the same process for `age` as well. We see there is very clear evidence that
|
||||
a non-linear term is required for `age`.
|
||||
\newpage
|
||||
|
||||
```{python}
|
||||
gam_0 = LinearGAM(year_term +
|
||||
@@ -776,7 +762,7 @@ We can make predictions from `gam` objects, just like from
|
||||
Yhat = gam_full.predict(Xgam)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In order to fit a logistic regression GAM, we use `LogisticGAM()`
|
||||
from `pygam`.
|
||||
|
||||
@@ -787,7 +773,7 @@ gam_logit = LogisticGAM(age_term +
|
||||
gam_logit.fit(Xgam, high_earn)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
@@ -839,8 +825,8 @@ gam_logit_ = LogisticGAM(age_term +
|
||||
gam_logit_.fit(Xgam_, high_earn_)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve
|
||||
removed those observations.
|
||||
|
||||
@@ -873,7 +859,7 @@ ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Local Regression
|
||||
We illustrate the use of local regression using the `lowess()`
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,16 +1,3 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 8
|
||||
|
||||
@@ -46,10 +33,10 @@ from sklearn.ensemble import \
|
||||
from ISLP.bart import BART
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Fitting Classification Trees
|
||||
|
||||
|
||||
|
||||
We first use classification trees to analyze the `Carseats` data set.
|
||||
In these data, `Sales` is a continuous variable, and so we begin
|
||||
@@ -65,7 +52,7 @@ High = np.where(Carseats.Sales > 8,
|
||||
"No")
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now use `DecisionTreeClassifier()` to fit a classification tree in
|
||||
order to predict `High` using all variables but `Sales`.
|
||||
To do so, we must form a model matrix as we did when fitting regression
|
||||
@@ -93,8 +80,8 @@ clf = DTC(criterion='entropy',
|
||||
clf.fit(X, High)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
In our discussion of qualitative features in Section 3.3,
|
||||
we noted that for a linear regression model such a feature could be
|
||||
represented by including a matrix of dummy variables (one-hot-encoding) in the model
|
||||
@@ -110,8 +97,8 @@ advantage of this approach; instead it simply treats the one-hot-encoded levels
|
||||
accuracy_score(High, clf.predict(X))
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
With only the default arguments, the training error rate is
|
||||
21%.
|
||||
For classification trees, we can
|
||||
@@ -129,7 +116,7 @@ resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
|
||||
resid_dev
|
||||
|
||||
```
|
||||
|
||||
|
||||
This is closely related to the *entropy*, defined in (8.7).
|
||||
A small deviance indicates a
|
||||
tree that provides a good fit to the (training) data.
|
||||
@@ -161,7 +148,7 @@ print(export_text(clf,
|
||||
show_weights=True))
|
||||
|
||||
```
|
||||
|
||||
|
||||
In order to properly evaluate the performance of a classification tree
|
||||
on these data, we must estimate the test error rather than simply
|
||||
computing the training error. We split the observations into a
|
||||
@@ -264,8 +251,8 @@ confusion = confusion_table(best_.predict(X_test),
|
||||
confusion
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
|
||||
|
||||
|
||||
@@ -283,7 +270,7 @@ feature_names = list(D.columns)
|
||||
X = np.asarray(D)
|
||||
|
||||
```
|
||||
|
||||
|
||||
First, we split the data into training and test sets, and fit the tree
|
||||
to the training data. Here we use 30% of the data for the test set.
|
||||
|
||||
@@ -298,7 +285,7 @@ to the training data. Here we use 30% of the data for the test set.
|
||||
random_state=0)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Having formed our training and test data sets, we fit the regression tree.
|
||||
|
||||
```{python}
|
||||
@@ -310,7 +297,7 @@ plot_tree(reg,
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
The variable `lstat` measures the percentage of individuals with
|
||||
lower socioeconomic status. The tree indicates that lower
|
||||
values of `lstat` correspond to more expensive houses.
|
||||
@@ -334,7 +321,7 @@ grid = skm.GridSearchCV(reg,
|
||||
G = grid.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In keeping with the cross-validation results, we use the pruned tree
|
||||
to make predictions on the test set.
|
||||
|
||||
@@ -343,8 +330,8 @@ best_ = grid.best_estimator_
|
||||
np.mean((y_test - best_.predict(X_test))**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
In other words, the test set MSE associated with the regression tree
|
||||
is 28.07. The square root of
|
||||
the MSE is therefore around
|
||||
@@ -367,7 +354,7 @@ plot_tree(G.best_estimator_,
|
||||
|
||||
|
||||
## Bagging and Random Forests
|
||||
|
||||
|
||||
|
||||
Here we apply bagging and random forests to the `Boston` data, using
|
||||
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
|
||||
@@ -380,8 +367,8 @@ bag_boston = RF(max_features=X_train.shape[1], random_state=0)
|
||||
bag_boston.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The argument `max_features` indicates that all 12 predictors should
|
||||
be considered for each split of the tree --- in other words, that
|
||||
bagging should be done. How well does this bagged model perform on
|
||||
@@ -394,7 +381,7 @@ ax.scatter(y_hat_bag, y_test)
|
||||
np.mean((y_test - y_hat_bag)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The test set MSE associated with the bagged regression tree is
|
||||
14.63, about half that obtained using an optimally-pruned single
|
||||
tree. We could change the number of trees grown from the default of
|
||||
@@ -425,8 +412,8 @@ y_hat_RF = RF_boston.predict(X_test)
|
||||
np.mean((y_test - y_hat_RF)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The test set MSE is 20.04;
|
||||
this indicates that random forests did somewhat worse than bagging
|
||||
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
|
||||
@@ -450,7 +437,7 @@ house size (`rm`) are by far the two most important variables.
|
||||
|
||||
|
||||
## Boosting
|
||||
|
||||
|
||||
|
||||
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
|
||||
to fit boosted regression trees to the `Boston` data
|
||||
@@ -469,7 +456,7 @@ boost_boston = GBR(n_estimators=5000,
|
||||
boost_boston.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can see how the training error decreases with the `train_score_` attribute.
|
||||
To get an idea of how the test error decreases we can use the
|
||||
`staged_predict()` method to get the predicted values along the path.
|
||||
@@ -492,7 +479,7 @@ ax.plot(plot_idx,
|
||||
ax.legend();
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now use the boosted model to predict `medv` on the test set:
|
||||
|
||||
```{python}
|
||||
@@ -500,7 +487,7 @@ y_hat_boost = boost_boston.predict(X_test);
|
||||
np.mean((y_test - y_hat_boost)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The test MSE obtained is 14.48,
|
||||
similar to the test MSE for bagging. If we want to, we can
|
||||
perform boosting with a different value of the shrinkage parameter
|
||||
@@ -518,8 +505,8 @@ y_hat_boost = boost_boston.predict(X_test);
|
||||
np.mean((y_test - y_hat_boost)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
In this case, using $\lambda=0.2$ leads to a almost the same test MSE
|
||||
as when using $\lambda=0.001$.
|
||||
|
||||
@@ -527,7 +514,7 @@ as when using $\lambda=0.001$.
|
||||
|
||||
|
||||
## Bayesian Additive Regression Trees
|
||||
|
||||
|
||||
|
||||
In this section we demonstrate a `Python` implementation of BART found in the
|
||||
`ISLP.bart` package. We fit a model
|
||||
@@ -540,8 +527,8 @@ bart_boston = BART(random_state=0, burnin=5, ndraw=15)
|
||||
bart_boston.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
|
||||
|
||||
```{python}
|
||||
@@ -549,8 +536,8 @@ yhat_test = bart_boston.predict(X_test.astype(np.float32))
|
||||
np.mean((y_test - yhat_test)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We can check how many times each variable appeared in the collection of trees.
|
||||
This gives a summary similar to the variable importance plot for boosting and random forests.
|
||||
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -1,21 +1,8 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 9
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# Lab: Support Vector Machines
|
||||
In this lab, we use the `sklearn.svm` library to demonstrate the support
|
||||
@@ -39,7 +26,7 @@ from ISLP.svm import plot as plot_svm
|
||||
from sklearn.metrics import RocCurveDisplay
|
||||
|
||||
```
|
||||
|
||||
|
||||
We will use the function `RocCurveDisplay.from_estimator()` to
|
||||
produce several ROC plots, using a shorthand `roc_curve`.
|
||||
|
||||
@@ -84,8 +71,8 @@ svm_linear = SVC(C=10, kernel='linear')
|
||||
svm_linear.fit(X, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The support vector classifier with two features can
|
||||
be visualized by plotting values of its *decision function*.
|
||||
We have included a function for this in the `ISLP` package (inspired by a similar
|
||||
@@ -99,7 +86,7 @@ plot_svm(X,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The decision
|
||||
boundary between the two classes is linear (because we used the
|
||||
argument `kernel='linear'`). The support vectors are marked with `+`
|
||||
@@ -126,8 +113,8 @@ coefficients of the linear decision boundary as follows:
|
||||
svm_linear.coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Since the support vector machine is an estimator in `sklearn`, we
|
||||
can use the usual machinery to tune it.
|
||||
|
||||
@@ -144,8 +131,8 @@ grid.fit(X, y)
|
||||
grid.best_params_
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We can easily access the cross-validation errors for each of these models
|
||||
in `grid.cv_results_`. This prints out a lot of detail, so we
|
||||
extract the accuracy results only.
|
||||
@@ -166,7 +153,7 @@ y_test = np.array([-1]*10+[1]*10)
|
||||
X_test[y_test==1] += 1
|
||||
|
||||
```
|
||||
|
||||
|
||||
Now we predict the class labels of these test observations. Here we
|
||||
use the best model selected by cross-validation in order to make the
|
||||
predictions.
|
||||
@@ -177,7 +164,7 @@ y_test_hat = best_.predict(X_test)
|
||||
confusion_table(y_test_hat, y_test)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Thus, with this value of `C`,
|
||||
70% of the test
|
||||
observations are correctly classified. What if we had instead used
|
||||
@@ -190,7 +177,7 @@ y_test_hat = svm_.predict(X_test)
|
||||
confusion_table(y_test_hat, y_test)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In this case 60% of test observations are correctly classified.
|
||||
|
||||
We now consider a situation in which the two classes are linearly
|
||||
@@ -205,7 +192,7 @@ fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
||||
|
||||
```
|
||||
|
||||
|
||||
Now the observations are just barely linearly separable.
|
||||
|
||||
```{python}
|
||||
@@ -214,7 +201,7 @@ y_hat = svm_.predict(X)
|
||||
confusion_table(y_hat, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We fit the
|
||||
support vector classifier and plot the resulting hyperplane, using a
|
||||
very large value of `C` so that no observations are
|
||||
@@ -240,7 +227,7 @@ y_hat = svm_.predict(X)
|
||||
confusion_table(y_hat, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Using `C=0.1`, we again do not misclassify any training observations, but we
|
||||
also obtain a much wider margin and make use of twelve support
|
||||
vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test
|
||||
@@ -254,7 +241,7 @@ plot_svm(X,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Support Vector Machine
|
||||
In order to fit an SVM using a non-linear kernel, we once again use
|
||||
@@ -277,7 +264,7 @@ X[100:150] -= 2
|
||||
y = np.array([1]*150+[2]*50)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Plotting the data makes it clear that the class boundary is indeed non-linear.
|
||||
|
||||
```{python}
|
||||
@@ -288,8 +275,8 @@ ax.scatter(X[:,0],
|
||||
cmap=cm.coolwarm)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The data is randomly split into training and testing groups. We then
|
||||
fit the training data using the `SVC()` estimator with a
|
||||
radial kernel and $\gamma=1$:
|
||||
@@ -306,7 +293,7 @@ svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
|
||||
svm_rbf.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The plot shows that the resulting SVM has a decidedly non-linear
|
||||
boundary.
|
||||
|
||||
@@ -318,7 +305,7 @@ plot_svm(X_train,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can see from the figure that there are a fair number of training
|
||||
errors in this SVM fit. If we increase the value of `C`, we
|
||||
can reduce the number of training errors. However, this comes at the
|
||||
@@ -335,7 +322,7 @@ plot_svm(X_train,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can perform cross-validation using `skm.GridSearchCV()` to select the
|
||||
best choice of $\gamma$ and `C` for an SVM with a radial
|
||||
kernel:
|
||||
@@ -354,7 +341,7 @@ grid.fit(X_train, y_train)
|
||||
grid.best_params_
|
||||
|
||||
```
|
||||
|
||||
|
||||
The best choice of parameters under five-fold CV is achieved at `C=1`
|
||||
and `gamma=0.5`, though several other values also achieve the same
|
||||
value.
|
||||
@@ -371,7 +358,7 @@ y_hat_test = best_svm.predict(X_test)
|
||||
confusion_table(y_hat_test, y_test)
|
||||
|
||||
```
|
||||
|
||||
|
||||
With these parameters, 12% of test
|
||||
observations are misclassified by this SVM.
|
||||
|
||||
@@ -431,7 +418,7 @@ roc_curve(svm_flex,
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
However, these ROC curves are all on the training data. We are really
|
||||
more interested in the level of prediction accuracy on the test
|
||||
data. When we compute the ROC curves on the test data, the model with
|
||||
@@ -447,7 +434,7 @@ roc_curve(svm_flex,
|
||||
fig;
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s look at our tuned SVM.
|
||||
|
||||
```{python}
|
||||
@@ -466,7 +453,7 @@ for (X_, y_, c, name) in zip(
|
||||
color=c)
|
||||
|
||||
```
|
||||
|
||||
|
||||
## SVM with Multiple Classes
|
||||
|
||||
If the response is a factor containing more than two levels, then the
|
||||
@@ -485,7 +472,7 @@ fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit an SVM to the data:
|
||||
|
||||
```{python}
|
||||
@@ -521,7 +508,7 @@ Khan = load_data('Khan')
|
||||
Khan['xtrain'].shape, Khan['xtest'].shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
This data set consists of expression measurements for 2,308
|
||||
genes. The training and test sets consist of 63 and 20
|
||||
observations, respectively.
|
||||
@@ -540,7 +527,7 @@ confusion_table(khan_linear.predict(Khan['xtrain']),
|
||||
Khan['ytrain'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that there are *no* training
|
||||
errors. In fact, this is not surprising, because the large number of
|
||||
variables relative to the number of observations implies that it is
|
||||
@@ -553,7 +540,7 @@ confusion_table(khan_linear.predict(Khan['xtest']),
|
||||
Khan['ytest'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that using `C=10` yields two test set errors on these data.
|
||||
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d45c6d2b",
|
||||
"id": "30b873f3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
@@ -12,7 +12,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "94d1e03c",
|
||||
"id": "3e06bca8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Lab: Support Vector Machines\n",
|
||||
@@ -25,13 +25,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "eeaa5be0",
|
||||
"id": "3973b95f",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:27.947789Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:27.947634Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:28.991210Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:28.990616Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:58.477582Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:58.477467Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.432527Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.432225Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -45,7 +45,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "26ebd377",
|
||||
"id": "631b7d3d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We also collect the new imports\n",
|
||||
@@ -55,13 +55,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "41a59634",
|
||||
"id": "0161e55e",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:28.993557Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:28.993245Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.028199Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.027857Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.434432Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.434258Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.466972Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.466647Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -73,7 +73,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f197b846",
|
||||
"id": "b397af05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will use the function `RocCurveDisplay.from_estimator()` to\n",
|
||||
@@ -83,13 +83,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "c9a175d7",
|
||||
"id": "7661b056",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.030225Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.030097Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.032026Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.031756Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.469128Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.468999Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.470961Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.470667Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -99,7 +99,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f666c212",
|
||||
"id": "38115984",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Support Vector Classifier\n",
|
||||
@@ -123,13 +123,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "a7216b47",
|
||||
"id": "46e9ab84",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.033695Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.033581Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.207161Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.205980Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.472867Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.472726Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.583508Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.583126Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -159,7 +159,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7b4aff06",
|
||||
"id": "a9766d18",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"They are not. We now fit the classifier."
|
||||
@@ -168,13 +168,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "ed329198",
|
||||
"id": "605ffdc0",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.211951Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.211403Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.220643Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.219858Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.585485Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.585317Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.590274Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.589979Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -200,7 +200,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5e6b4c79",
|
||||
"id": "16215b77",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The support vector classifier with two features can\n",
|
||||
@@ -212,13 +212,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "95494b8b",
|
||||
"id": "302a49a1",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.224179Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.223775Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.400927Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.400620Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.591976Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.591865Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.734225Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.733936Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -243,7 +243,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f6ce1246",
|
||||
"id": "6010e865",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The decision\n",
|
||||
@@ -257,13 +257,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "98c2236f",
|
||||
"id": "cc1d6a13",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.402894Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.402744Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.544636Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.544249Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.735943Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.735816Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.878335Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.878032Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -291,7 +291,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "906f4bb8",
|
||||
"id": "301d764d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With a smaller value of the cost parameter, we\n",
|
||||
@@ -303,13 +303,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "b498f594",
|
||||
"id": "6133c846",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.546722Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.546549Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.549088Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.548814Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.880078Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.879965Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.882347Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.882070Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -331,7 +331,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "90a0ee53",
|
||||
"id": "0693b3eb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Since the support vector machine is an estimator in `sklearn`, we\n",
|
||||
@@ -341,13 +341,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "b65e80d6",
|
||||
"id": "9adb3793",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.550593Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.550485Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.578952Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.578657Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.883852Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.883749Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.910535Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.910272Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -378,7 +378,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d390528c",
|
||||
"id": "611e76a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can easily access the cross-validation errors for each of these models\n",
|
||||
@@ -389,13 +389,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "bba8fad7",
|
||||
"id": "d3ab343e",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.580977Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.580845Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.583558Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.583239Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.912005Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.911925Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.914189Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.913943Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -417,7 +417,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "703e2d43",
|
||||
"id": "41d85a2a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see that `C=1` results in the highest cross-validation\n",
|
||||
@@ -430,13 +430,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "ad64269d",
|
||||
"id": "6aba117e",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.585087Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.584981Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.586995Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.586714Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.915563Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.915487Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.917323Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.917078Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -448,7 +448,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "db41f5e2",
|
||||
"id": "ddbda9de",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we predict the class labels of these test observations. Here we\n",
|
||||
@@ -459,13 +459,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "5107fca1",
|
||||
"id": "dbe7d737",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.588685Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.588519Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.595768Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.595341Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.918744Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.918666Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.925361Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.925039Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -534,7 +534,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bbfc8005",
|
||||
"id": "7f002ea6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Thus, with this value of `C`,\n",
|
||||
@@ -546,13 +546,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "0320d9e0",
|
||||
"id": "ab1697c2",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.597509Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.597387Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.602346Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.601964Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.927158Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.927027Z",
|
||||
"iopub.status.idle": "2023-08-21T02:29:59.931558Z",
|
||||
"shell.execute_reply": "2023-08-21T02:29:59.931228Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -622,7 +622,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "427d775f",
|
||||
"id": "7574703a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case 60% of test observations are correctly classified.\n",
|
||||
@@ -637,13 +637,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "84d7e778",
|
||||
"id": "0fd42b1e",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.604018Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.603879Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.734586Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.734264Z"
|
||||
"iopub.execute_input": "2023-08-21T02:29:59.933100Z",
|
||||
"iopub.status.busy": "2023-08-21T02:29:59.933001Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.054738Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.054338Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -666,7 +666,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ff7bdad1",
|
||||
"id": "4bdaf415",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now the observations are just barely linearly separable."
|
||||
@@ -675,13 +675,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "abb1f8be",
|
||||
"id": "09c15299",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.736388Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.736251Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.741179Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.740886Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.056655Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.056526Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.061096Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.060792Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -750,7 +750,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c44297cc",
|
||||
"id": "d987eecc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We fit the\n",
|
||||
@@ -762,13 +762,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "2e4ed2f5",
|
||||
"id": "d5fd2ff9",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.742864Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.742750Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.860686Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.860305Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.062673Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.062585Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.199860Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.199129Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -794,7 +794,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2836d70d",
|
||||
"id": "0834d471",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Indeed no training errors were made and only three support vectors were used.\n",
|
||||
@@ -807,13 +807,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "164a611c",
|
||||
"id": "39aff1b1",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.862647Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.862496Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.867261Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.866916Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.202380Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.202233Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.207886Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.207493Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -882,7 +882,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a432d1",
|
||||
"id": "d0684844",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using `C=0.1`, we again do not misclassify any training observations, but we\n",
|
||||
@@ -894,13 +894,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "c67591a1",
|
||||
"id": "63a9d752",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.868821Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.868723Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.990207Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.989921Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.209907Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.209781Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.340803Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.340433Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -926,7 +926,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "25e61f65",
|
||||
"id": "a70d84f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Support Vector Machine\n",
|
||||
@@ -947,13 +947,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "322be574",
|
||||
"id": "2fee8df5",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.991910Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.991799Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:29.993907Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:29.993635Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.342773Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.342626Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.345094Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.344774Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -966,7 +966,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "22fe2182",
|
||||
"id": "d5c7545e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Plotting the data makes it clear that the class boundary is indeed non-linear."
|
||||
@@ -975,13 +975,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"id": "04fda182",
|
||||
"id": "48f01abe",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:29.995558Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:29.995406Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.089596Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.089130Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.347053Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.346902Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.440453Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.440153Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -989,7 +989,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<matplotlib.collections.PathCollection at 0x17f2b35d0>"
|
||||
"<matplotlib.collections.PathCollection at 0x28b7c65d0>"
|
||||
]
|
||||
},
|
||||
"execution_count": 20,
|
||||
@@ -1017,7 +1017,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "64913fe3",
|
||||
"id": "7c0bc32b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The data is randomly split into training and testing groups. We then\n",
|
||||
@@ -1028,13 +1028,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"id": "0c2690d1",
|
||||
"id": "4acc3246",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.091605Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.091498Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.095614Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.095347Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.442257Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.442156Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.446674Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.446369Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1066,7 +1066,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5da9efdb",
|
||||
"id": "b2c7e95e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The plot shows that the resulting SVM has a decidedly non-linear\n",
|
||||
@@ -1076,13 +1076,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "3eb171e8",
|
||||
"id": "e9852a28",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.097178Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.097088Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.357131Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.356847Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.448268Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.448160Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:00.828511Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:00.828128Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1107,7 +1107,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ab5b1446",
|
||||
"id": "acfa4bed",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see from the figure that there are a fair number of training\n",
|
||||
@@ -1120,13 +1120,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"id": "9a6b905b",
|
||||
"id": "01232fc9",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.358811Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.358698Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.513702Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.513395Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:00.830365Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:00.830226Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.132677Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.132224Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1153,7 +1153,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "300c1b8b",
|
||||
"id": "5bc77e3f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can perform cross-validation using `skm.GridSearchCV()` to select the\n",
|
||||
@@ -1164,13 +1164,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"id": "5ab01d6c",
|
||||
"id": "bcbd15a4",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.515803Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.515668Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.612245Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.611940Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.134616Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.134486Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.243519Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.243203Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1201,7 +1201,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1bb987ae",
|
||||
"id": "997bbfbd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The best choice of parameters under five-fold CV is achieved at `C=1`\n",
|
||||
@@ -1212,13 +1212,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"id": "166a6acb",
|
||||
"id": "28ca551e",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.614152Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.614029Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.850984Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.850653Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.245550Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.245377Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.600896Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.600574Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1303,7 +1303,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39ee6f32",
|
||||
"id": "48e671f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With these parameters, 12% of test\n",
|
||||
@@ -1312,7 +1312,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f0ea699d",
|
||||
"id": "eaed0a87",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ROC Curves\n",
|
||||
@@ -1346,13 +1346,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"id": "0607fc41",
|
||||
"id": "68ac9421",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.853079Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.852934Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:30.948570Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:30.948252Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.602740Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.602614Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.698620Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.698322Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1380,7 +1380,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "54446e71",
|
||||
"id": "0c35d32a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" In this example, the SVM appears to provide accurate predictions. By increasing\n",
|
||||
@@ -1391,13 +1391,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"id": "5211a882",
|
||||
"id": "f79a9e0a",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:30.950213Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:30.950106Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.095103Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.094737Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.700479Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.700347Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.837479Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.837102Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1428,7 +1428,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "de7e4be8",
|
||||
"id": "7bd1a22b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"However, these ROC curves are all on the training data. We are really\n",
|
||||
@@ -1440,13 +1440,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "12acc4ff",
|
||||
"id": "bdb9e503",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.096951Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.096805Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.101372Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.101097Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.839390Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.839243Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.843595Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.843287Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -1462,7 +1462,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eb5c8aeb",
|
||||
"id": "8a42e924",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let’s look at our tuned SVM."
|
||||
@@ -1471,13 +1471,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"id": "21c81913",
|
||||
"id": "329f5d2c",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.103089Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.102993Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.204133Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.203835Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.845300Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.845201Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:01.944073Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:01.943763Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1510,7 +1510,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b9fefe9f",
|
||||
"id": "bac19279",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## SVM with Multiple Classes\n",
|
||||
@@ -1526,13 +1526,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"id": "2fff4fa8",
|
||||
"id": "267e113d",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.205816Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.205709Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.294925Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.294593Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:01.945725Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:01.945611Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:02.034378Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:02.034069Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1558,7 +1558,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b7adc87d",
|
||||
"id": "a9f4297c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We now fit an SVM to the data:"
|
||||
@@ -1567,13 +1567,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"id": "5396f2df",
|
||||
"id": "64cbebd0",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.296594Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.296472Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.880175Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.879674Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:02.036083Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:02.035963Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:03.015535Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:03.014798Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1605,7 +1605,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "837644f5",
|
||||
"id": "62c5d16e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `sklearn.svm` library can also be used to perform support vector\n",
|
||||
@@ -1614,7 +1614,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a6bc0cbc",
|
||||
"id": "5c0824b6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Application to Gene Expression Data\n",
|
||||
@@ -1631,13 +1631,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"id": "f63c575e",
|
||||
"id": "b6e6f12b",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.882095Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.881962Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.959079Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.958769Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:03.017430Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:03.017293Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:03.099156Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:03.098760Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1659,7 +1659,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bfd6492c",
|
||||
"id": "e3fbaa58",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This data set consists of expression measurements for 2,308\n",
|
||||
@@ -1677,13 +1677,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"id": "32091338",
|
||||
"id": "273a10b2",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.960641Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.960528Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:31.990176Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:31.989868Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:03.101069Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:03.100881Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:03.130224Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:03.129845Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1777,7 +1777,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "23043ab0",
|
||||
"id": "31cad43a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see that there are *no* training\n",
|
||||
@@ -1791,13 +1791,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"id": "d9058023",
|
||||
"id": "bc3079a7",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:31.991754Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:31.991636Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:32.002452Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:32.002189Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:03.132111Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:03.131975Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:03.143298Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:03.142948Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1889,7 +1889,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0d5aba4",
|
||||
"id": "0d059312",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see that using `C=10` yields two test set errors on these data.\n",
|
||||
@@ -1900,8 +1900,8 @@
|
||||
"metadata": {
|
||||
"jupytext": {
|
||||
"cell_metadata_filter": "-all",
|
||||
"formats": "ipynb,Rmd",
|
||||
"main_language": "python"
|
||||
"main_language": "python",
|
||||
"notebook_metadata_filter": "-all"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -1,16 +1,3 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 11
|
||||
|
||||
@@ -37,7 +24,7 @@ from ISLP.models import ModelSpec as MS
|
||||
from ISLP import load_data
|
||||
|
||||
```
|
||||
|
||||
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
@@ -61,7 +48,7 @@ BrainCancer = load_data('BrainCancer')
|
||||
BrainCancer.columns
|
||||
|
||||
```
|
||||
|
||||
|
||||
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
|
||||
We first briefly examine the data.
|
||||
|
||||
@@ -69,20 +56,20 @@ We first briefly examine the data.
|
||||
BrainCancer['sex'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
BrainCancer['diagnosis'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
BrainCancer['status'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Before beginning an analysis, it is important to know how the
|
||||
`status` variable has been coded. Most software
|
||||
uses the convention that a `status` of 1 indicates an
|
||||
@@ -109,7 +96,7 @@ km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
|
||||
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next we create Kaplan-Meier survival curves that are stratified by
|
||||
`sex`, in order to reproduce Figure 11.3.
|
||||
We do this using the `groupby()` method of a dataframe.
|
||||
@@ -138,7 +125,7 @@ for sex, df in BrainCancer.groupby('sex'):
|
||||
km_sex.plot(label='Sex=%s' % sex, ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As discussed in Section 11.4, we can perform a
|
||||
log-rank test to compare the survival of males to females. We use
|
||||
the `logrank_test()` function from the `lifelines.statistics` module.
|
||||
@@ -152,8 +139,8 @@ logrank_test(by_sex['Male']['time'],
|
||||
by_sex['Female']['status'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The resulting $p$-value is $0.23$, indicating no evidence of a
|
||||
difference in survival between the two sexes.
|
||||
|
||||
@@ -172,7 +159,7 @@ cox_fit = coxph().fit(model_df,
|
||||
cox_fit.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
The first argument to `fit` should be a data frame containing
|
||||
at least the event time (the second argument `time` in this case),
|
||||
as well as an optional censoring variable (the argument `status` in this case).
|
||||
@@ -186,7 +173,7 @@ with no features as follows:
|
||||
cox_fit.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Regardless of which test we use, we see that there is no clear
|
||||
evidence for a difference in survival between males and females. As
|
||||
we learned in this chapter, the score test from the Cox model is
|
||||
@@ -206,7 +193,7 @@ fit_all = coxph().fit(all_df,
|
||||
fit_all.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `diagnosis` variable has been coded so that the baseline
|
||||
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
|
||||
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
|
||||
@@ -233,7 +220,7 @@ def representative(series):
|
||||
modal_data = cleaned.apply(representative, axis=0)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We make four
|
||||
copies of the column means and assign the `diagnosis` column to be the four different
|
||||
diagnoses.
|
||||
@@ -245,7 +232,7 @@ modal_df['diagnosis'] = levels
|
||||
modal_df
|
||||
|
||||
```
|
||||
|
||||
|
||||
We then construct the model matrix based on the model specification `all_MS` used to fit
|
||||
the model, and name the rows according to the levels of `diagnosis`.
|
||||
|
||||
@@ -272,7 +259,7 @@ fig, ax = subplots(figsize=(8, 8))
|
||||
predicted_survival.plot(ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Publication Data
|
||||
The `Publication` data presented in Section 11.5.4 can be
|
||||
@@ -291,7 +278,7 @@ for result, df in Publication.groupby('posres'):
|
||||
km_result.plot(label='Result=%d' % result, ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As discussed previously, the $p$-values from fitting Cox’s
|
||||
proportional hazards model to the `posres` variable are quite
|
||||
large, providing no evidence of a difference in time-to-publication
|
||||
@@ -308,8 +295,8 @@ posres_fit = coxph().fit(posres_df,
|
||||
posres_fit.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
However, the results change dramatically when we include other
|
||||
predictors in the model. Here we exclude the funding mechanism
|
||||
variable.
|
||||
@@ -322,7 +309,7 @@ coxph().fit(model.fit_transform(Publication),
|
||||
'status').summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that there are a number of statistically significant variables,
|
||||
including whether the trial focused on a clinical endpoint, the impact
|
||||
of the study, and whether the study had positive or negative results.
|
||||
@@ -372,7 +359,7 @@ model = MS(['Operators',
|
||||
intercept=False)
|
||||
X = model.fit_transform(D)
|
||||
```
|
||||
|
||||
|
||||
It is worthwhile to take a peek at the model matrix `X`, so
|
||||
that we can be sure that we understand how the variables have been coded. By default,
|
||||
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
|
||||
@@ -382,7 +369,7 @@ of the variable is dropped.
|
||||
X[:5]
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next, we specify the coefficients and the hazard function.
|
||||
|
||||
```{python}
|
||||
@@ -431,7 +418,7 @@ W = np.array([sim_time(l, cum_hazard, rng)
|
||||
D['Wait time'] = np.clip(W, 0, 1000)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now simulate our censoring variable, for which we assume
|
||||
90% of calls were answered (`Failed==1`) before the
|
||||
customer hung up (`Failed==0`).
|
||||
@@ -443,13 +430,13 @@ D['Failed'] = rng.choice([1, 0],
|
||||
D[:5]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
D['Failed'].mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
|
||||
|
||||
```{python}
|
||||
@@ -462,7 +449,7 @@ for center, df in D.groupby('Center'):
|
||||
ax.set_title("Probability of Still Being on Hold")
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next, we stratify by `Time`.
|
||||
|
||||
```{python}
|
||||
@@ -475,7 +462,7 @@ for time, df in D.groupby('Time'):
|
||||
ax.set_title("Probability of Still Being on Hold")
|
||||
|
||||
```
|
||||
|
||||
|
||||
It seems that calls at Call Center B take longer to be answered than
|
||||
calls at Centers A and C. Similarly, it appears that wait times are
|
||||
longest in the morning and shortest in the evening hours. We can use a
|
||||
@@ -488,8 +475,8 @@ multivariate_logrank_test(D['Wait time'],
|
||||
D['Failed'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Next, we consider the effect of `Time`.
|
||||
|
||||
```{python}
|
||||
@@ -498,8 +485,8 @@ multivariate_logrank_test(D['Wait time'],
|
||||
D['Failed'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
As in the case of a categorical variable with 2 levels, these
|
||||
results are similar to the likelihood ratio test
|
||||
from the Cox proportional hazards model. First, we
|
||||
@@ -514,8 +501,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
|
||||
F.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Next, we look at the results for `Time`.
|
||||
|
||||
```{python}
|
||||
@@ -527,8 +514,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
|
||||
F.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We find that differences between centers are highly significant, as
|
||||
are differences between times of day.
|
||||
|
||||
@@ -544,8 +531,8 @@ fit_queuing = coxph().fit(
|
||||
fit_queuing.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The $p$-values for Center B and evening time
|
||||
are very small. It is also clear that the
|
||||
hazard --- that is, the instantaneous risk that a call will be
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c7f4eb5a",
|
||||
"id": "62a1a218",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
@@ -12,7 +12,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0ae03bc9",
|
||||
"id": "9da89fbb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Lab: Survival Analysis\n",
|
||||
@@ -31,13 +31,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "91ac40fd",
|
||||
"id": "d2d71add",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:33.224953Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:33.224846Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.446999Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.446629Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:04.373618Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:04.373522Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.528375Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.528065Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -51,7 +51,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a3dbcbbf",
|
||||
"id": "70fe80b5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We also collect the new imports\n",
|
||||
@@ -61,13 +61,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "99782418",
|
||||
"id": "994efc94",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.448996Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.448819Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.539258Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.538955Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.530453Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.530271Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.593786Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.593483Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -83,7 +83,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2c538d28",
|
||||
"id": "e65a4796",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Brain Cancer Data\n",
|
||||
@@ -94,13 +94,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "3137149a",
|
||||
"id": "9d41ddea",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.541177Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.541057Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.547991Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.547753Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.595762Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.595642Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.602243Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.601969Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -122,7 +122,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e798f172",
|
||||
"id": "4ac65a33",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.\n",
|
||||
@@ -132,13 +132,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "45963c92",
|
||||
"id": "2bece782",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.549558Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.549453Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.552571Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.552293Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.603954Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.603852Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.607075Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.606729Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -163,13 +163,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "73be61f6",
|
||||
"id": "9ca465e5",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.553962Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.553866Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.556544Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.556286Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.608553Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.608445Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.611386Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.611134Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -196,13 +196,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "572f0b9e",
|
||||
"id": "33bc4d3c",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.557984Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.557901Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.560759Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.560490Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.612735Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.612639Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.615164Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.614915Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -226,7 +226,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fbd132de",
|
||||
"id": "eb9c6d4f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Before beginning an analysis, it is important to know how the\n",
|
||||
@@ -252,13 +252,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "92c39707",
|
||||
"id": "0b6dba70",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.562264Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.562173Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.764386Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.763084Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.616714Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.616622Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.728265Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.727903Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -292,7 +292,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f037665b",
|
||||
"id": "2cc511cd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next we create Kaplan-Meier survival curves that are stratified by\n",
|
||||
@@ -318,13 +318,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "3fc7848c",
|
||||
"id": "9e6f2e70",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.770269Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.769500Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.900514Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.900203Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.730200Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.730056Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.845830Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.845506Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -350,7 +350,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c0c1a16a",
|
||||
"id": "4d7efefb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As discussed in Section 11.4, we can perform a\n",
|
||||
@@ -363,13 +363,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "bf30d26f",
|
||||
"id": "c135f7aa",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.902462Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.902313Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.956077Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.955714Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.847658Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.847519Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.901295Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.900935Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -462,7 +462,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e270649c",
|
||||
"id": "bd14317d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The resulting $p$-value is $0.23$, indicating no evidence of a\n",
|
||||
@@ -476,13 +476,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "2ab78e07",
|
||||
"id": "5f9303dd",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.957966Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.957792Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.984567Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.984261Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.903263Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.903017Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.930691Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.930331Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -553,7 +553,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b58b93ae",
|
||||
"id": "7e56e83e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The first argument to `fit` should be a data frame containing\n",
|
||||
@@ -569,13 +569,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "4716b7b0",
|
||||
"id": "bcc8470c",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.986336Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.986193Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:34.991518Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:34.991252Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.932434Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.932285Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.937796Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.937549Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -659,7 +659,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2820f486",
|
||||
"id": "2e6163ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Regardless of which test we use, we see that there is no clear\n",
|
||||
@@ -675,13 +675,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "c2767d88",
|
||||
"id": "c26a3499",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:34.993223Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:34.993093Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.028673Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.028408Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.939300Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.939184Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.979585Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.979250Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -800,7 +800,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eee4ab1f",
|
||||
"id": "a96e311c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" The `diagnosis` variable has been coded so that the baseline\n",
|
||||
@@ -823,13 +823,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "ede1d219",
|
||||
"id": "8d999f26",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.030313Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.030211Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.034142Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.033836Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.981441Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.981315Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.986317Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.985949Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -845,7 +845,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e1c307ae",
|
||||
"id": "bf628fd8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We make four\n",
|
||||
@@ -856,13 +856,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "dc032a71",
|
||||
"id": "a1f6b355",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.035583Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.035483Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.041790Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.041394Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.988012Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.987898Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:05.993889Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:05.993534Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -974,7 +974,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "84da2586",
|
||||
"id": "3f6334e4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We then construct the model matrix based on the model specification `all_MS` used to fit\n",
|
||||
@@ -984,13 +984,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "e7c1fe43",
|
||||
"id": "a1a9d5b3",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.043454Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.043346Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.050931Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.050643Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:05.995682Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:05.995549Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.005479Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.005089Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1117,7 +1117,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3cfe1ec4",
|
||||
"id": "3eaec7e6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can use the `predict_survival_function()` method to obtain the estimated survival function."
|
||||
@@ -1126,13 +1126,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "f89fbed7",
|
||||
"id": "1a18b56a",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.052472Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.052367Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.059232Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.058922Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.007172Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.007049Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.014185Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.013870Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1276,7 +1276,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "29afd641",
|
||||
"id": "7d533f90",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This returns a data frame,\n",
|
||||
@@ -1287,13 +1287,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "8f0329b4",
|
||||
"id": "ff3de29c",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.061046Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.060930Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.167601Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.167288Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.015778Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.015664Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.124035Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.123732Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -1316,7 +1316,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12723ce5",
|
||||
"id": "e660793e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Publication Data\n",
|
||||
@@ -1330,13 +1330,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "3045bfc0",
|
||||
"id": "cd9060c1",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.169251Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.169133Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.287186Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.286859Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.125714Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.125592Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.243701Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.243300Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1363,7 +1363,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6fcb22f7",
|
||||
"id": "d8f0f687",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As discussed previously, the $p$-values from fitting Cox’s\n",
|
||||
@@ -1375,13 +1375,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "d070f716",
|
||||
"id": "6af7106e",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.288887Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.288769Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.321428Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.321128Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.245493Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.245357Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.281521Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.281138Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -1453,7 +1453,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "513a55b1",
|
||||
"id": "8ef28b8b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"However, the results change dramatically when we include other\n",
|
||||
@@ -1464,13 +1464,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"id": "2bbcdd0c",
|
||||
"id": "b6ebefa7",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.323119Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.323003Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.362910Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.362438Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.283282Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.283123Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.327003Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.326646Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1573,7 +1573,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "75bb8aa6",
|
||||
"id": "d463e623",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see that there are a number of statistically significant variables,\n",
|
||||
@@ -1583,7 +1583,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bfe236e5",
|
||||
"id": "a23c38e0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Call Center Data\n",
|
||||
@@ -1608,13 +1608,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"id": "b8ece43a",
|
||||
"id": "098f42ea",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.364905Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.364772Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.368589Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.368291Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.329058Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.328927Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.332782Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.332425Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -1637,7 +1637,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c93e44f3",
|
||||
"id": "2f54ed03",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We then build a model matrix (omitting the intercept)"
|
||||
@@ -1646,13 +1646,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "3e4f766f",
|
||||
"id": "26d5d0d0",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.370485Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.370371Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.377790Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.377469Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.334692Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.334589Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.344047Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.343708Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -1666,7 +1666,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cad1ed19",
|
||||
"id": "a1a8f65d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is worthwhile to take a peek at the model matrix `X`, so\n",
|
||||
@@ -1678,13 +1678,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"id": "72f42d14",
|
||||
"id": "77500663",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.380244Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.380068Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.384542Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.384259Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.345660Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.345575Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.350086Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.349797Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -1781,7 +1781,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "38c40ae1",
|
||||
"id": "fd7bd61c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we specify the coefficients and the hazard function."
|
||||
@@ -1790,13 +1790,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"id": "8b921536",
|
||||
"id": "74324a56",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.386034Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.385942Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.420461Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.405608Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.351738Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.351549Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.444268Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.441484Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -1808,7 +1808,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a0698ffd",
|
||||
"id": "cfe879e6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here, we have set the coefficient associated with `Operators` to\n",
|
||||
@@ -1837,13 +1837,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"id": "96ce0f99",
|
||||
"id": "d4be10c2",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.484657Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.484150Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.509286Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.508169Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.449822Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.449515Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.458388Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.457673Z"
|
||||
},
|
||||
"lines_to_next_cell": 0
|
||||
},
|
||||
@@ -1854,7 +1854,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1956e4c2",
|
||||
"id": "6095cfc1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We are now ready to generate data under the Cox proportional hazards\n",
|
||||
@@ -1868,13 +1868,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"id": "63d78ff9",
|
||||
"id": "c98d396f",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.522637Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.522260Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.630452Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.627960Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.461931Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.461787Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.624349Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.624026Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -1886,7 +1886,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "035e4ecf",
|
||||
"id": "ed2e23ea",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We now simulate our censoring variable, for which we assume\n",
|
||||
@@ -1897,13 +1897,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"id": "fe008dbf",
|
||||
"id": "caf627bc",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.635844Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.635469Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.649527Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.646191Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.626165Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.626054Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.630808Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.630542Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -2005,13 +2005,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "c3a2bec7",
|
||||
"id": "e63242f9",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.653001Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.652338Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.656830Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.656500Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.632357Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.632261Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.634630Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.634305Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -2032,7 +2032,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "207937e5",
|
||||
"id": "5f345011",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We now plot Kaplan-Meier survival curves. First, we stratify by `Center`."
|
||||
@@ -2041,13 +2041,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"id": "2b27af56",
|
||||
"id": "338db71d",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.658421Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.658328Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:35.811796Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:35.811449Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.636188Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.636081Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.791856Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.791521Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -2084,7 +2084,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "be6d37f7",
|
||||
"id": "1b5a1230",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we stratify by `Time`."
|
||||
@@ -2093,13 +2093,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"id": "9625598d",
|
||||
"id": "c1db6e15",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:35.813696Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:35.813601Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:36.041021Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:36.040708Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.793629Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.793538Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:06.992155Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:06.991803Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
@@ -2136,7 +2136,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1408ebc0",
|
||||
"id": "deb73d38",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It seems that calls at Call Center B take longer to be answered than\n",
|
||||
@@ -2149,13 +2149,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"id": "75a744ef",
|
||||
"id": "02ea4212",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:36.043079Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:36.042900Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:36.061936Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:36.061630Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:06.993929Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:06.993819Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:07.011557Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:07.011276Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -2247,7 +2247,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "be5055e4",
|
||||
"id": "db9cc6ee",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we consider the effect of `Time`."
|
||||
@@ -2256,13 +2256,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"id": "9badb3e3",
|
||||
"id": "0ac610d5",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:36.063627Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:36.063519Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:36.082451Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:36.082161Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:07.013331Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:07.013187Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:07.030401Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:07.030073Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -2354,7 +2354,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "64b2bc33",
|
||||
"id": "0946d3ef",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As in the case of a categorical variable with 2 levels, these\n",
|
||||
@@ -2366,13 +2366,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"id": "026e9ff8",
|
||||
"id": "107cedad",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:36.084076Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:36.083964Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:36.208409Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:36.208076Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:07.032008Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:07.031887Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:07.160931Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:07.160639Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -2462,7 +2462,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ed54fe0",
|
||||
"id": "10f2a0c1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we look at the results for `Time`."
|
||||
@@ -2471,13 +2471,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"id": "7cab3789",
|
||||
"id": "334eb331",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:36.210101Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:36.209985Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:36.334146Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:36.333801Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:07.162793Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:07.162651Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:07.291875Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:07.291550Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -2567,7 +2567,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2d250dc9",
|
||||
"id": "774963d4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We find that differences between centers are highly significant, as\n",
|
||||
@@ -2579,13 +2579,13 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"id": "5cc4b898",
|
||||
"id": "421811c5",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2023-08-07T00:19:36.336025Z",
|
||||
"iopub.status.busy": "2023-08-07T00:19:36.335898Z",
|
||||
"iopub.status.idle": "2023-08-07T00:19:36.561174Z",
|
||||
"shell.execute_reply": "2023-08-07T00:19:36.559597Z"
|
||||
"iopub.execute_input": "2023-08-21T02:30:07.293545Z",
|
||||
"iopub.status.busy": "2023-08-21T02:30:07.293433Z",
|
||||
"iopub.status.idle": "2023-08-21T02:30:07.532213Z",
|
||||
"shell.execute_reply": "2023-08-21T02:30:07.531293Z"
|
||||
},
|
||||
"lines_to_next_cell": 2
|
||||
},
|
||||
@@ -2684,7 +2684,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bec9d61d",
|
||||
"id": "3c65063f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The $p$-values for Center B and evening time\n",
|
||||
@@ -2703,8 +2703,8 @@
|
||||
"metadata": {
|
||||
"jupytext": {
|
||||
"cell_metadata_filter": "-all",
|
||||
"formats": "ipynb,Rmd",
|
||||
"main_language": "python"
|
||||
"main_language": "python",
|
||||
"notebook_metadata_filter": "-all"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -1,20 +1,7 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 12
|
||||
|
||||
# Lab: Unsupervised Learning
|
||||
# Lab: Unsupervised Learning
|
||||
In this lab we demonstrate PCA and clustering on several datasets.
|
||||
As in other labs, we import some of our libraries at this top
|
||||
level. This makes the code more readable, as scanning the first few
|
||||
@@ -44,7 +31,7 @@ from scipy.cluster.hierarchy import \
|
||||
from ISLP.cluster import compute_linkage
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Principal Components Analysis
|
||||
In this lab, we perform PCA on `USArrests`, a data set in the
|
||||
`R` computing environment.
|
||||
@@ -58,22 +45,22 @@ USArrests = get_rdataset('USArrests').data
|
||||
USArrests
|
||||
|
||||
```
|
||||
|
||||
|
||||
The columns of the data set contain the four variables.
|
||||
|
||||
```{python}
|
||||
USArrests.columns
|
||||
|
||||
```
|
||||
|
||||
|
||||
We first briefly examine the data. We notice that the variables have vastly different means.
|
||||
|
||||
```{python}
|
||||
USArrests.mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Dataframes have several useful methods for computing
|
||||
column-wise summaries. We can also examine the
|
||||
variance of the four variables using the `var()` method.
|
||||
@@ -82,7 +69,7 @@ variance of the four variables using the `var()` method.
|
||||
USArrests.var()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Not surprisingly, the variables also have vastly different variances.
|
||||
The `UrbanPop` variable measures the percentage of the population
|
||||
in each state living in an urban area, which is not a comparable
|
||||
@@ -132,7 +119,7 @@ of the variables. In this case, since we centered and scaled the data with
|
||||
pcaUS.mean_
|
||||
|
||||
```
|
||||
|
||||
|
||||
The scores can be computed using the `transform()` method
|
||||
of `pcaUS` after it has been fit.
|
||||
|
||||
@@ -150,7 +137,7 @@ principal component loading vector.
|
||||
pcaUS.components_
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `biplot` is a common visualization method used with
|
||||
PCA. It is not built in as a standard
|
||||
part of `sklearn`, though there are python
|
||||
@@ -191,14 +178,14 @@ for k in range(pcaUS.components_.shape[1]):
|
||||
USArrests.columns[k])
|
||||
|
||||
```
|
||||
|
||||
|
||||
The standard deviations of the principal component scores are as follows:
|
||||
|
||||
```{python}
|
||||
scores.std(0, ddof=1)
|
||||
```
|
||||
|
||||
|
||||
|
||||
The variance of each score can be extracted directly from the `pcaUS` object via
|
||||
the `explained_variance_` attribute.
|
||||
|
||||
@@ -220,7 +207,7 @@ We can plot the PVE explained by each component, as well as the cumulative PVE.
|
||||
plot the proportion of variance explained.
|
||||
|
||||
```{python}
|
||||
# %%capture
|
||||
%%capture
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
|
||||
ticks = np.arange(pcaUS.n_components_)+1
|
||||
ax = axes[0]
|
||||
@@ -320,7 +307,7 @@ Xna = X.copy()
|
||||
Xna[r_idx, c_idx] = np.nan
|
||||
|
||||
```
|
||||
|
||||
|
||||
Here the array `r_idx`
|
||||
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
|
||||
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
|
||||
@@ -348,7 +335,7 @@ Xbar = np.nanmean(Xhat, axis=0)
|
||||
Xhat[r_idx, c_idx] = Xbar[c_idx]
|
||||
|
||||
```
|
||||
|
||||
|
||||
Before we begin Step 2, we set ourselves up to measure the progress of our
|
||||
iterations:
|
||||
|
||||
@@ -387,7 +374,7 @@ while rel_err > thresh:
|
||||
.format(count, mss, rel_err))
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
|
||||
|
||||
Finally, we compute the correlation between the 20 imputed values
|
||||
@@ -397,8 +384,8 @@ and the actual values:
|
||||
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python` implementations.
|
||||
|
||||
|
||||
@@ -444,7 +431,7 @@ ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
|
||||
ax.set_title("K-Means Clustering Results with K=2");
|
||||
|
||||
```
|
||||
|
||||
|
||||
Here the observations can be easily plotted because they are
|
||||
two-dimensional. If there were more than two variables then we could
|
||||
instead perform PCA and plot the first two principal component score
|
||||
@@ -519,7 +506,7 @@ hc_comp = HClust(distance_threshold=0,
|
||||
hc_comp.fit(X)
|
||||
|
||||
```
|
||||
|
||||
|
||||
This computes the entire dendrogram.
|
||||
We could just as easily perform hierarchical clustering with average or single linkage instead:
|
||||
|
||||
@@ -534,7 +521,7 @@ hc_sing = HClust(distance_threshold=0,
|
||||
hc_sing.fit(X);
|
||||
|
||||
```
|
||||
|
||||
|
||||
To use a precomputed distance matrix, we provide an additional
|
||||
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
|
||||
|
||||
@@ -550,7 +537,7 @@ hc_sing_pre = HClust(distance_threshold=0,
|
||||
hc_sing_pre.fit(D)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We use
|
||||
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
|
||||
`dendrogram()` expects a so-called *linkage-matrix representation*
|
||||
@@ -573,7 +560,7 @@ dendrogram(linkage_comp,
|
||||
**cargs);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We may want to color branches of the tree above
|
||||
and below a cut-threshold differently. This can be achieved
|
||||
by changing the `color_threshold`. Let’s cut the tree at a height of 4,
|
||||
@@ -587,7 +574,7 @@ dendrogram(linkage_comp,
|
||||
above_threshold_color='black');
|
||||
|
||||
```
|
||||
|
||||
|
||||
To determine the cluster labels for each observation associated with a
|
||||
given cut of the dendrogram, we can use the `cut_tree()`
|
||||
function from `scipy.cluster.hierarchy`:
|
||||
@@ -607,7 +594,7 @@ or `height` to `cut_tree()`.
|
||||
cut_tree(linkage_comp, height=5)
|
||||
|
||||
```
|
||||
|
||||
|
||||
To scale the variables before performing hierarchical clustering of
|
||||
the observations, we use `StandardScaler()` as in our PCA example:
|
||||
|
||||
@@ -651,7 +638,7 @@ dendrogram(linkage_cor, ax=ax, **cargs)
|
||||
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## NCI60 Data Example
|
||||
Unsupervised techniques are often used in the analysis of genomic
|
||||
@@ -666,7 +653,7 @@ nci_labs = NCI60['labels']
|
||||
nci_data = NCI60['data']
|
||||
|
||||
```
|
||||
|
||||
|
||||
Each cell line is labeled with a cancer type. We do not make use of
|
||||
the cancer types in performing PCA and clustering, as these are
|
||||
unsupervised techniques. But after performing PCA and clustering, we
|
||||
@@ -679,8 +666,8 @@ The data has 64 rows and 6830 columns.
|
||||
nci_data.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We begin by examining the cancer types for the cell lines.
|
||||
|
||||
|
||||
@@ -688,7 +675,7 @@ We begin by examining the cancer types for the cell lines.
|
||||
nci_labs.value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### PCA on the NCI60 Data
|
||||
|
||||
@@ -703,7 +690,7 @@ nci_pca = PCA()
|
||||
nci_scores = nci_pca.fit_transform(nci_scaled)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now plot the first few principal component score vectors, in order
|
||||
to visualize the data. The observations (cell lines) corresponding to
|
||||
a given cancer type will be plotted in the same color, so that we can
|
||||
@@ -739,7 +726,7 @@ to have pretty similar gene expression levels.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We can also plot the percent variance
|
||||
explained by the principal components as well as the cumulative percent variance explained.
|
||||
This is similar to the plots we made earlier for the `USArrests` data.
|
||||
@@ -798,7 +785,7 @@ def plot_nci(linkage, ax, cut=-np.inf):
|
||||
return hc
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s plot our results.
|
||||
|
||||
```{python}
|
||||
@@ -819,7 +806,7 @@ linkage. Clearly cell lines within a single cancer type do tend to
|
||||
cluster together, although the clustering is not perfect. We will use
|
||||
complete linkage hierarchical clustering for the analysis that
|
||||
follows.
|
||||
|
||||
|
||||
We can cut the dendrogram at the height that will yield a particular
|
||||
number of clusters, say four:
|
||||
|
||||
@@ -830,7 +817,7 @@ pd.crosstab(nci_labs['label'],
|
||||
pd.Series(comp_cut.reshape(-1), name='Complete'))
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
There are some clear patterns. All the leukemia cell lines fall in
|
||||
one cluster, while the breast cancer cell lines are spread out over
|
||||
@@ -844,7 +831,7 @@ plot_nci('Complete', ax, cut=140)
|
||||
ax.axhline(140, c='r', linewidth=4);
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `axhline()` function draws a horizontal line line on top of any
|
||||
existing set of axes. The argument `140` plots a horizontal
|
||||
line at height 140 on the dendrogram; this is a height that
|
||||
@@ -866,7 +853,7 @@ pd.crosstab(pd.Series(comp_cut, name='HClust'),
|
||||
pd.Series(nci_kmeans.labels_, name='K-means'))
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that the four clusters obtained using hierarchical clustering
|
||||
and $K$-means clustering are somewhat different. First we note
|
||||
that the labels in the two clusterings are arbitrary. That is, swapping
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user