v2.2 versions of labs except Ch10
This commit is contained in:
@@ -1,23 +1,16 @@
|
|||||||
---
|
# Introduction to Python
|
||||||
jupyter:
|
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
|
||||||
formats: Rmd,ipynb
|
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch02-statlearn-lab.ipynb">
|
||||||
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
# Chapter 2
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch02-statlearn-lab.ipynb)
|
||||||
|
|
||||||
# Lab: Introduction to Python
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
|
|
||||||
@@ -73,21 +66,21 @@ inputs. For example, the
|
|||||||
print('fit a model with', 11, 'variables')
|
print('fit a model with', 11, 'variables')
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The following command will provide information about the `print()` function.
|
The following command will provide information about the `print()` function.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
# print?
|
print?
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Adding two integers in `Python` is pretty intuitive.
|
Adding two integers in `Python` is pretty intuitive.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
3 + 5
|
3 + 5
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
In `Python`, textual data is handled using
|
In `Python`, textual data is handled using
|
||||||
*strings*. For instance, `"hello"` and
|
*strings*. For instance, `"hello"` and
|
||||||
`'hello'`
|
`'hello'`
|
||||||
@@ -98,7 +91,7 @@ We can concatenate them using the addition `+` symbol.
|
|||||||
"hello" + " " + "world"
|
"hello" + " " + "world"
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
A string is actually a type of *sequence*: this is a generic term for an ordered list.
|
A string is actually a type of *sequence*: this is a generic term for an ordered list.
|
||||||
The three most important types of sequences are lists, tuples, and strings.
|
The three most important types of sequences are lists, tuples, and strings.
|
||||||
We introduce lists now.
|
We introduce lists now.
|
||||||
@@ -114,7 +107,7 @@ x = [3, 4, 5]
|
|||||||
x
|
x
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that we used the brackets
|
Note that we used the brackets
|
||||||
`[]` to construct this list.
|
`[]` to construct this list.
|
||||||
|
|
||||||
@@ -126,14 +119,14 @@ y = [4, 9, 7]
|
|||||||
x + y
|
x + y
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists
|
The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists
|
||||||
element-by-element?
|
element-by-element?
|
||||||
In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*.
|
In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*.
|
||||||
In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`.
|
In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
This example reflects the fact that
|
This example reflects the fact that
|
||||||
`Python` is a general-purpose programming language. Much of `Python`'s data-specific
|
`Python` is a general-purpose programming language. Much of `Python`'s data-specific
|
||||||
functionality comes from other packages, notably `numpy`
|
functionality comes from other packages, notably `numpy`
|
||||||
@@ -148,8 +141,8 @@ See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/n
|
|||||||
As mentioned earlier, this book makes use of functionality that is contained in the `numpy`
|
As mentioned earlier, this book makes use of functionality that is contained in the `numpy`
|
||||||
*library*, or *package*. A package is a collection of modules that are not necessarily included in
|
*library*, or *package*. A package is a collection of modules that are not necessarily included in
|
||||||
the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*.
|
the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*.
|
||||||
|
|
||||||
|
|
||||||
To access `numpy`, we must first `import` it.
|
To access `numpy`, we must first `import` it.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -193,7 +186,7 @@ x
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
The object `x` has several
|
The object `x` has several
|
||||||
*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`
|
*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`
|
||||||
@@ -203,7 +196,7 @@ For instance, we can access the `ndim` attribute of `x` as follows.
|
|||||||
```{python}
|
```{python}
|
||||||
x.ndim
|
x.ndim
|
||||||
```
|
```
|
||||||
|
|
||||||
The output indicates that `x` is a two-dimensional array.
|
The output indicates that `x` is a two-dimensional array.
|
||||||
Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is
|
Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is
|
||||||
comprised of 64-bit integers:
|
comprised of 64-bit integers:
|
||||||
@@ -227,7 +220,7 @@ documentation associated with the function `fun`, if it exists.
|
|||||||
We can try this for `np.array()`.
|
We can try this for `np.array()`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
# np.array?
|
np.array?
|
||||||
|
|
||||||
```
|
```
|
||||||
This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`.
|
This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`.
|
||||||
@@ -245,7 +238,7 @@ at its `shape` attribute.
|
|||||||
x.shape
|
x.shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
A *method* is a function that is associated with an
|
A *method* is a function that is associated with an
|
||||||
object.
|
object.
|
||||||
@@ -282,10 +275,10 @@ x_reshape = x.reshape((2, 3))
|
|||||||
print('reshaped x:\n', x_reshape)
|
print('reshaped x:\n', x_reshape)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The previous output reveals that `numpy` arrays are specified as a sequence
|
The previous output reveals that `numpy` arrays are specified as a sequence
|
||||||
of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*.
|
of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*.
|
||||||
|
|
||||||
|
|
||||||
`Python` (and hence `numpy`) uses 0-based
|
`Python` (and hence `numpy`) uses 0-based
|
||||||
indexing. This means that to access the top left element of `x_reshape`,
|
indexing. This means that to access the top left element of `x_reshape`,
|
||||||
@@ -315,13 +308,13 @@ print('x_reshape after we modify its top left element:\n', x_reshape)
|
|||||||
print('x after we modify top left element of x_reshape:\n', x)
|
print('x after we modify top left element of x_reshape:\n', x)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory.
|
Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces
|
We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces
|
||||||
an *exception*, or error.
|
an *exception*, or error.
|
||||||
|
|
||||||
@@ -330,8 +323,8 @@ my_tuple = (3, 4, 5)
|
|||||||
my_tuple[0] = 2
|
my_tuple[0] = 2
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple.
|
We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple.
|
||||||
The `ndim` attribute yields the number of dimensions, and `T` provides its transpose.
|
The `ndim` attribute yields the number of dimensions, and `T` provides its transpose.
|
||||||
|
|
||||||
@@ -339,7 +332,7 @@ The `ndim` attribute yields the number of dimensions, and `T` provides its tran
|
|||||||
x_reshape.shape, x_reshape.ndim, x_reshape.T
|
x_reshape.shape, x_reshape.ndim, x_reshape.T
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple.
|
Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple.
|
||||||
|
|
||||||
We will often want to apply functions to arrays.
|
We will often want to apply functions to arrays.
|
||||||
@@ -350,22 +343,22 @@ square root of the entries using the `np.sqrt()` function:
|
|||||||
np.sqrt(x)
|
np.sqrt(x)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can also square the elements:
|
We can also square the elements:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
x**2
|
x**2
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2.
|
We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
x**0.5
|
x**0.5
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Throughout this book, we will often want to generate random data.
|
Throughout this book, we will often want to generate random data.
|
||||||
The `np.random.normal()` function generates a vector of random
|
The `np.random.normal()` function generates a vector of random
|
||||||
normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`.
|
normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`.
|
||||||
@@ -382,7 +375,7 @@ x = np.random.normal(size=50)
|
|||||||
x
|
x
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`.
|
We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -394,7 +387,7 @@ correlation between `x` and `y`.
|
|||||||
```{python}
|
```{python}
|
||||||
np.corrcoef(x, y)
|
np.corrcoef(x, y)
|
||||||
```
|
```
|
||||||
|
|
||||||
If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few
|
If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few
|
||||||
commands. In particular,
|
commands. In particular,
|
||||||
each
|
each
|
||||||
@@ -407,7 +400,7 @@ print(np.random.normal(scale=5, size=2))
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
In order to ensure that our code provides exactly the same results
|
In order to ensure that our code provides exactly the same results
|
||||||
each time it is run, we can set a *random seed*
|
each time it is run, we can set a *random seed*
|
||||||
using the
|
using the
|
||||||
@@ -423,7 +416,7 @@ print(rng.normal(scale=5, size=2))
|
|||||||
rng2 = np.random.default_rng(1303)
|
rng2 = np.random.default_rng(1303)
|
||||||
print(rng2.normal(scale=5, size=2))
|
print(rng2.normal(scale=5, size=2))
|
||||||
```
|
```
|
||||||
|
|
||||||
Throughout the labs in this book, we use `np.random.default_rng()` whenever we
|
Throughout the labs in this book, we use `np.random.default_rng()` whenever we
|
||||||
perform calculations involving random quantities within `numpy`. In principle, this
|
perform calculations involving random quantities within `numpy`. In principle, this
|
||||||
should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible
|
should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible
|
||||||
@@ -446,7 +439,7 @@ np.mean(y), y.mean()
|
|||||||
```{python}
|
```{python}
|
||||||
np.var(y), y.var(), np.mean((y - y.mean())**2)
|
np.var(y), y.var(), np.mean((y - y.mean())**2)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Notice that by default `np.var()` divides by the sample size $n$ rather
|
Notice that by default `np.var()` divides by the sample size $n$ rather
|
||||||
than $n-1$; see the `ddof` argument in `np.var?`.
|
than $n-1$; see the `ddof` argument in `np.var?`.
|
||||||
@@ -455,7 +448,7 @@ than $n-1$; see the `ddof` argument in `np.var?`.
|
|||||||
```{python}
|
```{python}
|
||||||
np.sqrt(np.var(y)), np.std(y)
|
np.sqrt(np.var(y)), np.std(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix.
|
The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix.
|
||||||
To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums.
|
To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums.
|
||||||
|
|
||||||
@@ -469,14 +462,14 @@ Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its
|
|||||||
```{python}
|
```{python}
|
||||||
X.mean(axis=0)
|
X.mean(axis=0)
|
||||||
```
|
```
|
||||||
|
|
||||||
The following yields the same result.
|
The following yields the same result.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
X.mean(0)
|
X.mean(0)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Graphics
|
## Graphics
|
||||||
In `Python`, common practice is to use the library
|
In `Python`, common practice is to use the library
|
||||||
@@ -542,7 +535,7 @@ As an alternative, we could use the `ax.scatter()` function to create a scatter
|
|||||||
fig, ax = subplots(figsize=(8, 8))
|
fig, ax = subplots(figsize=(8, 8))
|
||||||
ax.scatter(x, y, marker='o');
|
ax.scatter(x, y, marker='o');
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice that in the code blocks above, we have ended
|
Notice that in the code blocks above, we have ended
|
||||||
the last line with a semicolon. This prevents `ax.plot(x, y)` from printing
|
the last line with a semicolon. This prevents `ax.plot(x, y)` from printing
|
||||||
text to the notebook. However, it does not prevent a plot from being produced.
|
text to the notebook. However, it does not prevent a plot from being produced.
|
||||||
@@ -583,7 +576,7 @@ fig.set_size_inches(12,3)
|
|||||||
fig
|
fig
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Occasionally we will want to create several plots within a figure. This can be
|
Occasionally we will want to create several plots within a figure. This can be
|
||||||
achieved by passing additional arguments to `subplots()`.
|
achieved by passing additional arguments to `subplots()`.
|
||||||
@@ -612,8 +605,8 @@ Type `subplots?` to learn more about
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
To save the output of `fig`, we call its `savefig()`
|
To save the output of `fig`, we call its `savefig()`
|
||||||
method. The argument `dpi` is the dots per inch, used
|
method. The argument `dpi` is the dots per inch, used
|
||||||
to determine how large the figure will be in pixels.
|
to determine how large the figure will be in pixels.
|
||||||
@@ -623,7 +616,7 @@ fig.savefig("Figure.png", dpi=400)
|
|||||||
fig.savefig("Figure.pdf", dpi=200);
|
fig.savefig("Figure.pdf", dpi=200);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it.
|
We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it.
|
||||||
|
|
||||||
@@ -675,7 +668,7 @@ fig, ax = subplots(figsize=(8, 8))
|
|||||||
ax.imshow(f);
|
ax.imshow(f);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Sequences and Slice Notation
|
## Sequences and Slice Notation
|
||||||
|
|
||||||
@@ -689,8 +682,8 @@ seq1 = np.linspace(0, 10, 11)
|
|||||||
seq1
|
seq1
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The function `np.arange()`
|
The function `np.arange()`
|
||||||
returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence
|
returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence
|
||||||
that starts at $0$ and ends at $10$.
|
that starts at $0$ and ends at $10$.
|
||||||
@@ -700,7 +693,7 @@ seq2 = np.arange(0, 10)
|
|||||||
seq2
|
seq2
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Why isn't $10$ output above? This has to do with *slice* notation in `Python`.
|
Why isn't $10$ output above? This has to do with *slice* notation in `Python`.
|
||||||
Slice notation
|
Slice notation
|
||||||
is used to index sequences such as lists, tuples and arrays.
|
is used to index sequences such as lists, tuples and arrays.
|
||||||
@@ -742,7 +735,7 @@ See the documentation `slice?` for useful options in creating slices.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Indexing Data
|
## Indexing Data
|
||||||
To begin, we create a two-dimensional `numpy` array.
|
To begin, we create a two-dimensional `numpy` array.
|
||||||
@@ -752,7 +745,7 @@ A = np.array(np.arange(16)).reshape((4, 4))
|
|||||||
A
|
A
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Typing `A[1,2]` retrieves the element corresponding to the second row and third
|
Typing `A[1,2]` retrieves the element corresponding to the second row and third
|
||||||
column. (As usual, `Python` indexes from $0.$)
|
column. (As usual, `Python` indexes from $0.$)
|
||||||
|
|
||||||
@@ -760,7 +753,7 @@ column. (As usual, `Python` indexes from $0.$)
|
|||||||
A[1,2]
|
A[1,2]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The first number after the open-bracket symbol `[`
|
The first number after the open-bracket symbol `[`
|
||||||
refers to the row, and the second number refers to the column.
|
refers to the row, and the second number refers to the column.
|
||||||
|
|
||||||
@@ -772,7 +765,7 @@ The first number after the open-bracket symbol `[`
|
|||||||
A[[1,3]]
|
A[[1,3]]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets.
|
To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets.
|
||||||
In this case we need to supply the first argument `:`
|
In this case we need to supply the first argument `:`
|
||||||
which selects all rows.
|
which selects all rows.
|
||||||
@@ -781,7 +774,7 @@ which selects all rows.
|
|||||||
A[:,[0,2]]
|
A[:,[0,2]]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Now, suppose that we want to select the submatrix made up of the second and fourth
|
Now, suppose that we want to select the submatrix made up of the second and fourth
|
||||||
rows as well as the first and third columns. This is where
|
rows as well as the first and third columns. This is where
|
||||||
indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns:
|
indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns:
|
||||||
@@ -790,21 +783,21 @@ indexing gets slightly tricky. It is natural to try to use lists to retrieve th
|
|||||||
A[[1,3],[0,2]]
|
A[[1,3],[0,2]]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Oops --- what happened? We got a one-dimensional array of length two identical to
|
Oops --- what happened? We got a one-dimensional array of length two identical to
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
np.array([A[1,0],A[3,2]])
|
np.array([A[1,0],A[3,2]])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:
|
Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
A[[1,3],[0,2,3]]
|
A[[1,3],[0,2,3]]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.
|
We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.
|
||||||
|
|
||||||
One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns.
|
One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns.
|
||||||
@@ -815,7 +808,7 @@ A[[1,3]][:,[0,2]]
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
There are more efficient ways of achieving the same result.
|
There are more efficient ways of achieving the same result.
|
||||||
|
|
||||||
@@ -827,7 +820,7 @@ idx = np.ix_([1,3],[0,2,3])
|
|||||||
A[idx]
|
A[idx]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Alternatively, we can subset matrices efficiently using slices.
|
Alternatively, we can subset matrices efficiently using slices.
|
||||||
|
|
||||||
@@ -841,7 +834,7 @@ A[1:4:2,0:3:2]
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Why are we able to retrieve a submatrix directly using slices but not using lists?
|
Why are we able to retrieve a submatrix directly using slices but not using lists?
|
||||||
Its because they are different `Python` types, and
|
Its because they are different `Python` types, and
|
||||||
are treated differently by `numpy`.
|
are treated differently by `numpy`.
|
||||||
@@ -857,7 +850,7 @@ Slices can be used to extract objects from arbitrary sequences, such as strings,
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Boolean Indexing
|
### Boolean Indexing
|
||||||
In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively).
|
In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively).
|
||||||
@@ -874,7 +867,7 @@ keep_rows[[1,3]] = True
|
|||||||
keep_rows
|
keep_rows
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that the elements of `keep_rows`, when viewed as integers, are the same as the
|
Note that the elements of `keep_rows`, when viewed as integers, are the same as the
|
||||||
values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When
|
values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When
|
||||||
applied to two arrays, the `==` operation is applied elementwise.
|
applied to two arrays, the `==` operation is applied elementwise.
|
||||||
@@ -883,7 +876,7 @@ applied to two arrays, the `==` operation is applied elementwise.
|
|||||||
np.all(keep_rows == np.array([0,1,0,1]))
|
np.all(keep_rows == np.array([0,1,0,1]))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
(Here, the function `np.all()` has checked whether
|
(Here, the function `np.all()` has checked whether
|
||||||
all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.)
|
all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.)
|
||||||
|
|
||||||
@@ -895,14 +888,14 @@ The former retrieves the first, second, first, and second rows of `A`.
|
|||||||
A[np.array([0,1,0,1])]
|
A[np.array([0,1,0,1])]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`.
|
By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
A[keep_rows]
|
A[keep_rows]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This example shows that Booleans and integers are treated differently by `numpy`.
|
This example shows that Booleans and integers are treated differently by `numpy`.
|
||||||
|
|
||||||
|
|
||||||
@@ -926,7 +919,7 @@ A[idx_mixed]
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
For more details on indexing in `numpy`, readers are referred
|
For more details on indexing in `numpy`, readers are referred
|
||||||
to the `numpy` tutorial mentioned earlier.
|
to the `numpy` tutorial mentioned earlier.
|
||||||
@@ -979,7 +972,7 @@ files. Before loading data into `Python`, it is a good idea to view it using
|
|||||||
a text editor or other software, such as Microsoft Excel.
|
a text editor or other software, such as Microsoft Excel.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
We now take a look at the column of `Auto` corresponding to the variable `horsepower`:
|
We now take a look at the column of `Auto` corresponding to the variable `horsepower`:
|
||||||
|
|
||||||
@@ -1000,7 +993,7 @@ We see the culprit is the value `?`, which is being used to encode missing value
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`.
|
To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`.
|
||||||
Now, each instance of `?` in the file is replaced with the
|
Now, each instance of `?` in the file is replaced with the
|
||||||
value `np.nan`, which means *not a number*:
|
value `np.nan`, which means *not a number*:
|
||||||
@@ -1012,8 +1005,8 @@ Auto = pd.read_csv('Auto.data',
|
|||||||
Auto['horsepower'].sum()
|
Auto['horsepower'].sum()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The `Auto.shape` attribute tells us that the data has 397
|
The `Auto.shape` attribute tells us that the data has 397
|
||||||
observations, or rows, and nine variables, or columns.
|
observations, or rows, and nine variables, or columns.
|
||||||
|
|
||||||
@@ -1021,7 +1014,7 @@ observations, or rows, and nine variables, or columns.
|
|||||||
Auto.shape
|
Auto.shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
There are
|
There are
|
||||||
various ways to deal with missing data.
|
various ways to deal with missing data.
|
||||||
In this case, since only five of the rows contain missing
|
In this case, since only five of the rows contain missing
|
||||||
@@ -1032,7 +1025,7 @@ Auto_new = Auto.dropna()
|
|||||||
Auto_new.shape
|
Auto_new.shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Basics of Selecting Rows and Columns
|
### Basics of Selecting Rows and Columns
|
||||||
|
|
||||||
@@ -1043,7 +1036,7 @@ Auto = Auto_new # overwrite the previous value
|
|||||||
Auto.columns
|
Auto.columns
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array.
|
Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array.
|
||||||
Recall that the first argument to the `[]` method
|
Recall that the first argument to the `[]` method
|
||||||
@@ -1328,8 +1321,8 @@ Auto.plot.scatter('horsepower', 'mpg', ax=axes[1]);
|
|||||||
```
|
```
|
||||||
|
|
||||||
Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`.
|
Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`.
|
||||||
|
|
||||||
|
|
||||||
We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable.
|
We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable.
|
||||||
However, since there is only a small number of possible values for this variable, we may wish to treat it as
|
However, since there is only a small number of possible values for this variable, we may wish to treat it as
|
||||||
qualitative. Below, we replace
|
qualitative. Below, we replace
|
||||||
@@ -1348,7 +1341,7 @@ fig, ax = subplots(figsize=(8, 8))
|
|||||||
Auto.boxplot('mpg', by='cylinders', ax=ax);
|
Auto.boxplot('mpg', by='cylinders', ax=ax);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `hist()` method can be used to plot a *histogram*.
|
The `hist()` method can be used to plot a *histogram*.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -1395,7 +1388,7 @@ Auto['cylinders'].describe()
|
|||||||
Auto['mpg'].describe()
|
Auto['mpg'].describe()
|
||||||
|
|
||||||
```
|
```
|
||||||
To exit `Jupyter`, select `File / Close and Halt`.
|
To exit `Jupyter`, select `File / Shut Down`.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,21 +1,13 @@
|
|||||||
---
|
|
||||||
jupyter:
|
# Linear Regression
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch03-linreg-lab.ipynb">
|
||||||
formats: Rmd,ipynb
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
main_language: python
|
</a>
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch03-linreg-lab.ipynb)
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 3
|
|
||||||
|
|
||||||
|
|
||||||
# Lab: Linear Regression
|
|
||||||
|
|
||||||
## Importing packages
|
## Importing packages
|
||||||
We import our standard libraries at this top
|
We import our standard libraries at this top
|
||||||
@@ -27,7 +19,7 @@ import pandas as pd
|
|||||||
from matplotlib.pyplot import subplots
|
from matplotlib.pyplot import subplots
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### New imports
|
### New imports
|
||||||
Throughout this lab we will introduce new functions and libraries. However,
|
Throughout this lab we will introduce new functions and libraries. However,
|
||||||
@@ -103,7 +95,7 @@ A.sum()
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Simple Linear Regression
|
## Simple Linear Regression
|
||||||
In this section we will construct model
|
In this section we will construct model
|
||||||
@@ -125,7 +117,7 @@ Boston = load_data("Boston")
|
|||||||
Boston.columns
|
Boston.columns
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Type `Boston?` to find out more about these data.
|
Type `Boston?` to find out more about these data.
|
||||||
|
|
||||||
We start by using the `sm.OLS()` function to fit a
|
We start by using the `sm.OLS()` function to fit a
|
||||||
@@ -140,7 +132,7 @@ X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]),
|
|||||||
X[:4]
|
X[:4]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We extract the response, and fit the model.
|
We extract the response, and fit the model.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -162,7 +154,7 @@ method, and returns such a summary.
|
|||||||
summarize(results)
|
summarize(results)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
|
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
|
||||||
### Using Transformations: Fit and Transform
|
### Using Transformations: Fit and Transform
|
||||||
@@ -233,8 +225,8 @@ The fitted coefficients can also be retrieved as the
|
|||||||
results.params
|
results.params
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
|
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
|
||||||
prediction intervals for the prediction of `medv` for given values of `lstat`.
|
prediction intervals for the prediction of `medv` for given values of `lstat`.
|
||||||
|
|
||||||
@@ -339,7 +331,7 @@ As mentioned above, there is an existing function to add a line to a plot --- `a
|
|||||||
|
|
||||||
|
|
||||||
Next we examine some diagnostic plots, several of which were discussed
|
Next we examine some diagnostic plots, several of which were discussed
|
||||||
in Section 3.3.3.
|
in Section~\ref{Ch3:problems.sec}.
|
||||||
We can find the fitted values and residuals
|
We can find the fitted values and residuals
|
||||||
of the fit as attributes of the `results` object.
|
of the fit as attributes of the `results` object.
|
||||||
Various influence measures describing the regression model
|
Various influence measures describing the regression model
|
||||||
@@ -404,7 +396,7 @@ terms = Boston.columns.drop('medv')
|
|||||||
terms
|
terms
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can now fit the model with all the variables in `terms` using
|
We can now fit the model with all the variables in `terms` using
|
||||||
the same model matrix builder.
|
the same model matrix builder.
|
||||||
|
|
||||||
@@ -415,7 +407,7 @@ results = model.fit()
|
|||||||
summarize(results)
|
summarize(results)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
What if we would like to perform a regression using all of the variables but one? For
|
What if we would like to perform a regression using all of the variables but one? For
|
||||||
example, in the above regression output, `age` has a high $p$-value.
|
example, in the above regression output, `age` has a high $p$-value.
|
||||||
So we may wish to run a regression excluding this predictor.
|
So we may wish to run a regression excluding this predictor.
|
||||||
@@ -436,7 +428,7 @@ We can access the individual components of `results` by name
|
|||||||
and
|
and
|
||||||
`np.sqrt(results.scale)` gives us the RSE.
|
`np.sqrt(results.scale)` gives us the RSE.
|
||||||
|
|
||||||
Variance inflation factors (section 3.3.3) are sometimes useful
|
Variance inflation factors (section~\ref{Ch3:problems.sec}) are sometimes useful
|
||||||
to assess the effect of collinearity in the model matrix of a regression model.
|
to assess the effect of collinearity in the model matrix of a regression model.
|
||||||
We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
|
We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
|
||||||
|
|
||||||
@@ -490,7 +482,7 @@ model2 = sm.OLS(y, X)
|
|||||||
summarize(model2.fit())
|
summarize(model2.fit())
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Non-linear Transformations of the Predictors
|
## Non-linear Transformations of the Predictors
|
||||||
The model matrix builder can include terms beyond
|
The model matrix builder can include terms beyond
|
||||||
@@ -557,7 +549,7 @@ ax = subplots(figsize=(8,8))[1]
|
|||||||
ax.scatter(results3.fittedvalues, results3.resid)
|
ax.scatter(results3.fittedvalues, results3.resid)
|
||||||
ax.set_xlabel('Fitted value')
|
ax.set_xlabel('Fitted value')
|
||||||
ax.set_ylabel('Residual')
|
ax.set_ylabel('Residual')
|
||||||
ax.axhline(0, c='k', ls='--')
|
ax.axhline(0, c='k', ls='--');
|
||||||
|
|
||||||
```
|
```
|
||||||
We see that when the quadratic term is included in the model,
|
We see that when the quadratic term is included in the model,
|
||||||
@@ -565,7 +557,7 @@ there is little discernible pattern in the residuals.
|
|||||||
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
|
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
|
||||||
to `poly()`.
|
to `poly()`.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Qualitative Predictors
|
## Qualitative Predictors
|
||||||
Here we use the `Carseats` data, which is included in the
|
Here we use the `Carseats` data, which is included in the
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,24 +1,16 @@
|
|||||||
---
|
|
||||||
jupyter:
|
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
|
||||||
formats: Rmd,ipynb
|
|
||||||
main_language: python
|
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 4
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Logistic Regression, LDA, QDA, and KNN
|
||||||
|
|
||||||
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch04-classification-lab.ipynb">
|
||||||
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch04-classification-lab.ipynb)
|
||||||
|
|
||||||
|
|
||||||
# Lab: Logistic Regression, LDA, QDA, and KNN
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -404,7 +396,7 @@ lda.fit(X_train, L_train)
|
|||||||
|
|
||||||
```
|
```
|
||||||
Here we have used the list comprehensions introduced
|
Here we have used the list comprehensions introduced
|
||||||
in Section 3.6.4. Looking at our first line above, we see that the right-hand side is a list
|
in Section~\ref{Ch3-linreg-lab:multivariate-goodness-of-fit}. Looking at our first line above, we see that the right-hand side is a list
|
||||||
of length two. This is because the code `for M in [X_train, X_test]` iterates over a list
|
of length two. This is because the code `for M in [X_train, X_test]` iterates over a list
|
||||||
of length two. While here we loop over a list,
|
of length two. While here we loop over a list,
|
||||||
the list comprehension method works when looping over any iterable object.
|
the list comprehension method works when looping over any iterable object.
|
||||||
@@ -453,7 +445,7 @@ lda.scalings_
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
These values provide the linear combination of `Lag1` and `Lag2` that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (4.24).
|
These values provide the linear combination of `Lag1` and `Lag2` that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (\ref{Ch4:bayes.multi}).
|
||||||
If $-0.64\times `Lag1` - 0.51 \times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.
|
If $-0.64\times `Lag1` - 0.51 \times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -462,7 +454,7 @@ lda_pred = lda.predict(X_test)
|
|||||||
```
|
```
|
||||||
|
|
||||||
As we observed in our comparison of classification methods
|
As we observed in our comparison of classification methods
|
||||||
(Section 4.5), the LDA and logistic
|
(Section~\ref{Ch4:comparison.sec}), the LDA and logistic
|
||||||
regression predictions are almost identical.
|
regression predictions are almost identical.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -521,7 +513,7 @@ The LDA classifier above is the first classifier from the
|
|||||||
`sklearn` library. We will use several other objects
|
`sklearn` library. We will use several other objects
|
||||||
from this library. The objects
|
from this library. The objects
|
||||||
follow a common structure that simplifies tasks such as cross-validation,
|
follow a common structure that simplifies tasks such as cross-validation,
|
||||||
which we will see in Chapter 5. Specifically,
|
which we will see in Chapter~\ref{Ch5:resample}. Specifically,
|
||||||
the methods first create a generic classifier without
|
the methods first create a generic classifier without
|
||||||
referring to any data. This classifier is then fit
|
referring to any data. This classifier is then fit
|
||||||
to data with the `fit()` method and predictions are
|
to data with the `fit()` method and predictions are
|
||||||
@@ -807,7 +799,7 @@ feature_std.std()
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 200.
|
Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page~\pageref{Ch4-varformula}.
|
||||||
In this case it does not matter, as long as the variables are all on the same scale.
|
In this case it does not matter, as long as the variables are all on the same scale.
|
||||||
|
|
||||||
Using the function `train_test_split()` we now split the observations into a test set,
|
Using the function `train_test_split()` we now split the observations into a test set,
|
||||||
@@ -874,7 +866,7 @@ This is double the rate that one would obtain from random guessing.
|
|||||||
The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*.
|
The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*.
|
||||||
We do not know *a priori* what value to use. It is therefore of interest
|
We do not know *a priori* what value to use. It is therefore of interest
|
||||||
to see how the classifier performs on test data as we vary these
|
to see how the classifier performs on test data as we vary these
|
||||||
parameters. This can be achieved with a `for` loop, described in Section 2.3.8.
|
parameters. This can be achieved with a `for` loop, described in Section~\ref{Ch2-statlearn-lab:for-loops}.
|
||||||
Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase
|
Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase
|
||||||
insurance as we vary the number of neighbors from 1 to 5:
|
insurance as we vary the number of neighbors from 1 to 5:
|
||||||
|
|
||||||
@@ -901,7 +893,7 @@ As a comparison, we can also fit a logistic regression model to the
|
|||||||
data. This can also be done
|
data. This can also be done
|
||||||
with `sklearn`, though by default it fits
|
with `sklearn`, though by default it fits
|
||||||
something like the *ridge regression* version
|
something like the *ridge regression* version
|
||||||
of logistic regression, which we introduce in Chapter 6. This can
|
of logistic regression, which we introduce in Chapter~\ref{Ch6:varselect}. This can
|
||||||
be modified by appropriately setting the argument `C` below. Its default
|
be modified by appropriately setting the argument `C` below. Its default
|
||||||
value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized)
|
value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized)
|
||||||
logistic regression estimator discussed above.
|
logistic regression estimator discussed above.
|
||||||
@@ -945,7 +937,7 @@ confusion_table(logit_labels, y_test)
|
|||||||
|
|
||||||
```
|
```
|
||||||
## Linear and Poisson Regression on the Bikeshare Data
|
## Linear and Poisson Regression on the Bikeshare Data
|
||||||
Here we fit linear and Poisson regression models to the `Bikeshare` data, as described in Section 4.6.
|
Here we fit linear and Poisson regression models to the `Bikeshare` data, as described in Section~\ref{Ch4:sec:pois}.
|
||||||
The response `bikers` measures the number of bike rentals per hour
|
The response `bikers` measures the number of bike rentals per hour
|
||||||
in Washington, DC in the period 2010--2012.
|
in Washington, DC in the period 2010--2012.
|
||||||
|
|
||||||
@@ -986,7 +978,7 @@ variables constant, there are on average about 7 more riders in
|
|||||||
February than in January. Similarly there are about 16.5 more riders
|
February than in January. Similarly there are about 16.5 more riders
|
||||||
in March than in January.
|
in March than in January.
|
||||||
|
|
||||||
The results seen in Section 4.6.1
|
The results seen in Section~\ref{sec:bikeshare.linear}
|
||||||
used a slightly different coding of the variables `hr` and `mnth`, as follows:
|
used a slightly different coding of the variables `hr` and `mnth`, as follows:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -1040,7 +1032,7 @@ np.allclose(M_lm.fittedvalues, M2_lm.fittedvalues)
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
To reproduce the left-hand side of Figure 4.13
|
To reproduce the left-hand side of Figure~\ref{Ch4:bikeshare}
|
||||||
we must first obtain the coefficient estimates associated with
|
we must first obtain the coefficient estimates associated with
|
||||||
`mnth`. The coefficients for January through November can be obtained
|
`mnth`. The coefficients for January through November can be obtained
|
||||||
directly from the `M2_lm` object. The coefficient for December
|
directly from the `M2_lm` object. The coefficient for December
|
||||||
@@ -1080,7 +1072,7 @@ ax_month.set_ylabel('Coefficient', fontsize=20);
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Reproducing the right-hand plot in Figure 4.13 follows a similar process.
|
Reproducing the right-hand plot in Figure~\ref{Ch4:bikeshare} follows a similar process.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
coef_hr = S2[S2.index.str.contains('hr')]['coef']
|
coef_hr = S2[S2.index.str.contains('hr')]['coef']
|
||||||
@@ -1115,7 +1107,7 @@ M_pois = sm.GLM(Y, X2, family=sm.families.Poisson()).fit()
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce Figure 4.15. We first complete these coefficients as before.
|
We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce Figure~\ref{Ch4:bikeshare.pois}. We first complete these coefficients as before.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
S_pois = summarize(M_pois)
|
S_pois = summarize(M_pois)
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,23 +1,15 @@
|
|||||||
---
|
|
||||||
jupyter:
|
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
|
||||||
formats: Rmd,ipynb
|
|
||||||
main_language: python
|
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 5
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Cross-Validation and the Bootstrap
|
||||||
|
|
||||||
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb">
|
||||||
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch05-resample-lab.ipynb)
|
||||||
|
|
||||||
|
|
||||||
# Lab: Cross-Validation and the Bootstrap
|
|
||||||
In this lab, we explore the resampling techniques covered in this
|
In this lab, we explore the resampling techniques covered in this
|
||||||
chapter. Some of the commands in this lab may take a while to run on
|
chapter. Some of the commands in this lab may take a while to run on
|
||||||
your computer.
|
your computer.
|
||||||
@@ -235,7 +227,7 @@ for i, d in enumerate(range(1,6)):
|
|||||||
cv_error
|
cv_error
|
||||||
|
|
||||||
```
|
```
|
||||||
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
|
As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
|
||||||
quadratic fits, but then no clear improvement from using higher-degree polynomials.
|
quadratic fits, but then no clear improvement from using higher-degree polynomials.
|
||||||
|
|
||||||
Above we introduced the `outer()` method of the `np.power()`
|
Above we introduced the `outer()` method of the `np.power()`
|
||||||
@@ -276,7 +268,7 @@ cv_error
|
|||||||
Notice that the computation time is much shorter than that of LOOCV.
|
Notice that the computation time is much shorter than that of LOOCV.
|
||||||
(In principle, the computation time for LOOCV for a least squares
|
(In principle, the computation time for LOOCV for a least squares
|
||||||
linear model should be faster than for $K$-fold CV, due to the
|
linear model should be faster than for $K$-fold CV, due to the
|
||||||
availability of the formula (5.2) for LOOCV;
|
availability of the formula~(\ref{Ch5:eq:LOOCVform}) for LOOCV;
|
||||||
however, the generic `cross_validate()` function does not make
|
however, the generic `cross_validate()` function does not make
|
||||||
use of this formula.) We still see little evidence that using cubic
|
use of this formula.) We still see little evidence that using cubic
|
||||||
or higher-degree polynomial terms leads to a lower test error than simply
|
or higher-degree polynomial terms leads to a lower test error than simply
|
||||||
@@ -322,7 +314,7 @@ incurred by picking different random folds.
|
|||||||
|
|
||||||
## The Bootstrap
|
## The Bootstrap
|
||||||
We illustrate the use of the bootstrap in the simple example
|
We illustrate the use of the bootstrap in the simple example
|
||||||
{of Section 5.2,} as well as on an example involving
|
{of Section~\ref{Ch5:sec:bootstrap},} as well as on an example involving
|
||||||
estimating the accuracy of the linear regression model on the `Auto`
|
estimating the accuracy of the linear regression model on the `Auto`
|
||||||
data set.
|
data set.
|
||||||
### Estimating the Accuracy of a Statistic of Interest
|
### Estimating the Accuracy of a Statistic of Interest
|
||||||
@@ -337,8 +329,8 @@ in a dataframe.
|
|||||||
To illustrate the bootstrap, we
|
To illustrate the bootstrap, we
|
||||||
start with a simple example.
|
start with a simple example.
|
||||||
The `Portfolio` data set in the `ISLP` package is described
|
The `Portfolio` data set in the `ISLP` package is described
|
||||||
in Section 5.2. The goal is to estimate the
|
in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
|
||||||
sampling variance of the parameter $\alpha$ given in formula (5.7). We will
|
sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}). We will
|
||||||
create a function
|
create a function
|
||||||
`alpha_func()`, which takes as input a dataframe `D` assumed
|
`alpha_func()`, which takes as input a dataframe `D` assumed
|
||||||
to have columns `X` and `Y`, as well as a
|
to have columns `X` and `Y`, as well as a
|
||||||
@@ -357,7 +349,7 @@ def alpha_func(D, idx):
|
|||||||
```
|
```
|
||||||
This function returns an estimate for $\alpha$
|
This function returns an estimate for $\alpha$
|
||||||
based on applying the minimum
|
based on applying the minimum
|
||||||
variance formula (5.7) to the observations indexed by
|
variance formula (\ref{Ch5:min.var}) to the observations indexed by
|
||||||
the argument `idx`. For instance, the following command
|
the argument `idx`. For instance, the following command
|
||||||
estimates $\alpha$ using all 100 observations.
|
estimates $\alpha$ using all 100 observations.
|
||||||
|
|
||||||
@@ -427,7 +419,7 @@ intercept and slope terms for the linear regression model that uses
|
|||||||
`horsepower` to predict `mpg` in the `Auto` data set. We
|
`horsepower` to predict `mpg` in the `Auto` data set. We
|
||||||
will compare the estimates obtained using the bootstrap to those
|
will compare the estimates obtained using the bootstrap to those
|
||||||
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
|
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
|
||||||
${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
|
${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.
|
||||||
|
|
||||||
To use our `boot_SE()` function, we must write a function (its
|
To use our `boot_SE()` function, we must write a function (its
|
||||||
first argument)
|
first argument)
|
||||||
@@ -474,7 +466,7 @@ demonstrate its utility on 10 bootstrap samples.
|
|||||||
```{python}
|
```{python}
|
||||||
rng = np.random.default_rng(0)
|
rng = np.random.default_rng(0)
|
||||||
np.array([hp_func(Auto,
|
np.array([hp_func(Auto,
|
||||||
rng.choice(392,
|
rng.choice(Auto.index,
|
||||||
392,
|
392,
|
||||||
replace=True)) for _ in range(10)])
|
replace=True)) for _ in range(10)])
|
||||||
|
|
||||||
@@ -496,7 +488,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
|
|||||||
0.85, and that the bootstrap
|
0.85, and that the bootstrap
|
||||||
estimate for ${\rm SE}(\hat{\beta}_1)$ is
|
estimate for ${\rm SE}(\hat{\beta}_1)$ is
|
||||||
0.0074. As discussed in
|
0.0074. As discussed in
|
||||||
Section 3.1.2, standard formulas can be used to compute
|
Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
|
||||||
the standard errors for the regression coefficients in a linear
|
the standard errors for the regression coefficients in a linear
|
||||||
model. These can be obtained using the `summarize()` function
|
model. These can be obtained using the `summarize()` function
|
||||||
from `ISLP.sm`.
|
from `ISLP.sm`.
|
||||||
@@ -510,7 +502,7 @@ model_se
|
|||||||
|
|
||||||
|
|
||||||
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
|
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
|
||||||
obtained using the formulas from Section 3.1.2 are
|
obtained using the formulas from Section~\ref{Ch3:secoefsec} are
|
||||||
0.717 for the
|
0.717 for the
|
||||||
intercept and
|
intercept and
|
||||||
0.006 for the
|
0.006 for the
|
||||||
@@ -518,13 +510,13 @@ slope. Interestingly, these are somewhat different from the estimates
|
|||||||
obtained using the bootstrap. Does this indicate a problem with the
|
obtained using the bootstrap. Does this indicate a problem with the
|
||||||
bootstrap? In fact, it suggests the opposite. Recall that the
|
bootstrap? In fact, it suggests the opposite. Recall that the
|
||||||
standard formulas given in
|
standard formulas given in
|
||||||
{Equation 3.8 on page 82}
|
{Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
|
||||||
rely on certain assumptions. For example,
|
rely on certain assumptions. For example,
|
||||||
they depend on the unknown parameter $\sigma^2$, the noise
|
they depend on the unknown parameter $\sigma^2$, the noise
|
||||||
variance. We then estimate $\sigma^2$ using the RSS. Now although the
|
variance. We then estimate $\sigma^2$ using the RSS. Now although the
|
||||||
formula for the standard errors do not rely on the linear model being
|
formula for the standard errors do not rely on the linear model being
|
||||||
correct, the estimate for $\sigma^2$ does. We see
|
correct, the estimate for $\sigma^2$ does. We see
|
||||||
{in Figure 3.8 on page 108} that there is
|
{in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}} that there is
|
||||||
a non-linear relationship in the data, and so the residuals from a
|
a non-linear relationship in the data, and so the residuals from a
|
||||||
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
|
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
|
||||||
the standard formulas assume (somewhat unrealistically) that the $x_i$
|
the standard formulas assume (somewhat unrealistically) that the $x_i$
|
||||||
@@ -537,7 +529,7 @@ the results from `sm.OLS`.
|
|||||||
Below we compute the bootstrap standard error estimates and the
|
Below we compute the bootstrap standard error estimates and the
|
||||||
standard linear regression estimates that result from fitting the
|
standard linear regression estimates that result from fitting the
|
||||||
quadratic model to the data. Since this model provides a good fit to
|
quadratic model to the data. Since this model provides a good fit to
|
||||||
the data (Figure 3.8), there is now a better
|
the data (Figure~\ref{Ch3:polyplot}), there is now a better
|
||||||
correspondence between the bootstrap estimates and the standard
|
correspondence between the bootstrap estimates and the standard
|
||||||
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
|
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
|
||||||
${\rm SE}(\hat{\beta}_2)$.
|
${\rm SE}(\hat{\beta}_2)$.
|
||||||
|
|||||||
@@ -2,20 +2,29 @@
|
|||||||
"cells": [
|
"cells": [
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "85ad9863",
|
"id": "dc2d635a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": []
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6dde3cef",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
|
"# Cross-Validation and the Bootstrap\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# Chapter 5\n",
|
"<a target=\"_blank\" href=\"https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb\">\n",
|
||||||
"\n"
|
"<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
|
||||||
|
"</a>\n",
|
||||||
|
"\n",
|
||||||
|
"[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch05-resample-lab.ipynb)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "ac8b08af",
|
"id": "a9fd4324",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# Lab: Cross-Validation and the Bootstrap\n",
|
|
||||||
"In this lab, we explore the resampling techniques covered in this\n",
|
"In this lab, we explore the resampling techniques covered in this\n",
|
||||||
"chapter. Some of the commands in this lab may take a while to run on\n",
|
"chapter. Some of the commands in this lab may take a while to run on\n",
|
||||||
"your computer.\n",
|
"your computer.\n",
|
||||||
@@ -26,9 +35,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 1,
|
"execution_count": 1,
|
||||||
"id": "e7712cfe",
|
"id": "f1deb5cc",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:13.493284Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:13.492950Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.143174Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.142882Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 2
|
"lines_to_next_cell": 2
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
@@ -44,7 +58,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "784a2ba3",
|
"id": "afa08b62",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"There are several new imports needed for this lab."
|
"There are several new imports needed for this lab."
|
||||||
@@ -53,9 +67,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 2,
|
"execution_count": 2,
|
||||||
"id": "21c2ed4f",
|
"id": "268c41b3",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.144884Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.144773Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.146541Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.146330Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 2
|
"lines_to_next_cell": 2
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
@@ -71,7 +90,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "9ac3acd5",
|
"id": "1c04f8e4",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## The Validation Set Approach\n",
|
"## The Validation Set Approach\n",
|
||||||
@@ -92,9 +111,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 3,
|
"execution_count": 3,
|
||||||
"id": "8af59641",
|
"id": "22f44ae0",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.147809Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.147730Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.152606Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.152414Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -106,7 +130,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "e76383f0",
|
"id": "318fe69f",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`."
|
"Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`."
|
||||||
@@ -115,9 +139,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 4,
|
"execution_count": 4,
|
||||||
"id": "d9b0b7c8",
|
"id": "0c32e917",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.153847Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.153757Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.157537Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.157339Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -130,7 +159,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "d196dd08",
|
"id": "7e883b8f",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We now use the `predict()` method of `results` evaluated on the model matrix for this model\n",
|
"We now use the `predict()` method of `results` evaluated on the model matrix for this model\n",
|
||||||
@@ -140,9 +169,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 5,
|
"execution_count": 5,
|
||||||
"id": "3e77d831",
|
"id": "86ce4f85",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.158717Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.158637Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.162177Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.161910Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -165,7 +199,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "f4369ee6",
|
"id": "f2ecdee6",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Hence our estimate for the validation MSE of the linear regression\n",
|
"Hence our estimate for the validation MSE of the linear regression\n",
|
||||||
@@ -179,9 +213,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 6,
|
"execution_count": 6,
|
||||||
"id": "0aa4bfcc",
|
"id": "50a66a97",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.163466Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.163397Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.165323Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.165076Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -205,7 +244,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "0271dc50",
|
"id": "a255779c",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Let’s use this function to estimate the validation MSE\n",
|
"Let’s use this function to estimate the validation MSE\n",
|
||||||
@@ -217,9 +256,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 7,
|
"execution_count": 7,
|
||||||
"id": "a0dbd55f",
|
"id": "d49b6999",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.166563Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.166497Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.177198Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.176975Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -245,7 +289,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "a7401536",
|
"id": "9d7b8fc1",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"These error rates are $23.62, 18.76$, and $18.80$, respectively. If we\n",
|
"These error rates are $23.62, 18.76$, and $18.80$, respectively. If we\n",
|
||||||
@@ -256,9 +300,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 8,
|
"execution_count": 8,
|
||||||
"id": "885136a4",
|
"id": "dac8bd54",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.178405Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.178321Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.188650Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.188432Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -287,7 +336,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "00785402",
|
"id": "61f2c12d",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Using this split of the observations into a training set and a validation set,\n",
|
"Using this split of the observations into a training set and a validation set,\n",
|
||||||
@@ -301,7 +350,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "21c071b8",
|
"id": "f22daa51",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Cross-Validation\n",
|
"## Cross-Validation\n",
|
||||||
@@ -334,9 +383,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 9,
|
"execution_count": 9,
|
||||||
"id": "6d957d8c",
|
"id": "601ae443",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.189993Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.189906Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:14.876368Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:14.876129Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -365,7 +419,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "c17e2bc8",
|
"id": "ebadc35f",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The arguments to `cross_validate()` are as follows: an\n",
|
"The arguments to `cross_validate()` are as follows: an\n",
|
||||||
@@ -381,7 +435,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "5c7901f2",
|
"id": "25f47b99",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We can repeat this procedure for increasingly complex polynomial fits.\n",
|
"We can repeat this procedure for increasingly complex polynomial fits.\n",
|
||||||
@@ -397,9 +451,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 10,
|
"execution_count": 10,
|
||||||
"id": "e2b5ce95",
|
"id": "11226c85",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:14.877800Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:14.877726Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.384419Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.384193Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -430,10 +489,10 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "03706248",
|
"id": "a3a920ae",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and\n",
|
"As in Figure~\\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and\n",
|
||||||
"quadratic fits, but then no clear improvement from using higher-degree polynomials.\n",
|
"quadratic fits, but then no clear improvement from using higher-degree polynomials.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Above we introduced the `outer()` method of the `np.power()`\n",
|
"Above we introduced the `outer()` method of the `np.power()`\n",
|
||||||
@@ -449,9 +508,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 11,
|
"execution_count": 11,
|
||||||
"id": "1dda1bd7",
|
"id": "64b64d97",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.385768Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.385690Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.387686Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.387484Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -475,7 +539,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "f5092f1b",
|
"id": "71385c1b",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In the CV example above, we used $K=n$, but of course we can also use $K<n$. The code is very similar\n",
|
"In the CV example above, we used $K=n$, but of course we can also use $K<n$. The code is very similar\n",
|
||||||
@@ -486,9 +550,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 12,
|
"execution_count": 12,
|
||||||
"id": "fb25fa70",
|
"id": "ca0f972f",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.389014Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.388934Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.407438Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.407194Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -520,13 +589,13 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "c4ec6afb",
|
"id": "8b234093",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Notice that the computation time is much shorter than that of LOOCV.\n",
|
"Notice that the computation time is much shorter than that of LOOCV.\n",
|
||||||
"(In principle, the computation time for LOOCV for a least squares\n",
|
"(In principle, the computation time for LOOCV for a least squares\n",
|
||||||
"linear model should be faster than for $K$-fold CV, due to the\n",
|
"linear model should be faster than for $K$-fold CV, due to the\n",
|
||||||
"availability of the formula (5.2) for LOOCV;\n",
|
"availability of the formula~(\\ref{Ch5:eq:LOOCVform}) for LOOCV;\n",
|
||||||
"however, the generic `cross_validate()` function does not make\n",
|
"however, the generic `cross_validate()` function does not make\n",
|
||||||
"use of this formula.) We still see little evidence that using cubic\n",
|
"use of this formula.) We still see little evidence that using cubic\n",
|
||||||
"or higher-degree polynomial terms leads to a lower test error than simply\n",
|
"or higher-degree polynomial terms leads to a lower test error than simply\n",
|
||||||
@@ -535,7 +604,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "5edf407f",
|
"id": "fb4487a4",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The `cross_validate()` function is flexible and can take\n",
|
"The `cross_validate()` function is flexible and can take\n",
|
||||||
@@ -546,9 +615,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 13,
|
"execution_count": 13,
|
||||||
"id": "d78795cd",
|
"id": "080cdb29",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.408750Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.408677Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.413979Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.413762Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 2
|
"lines_to_next_cell": 2
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -576,7 +650,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "a081be63",
|
"id": "b2f4b4cf",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"One can estimate the variability in the test error by running the following:"
|
"One can estimate the variability in the test error by running the following:"
|
||||||
@@ -585,9 +659,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 14,
|
"execution_count": 14,
|
||||||
"id": "0407ad56",
|
"id": "7c46de2b",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.415225Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.415158Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.437526Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.437302Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -614,7 +693,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "b66db3cb",
|
"id": "07165f0e",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Note that this standard deviation is not a valid estimate of the\n",
|
"Note that this standard deviation is not a valid estimate of the\n",
|
||||||
@@ -625,7 +704,7 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"## The Bootstrap\n",
|
"## The Bootstrap\n",
|
||||||
"We illustrate the use of the bootstrap in the simple example\n",
|
"We illustrate the use of the bootstrap in the simple example\n",
|
||||||
" {of Section 5.2,} as well as on an example involving\n",
|
" {of Section~\\ref{Ch5:sec:bootstrap},} as well as on an example involving\n",
|
||||||
"estimating the accuracy of the linear regression model on the `Auto`\n",
|
"estimating the accuracy of the linear regression model on the `Auto`\n",
|
||||||
"data set.\n",
|
"data set.\n",
|
||||||
"### Estimating the Accuracy of a Statistic of Interest\n",
|
"### Estimating the Accuracy of a Statistic of Interest\n",
|
||||||
@@ -640,8 +719,8 @@
|
|||||||
"To illustrate the bootstrap, we\n",
|
"To illustrate the bootstrap, we\n",
|
||||||
"start with a simple example.\n",
|
"start with a simple example.\n",
|
||||||
"The `Portfolio` data set in the `ISLP` package is described\n",
|
"The `Portfolio` data set in the `ISLP` package is described\n",
|
||||||
"in Section 5.2. The goal is to estimate the\n",
|
"in Section~\\ref{Ch5:sec:bootstrap}. The goal is to estimate the\n",
|
||||||
"sampling variance of the parameter $\\alpha$ given in formula (5.7). We will\n",
|
"sampling variance of the parameter $\\alpha$ given in formula~(\\ref{Ch5:min.var}). We will\n",
|
||||||
"create a function\n",
|
"create a function\n",
|
||||||
"`alpha_func()`, which takes as input a dataframe `D` assumed\n",
|
"`alpha_func()`, which takes as input a dataframe `D` assumed\n",
|
||||||
"to have columns `X` and `Y`, as well as a\n",
|
"to have columns `X` and `Y`, as well as a\n",
|
||||||
@@ -654,9 +733,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 15,
|
"execution_count": 15,
|
||||||
"id": "f04f15bd",
|
"id": "a4b6d9b3",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.438786Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.438714Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.441484Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.441268Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
@@ -670,12 +754,12 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "c88bd6a4",
|
"id": "9d50058e",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"This function returns an estimate for $\\alpha$\n",
|
"This function returns an estimate for $\\alpha$\n",
|
||||||
"based on applying the minimum\n",
|
"based on applying the minimum\n",
|
||||||
" variance formula (5.7) to the observations indexed by\n",
|
" variance formula (\\ref{Ch5:min.var}) to the observations indexed by\n",
|
||||||
"the argument `idx`. For instance, the following command\n",
|
"the argument `idx`. For instance, the following command\n",
|
||||||
"estimates $\\alpha$ using all 100 observations."
|
"estimates $\\alpha$ using all 100 observations."
|
||||||
]
|
]
|
||||||
@@ -683,9 +767,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 16,
|
"execution_count": 16,
|
||||||
"id": "f98c0323",
|
"id": "81498a11",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.442843Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.442765Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.445171Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.444944Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -705,7 +794,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "58a78f00",
|
"id": "4f5d0aab",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Next we randomly select\n",
|
"Next we randomly select\n",
|
||||||
@@ -717,9 +806,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 17,
|
"execution_count": 17,
|
||||||
"id": "bcd40175",
|
"id": "64fe1cb6",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.446422Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.446354Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.448793Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.448579Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 2
|
"lines_to_next_cell": 2
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -744,7 +838,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "e6058be4",
|
"id": "91a635fe",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"This process can be generalized to create a simple function `boot_SE()` for\n",
|
"This process can be generalized to create a simple function `boot_SE()` for\n",
|
||||||
@@ -755,9 +849,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 18,
|
"execution_count": 18,
|
||||||
"id": "ab6602cd",
|
"id": "dd16bbae",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.450062Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.449992Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.451958Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.451742Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
@@ -782,7 +881,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "d94d383e",
|
"id": "ac4e17ed",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Notice the use of `_` as a loop variable in `for _ in range(B)`. This is often used if the value of the counter is\n",
|
"Notice the use of `_` as a loop variable in `for _ in range(B)`. This is often used if the value of the counter is\n",
|
||||||
@@ -795,9 +894,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 19,
|
"execution_count": 19,
|
||||||
"id": "4a323513",
|
"id": "b42b4585",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.453190Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.453118Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.631597Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.631370Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@@ -821,7 +925,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "22343f53",
|
"id": "6c5464d7",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The final output shows that the bootstrap estimate for ${\\rm SE}(\\hat{\\alpha})$ is $0.0912$.\n",
|
"The final output shows that the bootstrap estimate for ${\\rm SE}(\\hat{\\alpha})$ is $0.0912$.\n",
|
||||||
@@ -835,7 +939,7 @@
|
|||||||
"`horsepower` to predict `mpg` in the `Auto` data set. We\n",
|
"`horsepower` to predict `mpg` in the `Auto` data set. We\n",
|
||||||
"will compare the estimates obtained using the bootstrap to those\n",
|
"will compare the estimates obtained using the bootstrap to those\n",
|
||||||
"obtained using the formulas for ${\\rm SE}(\\hat{\\beta}_0)$ and\n",
|
"obtained using the formulas for ${\\rm SE}(\\hat{\\beta}_0)$ and\n",
|
||||||
"${\\rm SE}(\\hat{\\beta}_1)$ described in Section 3.1.2.\n",
|
"${\\rm SE}(\\hat{\\beta}_1)$ described in Section~\\ref{Ch3:secoefsec}.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"To use our `boot_SE()` function, we must write a function (its\n",
|
"To use our `boot_SE()` function, we must write a function (its\n",
|
||||||
"first argument)\n",
|
"first argument)\n",
|
||||||
@@ -856,9 +960,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 20,
|
"execution_count": 20,
|
||||||
"id": "0220f3af",
|
"id": "6bc11784",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.632802Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.632725Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.634450Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.634222Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
@@ -872,7 +981,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "df0c7f05",
|
"id": "2a6ea3ce",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"This is not quite what is needed as the first argument to\n",
|
"This is not quite what is needed as the first argument to\n",
|
||||||
@@ -886,9 +995,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 21,
|
"execution_count": 21,
|
||||||
"id": "62037dcb",
|
"id": "740cd50c",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.635644Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.635575Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.637097Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.636867Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
@@ -898,7 +1012,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "61fbe248",
|
"id": "ed6d19e2",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Typing `hp_func?` will show that it has two arguments `D`\n",
|
"Typing `hp_func?` will show that it has two arguments `D`\n",
|
||||||
@@ -914,25 +1028,30 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 22,
|
"execution_count": 22,
|
||||||
"id": "b8bdb7a4",
|
"id": "ffb3ec50",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.638287Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.638220Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:15.656475Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:15.656261Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"array([[39.88064456, -0.1567849 ],\n",
|
"array([[39.12226577, -0.1555926 ],\n",
|
||||||
" [38.73298691, -0.14699495],\n",
|
" [37.18648613, -0.13915813],\n",
|
||||||
" [38.31734657, -0.14442683],\n",
|
" [37.46989244, -0.14112749],\n",
|
||||||
" [39.91446826, -0.15782234],\n",
|
" [38.56723252, -0.14830116],\n",
|
||||||
" [39.43349349, -0.15072702],\n",
|
" [38.95495707, -0.15315141],\n",
|
||||||
" [40.36629857, -0.15912217],\n",
|
" [39.12563927, -0.15261044],\n",
|
||||||
" [39.62334517, -0.15449117],\n",
|
" [38.45763251, -0.14767251],\n",
|
||||||
" [39.0580588 , -0.14952908],\n",
|
" [38.43372587, -0.15019447],\n",
|
||||||
" [38.66688437, -0.14521037],\n",
|
" [37.87581142, -0.1409544 ],\n",
|
||||||
" [39.64280792, -0.15555698]])"
|
" [37.95949036, -0.1451333 ]])"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 22,
|
"execution_count": 22,
|
||||||
@@ -943,14 +1062,14 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"rng = np.random.default_rng(0)\n",
|
"rng = np.random.default_rng(0)\n",
|
||||||
"np.array([hp_func(Auto,\n",
|
"np.array([hp_func(Auto,\n",
|
||||||
" rng.choice(392,\n",
|
" rng.choice(Auto.index,\n",
|
||||||
" 392,\n",
|
" 392,\n",
|
||||||
" replace=True)) for _ in range(10)])\n"
|
" replace=True)) for _ in range(10)])\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "2a831036",
|
"id": "c6d09d96",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Next, we use the `boot_SE()` {} function to compute the standard\n",
|
"Next, we use the `boot_SE()` {} function to compute the standard\n",
|
||||||
@@ -960,17 +1079,22 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 23,
|
"execution_count": 23,
|
||||||
"id": "36808258",
|
"id": "7d561f70",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:15.657733Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:15.657659Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:17.204871Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:17.204614Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 2
|
"lines_to_next_cell": 2
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"intercept 0.848807\n",
|
"intercept 0.731176\n",
|
||||||
"horsepower 0.007352\n",
|
"horsepower 0.006092\n",
|
||||||
"dtype: float64"
|
"dtype: float64"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -989,14 +1113,14 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "38c65fbf",
|
"id": "a834f240",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"This indicates that the bootstrap estimate for ${\\rm SE}(\\hat{\\beta}_0)$ is\n",
|
"This indicates that the bootstrap estimate for ${\\rm SE}(\\hat{\\beta}_0)$ is\n",
|
||||||
"0.85, and that the bootstrap\n",
|
"0.85, and that the bootstrap\n",
|
||||||
"estimate for ${\\rm SE}(\\hat{\\beta}_1)$ is\n",
|
"estimate for ${\\rm SE}(\\hat{\\beta}_1)$ is\n",
|
||||||
"0.0074. As discussed in\n",
|
"0.0074. As discussed in\n",
|
||||||
"Section 3.1.2, standard formulas can be used to compute\n",
|
"Section~\\ref{Ch3:secoefsec}, standard formulas can be used to compute\n",
|
||||||
"the standard errors for the regression coefficients in a linear\n",
|
"the standard errors for the regression coefficients in a linear\n",
|
||||||
"model. These can be obtained using the `summarize()` function\n",
|
"model. These can be obtained using the `summarize()` function\n",
|
||||||
"from `ISLP.sm`."
|
"from `ISLP.sm`."
|
||||||
@@ -1005,9 +1129,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 24,
|
"execution_count": 24,
|
||||||
"id": "c9aea297",
|
"id": "3888aa0a",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:17.206302Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:17.206223Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:17.221631Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:17.221444Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 2
|
"lines_to_next_cell": 2
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -1032,11 +1161,11 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "d870ad6b",
|
"id": "aefc0575",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The standard error estimates for $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$\n",
|
"The standard error estimates for $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$\n",
|
||||||
"obtained using the formulas from Section 3.1.2 are\n",
|
"obtained using the formulas from Section~\\ref{Ch3:secoefsec} are\n",
|
||||||
"0.717 for the\n",
|
"0.717 for the\n",
|
||||||
"intercept and\n",
|
"intercept and\n",
|
||||||
"0.006 for the\n",
|
"0.006 for the\n",
|
||||||
@@ -1044,13 +1173,13 @@
|
|||||||
"obtained using the bootstrap. Does this indicate a problem with the\n",
|
"obtained using the bootstrap. Does this indicate a problem with the\n",
|
||||||
"bootstrap? In fact, it suggests the opposite. Recall that the\n",
|
"bootstrap? In fact, it suggests the opposite. Recall that the\n",
|
||||||
"standard formulas given in\n",
|
"standard formulas given in\n",
|
||||||
" {Equation 3.8 on page 82}\n",
|
" {Equation~\\ref{Ch3:se.eqn} on page~\\pageref{Ch3:se.eqn}}\n",
|
||||||
"rely on certain assumptions. For example,\n",
|
"rely on certain assumptions. For example,\n",
|
||||||
"they depend on the unknown parameter $\\sigma^2$, the noise\n",
|
"they depend on the unknown parameter $\\sigma^2$, the noise\n",
|
||||||
"variance. We then estimate $\\sigma^2$ using the RSS. Now although the\n",
|
"variance. We then estimate $\\sigma^2$ using the RSS. Now although the\n",
|
||||||
"formula for the standard errors do not rely on the linear model being\n",
|
"formula for the standard errors do not rely on the linear model being\n",
|
||||||
"correct, the estimate for $\\sigma^2$ does. We see\n",
|
"correct, the estimate for $\\sigma^2$ does. We see\n",
|
||||||
" {in Figure 3.8 on page 108} that there is\n",
|
" {in Figure~\\ref{Ch3:polyplot} on page~\\pageref{Ch3:polyplot}} that there is\n",
|
||||||
"a non-linear relationship in the data, and so the residuals from a\n",
|
"a non-linear relationship in the data, and so the residuals from a\n",
|
||||||
"linear fit will be inflated, and so will $\\hat{\\sigma}^2$. Secondly,\n",
|
"linear fit will be inflated, and so will $\\hat{\\sigma}^2$. Secondly,\n",
|
||||||
"the standard formulas assume (somewhat unrealistically) that the $x_i$\n",
|
"the standard formulas assume (somewhat unrealistically) that the $x_i$\n",
|
||||||
@@ -1063,7 +1192,7 @@
|
|||||||
"Below we compute the bootstrap standard error estimates and the\n",
|
"Below we compute the bootstrap standard error estimates and the\n",
|
||||||
"standard linear regression estimates that result from fitting the\n",
|
"standard linear regression estimates that result from fitting the\n",
|
||||||
"quadratic model to the data. Since this model provides a good fit to\n",
|
"quadratic model to the data. Since this model provides a good fit to\n",
|
||||||
"the data (Figure 3.8), there is now a better\n",
|
"the data (Figure~\\ref{Ch3:polyplot}), there is now a better\n",
|
||||||
"correspondence between the bootstrap estimates and the standard\n",
|
"correspondence between the bootstrap estimates and the standard\n",
|
||||||
"estimates of ${\\rm SE}(\\hat{\\beta}_0)$, ${\\rm SE}(\\hat{\\beta}_1)$ and\n",
|
"estimates of ${\\rm SE}(\\hat{\\beta}_0)$, ${\\rm SE}(\\hat{\\beta}_1)$ and\n",
|
||||||
"${\\rm SE}(\\hat{\\beta}_2)$."
|
"${\\rm SE}(\\hat{\\beta}_2)$."
|
||||||
@@ -1072,17 +1201,22 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 25,
|
"execution_count": 25,
|
||||||
"id": "79c56529",
|
"id": "acc3e32c",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {}
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:17.222887Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:17.222785Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:19.351574Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:19.351317Z"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"intercept 2.067840\n",
|
"intercept 1.538641\n",
|
||||||
"poly(horsepower, degree=2, raw=True)[0] 0.033019\n",
|
"poly(horsepower, degree=2, raw=True)[0] 0.024696\n",
|
||||||
"poly(horsepower, degree=2, raw=True)[1] 0.000120\n",
|
"poly(horsepower, degree=2, raw=True)[1] 0.000090\n",
|
||||||
"dtype: float64"
|
"dtype: float64"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -1101,7 +1235,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "9fccbbbd",
|
"id": "e8a2fd2b",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We compare the results to the standard errors computed using `sm.OLS()`."
|
"We compare the results to the standard errors computed using `sm.OLS()`."
|
||||||
@@ -1110,9 +1244,14 @@
|
|||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 26,
|
"execution_count": 26,
|
||||||
"id": "4d0b4edc",
|
"id": "dca5340c",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"execution": {},
|
"execution": {
|
||||||
|
"iopub.execute_input": "2024-06-04T23:19:19.352904Z",
|
||||||
|
"iopub.status.busy": "2024-06-04T23:19:19.352827Z",
|
||||||
|
"iopub.status.idle": "2024-06-04T23:19:19.360147Z",
|
||||||
|
"shell.execute_reply": "2024-06-04T23:19:19.359948Z"
|
||||||
|
},
|
||||||
"lines_to_next_cell": 0
|
"lines_to_next_cell": 0
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -1138,7 +1277,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "9a86ff6e",
|
"id": "e98297be",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"\n",
|
"\n",
|
||||||
@@ -1149,8 +1288,13 @@
|
|||||||
"metadata": {
|
"metadata": {
|
||||||
"jupytext": {
|
"jupytext": {
|
||||||
"cell_metadata_filter": "-all",
|
"cell_metadata_filter": "-all",
|
||||||
"formats": "Rmd,ipynb",
|
"main_language": "python",
|
||||||
"main_language": "python"
|
"notebook_metadata_filter": "-all"
|
||||||
|
},
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
},
|
},
|
||||||
"language_info": {
|
"language_info": {
|
||||||
"codemirror_mode": {
|
"codemirror_mode": {
|
||||||
@@ -1162,7 +1306,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.10.12"
|
"version": "3.12.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|||||||
@@ -1,21 +1,13 @@
|
|||||||
---
|
|
||||||
jupyter:
|
# Linear Models and Regularization Methods
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch06-varselect-lab.ipynb">
|
||||||
formats: Rmd,ipynb
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
main_language: python
|
</a>
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch06-varselect-lab.ipynb)
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 6
|
|
||||||
|
|
||||||
|
|
||||||
# Lab: Linear Models and Regularization Methods
|
|
||||||
In this lab we implement many of the techniques discussed in this chapter.
|
In this lab we implement many of the techniques discussed in this chapter.
|
||||||
We import some of our libraries at this top
|
We import some of our libraries at this top
|
||||||
level.
|
level.
|
||||||
@@ -35,7 +27,7 @@ from functools import partial
|
|||||||
```
|
```
|
||||||
|
|
||||||
We again collect the new imports
|
We again collect the new imports
|
||||||
needed for this lab.
|
needed for this lab. Readers will also have to have installed `l0bnb` using `pip install l0bnb`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
from sklearn.pipeline import Pipeline
|
from sklearn.pipeline import Pipeline
|
||||||
@@ -45,11 +37,10 @@ from ISLP.models import \
|
|||||||
(Stepwise,
|
(Stepwise,
|
||||||
sklearn_selected,
|
sklearn_selected,
|
||||||
sklearn_selection_path)
|
sklearn_selection_path)
|
||||||
# !pip install l0bnb
|
|
||||||
from l0bnb import fit_path
|
from l0bnb import fit_path
|
||||||
|
|
||||||
```
|
```
|
||||||
We have installed the package `l0bnb` on the fly. Note the escaped `!pip install` --- this is run as a separate system command.
|
|
||||||
## Subset Selection Methods
|
## Subset Selection Methods
|
||||||
Here we implement methods that reduce the number of parameters in a
|
Here we implement methods that reduce the number of parameters in a
|
||||||
model by restricting the model to a subset of the input variables.
|
model by restricting the model to a subset of the input variables.
|
||||||
@@ -74,7 +65,7 @@ Hitters = load_data('Hitters')
|
|||||||
np.isnan(Hitters['Salary']).sum()
|
np.isnan(Hitters['Salary']).sum()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that `Salary` is missing for 59 players. The
|
We see that `Salary` is missing for 59 players. The
|
||||||
`dropna()` method of data frames removes all of the rows that have missing
|
`dropna()` method of data frames removes all of the rows that have missing
|
||||||
values in any variable (by default --- see `Hitters.dropna?`).
|
values in any variable (by default --- see `Hitters.dropna?`).
|
||||||
@@ -84,9 +75,9 @@ Hitters = Hitters.dropna();
|
|||||||
Hitters.shape
|
Hitters.shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We first choose the best model using forward selection based on $C_p$ (6.2). This score
|
We first choose the best model using forward selection based on $C_p$ (\ref{Ch6:eq:cp}). This score
|
||||||
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
|
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
|
||||||
it as a scorer. By default, `sklearn` tries to maximize a score, hence
|
it as a scorer. By default, `sklearn` tries to maximize a score, hence
|
||||||
our scoring function computes the negative $C_p$ statistic.
|
our scoring function computes the negative $C_p$ statistic.
|
||||||
@@ -111,7 +102,7 @@ sigma2 = OLS(Y,X).fit().scale
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$.
|
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section~\ref{Ch5-resample-lab:the-bootstrap} to freeze the first argument with our estimate of $\sigma^2$.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
neg_Cp = partial(nCp, sigma2)
|
neg_Cp = partial(nCp, sigma2)
|
||||||
@@ -119,7 +110,7 @@ neg_Cp = partial(nCp, sigma2)
|
|||||||
```
|
```
|
||||||
We can now use `neg_Cp()` as a scorer for model selection.
|
We can now use `neg_Cp()` as a scorer for model selection.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Along with a score we need to specify the search strategy. This is done through the object
|
Along with a score we need to specify the search strategy. This is done through the object
|
||||||
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
|
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
|
||||||
@@ -133,7 +124,7 @@ strategy = Stepwise.first_peak(design,
|
|||||||
max_terms=len(design.terms))
|
max_terms=len(design.terms))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now fit a linear regression model with `Salary` as outcome using forward
|
We now fit a linear regression model with `Salary` as outcome using forward
|
||||||
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
|
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
|
||||||
a model from `statsmodels` along with a search strategy and selects a model with its
|
a model from `statsmodels` along with a search strategy and selects a model with its
|
||||||
@@ -147,7 +138,7 @@ hitters_MSE.fit(Hitters, Y)
|
|||||||
hitters_MSE.selected_state_
|
hitters_MSE.selected_state_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
|
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -158,7 +149,7 @@ hitters_Cp.fit(Hitters, Y)
|
|||||||
hitters_Cp.selected_state_
|
hitters_Cp.selected_state_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
|
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
|
||||||
|
|
||||||
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
|
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
|
||||||
@@ -180,7 +171,7 @@ strategy = Stepwise.fixed_steps(design,
|
|||||||
full_path = sklearn_selection_path(OLS, strategy)
|
full_path = sklearn_selection_path(OLS, strategy)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
|
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -189,8 +180,8 @@ Yhat_in = full_path.predict(Hitters)
|
|||||||
Yhat_in.shape
|
Yhat_in.shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
|
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
|
||||||
in-sample MSE. As expected, the in-sample MSE improves each step we take,
|
in-sample MSE. As expected, the in-sample MSE improves each step we take,
|
||||||
indicating we must use either the validation or cross-validation
|
indicating we must use either the validation or cross-validation
|
||||||
@@ -279,7 +270,7 @@ ax.legend()
|
|||||||
mse_fig
|
mse_fig
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To repeat the above using the validation set approach, we simply change our
|
To repeat the above using the validation set approach, we simply change our
|
||||||
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
|
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
|
||||||
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
|
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
|
||||||
@@ -309,7 +300,7 @@ ax.legend()
|
|||||||
mse_fig
|
mse_fig
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Best Subset Selection
|
### Best Subset Selection
|
||||||
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
|
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
|
||||||
@@ -337,7 +328,7 @@ path = fit_path(X,
|
|||||||
max_nonzeros=X.shape[1])
|
max_nonzeros=X.shape[1])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
|
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -363,7 +354,7 @@ Since we
|
|||||||
standardize first, in order to find coefficient
|
standardize first, in order to find coefficient
|
||||||
estimates on the original scale, we must *unstandardize*
|
estimates on the original scale, we must *unstandardize*
|
||||||
the coefficient estimates. The parameter
|
the coefficient estimates. The parameter
|
||||||
$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to
|
$\lambda$ in (\ref{Ch6:ridge}) and (\ref{Ch6:LASSO}) is called `alphas` in `sklearn`. In order to
|
||||||
be consistent with the rest of this chapter, we use `lambdas`
|
be consistent with the rest of this chapter, we use `lambdas`
|
||||||
rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [22] issue unwarranted convergence warning messages; we expect these to disappear as this package matures.}
|
rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [22] issue unwarranted convergence warning messages; we expect these to disappear as this package matures.}
|
||||||
|
|
||||||
@@ -405,7 +396,7 @@ soln_path.index.name = 'negative log(lambda)'
|
|||||||
soln_path
|
soln_path
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
|
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
|
||||||
To control the location of the legend we first set `legend` to `False` in the
|
To control the location of the legend we first set `legend` to `False` in the
|
||||||
plot method, adding it afterward with the `legend()` method of `ax`.
|
plot method, adding it afterward with the `legend()` method of `ax`.
|
||||||
@@ -429,14 +420,14 @@ beta_hat = soln_path.loc[soln_path.index[39]]
|
|||||||
lambdas[39], beta_hat
|
lambdas[39], beta_hat
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Let’s compute the $\ell_2$ norm of the standardized coefficients.
|
Let’s compute the $\ell_2$ norm of the standardized coefficients.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
np.linalg.norm(beta_hat)
|
np.linalg.norm(beta_hat)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
|
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
|
||||||
Note the much larger $\ell_2$ norm of the
|
Note the much larger $\ell_2$ norm of the
|
||||||
coefficients associated with this smaller value of $\lambda$.
|
coefficients associated with this smaller value of $\lambda$.
|
||||||
@@ -490,7 +481,7 @@ results = skm.cross_validate(ridge,
|
|||||||
-results['test_score']
|
-results['test_score']
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The test MSE is 1.342e+05. Note
|
The test MSE is 1.342e+05. Note
|
||||||
that if we had instead simply fit a model with just an intercept, we
|
that if we had instead simply fit a model with just an intercept, we
|
||||||
would have predicted each test observation using the mean of the
|
would have predicted each test observation using the mean of the
|
||||||
@@ -527,7 +518,7 @@ grid.best_params_['ridge__alpha']
|
|||||||
grid.best_estimator_
|
grid.best_estimator_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternatively, we can use 5-fold cross-validation.
|
Alternatively, we can use 5-fold cross-validation.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -540,7 +531,7 @@ grid.best_params_['ridge__alpha']
|
|||||||
grid.best_estimator_
|
grid.best_estimator_
|
||||||
|
|
||||||
```
|
```
|
||||||
Recall we set up the `kfold` object for 5-fold cross-validation on page 298. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
|
Recall we set up the `kfold` object for 5-fold cross-validation on page~\pageref{line:choos-among-models}. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
|
||||||
to right.
|
to right.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -553,7 +544,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
|||||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
One can cross-validate different metrics to choose a parameter. The default
|
One can cross-validate different metrics to choose a parameter. The default
|
||||||
metric for `skl.ElasticNet()` is test $R^2$.
|
metric for `skl.ElasticNet()` is test $R^2$.
|
||||||
Let’s compare $R^2$ to MSE for cross-validation here.
|
Let’s compare $R^2$ to MSE for cross-validation here.
|
||||||
@@ -565,7 +556,7 @@ grid_r2 = skm.GridSearchCV(pipe,
|
|||||||
grid_r2.fit(X, Y)
|
grid_r2.fit(X, Y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Finally, let’s plot the results for cross-validated $R^2$.
|
Finally, let’s plot the results for cross-validated $R^2$.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -577,7 +568,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
|||||||
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
|
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Fast Cross-Validation for Solution Paths
|
### Fast Cross-Validation for Solution Paths
|
||||||
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
|
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
|
||||||
@@ -597,7 +588,7 @@ pipeCV = Pipeline(steps=[('scaler', scaler),
|
|||||||
pipeCV.fit(X, Y)
|
pipeCV.fit(X, Y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Let’s produce a plot again of the cross-validation error to see that
|
Let’s produce a plot again of the cross-validation error to see that
|
||||||
it is similar to using `skm.GridSearchCV`.
|
it is similar to using `skm.GridSearchCV`.
|
||||||
|
|
||||||
@@ -613,7 +604,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
|||||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that the value of $\lambda$ that results in the
|
We see that the value of $\lambda$ that results in the
|
||||||
smallest cross-validation error is 1.19e-02, available
|
smallest cross-validation error is 1.19e-02, available
|
||||||
as the value `tuned_ridge.alpha_`. What is the test MSE
|
as the value `tuned_ridge.alpha_`. What is the test MSE
|
||||||
@@ -623,7 +614,7 @@ associated with this value of $\lambda$?
|
|||||||
np.min(tuned_ridge.mse_path_.mean(1))
|
np.min(tuned_ridge.mse_path_.mean(1))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This represents a further improvement over the test MSE that we got
|
This represents a further improvement over the test MSE that we got
|
||||||
using $\lambda=4$. Finally, `tuned_ridge.coef_`
|
using $\lambda=4$. Finally, `tuned_ridge.coef_`
|
||||||
has the coefficients fit on the entire data set
|
has the coefficients fit on the entire data set
|
||||||
@@ -640,7 +631,7 @@ not perform variable selection!
|
|||||||
### Evaluating Test Error of Cross-Validated Ridge
|
### Evaluating Test Error of Cross-Validated Ridge
|
||||||
Choosing $\lambda$ using cross-validation provides a single regression
|
Choosing $\lambda$ using cross-validation provides a single regression
|
||||||
estimator, similar to fitting a linear regression model as we saw in
|
estimator, similar to fitting a linear regression model as we saw in
|
||||||
Chapter 3. It is therefore reasonable to estimate what its test error
|
Chapter~\ref{Ch3:linreg}. It is therefore reasonable to estimate what its test error
|
||||||
is. We run into a problem here in that cross-validation will have
|
is. We run into a problem here in that cross-validation will have
|
||||||
*touched* all of its data in choosing $\lambda$, hence we have no
|
*touched* all of its data in choosing $\lambda$, hence we have no
|
||||||
further data to estimate test error. A compromise is to do an initial
|
further data to estimate test error. A compromise is to do an initial
|
||||||
@@ -679,7 +670,7 @@ results = skm.cross_validate(pipeCV,
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### The Lasso
|
### The Lasso
|
||||||
We saw that ridge regression with a wise choice of $\lambda$ can
|
We saw that ridge regression with a wise choice of $\lambda$ can
|
||||||
@@ -728,13 +719,13 @@ ax.set_ylabel('Standardized coefficiients', fontsize=20);
|
|||||||
```
|
```
|
||||||
The smallest cross-validated error is lower than the test set MSE of the null model
|
The smallest cross-validated error is lower than the test set MSE of the null model
|
||||||
and of least squares, and very similar to the test MSE of 115526.71 of ridge
|
and of least squares, and very similar to the test MSE of 115526.71 of ridge
|
||||||
regression (page 305) with $\lambda$ chosen by cross-validation.
|
regression (page~\pageref{page:MSECVRidge}) with $\lambda$ chosen by cross-validation.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
np.min(tuned_lasso.mse_path_.mean(1))
|
np.min(tuned_lasso.mse_path_.mean(1))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Let’s again produce a plot of the cross-validation error.
|
Let’s again produce a plot of the cross-validation error.
|
||||||
|
|
||||||
|
|
||||||
@@ -759,7 +750,7 @@ variables.
|
|||||||
tuned_lasso.coef_
|
tuned_lasso.coef_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As in ridge regression, we could evaluate the test error
|
As in ridge regression, we could evaluate the test error
|
||||||
of cross-validated lasso by first splitting into
|
of cross-validated lasso by first splitting into
|
||||||
test and training sets and internally running
|
test and training sets and internally running
|
||||||
@@ -770,17 +761,17 @@ this as an exercise.
|
|||||||
## PCR and PLS Regression
|
## PCR and PLS Regression
|
||||||
|
|
||||||
### Principal Components Regression
|
### Principal Components Regression
|
||||||
|
|
||||||
|
|
||||||
Principal components regression (PCR) can be performed using
|
Principal components regression (PCR) can be performed using
|
||||||
`PCA()` from the `sklearn.decomposition`
|
`PCA()` from the `sklearn.decomposition`
|
||||||
module. We now apply PCR to the `Hitters` data, in order to
|
module. We now apply PCR to the `Hitters` data, in order to
|
||||||
predict `Salary`. Again, ensure that the missing values have
|
predict `Salary`. Again, ensure that the missing values have
|
||||||
been removed from the data, as described in Section 6.5.1.
|
been removed from the data, as described in Section~\ref{Ch6-varselect-lab:lab-1-subset-selection-methods}.
|
||||||
|
|
||||||
We use `LinearRegression()` to fit the regression model
|
We use `LinearRegression()` to fit the regression model
|
||||||
here. Note that it fits an intercept by default, unlike
|
here. Note that it fits an intercept by default, unlike
|
||||||
the `OLS()` function seen earlier in Section 6.5.1.
|
the `OLS()` function seen earlier in Section~\ref{Ch6-varselect-lab:lab-1-subset-selection-methods}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
pca = PCA(n_components=2)
|
pca = PCA(n_components=2)
|
||||||
@@ -791,7 +782,7 @@ pipe.fit(X, Y)
|
|||||||
pipe.named_steps['linreg'].coef_
|
pipe.named_steps['linreg'].coef_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
When performing PCA, the results vary depending
|
When performing PCA, the results vary depending
|
||||||
on whether the data has been *standardized* or not.
|
on whether the data has been *standardized* or not.
|
||||||
As in the earlier examples, this can be accomplished
|
As in the earlier examples, this can be accomplished
|
||||||
@@ -805,7 +796,7 @@ pipe.fit(X, Y)
|
|||||||
pipe.named_steps['linreg'].coef_
|
pipe.named_steps['linreg'].coef_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can of course use CV to choose the number of components, by
|
We can of course use CV to choose the number of components, by
|
||||||
using `skm.GridSearchCV`, in this
|
using `skm.GridSearchCV`, in this
|
||||||
case fixing the parameters to vary the
|
case fixing the parameters to vary the
|
||||||
@@ -820,7 +811,7 @@ grid = skm.GridSearchCV(pipe,
|
|||||||
grid.fit(X, Y)
|
grid.fit(X, Y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Let’s plot the results as we have for other methods.
|
Let’s plot the results as we have for other methods.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -835,7 +826,7 @@ ax.set_xticks(n_comp[::2])
|
|||||||
ax.set_ylim([50000,250000]);
|
ax.set_ylim([50000,250000]);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that the smallest cross-validation error occurs when
|
We see that the smallest cross-validation error occurs when
|
||||||
17
|
17
|
||||||
components are used. However, from the plot we also see that the
|
components are used. However, from the plot we also see that the
|
||||||
@@ -859,18 +850,18 @@ cv_null = skm.cross_validate(linreg,
|
|||||||
-cv_null['test_score'].mean()
|
-cv_null['test_score'].mean()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The `explained_variance_ratio_`
|
The `explained_variance_ratio_`
|
||||||
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
|
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
|
||||||
different numbers of components. This concept is discussed in greater
|
different numbers of components. This concept is discussed in greater
|
||||||
detail in Section 12.2.
|
detail in Section~\ref{Ch10:sec:pca}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
pipe.named_steps['pca'].explained_variance_ratio_
|
pipe.named_steps['pca'].explained_variance_ratio_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Briefly, we can think of
|
Briefly, we can think of
|
||||||
this as the amount of information about the predictors
|
this as the amount of information about the predictors
|
||||||
that is captured using $M$ principal components. For example, setting
|
that is captured using $M$ principal components. For example, setting
|
||||||
@@ -893,7 +884,7 @@ pls = PLSRegression(n_components=2,
|
|||||||
pls.fit(X, Y)
|
pls.fit(X, Y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As was the case in PCR, we will want to
|
As was the case in PCR, we will want to
|
||||||
use CV to choose the number of components.
|
use CV to choose the number of components.
|
||||||
|
|
||||||
@@ -906,7 +897,7 @@ grid = skm.GridSearchCV(pls,
|
|||||||
grid.fit(X, Y)
|
grid.fit(X, Y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As for our other methods, we plot the MSE.
|
As for our other methods, we plot the MSE.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -921,7 +912,7 @@ ax.set_xticks(n_comp[::2])
|
|||||||
ax.set_ylim([50000,250000]);
|
ax.set_ylim([50000,250000]);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
CV error is minimized at 12,
|
CV error is minimized at 12,
|
||||||
though there is little noticable difference between this point and a much lower number like 2 or 3 components.
|
though there is little noticable difference between this point and a much lower number like 2 or 3 components.
|
||||||
|
|
||||||
|
|||||||
11389
Ch06-varselect-lab.ipynb
11389
Ch06-varselect-lab.ipynb
File diff suppressed because one or more lines are too long
@@ -1,22 +1,14 @@
|
|||||||
---
|
# Non-Linear Modeling
|
||||||
jupyter:
|
|
||||||
jupytext:
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch07-nonlin-lab.ipynb">
|
||||||
cell_metadata_filter: -all
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
formats: Rmd,ipynb
|
</a>
|
||||||
main_language: python
|
|
||||||
text_representation:
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch07-nonlin-lab.ipynb)
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 7
|
|
||||||
|
|
||||||
# Lab: Non-Linear Modeling
|
|
||||||
In this lab, we demonstrate some of the nonlinear models discussed in
|
In this lab, we demonstrate some of the nonlinear models discussed in
|
||||||
this chapter. We use the `Wage` data as a running example, and show that many of the complex non-linear fitting procedures discussed can easily be implemented in \Python.
|
this chapter. We use the `Wage` data as a running example, and show that many of the complex non-linear fitting procedures discussed can easily be implemented in `Python`.
|
||||||
|
|
||||||
As usual, we start with some of our standard imports.
|
As usual, we start with some of our standard imports.
|
||||||
|
|
||||||
@@ -30,7 +22,7 @@ from ISLP.models import (summarize,
|
|||||||
ModelSpec as MS)
|
ModelSpec as MS)
|
||||||
from statsmodels.stats.anova import anova_lm
|
from statsmodels.stats.anova import anova_lm
|
||||||
```
|
```
|
||||||
|
|
||||||
We again collect the new imports
|
We again collect the new imports
|
||||||
needed for this lab. Many of these are developed specifically for the
|
needed for this lab. Many of these are developed specifically for the
|
||||||
`ISLP` package.
|
`ISLP` package.
|
||||||
@@ -51,9 +43,9 @@ from ISLP.pygam import (approx_lam,
|
|||||||
anova as anova_gam)
|
anova as anova_gam)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Polynomial Regression and Step Functions
|
## Polynomial Regression and Step Functions
|
||||||
We start by demonstrating how Figure 7.1 can be reproduced.
|
We start by demonstrating how Figure~\ref{Ch7:fig:poly} can be reproduced.
|
||||||
Let's begin by loading the data.
|
Let's begin by loading the data.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -62,10 +54,10 @@ y = Wage['wage']
|
|||||||
age = Wage['age']
|
age = Wage['age']
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Throughout most of this lab, our response is `Wage['wage']`, which
|
Throughout most of this lab, our response is `Wage['wage']`, which
|
||||||
we have stored as `y` above.
|
we have stored as `y` above.
|
||||||
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
|
As in Section~\ref{Ch3-linreg-lab:non-linear-transformations-of-the-predictors}, we will use the `poly()` function to create a model matrix
|
||||||
that will fit a $4$th degree polynomial in `age`.
|
that will fit a $4$th degree polynomial in `age`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -74,16 +66,16 @@ M = sm.OLS(y, poly_age.transform(Wage)).fit()
|
|||||||
summarize(M)
|
summarize(M)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
This polynomial is constructed using the function `poly()`,
|
This polynomial is constructed using the function `poly()`,
|
||||||
which creates
|
which creates
|
||||||
a special *transformer* `Poly()` (using `sklearn` terminology
|
a special *transformer* `Poly()` (using `sklearn` terminology
|
||||||
for feature transformations such as `PCA()` seen in Section 6.5.3) which
|
for feature transformations such as `PCA()` seen in Section \ref{Ch6-varselect-lab:principal-components-regression}) which
|
||||||
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
|
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
|
||||||
the
|
the
|
||||||
discussion of transformations on
|
discussion of transformations on
|
||||||
page 129.
|
page~\pageref{Ch3-linreg-lab:using-transformations-fit-and-transform}.
|
||||||
|
|
||||||
In the code above, the first line executes the `fit()` method
|
In the code above, the first line executes the `fit()` method
|
||||||
using the dataframe
|
using the dataframe
|
||||||
@@ -96,7 +88,7 @@ on the second line, as well as in the plotting function developed below.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
We now create a grid of values for `age` at which we want
|
We now create a grid of values for `age` at which we want
|
||||||
predictions.
|
predictions.
|
||||||
|
|
||||||
@@ -146,10 +138,10 @@ def plot_wage_fit(age_df,
|
|||||||
We include an argument `alpha` to `ax.scatter()`
|
We include an argument `alpha` to `ax.scatter()`
|
||||||
to add some transparency to the points. This provides a visual indication
|
to add some transparency to the points. This provides a visual indication
|
||||||
of density. Notice the use of the `zip()` function in the
|
of density. Notice the use of the `zip()` function in the
|
||||||
`for` loop above (see Section 2.3.8).
|
`for` loop above (see Section~\ref{Ch2-statlearn-lab:for-loops}).
|
||||||
We have three lines to plot, each with different colors and line
|
We have three lines to plot, each with different colors and line
|
||||||
types. Here `zip()` conveniently bundles these together as
|
types. Here `zip()` conveniently bundles these together as
|
||||||
iterators in the loop. {In `Python` speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
|
iterators in the loop. {In `Python`{} speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
|
||||||
|
|
||||||
We now plot the fit of the fourth-degree polynomial using this
|
We now plot the fit of the fourth-degree polynomial using this
|
||||||
function.
|
function.
|
||||||
@@ -164,7 +156,7 @@ plot_wage_fit(age_df,
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
With polynomial regression we must decide on the degree of
|
With polynomial regression we must decide on the degree of
|
||||||
the polynomial to use. Sometimes we just wing it, and decide to use
|
the polynomial to use. Sometimes we just wing it, and decide to use
|
||||||
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
|
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
|
||||||
@@ -195,7 +187,7 @@ anova_lm(*[sm.OLS(y, X_).fit()
|
|||||||
for X_ in Xs])
|
for X_ in Xs])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice the `*` in the `anova_lm()` line above. This
|
Notice the `*` in the `anova_lm()` line above. This
|
||||||
function takes a variable number of non-keyword arguments, in this case fitted models.
|
function takes a variable number of non-keyword arguments, in this case fitted models.
|
||||||
When these models are provided as a list (as is done here), it must be
|
When these models are provided as a list (as is done here), it must be
|
||||||
@@ -220,8 +212,8 @@ that `poly()` creates orthogonal polynomials.
|
|||||||
summarize(M)
|
summarize(M)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Notice that the p-values are the same, and in fact the square of
|
Notice that the p-values are the same, and in fact the square of
|
||||||
the t-statistics are equal to the F-statistics from the
|
the t-statistics are equal to the F-statistics from the
|
||||||
`anova_lm()` function; for example:
|
`anova_lm()` function; for example:
|
||||||
@@ -230,8 +222,8 @@ the t-statistics are equal to the F-statistics from the
|
|||||||
(-11.983)**2
|
(-11.983)**2
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
However, the ANOVA method works whether or not we used orthogonal
|
However, the ANOVA method works whether or not we used orthogonal
|
||||||
polynomials, provided the models are nested. For example, we can use
|
polynomials, provided the models are nested. For example, we can use
|
||||||
`anova_lm()` to compare the following three
|
`anova_lm()` to compare the following three
|
||||||
@@ -246,10 +238,10 @@ XEs = [model.fit_transform(Wage)
|
|||||||
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
|
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
As an alternative to using hypothesis tests and ANOVA, we could choose
|
As an alternative to using hypothesis tests and ANOVA, we could choose
|
||||||
the polynomial degree using cross-validation, as discussed in Chapter 5.
|
the polynomial degree using cross-validation, as discussed in Chapter~\ref{Ch5:resample}.
|
||||||
|
|
||||||
Next we consider the task of predicting whether an individual earns
|
Next we consider the task of predicting whether an individual earns
|
||||||
more than $250,000 per year. We proceed much as before, except
|
more than $250,000 per year. We proceed much as before, except
|
||||||
@@ -267,8 +259,8 @@ B = glm.fit()
|
|||||||
summarize(B)
|
summarize(B)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Once again, we make predictions using the `get_prediction()` method.
|
Once again, we make predictions using the `get_prediction()` method.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -277,7 +269,7 @@ preds = B.get_prediction(newX)
|
|||||||
bands = preds.conf_int(alpha=0.05)
|
bands = preds.conf_int(alpha=0.05)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now plot the estimated relationship.
|
We now plot the estimated relationship.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -308,7 +300,7 @@ value do not cover each other up. This type of plot is often called a
|
|||||||
*rug plot*.
|
*rug plot*.
|
||||||
|
|
||||||
In order to fit a step function, as discussed in
|
In order to fit a step function, as discussed in
|
||||||
Section 7.2, we first use the `pd.qcut()`
|
Section~\ref{Ch7:sec:scolstep-function}, we first use the `pd.qcut()`
|
||||||
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
|
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
|
||||||
columns of the model matrix for this categorical variable. Note that this function will
|
columns of the model matrix for this categorical variable. Note that this function will
|
||||||
include *all* columns for a given categorical, rather than the usual approach which drops one
|
include *all* columns for a given categorical, rather than the usual approach which drops one
|
||||||
@@ -319,8 +311,8 @@ cut_age = pd.qcut(age, 4)
|
|||||||
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
|
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
|
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
|
||||||
quantiles directly instead of the argument `4`. For cuts not based
|
quantiles directly instead of the argument `4`. For cuts not based
|
||||||
on quantiles we would use the `pd.cut()` function.
|
on quantiles we would use the `pd.cut()` function.
|
||||||
@@ -340,7 +332,7 @@ evaluation functions are in the `scipy.interpolate` package;
|
|||||||
we have simply wrapped them as transforms
|
we have simply wrapped them as transforms
|
||||||
similar to `Poly()` and `PCA()`.
|
similar to `Poly()` and `PCA()`.
|
||||||
|
|
||||||
In Section 7.4, we saw
|
In Section~\ref{Ch7:sec:scolr-splin}, we saw
|
||||||
that regression splines can be fit by constructing an appropriate
|
that regression splines can be fit by constructing an appropriate
|
||||||
matrix of basis functions. The `BSpline()` function generates the
|
matrix of basis functions. The `BSpline()` function generates the
|
||||||
entire matrix of basis functions for splines with the specified set of
|
entire matrix of basis functions for splines with the specified set of
|
||||||
@@ -355,7 +347,7 @@ bs_age.shape
|
|||||||
```
|
```
|
||||||
This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots.
|
This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots.
|
||||||
We can form this same matrix using the `bs()` object,
|
We can form this same matrix using the `bs()` object,
|
||||||
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1.
|
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section~\ref{Ch7-nonlin-lab:polynomial-regression-and-step-functions}.
|
||||||
|
|
||||||
We now fit a cubic spline model to the `Wage` data.
|
We now fit a cubic spline model to the `Wage` data.
|
||||||
|
|
||||||
@@ -377,7 +369,7 @@ M = sm.OLS(y, Xbs).fit()
|
|||||||
summarize(M)
|
summarize(M)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
|
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
|
||||||
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
|
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
|
||||||
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
|
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
|
||||||
@@ -435,7 +427,7 @@ deciding bin membership.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
In order to fit a natural spline, we use the `NaturalSpline()`
|
In order to fit a natural spline, we use the `NaturalSpline()`
|
||||||
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
|
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
|
||||||
degrees of freedom (excluding the intercept) and plot the results.
|
degrees of freedom (excluding the intercept) and plot the results.
|
||||||
@@ -453,7 +445,7 @@ plot_wage_fit(age_df,
|
|||||||
'Natural spline, df=5');
|
'Natural spline, df=5');
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Smoothing Splines and GAMs
|
## Smoothing Splines and GAMs
|
||||||
A smoothing spline is a special case of a GAM with squared-error loss
|
A smoothing spline is a special case of a GAM with squared-error loss
|
||||||
and a single feature. To fit GAMs in `Python` we will use the
|
and a single feature. To fit GAMs in `Python` we will use the
|
||||||
@@ -464,7 +456,7 @@ of a model matrix with a particular smoothing operation:
|
|||||||
`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
|
`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
|
||||||
The argument `0` passed to `s` below indicates that this smoother will
|
The argument `0` passed to `s` below indicates that this smoother will
|
||||||
apply to the first column of a feature matrix. Below, we pass it a
|
apply to the first column of a feature matrix. Below, we pass it a
|
||||||
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2.
|
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section~\ref{Ch7:sec5.2}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
X_age = np.asarray(age).reshape((-1,1))
|
X_age = np.asarray(age).reshape((-1,1))
|
||||||
@@ -472,7 +464,7 @@ gam = LinearGAM(s_gam(0, lam=0.6))
|
|||||||
gam.fit(X_age, y)
|
gam.fit(X_age, y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
|
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
|
||||||
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
|
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
|
||||||
size of that dimension based on the remaining entries of the shape tuple.
|
size of that dimension based on the remaining entries of the shape tuple.
|
||||||
@@ -495,7 +487,7 @@ ax.set_ylabel('Wage', fontsize=20);
|
|||||||
ax.legend(title='$\lambda$');
|
ax.legend(title='$\lambda$');
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `pygam` package can perform a search for an optimal smoothing parameter.
|
The `pygam` package can perform a search for an optimal smoothing parameter.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -508,7 +500,7 @@ ax.legend()
|
|||||||
fig
|
fig
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternatively, we can fix the degrees of freedom of the smoothing
|
Alternatively, we can fix the degrees of freedom of the smoothing
|
||||||
spline using a function included in the `ISLP.pygam` package. Below we
|
spline using a function included in the `ISLP.pygam` package. Below we
|
||||||
find a value of $\lambda$ that gives us roughly four degrees of
|
find a value of $\lambda$ that gives us roughly four degrees of
|
||||||
@@ -523,8 +515,8 @@ age_term.lam = lam_4
|
|||||||
degrees_of_freedom(X_age, age_term)
|
degrees_of_freedom(X_age, age_term)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
|
Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
|
||||||
as the desired degrees of freedom plus one to account for the fact that these smoothing
|
as the desired degrees of freedom plus one to account for the fact that these smoothing
|
||||||
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
|
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
|
||||||
@@ -554,7 +546,7 @@ The strength of generalized additive models lies in their ability to fit multiva
|
|||||||
|
|
||||||
We now fit a GAM by hand to predict
|
We now fit a GAM by hand to predict
|
||||||
`wage` using natural spline functions of `year` and `age`,
|
`wage` using natural spline functions of `year` and `age`,
|
||||||
treating `education` as a qualitative predictor, as in (7.16).
|
treating `education` as a qualitative predictor, as in (\ref{Ch7:nsmod}).
|
||||||
Since this is just a big linear regression model
|
Since this is just a big linear regression model
|
||||||
using an appropriate choice of basis functions, we can simply do this
|
using an appropriate choice of basis functions, we can simply do this
|
||||||
using the `sm.OLS()` function.
|
using the `sm.OLS()` function.
|
||||||
@@ -636,10 +628,10 @@ ax.set_ylabel('Effect on wage')
|
|||||||
ax.set_title('Partial dependence of year on wage', fontsize=20);
|
ax.set_title('Partial dependence of year on wage', fontsize=20);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now fit the model (7.16) using smoothing splines rather
|
We now fit the model (\ref{Ch7:nsmod}) using smoothing splines rather
|
||||||
than natural splines. All of the
|
than natural splines. All of the
|
||||||
terms in (7.16) are fit simultaneously, taking each other
|
terms in (\ref{Ch7:nsmod}) are fit simultaneously, taking each other
|
||||||
into account to explain the response. The `pygam` package only works with matrices, so we must convert
|
into account to explain the response. The `pygam` package only works with matrices, so we must convert
|
||||||
the categorical series `education` to its array representation, which can be found
|
the categorical series `education` to its array representation, which can be found
|
||||||
with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we
|
with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we
|
||||||
@@ -728,7 +720,7 @@ gam_linear = LinearGAM(age_term +
|
|||||||
gam_linear.fit(Xgam, y)
|
gam_linear.fit(Xgam, y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice our use of `age_term` in the expressions above. We do this because
|
Notice our use of `age_term` in the expressions above. We do this because
|
||||||
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
|
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
|
||||||
|
|
||||||
@@ -775,7 +767,7 @@ We can make predictions from `gam` objects, just like from
|
|||||||
Yhat = gam_full.predict(Xgam)
|
Yhat = gam_full.predict(Xgam)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
In order to fit a logistic regression GAM, we use `LogisticGAM()`
|
In order to fit a logistic regression GAM, we use `LogisticGAM()`
|
||||||
from `pygam`.
|
from `pygam`.
|
||||||
|
|
||||||
@@ -786,7 +778,7 @@ gam_logit = LogisticGAM(age_term +
|
|||||||
gam_logit.fit(Xgam, high_earn)
|
gam_logit.fit(Xgam, high_earn)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
fig, ax = subplots(figsize=(8, 8))
|
fig, ax = subplots(figsize=(8, 8))
|
||||||
@@ -838,8 +830,8 @@ gam_logit_ = LogisticGAM(age_term +
|
|||||||
gam_logit_.fit(Xgam_, high_earn_)
|
gam_logit_.fit(Xgam_, high_earn_)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve
|
Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve
|
||||||
removed those observations.
|
removed those observations.
|
||||||
|
|
||||||
@@ -872,7 +864,7 @@ ax.set_ylabel('Effect on wage')
|
|||||||
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
|
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Local Regression
|
## Local Regression
|
||||||
We illustrate the use of local regression using the `lowess()`
|
We illustrate the use of local regression using the `lowess()`
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,23 +1,15 @@
|
|||||||
---
|
|
||||||
jupyter:
|
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
|
||||||
formats: Rmd,ipynb
|
|
||||||
main_language: python
|
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 8
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Tree-Based Methods
|
||||||
|
|
||||||
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch08-baggboost-lab.ipynb">
|
||||||
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch08-baggboost-lab.ipynb)
|
||||||
|
|
||||||
|
|
||||||
# Lab: Tree-Based Methods
|
|
||||||
We import some of our usual libraries at this top
|
We import some of our usual libraries at this top
|
||||||
level.
|
level.
|
||||||
|
|
||||||
@@ -46,10 +38,10 @@ from sklearn.ensemble import \
|
|||||||
from ISLP.bart import BART
|
from ISLP.bart import BART
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Fitting Classification Trees
|
## Fitting Classification Trees
|
||||||
|
|
||||||
|
|
||||||
We first use classification trees to analyze the `Carseats` data set.
|
We first use classification trees to analyze the `Carseats` data set.
|
||||||
In these data, `Sales` is a continuous variable, and so we begin
|
In these data, `Sales` is a continuous variable, and so we begin
|
||||||
@@ -65,7 +57,7 @@ High = np.where(Carseats.Sales > 8,
|
|||||||
"No")
|
"No")
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now use `DecisionTreeClassifier()` to fit a classification tree in
|
We now use `DecisionTreeClassifier()` to fit a classification tree in
|
||||||
order to predict `High` using all variables but `Sales`.
|
order to predict `High` using all variables but `Sales`.
|
||||||
To do so, we must form a model matrix as we did when fitting regression
|
To do so, we must form a model matrix as we did when fitting regression
|
||||||
@@ -93,13 +85,13 @@ clf = DTC(criterion='entropy',
|
|||||||
clf.fit(X, High)
|
clf.fit(X, High)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
In our discussion of qualitative features in Section 3.3,
|
In our discussion of qualitative features in Section~\ref{ch3:sec3},
|
||||||
we noted that for a linear regression model such a feature could be
|
we noted that for a linear regression model such a feature could be
|
||||||
represented by including a matrix of dummy variables (one-hot-encoding) in the model
|
represented by including a matrix of dummy variables (one-hot-encoding) in the model
|
||||||
matrix, using the formula notation of `statsmodels`.
|
matrix, using the formula notation of `statsmodels`.
|
||||||
As mentioned in Section 8.1, there is a more
|
As mentioned in Section~\ref{Ch8:decison.tree.sec}, there is a more
|
||||||
natural way to handle qualitative features when building a decision
|
natural way to handle qualitative features when building a decision
|
||||||
tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
|
tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
|
||||||
However,
|
However,
|
||||||
@@ -110,8 +102,8 @@ advantage of this approach; instead it simply treats the one-hot-encoded levels
|
|||||||
accuracy_score(High, clf.predict(X))
|
accuracy_score(High, clf.predict(X))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
With only the default arguments, the training error rate is
|
With only the default arguments, the training error rate is
|
||||||
21%.
|
21%.
|
||||||
For classification trees, we can
|
For classification trees, we can
|
||||||
@@ -129,8 +121,8 @@ resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
|
|||||||
resid_dev
|
resid_dev
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This is closely related to the *entropy*, defined in (8.7).
|
This is closely related to the *entropy*, defined in (\ref{Ch8:eq:cross-entropy}).
|
||||||
A small deviance indicates a
|
A small deviance indicates a
|
||||||
tree that provides a good fit to the (training) data.
|
tree that provides a good fit to the (training) data.
|
||||||
|
|
||||||
@@ -161,13 +153,13 @@ print(export_text(clf,
|
|||||||
show_weights=True))
|
show_weights=True))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
In order to properly evaluate the performance of a classification tree
|
In order to properly evaluate the performance of a classification tree
|
||||||
on these data, we must estimate the test error rather than simply
|
on these data, we must estimate the test error rather than simply
|
||||||
computing the training error. We split the observations into a
|
computing the training error. We split the observations into a
|
||||||
training set and a test set, build the tree using the training set,
|
training set and a test set, build the tree using the training set,
|
||||||
and evaluate its performance on the test data. This pattern is
|
and evaluate its performance on the test data. This pattern is
|
||||||
similar to that in Chapter 6, with the linear models
|
similar to that in Chapter~\ref{Ch6:varselect}, with the linear models
|
||||||
replaced here by decision trees --- the code for validation
|
replaced here by decision trees --- the code for validation
|
||||||
is almost identical. This approach leads to correct predictions
|
is almost identical. This approach leads to correct predictions
|
||||||
for 68.5% of the locations in the test data set.
|
for 68.5% of the locations in the test data set.
|
||||||
@@ -264,8 +256,8 @@ confusion = confusion_table(best_.predict(X_test),
|
|||||||
confusion
|
confusion
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
|
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
|
||||||
|
|
||||||
|
|
||||||
@@ -283,7 +275,7 @@ feature_names = list(D.columns)
|
|||||||
X = np.asarray(D)
|
X = np.asarray(D)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
First, we split the data into training and test sets, and fit the tree
|
First, we split the data into training and test sets, and fit the tree
|
||||||
to the training data. Here we use 30% of the data for the test set.
|
to the training data. Here we use 30% of the data for the test set.
|
||||||
|
|
||||||
@@ -298,7 +290,7 @@ to the training data. Here we use 30% of the data for the test set.
|
|||||||
random_state=0)
|
random_state=0)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Having formed our training and test data sets, we fit the regression tree.
|
Having formed our training and test data sets, we fit the regression tree.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -310,7 +302,7 @@ plot_tree(reg,
|
|||||||
ax=ax);
|
ax=ax);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The variable `lstat` measures the percentage of individuals with
|
The variable `lstat` measures the percentage of individuals with
|
||||||
lower socioeconomic status. The tree indicates that lower
|
lower socioeconomic status. The tree indicates that lower
|
||||||
values of `lstat` correspond to more expensive houses.
|
values of `lstat` correspond to more expensive houses.
|
||||||
@@ -334,7 +326,7 @@ grid = skm.GridSearchCV(reg,
|
|||||||
G = grid.fit(X_train, y_train)
|
G = grid.fit(X_train, y_train)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
In keeping with the cross-validation results, we use the pruned tree
|
In keeping with the cross-validation results, we use the pruned tree
|
||||||
to make predictions on the test set.
|
to make predictions on the test set.
|
||||||
|
|
||||||
@@ -343,8 +335,8 @@ best_ = grid.best_estimator_
|
|||||||
np.mean((y_test - best_.predict(X_test))**2)
|
np.mean((y_test - best_.predict(X_test))**2)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
In other words, the test set MSE associated with the regression tree
|
In other words, the test set MSE associated with the regression tree
|
||||||
is 28.07. The square root of
|
is 28.07. The square root of
|
||||||
the MSE is therefore around
|
the MSE is therefore around
|
||||||
@@ -367,7 +359,7 @@ plot_tree(G.best_estimator_,
|
|||||||
|
|
||||||
|
|
||||||
## Bagging and Random Forests
|
## Bagging and Random Forests
|
||||||
|
|
||||||
|
|
||||||
Here we apply bagging and random forests to the `Boston` data, using
|
Here we apply bagging and random forests to the `Boston` data, using
|
||||||
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
|
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
|
||||||
@@ -380,8 +372,8 @@ bag_boston = RF(max_features=X_train.shape[1], random_state=0)
|
|||||||
bag_boston.fit(X_train, y_train)
|
bag_boston.fit(X_train, y_train)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The argument `max_features` indicates that all 12 predictors should
|
The argument `max_features` indicates that all 12 predictors should
|
||||||
be considered for each split of the tree --- in other words, that
|
be considered for each split of the tree --- in other words, that
|
||||||
bagging should be done. How well does this bagged model perform on
|
bagging should be done. How well does this bagged model perform on
|
||||||
@@ -394,7 +386,7 @@ ax.scatter(y_hat_bag, y_test)
|
|||||||
np.mean((y_test - y_hat_bag)**2)
|
np.mean((y_test - y_hat_bag)**2)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The test set MSE associated with the bagged regression tree is
|
The test set MSE associated with the bagged regression tree is
|
||||||
14.63, about half that obtained using an optimally-pruned single
|
14.63, about half that obtained using an optimally-pruned single
|
||||||
tree. We could change the number of trees grown from the default of
|
tree. We could change the number of trees grown from the default of
|
||||||
@@ -425,8 +417,8 @@ y_hat_RF = RF_boston.predict(X_test)
|
|||||||
np.mean((y_test - y_hat_RF)**2)
|
np.mean((y_test - y_hat_RF)**2)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The test set MSE is 20.04;
|
The test set MSE is 20.04;
|
||||||
this indicates that random forests did somewhat worse than bagging
|
this indicates that random forests did somewhat worse than bagging
|
||||||
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
|
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
|
||||||
@@ -440,7 +432,7 @@ feature_imp.sort_values(by='importance', ascending=False)
|
|||||||
```
|
```
|
||||||
This
|
This
|
||||||
is a relative measure of the total decrease in node impurity that results from
|
is a relative measure of the total decrease in node impurity that results from
|
||||||
splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data).
|
splits over that variable, averaged over all trees (this was plotted in Figure~\ref{Ch8:fig:varimp} for a model fit to the `Heart` data).
|
||||||
|
|
||||||
The results indicate that across all of the trees considered in the
|
The results indicate that across all of the trees considered in the
|
||||||
random forest, the wealth level of the community (`lstat`) and the
|
random forest, the wealth level of the community (`lstat`) and the
|
||||||
@@ -450,7 +442,7 @@ house size (`rm`) are by far the two most important variables.
|
|||||||
|
|
||||||
|
|
||||||
## Boosting
|
## Boosting
|
||||||
|
|
||||||
|
|
||||||
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
|
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
|
||||||
to fit boosted regression trees to the `Boston` data
|
to fit boosted regression trees to the `Boston` data
|
||||||
@@ -469,7 +461,7 @@ boost_boston = GBR(n_estimators=5000,
|
|||||||
boost_boston.fit(X_train, y_train)
|
boost_boston.fit(X_train, y_train)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can see how the training error decreases with the `train_score_` attribute.
|
We can see how the training error decreases with the `train_score_` attribute.
|
||||||
To get an idea of how the test error decreases we can use the
|
To get an idea of how the test error decreases we can use the
|
||||||
`staged_predict()` method to get the predicted values along the path.
|
`staged_predict()` method to get the predicted values along the path.
|
||||||
@@ -492,7 +484,7 @@ ax.plot(plot_idx,
|
|||||||
ax.legend();
|
ax.legend();
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now use the boosted model to predict `medv` on the test set:
|
We now use the boosted model to predict `medv` on the test set:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -500,11 +492,11 @@ y_hat_boost = boost_boston.predict(X_test);
|
|||||||
np.mean((y_test - y_hat_boost)**2)
|
np.mean((y_test - y_hat_boost)**2)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The test MSE obtained is 14.48,
|
The test MSE obtained is 14.48,
|
||||||
similar to the test MSE for bagging. If we want to, we can
|
similar to the test MSE for bagging. If we want to, we can
|
||||||
perform boosting with a different value of the shrinkage parameter
|
perform boosting with a different value of the shrinkage parameter
|
||||||
$\lambda$ in (8.10). The default value is 0.001, but
|
$\lambda$ in (\ref{Ch8:alphaboost}). The default value is 0.001, but
|
||||||
this is easily modified. Here we take $\lambda=0.2$.
|
this is easily modified. Here we take $\lambda=0.2$.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -518,8 +510,8 @@ y_hat_boost = boost_boston.predict(X_test);
|
|||||||
np.mean((y_test - y_hat_boost)**2)
|
np.mean((y_test - y_hat_boost)**2)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
In this case, using $\lambda=0.2$ leads to a almost the same test MSE
|
In this case, using $\lambda=0.2$ leads to a almost the same test MSE
|
||||||
as when using $\lambda=0.001$.
|
as when using $\lambda=0.001$.
|
||||||
|
|
||||||
@@ -527,7 +519,7 @@ as when using $\lambda=0.001$.
|
|||||||
|
|
||||||
|
|
||||||
## Bayesian Additive Regression Trees
|
## Bayesian Additive Regression Trees
|
||||||
|
|
||||||
|
|
||||||
In this section we demonstrate a `Python` implementation of BART found in the
|
In this section we demonstrate a `Python` implementation of BART found in the
|
||||||
`ISLP.bart` package. We fit a model
|
`ISLP.bart` package. We fit a model
|
||||||
@@ -540,8 +532,8 @@ bart_boston = BART(random_state=0, burnin=5, ndraw=15)
|
|||||||
bart_boston.fit(X_train, y_train)
|
bart_boston.fit(X_train, y_train)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
|
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -549,8 +541,8 @@ yhat_test = bart_boston.predict(X_test.astype(np.float32))
|
|||||||
np.mean((y_test - yhat_test)**2)
|
np.mean((y_test - yhat_test)**2)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We can check how many times each variable appeared in the collection of trees.
|
We can check how many times each variable appeared in the collection of trees.
|
||||||
This gives a summary similar to the variable importance plot for boosting and random forests.
|
This gives a summary similar to the variable importance plot for boosting and random forests.
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,23 +1,15 @@
|
|||||||
---
|
|
||||||
jupyter:
|
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
# Support Vector Machines
|
||||||
formats: Rmd,ipynb
|
|
||||||
main_language: python
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch09-svm-lab.ipynb">
|
||||||
text_representation:
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
extension: .Rmd
|
</a>
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch09-svm-lab.ipynb)
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 9
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Lab: Support Vector Machines
|
|
||||||
In this lab, we use the `sklearn.svm` library to demonstrate the support
|
In this lab, we use the `sklearn.svm` library to demonstrate the support
|
||||||
vector classifier and the support vector machine.
|
vector classifier and the support vector machine.
|
||||||
|
|
||||||
@@ -39,7 +31,7 @@ from ISLP.svm import plot as plot_svm
|
|||||||
from sklearn.metrics import RocCurveDisplay
|
from sklearn.metrics import RocCurveDisplay
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We will use the function `RocCurveDisplay.from_estimator()` to
|
We will use the function `RocCurveDisplay.from_estimator()` to
|
||||||
produce several ROC plots, using a shorthand `roc_curve`.
|
produce several ROC plots, using a shorthand `roc_curve`.
|
||||||
|
|
||||||
@@ -84,8 +76,8 @@ svm_linear = SVC(C=10, kernel='linear')
|
|||||||
svm_linear.fit(X, y)
|
svm_linear.fit(X, y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The support vector classifier with two features can
|
The support vector classifier with two features can
|
||||||
be visualized by plotting values of its *decision function*.
|
be visualized by plotting values of its *decision function*.
|
||||||
We have included a function for this in the `ISLP` package (inspired by a similar
|
We have included a function for this in the `ISLP` package (inspired by a similar
|
||||||
@@ -99,7 +91,7 @@ plot_svm(X,
|
|||||||
ax=ax)
|
ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The decision
|
The decision
|
||||||
boundary between the two classes is linear (because we used the
|
boundary between the two classes is linear (because we used the
|
||||||
argument `kernel='linear'`). The support vectors are marked with `+`
|
argument `kernel='linear'`). The support vectors are marked with `+`
|
||||||
@@ -126,8 +118,8 @@ coefficients of the linear decision boundary as follows:
|
|||||||
svm_linear.coef_
|
svm_linear.coef_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Since the support vector machine is an estimator in `sklearn`, we
|
Since the support vector machine is an estimator in `sklearn`, we
|
||||||
can use the usual machinery to tune it.
|
can use the usual machinery to tune it.
|
||||||
|
|
||||||
@@ -144,8 +136,8 @@ grid.fit(X, y)
|
|||||||
grid.best_params_
|
grid.best_params_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We can easily access the cross-validation errors for each of these models
|
We can easily access the cross-validation errors for each of these models
|
||||||
in `grid.cv_results_`. This prints out a lot of detail, so we
|
in `grid.cv_results_`. This prints out a lot of detail, so we
|
||||||
extract the accuracy results only.
|
extract the accuracy results only.
|
||||||
@@ -166,7 +158,7 @@ y_test = np.array([-1]*10+[1]*10)
|
|||||||
X_test[y_test==1] += 1
|
X_test[y_test==1] += 1
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Now we predict the class labels of these test observations. Here we
|
Now we predict the class labels of these test observations. Here we
|
||||||
use the best model selected by cross-validation in order to make the
|
use the best model selected by cross-validation in order to make the
|
||||||
predictions.
|
predictions.
|
||||||
@@ -177,7 +169,7 @@ y_test_hat = best_.predict(X_test)
|
|||||||
confusion_table(y_test_hat, y_test)
|
confusion_table(y_test_hat, y_test)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Thus, with this value of `C`,
|
Thus, with this value of `C`,
|
||||||
70% of the test
|
70% of the test
|
||||||
observations are correctly classified. What if we had instead used
|
observations are correctly classified. What if we had instead used
|
||||||
@@ -190,7 +182,7 @@ y_test_hat = svm_.predict(X_test)
|
|||||||
confusion_table(y_test_hat, y_test)
|
confusion_table(y_test_hat, y_test)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
In this case 60% of test observations are correctly classified.
|
In this case 60% of test observations are correctly classified.
|
||||||
|
|
||||||
We now consider a situation in which the two classes are linearly
|
We now consider a situation in which the two classes are linearly
|
||||||
@@ -205,7 +197,7 @@ fig, ax = subplots(figsize=(8,8))
|
|||||||
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Now the observations are just barely linearly separable.
|
Now the observations are just barely linearly separable.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -214,7 +206,7 @@ y_hat = svm_.predict(X)
|
|||||||
confusion_table(y_hat, y)
|
confusion_table(y_hat, y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We fit the
|
We fit the
|
||||||
support vector classifier and plot the resulting hyperplane, using a
|
support vector classifier and plot the resulting hyperplane, using a
|
||||||
very large value of `C` so that no observations are
|
very large value of `C` so that no observations are
|
||||||
@@ -240,7 +232,7 @@ y_hat = svm_.predict(X)
|
|||||||
confusion_table(y_hat, y)
|
confusion_table(y_hat, y)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Using `C=0.1`, we again do not misclassify any training observations, but we
|
Using `C=0.1`, we again do not misclassify any training observations, but we
|
||||||
also obtain a much wider margin and make use of twelve support
|
also obtain a much wider margin and make use of twelve support
|
||||||
vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test
|
vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test
|
||||||
@@ -254,7 +246,7 @@ plot_svm(X,
|
|||||||
ax=ax)
|
ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Support Vector Machine
|
## Support Vector Machine
|
||||||
In order to fit an SVM using a non-linear kernel, we once again use
|
In order to fit an SVM using a non-linear kernel, we once again use
|
||||||
@@ -264,9 +256,9 @@ kernel we use `kernel="poly"`, and to fit an SVM with a
|
|||||||
radial kernel we use
|
radial kernel we use
|
||||||
`kernel="rbf"`. In the former case we also use the
|
`kernel="rbf"`. In the former case we also use the
|
||||||
`degree` argument to specify a degree for the polynomial kernel
|
`degree` argument to specify a degree for the polynomial kernel
|
||||||
(this is $d$ in (9.22)), and in the latter case we use
|
(this is $d$ in (\ref{Ch9:eq:polyd})), and in the latter case we use
|
||||||
`gamma` to specify a value of $\gamma$ for the radial basis
|
`gamma` to specify a value of $\gamma$ for the radial basis
|
||||||
kernel (9.24).
|
kernel (\ref{Ch9:eq:radial}).
|
||||||
|
|
||||||
We first generate some data with a non-linear class boundary, as follows:
|
We first generate some data with a non-linear class boundary, as follows:
|
||||||
|
|
||||||
@@ -277,7 +269,7 @@ X[100:150] -= 2
|
|||||||
y = np.array([1]*150+[2]*50)
|
y = np.array([1]*150+[2]*50)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Plotting the data makes it clear that the class boundary is indeed non-linear.
|
Plotting the data makes it clear that the class boundary is indeed non-linear.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -285,11 +277,11 @@ fig, ax = subplots(figsize=(8,8))
|
|||||||
ax.scatter(X[:,0],
|
ax.scatter(X[:,0],
|
||||||
X[:,1],
|
X[:,1],
|
||||||
c=y,
|
c=y,
|
||||||
cmap=cm.coolwarm)
|
cmap=cm.coolwarm);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The data is randomly split into training and testing groups. We then
|
The data is randomly split into training and testing groups. We then
|
||||||
fit the training data using the `SVC()` estimator with a
|
fit the training data using the `SVC()` estimator with a
|
||||||
radial kernel and $\gamma=1$:
|
radial kernel and $\gamma=1$:
|
||||||
@@ -306,7 +298,7 @@ svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
|
|||||||
svm_rbf.fit(X_train, y_train)
|
svm_rbf.fit(X_train, y_train)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The plot shows that the resulting SVM has a decidedly non-linear
|
The plot shows that the resulting SVM has a decidedly non-linear
|
||||||
boundary.
|
boundary.
|
||||||
|
|
||||||
@@ -318,7 +310,7 @@ plot_svm(X_train,
|
|||||||
ax=ax)
|
ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can see from the figure that there are a fair number of training
|
We can see from the figure that there are a fair number of training
|
||||||
errors in this SVM fit. If we increase the value of `C`, we
|
errors in this SVM fit. If we increase the value of `C`, we
|
||||||
can reduce the number of training errors. However, this comes at the
|
can reduce the number of training errors. However, this comes at the
|
||||||
@@ -335,7 +327,7 @@ plot_svm(X_train,
|
|||||||
ax=ax)
|
ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can perform cross-validation using `skm.GridSearchCV()` to select the
|
We can perform cross-validation using `skm.GridSearchCV()` to select the
|
||||||
best choice of $\gamma$ and `C` for an SVM with a radial
|
best choice of $\gamma$ and `C` for an SVM with a radial
|
||||||
kernel:
|
kernel:
|
||||||
@@ -354,7 +346,7 @@ grid.fit(X_train, y_train)
|
|||||||
grid.best_params_
|
grid.best_params_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The best choice of parameters under five-fold CV is achieved at `C=1`
|
The best choice of parameters under five-fold CV is achieved at `C=1`
|
||||||
and `gamma=0.5`, though several other values also achieve the same
|
and `gamma=0.5`, though several other values also achieve the same
|
||||||
value.
|
value.
|
||||||
@@ -371,7 +363,7 @@ y_hat_test = best_svm.predict(X_test)
|
|||||||
confusion_table(y_hat_test, y_test)
|
confusion_table(y_hat_test, y_test)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
With these parameters, 12% of test
|
With these parameters, 12% of test
|
||||||
observations are misclassified by this SVM.
|
observations are misclassified by this SVM.
|
||||||
|
|
||||||
@@ -386,7 +378,7 @@ classifier, the fitted value for an observation $X= (X_1, X_2, \ldots,
|
|||||||
X_p)^T$ takes the form $\hat{\beta}_0 + \hat{\beta}_1 X_1 +
|
X_p)^T$ takes the form $\hat{\beta}_0 + \hat{\beta}_1 X_1 +
|
||||||
\hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$. For an SVM with a
|
\hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$. For an SVM with a
|
||||||
non-linear kernel, the equation that yields the fitted value is given
|
non-linear kernel, the equation that yields the fitted value is given
|
||||||
in (9.23). The sign of the fitted value
|
in (\ref{Ch9:eq:svmip}). The sign of the fitted value
|
||||||
determines on which side of the decision boundary the observation
|
determines on which side of the decision boundary the observation
|
||||||
lies. Therefore, the relationship between the fitted value and the
|
lies. Therefore, the relationship between the fitted value and the
|
||||||
class prediction for a given observation is simple: if the fitted
|
class prediction for a given observation is simple: if the fitted
|
||||||
@@ -431,7 +423,7 @@ roc_curve(svm_flex,
|
|||||||
ax=ax);
|
ax=ax);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
However, these ROC curves are all on the training data. We are really
|
However, these ROC curves are all on the training data. We are really
|
||||||
more interested in the level of prediction accuracy on the test
|
more interested in the level of prediction accuracy on the test
|
||||||
data. When we compute the ROC curves on the test data, the model with
|
data. When we compute the ROC curves on the test data, the model with
|
||||||
@@ -447,7 +439,7 @@ roc_curve(svm_flex,
|
|||||||
fig;
|
fig;
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Let’s look at our tuned SVM.
|
Let’s look at our tuned SVM.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -466,7 +458,7 @@ for (X_, y_, c, name) in zip(
|
|||||||
color=c)
|
color=c)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## SVM with Multiple Classes
|
## SVM with Multiple Classes
|
||||||
|
|
||||||
If the response is a factor containing more than two levels, then the
|
If the response is a factor containing more than two levels, then the
|
||||||
@@ -485,7 +477,7 @@ fig, ax = subplots(figsize=(8,8))
|
|||||||
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now fit an SVM to the data:
|
We now fit an SVM to the data:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -521,7 +513,7 @@ Khan = load_data('Khan')
|
|||||||
Khan['xtrain'].shape, Khan['xtest'].shape
|
Khan['xtrain'].shape, Khan['xtest'].shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This data set consists of expression measurements for 2,308
|
This data set consists of expression measurements for 2,308
|
||||||
genes. The training and test sets consist of 63 and 20
|
genes. The training and test sets consist of 63 and 20
|
||||||
observations, respectively.
|
observations, respectively.
|
||||||
@@ -540,7 +532,7 @@ confusion_table(khan_linear.predict(Khan['xtrain']),
|
|||||||
Khan['ytrain'])
|
Khan['ytrain'])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that there are *no* training
|
We see that there are *no* training
|
||||||
errors. In fact, this is not surprising, because the large number of
|
errors. In fact, this is not surprising, because the large number of
|
||||||
variables relative to the number of observations implies that it is
|
variables relative to the number of observations implies that it is
|
||||||
@@ -553,7 +545,7 @@ confusion_table(khan_linear.predict(Khan['xtest']),
|
|||||||
Khan['ytest'])
|
Khan['ytest'])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that using `C=10` yields two test set errors on these data.
|
We see that using `C=10` yields two test set errors on these data.
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
1286
Ch09-svm-lab.ipynb
1286
Ch09-svm-lab.ipynb
File diff suppressed because one or more lines are too long
@@ -1,13 +1,18 @@
|
|||||||
|
# Deep Learning
|
||||||
|
|
||||||
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch10-deeplearning-lab.ipynb">
|
||||||
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch10-deeplearning-lab.ipynb)
|
||||||
|
|
||||||
# Chapter 10
|
|
||||||
|
|
||||||
# Lab: Deep Learning
|
|
||||||
In this section we demonstrate how to fit the examples discussed
|
In this section we demonstrate how to fit the examples discussed
|
||||||
in the text. We use the `Python` `torch` package, along with the
|
in the text. We use the `Python`{} `torch` package, along with the
|
||||||
`pytorch_lightning` package which provides utilities to simplify
|
`pytorch_lightning` package which provides utilities to simplify
|
||||||
fitting and evaluating models. This code can be impressively fast
|
fitting and evaluating models. This code can be impressively fast
|
||||||
with certain special processors, such as Apple’s new M1 chip. The package is well-structured, flexible, and will feel comfortable
|
with certain special processors, such as Apple’s new M1 chip. The package is well-structured, flexible, and will feel comfortable
|
||||||
to `Python` users. A good companion is the site
|
to `Python`{} users. A good companion is the site
|
||||||
[pytorch.org/tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html).
|
[pytorch.org/tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html).
|
||||||
Much of our code is adapted from there, as well as the `pytorch_lightning` documentation. {The precise URLs at the time of writing are <https://pytorch.org/tutorials/beginner/basics/intro.html> and <https://pytorch-lightning.readthedocs.io/en/latest/>.}
|
Much of our code is adapted from there, as well as the `pytorch_lightning` documentation. {The precise URLs at the time of writing are <https://pytorch.org/tutorials/beginner/basics/intro.html> and <https://pytorch-lightning.readthedocs.io/en/latest/>.}
|
||||||
|
|
||||||
@@ -50,7 +55,13 @@ the `torchmetrics` package has utilities to compute
|
|||||||
various metrics to evaluate performance when fitting
|
various metrics to evaluate performance when fitting
|
||||||
a model. The `torchinfo` package provides a useful
|
a model. The `torchinfo` package provides a useful
|
||||||
summary of the layers of a model. We use the `read_image()`
|
summary of the layers of a model. We use the `read_image()`
|
||||||
function when loading test images in Section 10.9.4.
|
function when loading test images in Section~\ref{Ch13-deeplearning-lab:using-pretrained-cnn-models}.
|
||||||
|
|
||||||
|
If you have not already installed the packages `torchvision`
|
||||||
|
and `torchinfo` you can install them by running
|
||||||
|
`pip install torchinfo torchvision`.
|
||||||
|
We can now import from `torchinfo`.
|
||||||
|
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
from torchmetrics import (MeanAbsoluteError,
|
from torchmetrics import (MeanAbsoluteError,
|
||||||
@@ -150,7 +161,7 @@ import json
|
|||||||
|
|
||||||
|
|
||||||
## Single Layer Network on Hitters Data
|
## Single Layer Network on Hitters Data
|
||||||
We start by fitting the models in Section 10.6 on the `Hitters` data.
|
We start by fitting the models in Section~\ref{Ch13:sec:when-use-deep} on the `Hitters` data.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
Hitters = load_data('Hitters').dropna()
|
Hitters = load_data('Hitters').dropna()
|
||||||
@@ -204,7 +215,7 @@ np.abs(Yhat_test - Y_test).mean()
|
|||||||
|
|
||||||
Next we fit the lasso using `sklearn`. We are using
|
Next we fit the lasso using `sklearn`. We are using
|
||||||
mean absolute error to select and evaluate a model, rather than mean squared error.
|
mean absolute error to select and evaluate a model, rather than mean squared error.
|
||||||
The specialized solver we used in Section 6.5.2 uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly.
|
The specialized solver we used in Section~\ref{Ch6-varselect-lab:lab-2-ridge-regression-and-the-lasso} uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly.
|
||||||
|
|
||||||
We encode a pipeline with two steps: we first normalize the features using a `StandardScaler()` transform,
|
We encode a pipeline with two steps: we first normalize the features using a `StandardScaler()` transform,
|
||||||
and then fit the lasso without further normalization.
|
and then fit the lasso without further normalization.
|
||||||
@@ -426,7 +437,7 @@ hit_module = SimpleModule.regression(hit_model,
|
|||||||
```
|
```
|
||||||
|
|
||||||
By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in
|
By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in
|
||||||
(10.23).
|
(\ref{Ch13:eq:4}).
|
||||||
We have also asked for mean absolute error to be tracked as well
|
We have also asked for mean absolute error to be tracked as well
|
||||||
in the metrics that are logged.
|
in the metrics that are logged.
|
||||||
|
|
||||||
@@ -463,7 +474,7 @@ hit_trainer = Trainer(deterministic=True,
|
|||||||
hit_trainer.fit(hit_module, datamodule=hit_dm)
|
hit_trainer.fit(hit_module, datamodule=hit_dm)
|
||||||
```
|
```
|
||||||
At each step of SGD, the algorithm randomly selects 32 training observations for
|
At each step of SGD, the algorithm randomly selects 32 training observations for
|
||||||
the computation of the gradient. Recall from Section 10.7
|
the computation of the gradient. Recall from Section~\ref{Ch13:sec:fitt-neur-netw}
|
||||||
that an epoch amounts to the number of SGD steps required to process $n$
|
that an epoch amounts to the number of SGD steps required to process $n$
|
||||||
observations. Since the training set has
|
observations. Since the training set has
|
||||||
$n=175$, and we specified a `batch_size` of 32 in the construction of `hit_dm`, an epoch is $175/32=5.5$ SGD steps.
|
$n=175$, and we specified a `batch_size` of 32 in the construction of `hit_dm`, an epoch is $175/32=5.5$ SGD steps.
|
||||||
@@ -701,12 +712,13 @@ mnist_logger = CSVLogger('logs', name='MNIST')
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Now we are ready to go. The final step is to supply training data, and fit the model.
|
Now we are ready to go. The final step is to supply training data, and fit the model. We disable the progress bar below to avoid lengthy output in the browser when running.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
mnist_trainer = Trainer(deterministic=True,
|
mnist_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=30,
|
max_epochs=30,
|
||||||
logger=mnist_logger,
|
logger=mnist_logger,
|
||||||
|
enable_progress_bar=False,
|
||||||
callbacks=[ErrorTracker()])
|
callbacks=[ErrorTracker()])
|
||||||
mnist_trainer.fit(mnist_module,
|
mnist_trainer.fit(mnist_module,
|
||||||
datamodule=mnist_dm)
|
datamodule=mnist_dm)
|
||||||
@@ -751,8 +763,8 @@ mnist_trainer.test(mnist_module,
|
|||||||
datamodule=mnist_dm)
|
datamodule=mnist_dm)
|
||||||
```
|
```
|
||||||
|
|
||||||
Table 10.1 also reports the error rates resulting from LDA (Chapter 4) and multiclass logistic
|
Table~\ref{Ch13:tab:mnist} also reports the error rates resulting from LDA (Chapter~\ref{Ch4:classification}) and multiclass logistic
|
||||||
regression. For LDA we refer the reader to Section 4.7.3.
|
regression. For LDA we refer the reader to Section~\ref{Ch4-classification-lab:linear-discriminant-analysis}.
|
||||||
Although we could use the `sklearn` function `LogisticRegression()` to fit
|
Although we could use the `sklearn` function `LogisticRegression()` to fit
|
||||||
multiclass logistic regression, we are set up here to fit such a model
|
multiclass logistic regression, we are set up here to fit such a model
|
||||||
with `torch`.
|
with `torch`.
|
||||||
@@ -776,6 +788,7 @@ mlr_logger = CSVLogger('logs', name='MNIST_MLR')
|
|||||||
```{python}
|
```{python}
|
||||||
mlr_trainer = Trainer(deterministic=True,
|
mlr_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=30,
|
max_epochs=30,
|
||||||
|
enable_progress_bar=False,
|
||||||
callbacks=[ErrorTracker()])
|
callbacks=[ErrorTracker()])
|
||||||
mlr_trainer.fit(mlr_module, datamodule=mnist_dm)
|
mlr_trainer.fit(mlr_module, datamodule=mnist_dm)
|
||||||
```
|
```
|
||||||
@@ -856,7 +869,7 @@ for idx, (X_ ,Y_) in enumerate(cifar_dm.train_dataloader()):
|
|||||||
|
|
||||||
|
|
||||||
Before we start, we look at some of the training images; similar code produced
|
Before we start, we look at some of the training images; similar code produced
|
||||||
Figure 10.5 on page 406. The example below also illustrates
|
Figure~\ref{Ch13:fig:cifar100} on page \pageref{Ch13:fig:cifar100}. The example below also illustrates
|
||||||
that `TensorDataset` objects can be indexed with integers --- we are choosing
|
that `TensorDataset` objects can be indexed with integers --- we are choosing
|
||||||
random images from the training data by indexing `cifar_train`. In order to display correctly,
|
random images from the training data by indexing `cifar_train`. In order to display correctly,
|
||||||
we must reorder the dimensions by a call to `np.transpose()`.
|
we must reorder the dimensions by a call to `np.transpose()`.
|
||||||
@@ -879,7 +892,7 @@ for i in range(5):
|
|||||||
Here the `imshow()` method recognizes from the shape of its argument that it is a 3-dimensional array, with the last dimension indexing the three RGB color channels.
|
Here the `imshow()` method recognizes from the shape of its argument that it is a 3-dimensional array, with the last dimension indexing the three RGB color channels.
|
||||||
|
|
||||||
We specify a moderately-sized CNN for
|
We specify a moderately-sized CNN for
|
||||||
demonstration purposes, similar in structure to Figure 10.8.
|
demonstration purposes, similar in structure to Figure~\ref{Ch13:fig:DeepCNN}.
|
||||||
We use several layers, each consisting of convolution, ReLU, and max-pooling steps.
|
We use several layers, each consisting of convolution, ReLU, and max-pooling steps.
|
||||||
We first define a module that defines one of these layers. As in our
|
We first define a module that defines one of these layers. As in our
|
||||||
previous examples, we overwrite the `__init__()` and `forward()` methods
|
previous examples, we overwrite the `__init__()` and `forward()` methods
|
||||||
@@ -995,6 +1008,7 @@ cifar_logger = CSVLogger('logs', name='CIFAR100')
|
|||||||
cifar_trainer = Trainer(deterministic=True,
|
cifar_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=30,
|
max_epochs=30,
|
||||||
logger=cifar_logger,
|
logger=cifar_logger,
|
||||||
|
enable_progress_bar=False,
|
||||||
callbacks=[ErrorTracker()])
|
callbacks=[ErrorTracker()])
|
||||||
cifar_trainer.fit(cifar_module,
|
cifar_trainer.fit(cifar_module,
|
||||||
datamodule=cifar_dm)
|
datamodule=cifar_dm)
|
||||||
@@ -1051,6 +1065,7 @@ try:
|
|||||||
cifar_module.metrics[name] = metric.to('mps')
|
cifar_module.metrics[name] = metric.to('mps')
|
||||||
cifar_trainer_mps = Trainer(accelerator='mps',
|
cifar_trainer_mps = Trainer(accelerator='mps',
|
||||||
deterministic=True,
|
deterministic=True,
|
||||||
|
enable_progress_bar=False,
|
||||||
max_epochs=30)
|
max_epochs=30)
|
||||||
cifar_trainer_mps.fit(cifar_module,
|
cifar_trainer_mps.fit(cifar_module,
|
||||||
datamodule=cifar_dm)
|
datamodule=cifar_dm)
|
||||||
@@ -1066,7 +1081,7 @@ clauses; if it works, we get the speedup, if it fails, nothing happens.
|
|||||||
|
|
||||||
## Using Pretrained CNN Models
|
## Using Pretrained CNN Models
|
||||||
We now show how to use a CNN pretrained on the `imagenet` database to classify natural
|
We now show how to use a CNN pretrained on the `imagenet` database to classify natural
|
||||||
images, and demonstrate how we produced Figure 10.10.
|
images, and demonstrate how we produced Figure~\ref{Ch13:fig:homeimages}.
|
||||||
We copied six JPEG images from a digital photo album into the
|
We copied six JPEG images from a digital photo album into the
|
||||||
directory `book_images`. These images are available
|
directory `book_images`. These images are available
|
||||||
from the data section of <www.statlearning.com>, the ISLP book website. Download `book_images.zip`; when
|
from the data section of <www.statlearning.com>, the ISLP book website. Download `book_images.zip`; when
|
||||||
@@ -1175,7 +1190,7 @@ del(cifar_test,
|
|||||||
|
|
||||||
|
|
||||||
## IMDB Document Classification
|
## IMDB Document Classification
|
||||||
We now implement models for sentiment classification (Section 10.4) on the `IMDB`
|
We now implement models for sentiment classification (Section~\ref{Ch13:sec:docum-class}) on the `IMDB`
|
||||||
dataset. As mentioned above code block~8, we are using
|
dataset. As mentioned above code block~8, we are using
|
||||||
a preprocessed version of the `IMDB` dataset found in the
|
a preprocessed version of the `IMDB` dataset found in the
|
||||||
`keras` package. As `keras` uses `tensorflow`, a different
|
`keras` package. As `keras` uses `tensorflow`, a different
|
||||||
@@ -1299,6 +1314,7 @@ imdb_logger = CSVLogger('logs', name='IMDB')
|
|||||||
imdb_trainer = Trainer(deterministic=True,
|
imdb_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=30,
|
max_epochs=30,
|
||||||
logger=imdb_logger,
|
logger=imdb_logger,
|
||||||
|
enable_progress_bar=False,
|
||||||
callbacks=[ErrorTracker()])
|
callbacks=[ErrorTracker()])
|
||||||
imdb_trainer.fit(imdb_module,
|
imdb_trainer.fit(imdb_module,
|
||||||
datamodule=imdb_dm)
|
datamodule=imdb_dm)
|
||||||
@@ -1328,7 +1344,7 @@ matrix that is recognized by `sklearn.`
|
|||||||
```
|
```
|
||||||
|
|
||||||
Similar to what we did in
|
Similar to what we did in
|
||||||
Section 10.9.1,
|
Section~\ref{Ch13-deeplearning-lab:single-layer-network-on-hitters-data},
|
||||||
we construct a series of 50 values for the lasso reguralization parameter $\lambda$.
|
we construct a series of 50 values for the lasso reguralization parameter $\lambda$.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -1436,16 +1452,16 @@ del(imdb_model,
|
|||||||
|
|
||||||
## Recurrent Neural Networks
|
## Recurrent Neural Networks
|
||||||
In this lab we fit the models illustrated in
|
In this lab we fit the models illustrated in
|
||||||
Section 10.5.
|
Section~\ref{Ch13:sec:recurr-neur-netw}.
|
||||||
|
|
||||||
|
|
||||||
### Sequential Models for Document Classification
|
### Sequential Models for Document Classification
|
||||||
Here we fit a simple LSTM RNN for sentiment prediction to
|
Here we fit a simple LSTM RNN for sentiment prediction to
|
||||||
the `IMDb` movie-review data, as discussed in Section 10.5.1.
|
the `IMDb` movie-review data, as discussed in Section~\ref{Ch13:sec:sequ-models-docum}.
|
||||||
For an RNN we use the sequence of words in a document, taking their
|
For an RNN we use the sequence of words in a document, taking their
|
||||||
order into account. We loaded the preprocessed
|
order into account. We loaded the preprocessed
|
||||||
data at the beginning of
|
data at the beginning of
|
||||||
Section 10.9.5.
|
Section~\ref{Ch13-deeplearning-lab:imdb-document-classification}.
|
||||||
A script that details the preprocessing can be found in the
|
A script that details the preprocessing can be found in the
|
||||||
`ISLP` library. Notably, since more than 90% of the documents
|
`ISLP` library. Notably, since more than 90% of the documents
|
||||||
had fewer than 500 words, we set the document length to 500. For
|
had fewer than 500 words, we set the document length to 500. For
|
||||||
@@ -1519,6 +1535,7 @@ lstm_logger = CSVLogger('logs', name='IMDB_LSTM')
|
|||||||
lstm_trainer = Trainer(deterministic=True,
|
lstm_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=20,
|
max_epochs=20,
|
||||||
logger=lstm_logger,
|
logger=lstm_logger,
|
||||||
|
enable_progress_bar=False,
|
||||||
callbacks=[ErrorTracker()])
|
callbacks=[ErrorTracker()])
|
||||||
lstm_trainer.fit(lstm_module,
|
lstm_trainer.fit(lstm_module,
|
||||||
datamodule=imdb_seq_dm)
|
datamodule=imdb_seq_dm)
|
||||||
@@ -1559,7 +1576,7 @@ del(lstm_model,
|
|||||||
|
|
||||||
|
|
||||||
### Time Series Prediction
|
### Time Series Prediction
|
||||||
We now show how to fit the models in Section 10.5.2
|
We now show how to fit the models in Section~\ref{Ch13:sec:time-seri-pred}
|
||||||
for time series prediction.
|
for time series prediction.
|
||||||
We first load and standardize the data.
|
We first load and standardize the data.
|
||||||
|
|
||||||
@@ -1750,6 +1767,7 @@ The results on the test data are very similar to the linear AR model.
|
|||||||
```{python}
|
```{python}
|
||||||
nyse_trainer = Trainer(deterministic=True,
|
nyse_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=200,
|
max_epochs=200,
|
||||||
|
enable_progress_bar=False,
|
||||||
callbacks=[ErrorTracker()])
|
callbacks=[ErrorTracker()])
|
||||||
nyse_trainer.fit(nyse_module,
|
nyse_trainer.fit(nyse_module,
|
||||||
datamodule=nyse_dm)
|
datamodule=nyse_dm)
|
||||||
@@ -1820,8 +1838,9 @@ and evaluate the test error. We see the test $R^2$ is a slight improvement over
|
|||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
nl_trainer = Trainer(deterministic=True,
|
nl_trainer = Trainer(deterministic=True,
|
||||||
max_epochs=20,
|
max_epochs=20,
|
||||||
callbacks=[ErrorTracker()])
|
enable_progress_bar=False,
|
||||||
|
callbacks=[ErrorTracker()])
|
||||||
nl_trainer.fit(nl_module, datamodule=day_dm)
|
nl_trainer.fit(nl_module, datamodule=day_dm)
|
||||||
nl_trainer.test(nl_module, datamodule=day_dm)
|
nl_trainer.test(nl_module, datamodule=day_dm)
|
||||||
```
|
```
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,27 +1,19 @@
|
|||||||
---
|
|
||||||
jupyter:
|
|
||||||
jupytext:
|
|
||||||
cell_metadata_filter: -all
|
|
||||||
formats: Rmd,ipynb
|
|
||||||
main_language: python
|
|
||||||
text_representation:
|
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 11
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Survival Analysis
|
||||||
|
|
||||||
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch11-surv-lab.ipynb">
|
||||||
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch11-surv-lab.ipynb)
|
||||||
|
|
||||||
|
|
||||||
# Lab: Survival Analysis
|
|
||||||
In this lab, we perform survival analyses on three separate data
|
In this lab, we perform survival analyses on three separate data
|
||||||
sets. In Section 11.8.1 we analyze the `BrainCancer`
|
sets. In Section~\ref{brain.cancer.sec} we analyze the `BrainCancer`
|
||||||
data that was first described in Section 11.3. In Section 11.8.2, we examine the `Publication`
|
data that was first described in Section~\ref{sec:KM}. In Section~\ref{time.to.pub.sec}, we examine the `Publication`
|
||||||
data from Section 11.5.4. Finally, Section 11.8.3 explores
|
data from Section~\ref{sec:pub}. Finally, Section~\ref{call.center.sec} explores
|
||||||
a simulated call-center data set.
|
a simulated call-center data set.
|
||||||
|
|
||||||
We begin by importing some of our libraries at this top
|
We begin by importing some of our libraries at this top
|
||||||
@@ -37,7 +29,7 @@ from ISLP.models import ModelSpec as MS
|
|||||||
from ISLP import load_data
|
from ISLP import load_data
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We also collect the new imports
|
We also collect the new imports
|
||||||
needed for this lab.
|
needed for this lab.
|
||||||
|
|
||||||
@@ -61,7 +53,7 @@ BrainCancer = load_data('BrainCancer')
|
|||||||
BrainCancer.columns
|
BrainCancer.columns
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
|
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
|
||||||
We first briefly examine the data.
|
We first briefly examine the data.
|
||||||
|
|
||||||
@@ -69,20 +61,20 @@ We first briefly examine the data.
|
|||||||
BrainCancer['sex'].value_counts()
|
BrainCancer['sex'].value_counts()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
BrainCancer['diagnosis'].value_counts()
|
BrainCancer['diagnosis'].value_counts()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
BrainCancer['status'].value_counts()
|
BrainCancer['status'].value_counts()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Before beginning an analysis, it is important to know how the
|
Before beginning an analysis, it is important to know how the
|
||||||
`status` variable has been coded. Most software
|
`status` variable has been coded. Most software
|
||||||
uses the convention that a `status` of 1 indicates an
|
uses the convention that a `status` of 1 indicates an
|
||||||
@@ -91,7 +83,7 @@ observation. But some scientists might use the opposite coding. For
|
|||||||
the `BrainCancer` data set 35 patients died before the end of
|
the `BrainCancer` data set 35 patients died before the end of
|
||||||
the study, so we are using the conventional coding.
|
the study, so we are using the conventional coding.
|
||||||
|
|
||||||
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure 11.2. The main
|
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure~\ref{fig:survbrain}. The main
|
||||||
package we will use for survival analysis
|
package we will use for survival analysis
|
||||||
is `lifelines`.
|
is `lifelines`.
|
||||||
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
|
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
|
||||||
@@ -109,9 +101,9 @@ km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
|
|||||||
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
|
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Next we create Kaplan-Meier survival curves that are stratified by
|
Next we create Kaplan-Meier survival curves that are stratified by
|
||||||
`sex`, in order to reproduce Figure 11.3.
|
`sex`, in order to reproduce Figure~\ref{fig:survbrain2}.
|
||||||
We do this using the `groupby()` method of a dataframe.
|
We do this using the `groupby()` method of a dataframe.
|
||||||
This method returns a generator that can
|
This method returns a generator that can
|
||||||
be iterated over in the `for` loop. In this case,
|
be iterated over in the `for` loop. In this case,
|
||||||
@@ -138,8 +130,8 @@ for sex, df in BrainCancer.groupby('sex'):
|
|||||||
km_sex.plot(label='Sex=%s' % sex, ax=ax)
|
km_sex.plot(label='Sex=%s' % sex, ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As discussed in Section 11.4, we can perform a
|
As discussed in Section~\ref{sec:logrank}, we can perform a
|
||||||
log-rank test to compare the survival of males to females. We use
|
log-rank test to compare the survival of males to females. We use
|
||||||
the `logrank_test()` function from the `lifelines.statistics` module.
|
the `logrank_test()` function from the `lifelines.statistics` module.
|
||||||
The first two arguments are the event times, with the second
|
The first two arguments are the event times, with the second
|
||||||
@@ -152,8 +144,8 @@ logrank_test(by_sex['Male']['time'],
|
|||||||
by_sex['Female']['status'])
|
by_sex['Female']['status'])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The resulting $p$-value is $0.23$, indicating no evidence of a
|
The resulting $p$-value is $0.23$, indicating no evidence of a
|
||||||
difference in survival between the two sexes.
|
difference in survival between the two sexes.
|
||||||
|
|
||||||
@@ -172,7 +164,7 @@ cox_fit = coxph().fit(model_df,
|
|||||||
cox_fit.summary[['coef', 'se(coef)', 'p']]
|
cox_fit.summary[['coef', 'se(coef)', 'p']]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The first argument to `fit` should be a data frame containing
|
The first argument to `fit` should be a data frame containing
|
||||||
at least the event time (the second argument `time` in this case),
|
at least the event time (the second argument `time` in this case),
|
||||||
as well as an optional censoring variable (the argument `status` in this case).
|
as well as an optional censoring variable (the argument `status` in this case).
|
||||||
@@ -186,7 +178,7 @@ with no features as follows:
|
|||||||
cox_fit.log_likelihood_ratio_test()
|
cox_fit.log_likelihood_ratio_test()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Regardless of which test we use, we see that there is no clear
|
Regardless of which test we use, we see that there is no clear
|
||||||
evidence for a difference in survival between males and females. As
|
evidence for a difference in survival between males and females. As
|
||||||
we learned in this chapter, the score test from the Cox model is
|
we learned in this chapter, the score test from the Cox model is
|
||||||
@@ -206,7 +198,7 @@ fit_all = coxph().fit(all_df,
|
|||||||
fit_all.summary[['coef', 'se(coef)', 'p']]
|
fit_all.summary[['coef', 'se(coef)', 'p']]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `diagnosis` variable has been coded so that the baseline
|
The `diagnosis` variable has been coded so that the baseline
|
||||||
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
|
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
|
||||||
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
|
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
|
||||||
@@ -233,7 +225,7 @@ def representative(series):
|
|||||||
modal_data = cleaned.apply(representative, axis=0)
|
modal_data = cleaned.apply(representative, axis=0)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We make four
|
We make four
|
||||||
copies of the column means and assign the `diagnosis` column to be the four different
|
copies of the column means and assign the `diagnosis` column to be the four different
|
||||||
diagnoses.
|
diagnoses.
|
||||||
@@ -245,7 +237,7 @@ modal_df['diagnosis'] = levels
|
|||||||
modal_df
|
modal_df
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We then construct the model matrix based on the model specification `all_MS` used to fit
|
We then construct the model matrix based on the model specification `all_MS` used to fit
|
||||||
the model, and name the rows according to the levels of `diagnosis`.
|
the model, and name the rows according to the levels of `diagnosis`.
|
||||||
|
|
||||||
@@ -272,12 +264,12 @@ fig, ax = subplots(figsize=(8, 8))
|
|||||||
predicted_survival.plot(ax=ax);
|
predicted_survival.plot(ax=ax);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Publication Data
|
## Publication Data
|
||||||
The `Publication` data presented in Section 11.5.4 can be
|
The `Publication` data presented in Section~\ref{sec:pub} can be
|
||||||
found in the `ISLP` package.
|
found in the `ISLP` package.
|
||||||
We first reproduce Figure 11.5 by plotting the Kaplan-Meier curves
|
We first reproduce Figure~\ref{fig:lauersurv} by plotting the Kaplan-Meier curves
|
||||||
stratified on the `posres` variable, which records whether the
|
stratified on the `posres` variable, which records whether the
|
||||||
study had a positive or negative result.
|
study had a positive or negative result.
|
||||||
|
|
||||||
@@ -291,7 +283,7 @@ for result, df in Publication.groupby('posres'):
|
|||||||
km_result.plot(label='Result=%d' % result, ax=ax)
|
km_result.plot(label='Result=%d' % result, ax=ax)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As discussed previously, the $p$-values from fitting Cox’s
|
As discussed previously, the $p$-values from fitting Cox’s
|
||||||
proportional hazards model to the `posres` variable are quite
|
proportional hazards model to the `posres` variable are quite
|
||||||
large, providing no evidence of a difference in time-to-publication
|
large, providing no evidence of a difference in time-to-publication
|
||||||
@@ -308,8 +300,8 @@ posres_fit = coxph().fit(posres_df,
|
|||||||
posres_fit.summary[['coef', 'se(coef)', 'p']]
|
posres_fit.summary[['coef', 'se(coef)', 'p']]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
However, the results change dramatically when we include other
|
However, the results change dramatically when we include other
|
||||||
predictors in the model. Here we exclude the funding mechanism
|
predictors in the model. Here we exclude the funding mechanism
|
||||||
variable.
|
variable.
|
||||||
@@ -322,7 +314,7 @@ coxph().fit(model.fit_transform(Publication),
|
|||||||
'status').summary[['coef', 'se(coef)', 'p']]
|
'status').summary[['coef', 'se(coef)', 'p']]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that there are a number of statistically significant variables,
|
We see that there are a number of statistically significant variables,
|
||||||
including whether the trial focused on a clinical endpoint, the impact
|
including whether the trial focused on a clinical endpoint, the impact
|
||||||
of the study, and whether the study had positive or negative results.
|
of the study, and whether the study had positive or negative results.
|
||||||
@@ -332,7 +324,7 @@ of the study, and whether the study had positive or negative results.
|
|||||||
|
|
||||||
In this section, we will simulate survival data using the relationship
|
In this section, we will simulate survival data using the relationship
|
||||||
between cumulative hazard and
|
between cumulative hazard and
|
||||||
the survival function explored in Exercise 8.
|
the survival function explored in Exercise \ref{ex:all3}.
|
||||||
Our simulated data will represent the observed
|
Our simulated data will represent the observed
|
||||||
wait times (in seconds) for 2,000 customers who have phoned a call
|
wait times (in seconds) for 2,000 customers who have phoned a call
|
||||||
center. In this context, censoring occurs if a customer hangs up
|
center. In this context, censoring occurs if a customer hangs up
|
||||||
@@ -372,7 +364,7 @@ model = MS(['Operators',
|
|||||||
intercept=False)
|
intercept=False)
|
||||||
X = model.fit_transform(D)
|
X = model.fit_transform(D)
|
||||||
```
|
```
|
||||||
|
|
||||||
It is worthwhile to take a peek at the model matrix `X`, so
|
It is worthwhile to take a peek at the model matrix `X`, so
|
||||||
that we can be sure that we understand how the variables have been coded. By default,
|
that we can be sure that we understand how the variables have been coded. By default,
|
||||||
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
|
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
|
||||||
@@ -382,7 +374,7 @@ of the variable is dropped.
|
|||||||
X[:5]
|
X[:5]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Next, we specify the coefficients and the hazard function.
|
Next, we specify the coefficients and the hazard function.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -403,7 +395,7 @@ that the risk of a call being answered at Center B is 0.74 times the
|
|||||||
risk that it will be answered at Center A; in other words, the wait
|
risk that it will be answered at Center A; in other words, the wait
|
||||||
times are a bit longer at Center B.
|
times are a bit longer at Center B.
|
||||||
|
|
||||||
Recall from Section 2.3.7 the use of `lambda`
|
Recall from Section~\ref{Ch2-statlearn-lab:loading-data} the use of `lambda`
|
||||||
for creating short functions on the fly.
|
for creating short functions on the fly.
|
||||||
We use the function
|
We use the function
|
||||||
`sim_time()` from the `ISLP.survival` package. This function
|
`sim_time()` from the `ISLP.survival` package. This function
|
||||||
@@ -431,7 +423,7 @@ W = np.array([sim_time(l, cum_hazard, rng)
|
|||||||
D['Wait time'] = np.clip(W, 0, 1000)
|
D['Wait time'] = np.clip(W, 0, 1000)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now simulate our censoring variable, for which we assume
|
We now simulate our censoring variable, for which we assume
|
||||||
90% of calls were answered (`Failed==1`) before the
|
90% of calls were answered (`Failed==1`) before the
|
||||||
customer hung up (`Failed==0`).
|
customer hung up (`Failed==0`).
|
||||||
@@ -443,13 +435,13 @@ D['Failed'] = rng.choice([1, 0],
|
|||||||
D[:5]
|
D[:5]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
D['Failed'].mean()
|
D['Failed'].mean()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
|
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -462,7 +454,7 @@ for center, df in D.groupby('Center'):
|
|||||||
ax.set_title("Probability of Still Being on Hold")
|
ax.set_title("Probability of Still Being on Hold")
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Next, we stratify by `Time`.
|
Next, we stratify by `Time`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -475,7 +467,7 @@ for time, df in D.groupby('Time'):
|
|||||||
ax.set_title("Probability of Still Being on Hold")
|
ax.set_title("Probability of Still Being on Hold")
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
It seems that calls at Call Center B take longer to be answered than
|
It seems that calls at Call Center B take longer to be answered than
|
||||||
calls at Centers A and C. Similarly, it appears that wait times are
|
calls at Centers A and C. Similarly, it appears that wait times are
|
||||||
longest in the morning and shortest in the evening hours. We can use a
|
longest in the morning and shortest in the evening hours. We can use a
|
||||||
@@ -488,8 +480,8 @@ multivariate_logrank_test(D['Wait time'],
|
|||||||
D['Failed'])
|
D['Failed'])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Next, we consider the effect of `Time`.
|
Next, we consider the effect of `Time`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -498,8 +490,8 @@ multivariate_logrank_test(D['Wait time'],
|
|||||||
D['Failed'])
|
D['Failed'])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
As in the case of a categorical variable with 2 levels, these
|
As in the case of a categorical variable with 2 levels, these
|
||||||
results are similar to the likelihood ratio test
|
results are similar to the likelihood ratio test
|
||||||
from the Cox proportional hazards model. First, we
|
from the Cox proportional hazards model. First, we
|
||||||
@@ -514,8 +506,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
|
|||||||
F.log_likelihood_ratio_test()
|
F.log_likelihood_ratio_test()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Next, we look at the results for `Time`.
|
Next, we look at the results for `Time`.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -527,8 +519,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
|
|||||||
F.log_likelihood_ratio_test()
|
F.log_likelihood_ratio_test()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We find that differences between centers are highly significant, as
|
We find that differences between centers are highly significant, as
|
||||||
are differences between times of day.
|
are differences between times of day.
|
||||||
|
|
||||||
@@ -544,8 +536,8 @@ fit_queuing = coxph().fit(
|
|||||||
fit_queuing.summary[['coef', 'se(coef)', 'p']]
|
fit_queuing.summary[['coef', 'se(coef)', 'p']]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The $p$-values for Center B and evening time
|
The $p$-values for Center B and evening time
|
||||||
are very small. It is also clear that the
|
are very small. It is also clear that the
|
||||||
hazard --- that is, the instantaneous risk that a call will be
|
hazard --- that is, the instantaneous risk that a call will be
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,20 +1,12 @@
|
|||||||
---
|
# Unsupervised Learning
|
||||||
jupyter:
|
|
||||||
jupytext:
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch12-unsup-lab.ipynb">
|
||||||
cell_metadata_filter: -all
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
formats: Rmd,ipynb
|
</a>
|
||||||
main_language: python
|
|
||||||
text_representation:
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch12-unsup-lab.ipynb)
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 12
|
|
||||||
|
|
||||||
# Lab: Unsupervised Learning
|
|
||||||
In this lab we demonstrate PCA and clustering on several datasets.
|
In this lab we demonstrate PCA and clustering on several datasets.
|
||||||
As in other labs, we import some of our libraries at this top
|
As in other labs, we import some of our libraries at this top
|
||||||
level. This makes the code more readable, as scanning the first few
|
level. This makes the code more readable, as scanning the first few
|
||||||
@@ -44,7 +36,7 @@ from scipy.cluster.hierarchy import \
|
|||||||
from ISLP.cluster import compute_linkage
|
from ISLP.cluster import compute_linkage
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Principal Components Analysis
|
## Principal Components Analysis
|
||||||
In this lab, we perform PCA on `USArrests`, a data set in the
|
In this lab, we perform PCA on `USArrests`, a data set in the
|
||||||
`R` computing environment.
|
`R` computing environment.
|
||||||
@@ -58,22 +50,22 @@ USArrests = get_rdataset('USArrests').data
|
|||||||
USArrests
|
USArrests
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The columns of the data set contain the four variables.
|
The columns of the data set contain the four variables.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
USArrests.columns
|
USArrests.columns
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We first briefly examine the data. We notice that the variables have vastly different means.
|
We first briefly examine the data. We notice that the variables have vastly different means.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
USArrests.mean()
|
USArrests.mean()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Dataframes have several useful methods for computing
|
Dataframes have several useful methods for computing
|
||||||
column-wise summaries. We can also examine the
|
column-wise summaries. We can also examine the
|
||||||
variance of the four variables using the `var()` method.
|
variance of the four variables using the `var()` method.
|
||||||
@@ -82,7 +74,7 @@ variance of the four variables using the `var()` method.
|
|||||||
USArrests.var()
|
USArrests.var()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Not surprisingly, the variables also have vastly different variances.
|
Not surprisingly, the variables also have vastly different variances.
|
||||||
The `UrbanPop` variable measures the percentage of the population
|
The `UrbanPop` variable measures the percentage of the population
|
||||||
in each state living in an urban area, which is not a comparable
|
in each state living in an urban area, which is not a comparable
|
||||||
@@ -132,7 +124,7 @@ of the variables. In this case, since we centered and scaled the data with
|
|||||||
pcaUS.mean_
|
pcaUS.mean_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The scores can be computed using the `transform()` method
|
The scores can be computed using the `transform()` method
|
||||||
of `pcaUS` after it has been fit.
|
of `pcaUS` after it has been fit.
|
||||||
|
|
||||||
@@ -150,7 +142,7 @@ principal component loading vector.
|
|||||||
pcaUS.components_
|
pcaUS.components_
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `biplot` is a common visualization method used with
|
The `biplot` is a common visualization method used with
|
||||||
PCA. It is not built in as a standard
|
PCA. It is not built in as a standard
|
||||||
part of `sklearn`, though there are python
|
part of `sklearn`, though there are python
|
||||||
@@ -170,7 +162,7 @@ for k in range(pcaUS.components_.shape[1]):
|
|||||||
USArrests.columns[k])
|
USArrests.columns[k])
|
||||||
|
|
||||||
```
|
```
|
||||||
Notice that this figure is a reflection of Figure 12.1 through the $y$-axis. Recall that the
|
Notice that this figure is a reflection of Figure~\ref{Ch10:fig:USArrests:obs} through the $y$-axis. Recall that the
|
||||||
principal components are only unique up to a sign change, so we can
|
principal components are only unique up to a sign change, so we can
|
||||||
reproduce that figure by flipping the
|
reproduce that figure by flipping the
|
||||||
signs of the second set of scores and loadings.
|
signs of the second set of scores and loadings.
|
||||||
@@ -191,14 +183,14 @@ for k in range(pcaUS.components_.shape[1]):
|
|||||||
USArrests.columns[k])
|
USArrests.columns[k])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The standard deviations of the principal component scores are as follows:
|
The standard deviations of the principal component scores are as follows:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
scores.std(0, ddof=1)
|
scores.std(0, ddof=1)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The variance of each score can be extracted directly from the `pcaUS` object via
|
The variance of each score can be extracted directly from the `pcaUS` object via
|
||||||
the `explained_variance_` attribute.
|
the `explained_variance_` attribute.
|
||||||
|
|
||||||
@@ -220,7 +212,7 @@ We can plot the PVE explained by each component, as well as the cumulative PVE.
|
|||||||
plot the proportion of variance explained.
|
plot the proportion of variance explained.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
# %%capture
|
%%capture
|
||||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
|
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
|
||||||
ticks = np.arange(pcaUS.n_components_)+1
|
ticks = np.arange(pcaUS.n_components_)+1
|
||||||
ax = axes[0]
|
ax = axes[0]
|
||||||
@@ -247,7 +239,7 @@ ax.set_xticks(ticks)
|
|||||||
fig
|
fig
|
||||||
|
|
||||||
```
|
```
|
||||||
The result is similar to that shown in Figure 12.3. Note
|
The result is similar to that shown in Figure~\ref{Ch10:fig:USArrests:scree}. Note
|
||||||
that the method `cumsum()` computes the cumulative sum of
|
that the method `cumsum()` computes the cumulative sum of
|
||||||
the elements of a numeric vector. For instance:
|
the elements of a numeric vector. For instance:
|
||||||
|
|
||||||
@@ -259,15 +251,15 @@ np.cumsum(a)
|
|||||||
## Matrix Completion
|
## Matrix Completion
|
||||||
|
|
||||||
We now re-create the analysis carried out on the `USArrests` data in
|
We now re-create the analysis carried out on the `USArrests` data in
|
||||||
Section 12.3.
|
Section~\ref{Ch10:sec:princ-comp-with}.
|
||||||
|
|
||||||
We saw in Section 12.2.2 that solving the optimization
|
We saw in Section~\ref{ch10:sec2.2} that solving the optimization
|
||||||
problem (12.6) on a centered data matrix $\bf X$ is
|
problem~(\ref{Ch10:eq:mc2}) on a centered data matrix $\bf X$ is
|
||||||
equivalent to computing the first $M$ principal
|
equivalent to computing the first $M$ principal
|
||||||
components of the data. We use our scaled
|
components of the data. We use our scaled
|
||||||
and centered `USArrests` data as $\bf X$ below. The *singular value decomposition*
|
and centered `USArrests` data as $\bf X$ below. The *singular value decomposition*
|
||||||
(SVD) is a general algorithm for solving
|
(SVD) is a general algorithm for solving
|
||||||
(12.6).
|
(\ref{Ch10:eq:mc2}).
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
X = USArrests_scaled
|
X = USArrests_scaled
|
||||||
@@ -320,14 +312,14 @@ Xna = X.copy()
|
|||||||
Xna[r_idx, c_idx] = np.nan
|
Xna[r_idx, c_idx] = np.nan
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Here the array `r_idx`
|
Here the array `r_idx`
|
||||||
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
|
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
|
||||||
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
|
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
|
||||||
|
|
||||||
We now write some code to implement Algorithm 12.1.
|
We now write some code to implement Algorithm~\ref{Ch10:alg:hardimpute}.
|
||||||
We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function.
|
We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function.
|
||||||
This will be needed in Step 2 of Algorithm 12.1.
|
This will be needed in Step 2 of Algorithm~\ref{Ch10:alg:hardimpute}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
def low_rank(X, M=1):
|
def low_rank(X, M=1):
|
||||||
@@ -336,7 +328,7 @@ def low_rank(X, M=1):
|
|||||||
return L.dot(V[:M])
|
return L.dot(V[:M])
|
||||||
|
|
||||||
```
|
```
|
||||||
To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm 12.1 --- by replacing
|
To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm~\ref{Ch10:alg:hardimpute} --- by replacing
|
||||||
the missing values with the column means of the non-missing entries. These are stored in
|
the missing values with the column means of the non-missing entries. These are stored in
|
||||||
`Xbar` below after running `np.nanmean()` over the row axis.
|
`Xbar` below after running `np.nanmean()` over the row axis.
|
||||||
We make a copy so that when we assign values to `Xhat` below we do not also overwrite the
|
We make a copy so that when we assign values to `Xhat` below we do not also overwrite the
|
||||||
@@ -348,7 +340,7 @@ Xbar = np.nanmean(Xhat, axis=0)
|
|||||||
Xhat[r_idx, c_idx] = Xbar[c_idx]
|
Xhat[r_idx, c_idx] = Xbar[c_idx]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Before we begin Step 2, we set ourselves up to measure the progress of our
|
Before we begin Step 2, we set ourselves up to measure the progress of our
|
||||||
iterations:
|
iterations:
|
||||||
|
|
||||||
@@ -366,11 +358,11 @@ a given element is `True` if the corresponding matrix element is missing. The no
|
|||||||
because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`.
|
because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`.
|
||||||
We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold` (which currently
|
We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold` (which currently
|
||||||
agrees with `mss0`). We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then
|
agrees with `mss0`). We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then
|
||||||
iterate Step 2 of Algorithm 12.1 until the *relative error*, defined as
|
iterate Step 2 of Algorithm~\ref{Ch10:alg:hardimpute} until the *relative error*, defined as
|
||||||
`(mssold - mss) / mss0`, falls below `thresh = 1e-7`.
|
`(mssold - mss) / mss0`, falls below `thresh = 1e-7`.
|
||||||
{Algorithm 12.1 tells us to iterate Step 2 until (12.14) is no longer decreasing. Determining whether (12.14) is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm 12.1 to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.}
|
{Algorithm~\ref{Ch10:alg:hardimpute} tells us to iterate Step 2 until \eqref{Ch10:eq:mc6} is no longer decreasing. Determining whether \eqref{Ch10:eq:mc6} is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm~\ref{Ch10:alg:hardimpute} to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.}
|
||||||
|
|
||||||
In Step 2(a) of Algorithm 12.1, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop:
|
In Step 2(a) of Algorithm~\ref{Ch10:alg:hardimpute}, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop:
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
while rel_err > thresh:
|
while rel_err > thresh:
|
||||||
@@ -387,7 +379,7 @@ while rel_err > thresh:
|
|||||||
.format(count, mss, rel_err))
|
.format(count, mss, rel_err))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
|
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
|
||||||
|
|
||||||
Finally, we compute the correlation between the 20 imputed values
|
Finally, we compute the correlation between the 20 imputed values
|
||||||
@@ -397,9 +389,9 @@ and the actual values:
|
|||||||
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
|
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python` implementations.
|
In this lab, we implemented Algorithm~\ref{Ch10:alg:hardimpute} ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations.
|
||||||
|
|
||||||
|
|
||||||
## Clustering
|
## Clustering
|
||||||
@@ -444,7 +436,7 @@ ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
|
|||||||
ax.set_title("K-Means Clustering Results with K=2");
|
ax.set_title("K-Means Clustering Results with K=2");
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Here the observations can be easily plotted because they are
|
Here the observations can be easily plotted because they are
|
||||||
two-dimensional. If there were more than two variables then we could
|
two-dimensional. If there were more than two variables then we could
|
||||||
instead perform PCA and plot the first two principal component score
|
instead perform PCA and plot the first two principal component score
|
||||||
@@ -470,7 +462,7 @@ We have used the `n_init` argument to run the $K$-means with 20
|
|||||||
initial cluster assignments (the default is 10). If a
|
initial cluster assignments (the default is 10). If a
|
||||||
value of `n_init` greater than one is used, then $K$-means
|
value of `n_init` greater than one is used, then $K$-means
|
||||||
clustering will be performed using multiple random assignments in
|
clustering will be performed using multiple random assignments in
|
||||||
Step 1 of Algorithm 12.2, and the `KMeans()`
|
Step 1 of Algorithm~\ref{Ch10:alg:km}, and the `KMeans()`
|
||||||
function will report only the best results. Here we compare using
|
function will report only the best results. Here we compare using
|
||||||
`n_init=1` to `n_init=20`.
|
`n_init=1` to `n_init=20`.
|
||||||
|
|
||||||
@@ -486,7 +478,7 @@ kmeans1.inertia_, kmeans20.inertia_
|
|||||||
```
|
```
|
||||||
Note that `kmeans.inertia_` is the total within-cluster sum
|
Note that `kmeans.inertia_` is the total within-cluster sum
|
||||||
of squares, which we seek to minimize by performing $K$-means
|
of squares, which we seek to minimize by performing $K$-means
|
||||||
clustering (12.17).
|
clustering \eqref{Ch10:eq:kmeans}.
|
||||||
|
|
||||||
We *strongly* recommend always running $K$-means clustering with
|
We *strongly* recommend always running $K$-means clustering with
|
||||||
a large value of `n_init`, such as 20 or 50, since otherwise an
|
a large value of `n_init`, such as 20 or 50, since otherwise an
|
||||||
@@ -519,7 +511,7 @@ hc_comp = HClust(distance_threshold=0,
|
|||||||
hc_comp.fit(X)
|
hc_comp.fit(X)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This computes the entire dendrogram.
|
This computes the entire dendrogram.
|
||||||
We could just as easily perform hierarchical clustering with average or single linkage instead:
|
We could just as easily perform hierarchical clustering with average or single linkage instead:
|
||||||
|
|
||||||
@@ -534,7 +526,7 @@ hc_sing = HClust(distance_threshold=0,
|
|||||||
hc_sing.fit(X);
|
hc_sing.fit(X);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To use a precomputed distance matrix, we provide an additional
|
To use a precomputed distance matrix, we provide an additional
|
||||||
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
|
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
|
||||||
|
|
||||||
@@ -550,7 +542,7 @@ hc_sing_pre = HClust(distance_threshold=0,
|
|||||||
hc_sing_pre.fit(D)
|
hc_sing_pre.fit(D)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We use
|
We use
|
||||||
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
|
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
|
||||||
`dendrogram()` expects a so-called *linkage-matrix representation*
|
`dendrogram()` expects a so-called *linkage-matrix representation*
|
||||||
@@ -573,7 +565,7 @@ dendrogram(linkage_comp,
|
|||||||
**cargs);
|
**cargs);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We may want to color branches of the tree above
|
We may want to color branches of the tree above
|
||||||
and below a cut-threshold differently. This can be achieved
|
and below a cut-threshold differently. This can be achieved
|
||||||
by changing the `color_threshold`. Let’s cut the tree at a height of 4,
|
by changing the `color_threshold`. Let’s cut the tree at a height of 4,
|
||||||
@@ -587,7 +579,7 @@ dendrogram(linkage_comp,
|
|||||||
above_threshold_color='black');
|
above_threshold_color='black');
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To determine the cluster labels for each observation associated with a
|
To determine the cluster labels for each observation associated with a
|
||||||
given cut of the dendrogram, we can use the `cut_tree()`
|
given cut of the dendrogram, we can use the `cut_tree()`
|
||||||
function from `scipy.cluster.hierarchy`:
|
function from `scipy.cluster.hierarchy`:
|
||||||
@@ -607,7 +599,7 @@ or `height` to `cut_tree()`.
|
|||||||
cut_tree(linkage_comp, height=5)
|
cut_tree(linkage_comp, height=5)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To scale the variables before performing hierarchical clustering of
|
To scale the variables before performing hierarchical clustering of
|
||||||
the observations, we use `StandardScaler()` as in our PCA example:
|
the observations, we use `StandardScaler()` as in our PCA example:
|
||||||
|
|
||||||
@@ -651,7 +643,7 @@ dendrogram(linkage_cor, ax=ax, **cargs)
|
|||||||
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
|
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## NCI60 Data Example
|
## NCI60 Data Example
|
||||||
Unsupervised techniques are often used in the analysis of genomic
|
Unsupervised techniques are often used in the analysis of genomic
|
||||||
@@ -666,7 +658,7 @@ nci_labs = NCI60['labels']
|
|||||||
nci_data = NCI60['data']
|
nci_data = NCI60['data']
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Each cell line is labeled with a cancer type. We do not make use of
|
Each cell line is labeled with a cancer type. We do not make use of
|
||||||
the cancer types in performing PCA and clustering, as these are
|
the cancer types in performing PCA and clustering, as these are
|
||||||
unsupervised techniques. But after performing PCA and clustering, we
|
unsupervised techniques. But after performing PCA and clustering, we
|
||||||
@@ -679,8 +671,8 @@ The data has 64 rows and 6830 columns.
|
|||||||
nci_data.shape
|
nci_data.shape
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
We begin by examining the cancer types for the cell lines.
|
We begin by examining the cancer types for the cell lines.
|
||||||
|
|
||||||
|
|
||||||
@@ -688,7 +680,7 @@ We begin by examining the cancer types for the cell lines.
|
|||||||
nci_labs.value_counts()
|
nci_labs.value_counts()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### PCA on the NCI60 Data
|
### PCA on the NCI60 Data
|
||||||
|
|
||||||
@@ -703,7 +695,7 @@ nci_pca = PCA()
|
|||||||
nci_scores = nci_pca.fit_transform(nci_scaled)
|
nci_scores = nci_pca.fit_transform(nci_scaled)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now plot the first few principal component score vectors, in order
|
We now plot the first few principal component score vectors, in order
|
||||||
to visualize the data. The observations (cell lines) corresponding to
|
to visualize the data. The observations (cell lines) corresponding to
|
||||||
a given cancer type will be plotted in the same color, so that we can
|
a given cancer type will be plotted in the same color, so that we can
|
||||||
@@ -739,7 +731,7 @@ to have pretty similar gene expression levels.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
We can also plot the percent variance
|
We can also plot the percent variance
|
||||||
explained by the principal components as well as the cumulative percent variance explained.
|
explained by the principal components as well as the cumulative percent variance explained.
|
||||||
This is similar to the plots we made earlier for the `USArrests` data.
|
This is similar to the plots we made earlier for the `USArrests` data.
|
||||||
@@ -798,7 +790,7 @@ def plot_nci(linkage, ax, cut=-np.inf):
|
|||||||
return hc
|
return hc
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Let’s plot our results.
|
Let’s plot our results.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -830,7 +822,7 @@ pd.crosstab(nci_labs['label'],
|
|||||||
pd.Series(comp_cut.reshape(-1), name='Complete'))
|
pd.Series(comp_cut.reshape(-1), name='Complete'))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
There are some clear patterns. All the leukemia cell lines fall in
|
There are some clear patterns. All the leukemia cell lines fall in
|
||||||
one cluster, while the breast cancer cell lines are spread out over
|
one cluster, while the breast cancer cell lines are spread out over
|
||||||
@@ -844,7 +836,7 @@ plot_nci('Complete', ax, cut=140)
|
|||||||
ax.axhline(140, c='r', linewidth=4);
|
ax.axhline(140, c='r', linewidth=4);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `axhline()` function draws a horizontal line line on top of any
|
The `axhline()` function draws a horizontal line line on top of any
|
||||||
existing set of axes. The argument `140` plots a horizontal
|
existing set of axes. The argument `140` plots a horizontal
|
||||||
line at height 140 on the dendrogram; this is a height that
|
line at height 140 on the dendrogram; this is a height that
|
||||||
@@ -852,7 +844,7 @@ results in four distinct clusters. It is easy to verify that the
|
|||||||
resulting clusters are the same as the ones we obtained in
|
resulting clusters are the same as the ones we obtained in
|
||||||
`comp_cut`.
|
`comp_cut`.
|
||||||
|
|
||||||
We claimed earlier in Section 12.4.2 that
|
We claimed earlier in Section~\ref{Ch10:subsec:hc} that
|
||||||
$K$-means clustering and hierarchical clustering with the dendrogram
|
$K$-means clustering and hierarchical clustering with the dendrogram
|
||||||
cut to obtain the same number of clusters can yield very different
|
cut to obtain the same number of clusters can yield very different
|
||||||
results. How do these `NCI60` hierarchical clustering results compare
|
results. How do these `NCI60` hierarchical clustering results compare
|
||||||
@@ -866,7 +858,7 @@ pd.crosstab(pd.Series(comp_cut, name='HClust'),
|
|||||||
pd.Series(nci_kmeans.labels_, name='K-means'))
|
pd.Series(nci_kmeans.labels_, name='K-means'))
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We see that the four clusters obtained using hierarchical clustering
|
We see that the four clusters obtained using hierarchical clustering
|
||||||
and $K$-means clustering are somewhat different. First we note
|
and $K$-means clustering are somewhat different. First we note
|
||||||
that the labels in the two clusterings are arbitrary. That is, swapping
|
that the labels in the two clusterings are arbitrary. That is, swapping
|
||||||
|
|||||||
1978
Ch12-unsup-lab.ipynb
1978
Ch12-unsup-lab.ipynb
File diff suppressed because one or more lines are too long
@@ -1,22 +1,14 @@
|
|||||||
---
|
# Multiple Testing
|
||||||
jupyter:
|
|
||||||
jupytext:
|
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch13-multiple-lab.ipynb">
|
||||||
cell_metadata_filter: -all
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||||
formats: Rmd,ipynb
|
</a>
|
||||||
main_language: python
|
|
||||||
text_representation:
|
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch13-multiple-lab.ipynb)
|
||||||
extension: .Rmd
|
|
||||||
format_name: rmarkdown
|
|
||||||
format_version: '1.2'
|
|
||||||
jupytext_version: 1.14.7
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
||||||
# Chapter 13
|
|
||||||
|
|
||||||
# Lab: Multiple Testing
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
We include our usual imports seen in earlier labs.
|
We include our usual imports seen in earlier labs.
|
||||||
|
|
||||||
@@ -28,7 +20,7 @@ import statsmodels.api as sm
|
|||||||
from ISLP import load_data
|
from ISLP import load_data
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We also collect the new imports
|
We also collect the new imports
|
||||||
needed for this lab.
|
needed for this lab.
|
||||||
|
|
||||||
@@ -60,7 +52,7 @@ true_mean = np.array([0.5]*50 + [0]*50)
|
|||||||
X += true_mean[None,:]
|
X += true_mean[None,:]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To begin, we use `ttest_1samp()` from the
|
To begin, we use `ttest_1samp()` from the
|
||||||
`scipy.stats` module to test $H_{0}: \mu_1=0$, the null
|
`scipy.stats` module to test $H_{0}: \mu_1=0$, the null
|
||||||
hypothesis that the first variable has mean zero.
|
hypothesis that the first variable has mean zero.
|
||||||
@@ -70,7 +62,7 @@ result = ttest_1samp(X[:,0], 0)
|
|||||||
result.pvalue
|
result.pvalue
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The $p$-value comes out to 0.931, which is not low enough to
|
The $p$-value comes out to 0.931, which is not low enough to
|
||||||
reject the null hypothesis at level $\alpha=0.05$. In this case,
|
reject the null hypothesis at level $\alpha=0.05$. In this case,
|
||||||
$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made
|
$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made
|
||||||
@@ -97,7 +89,7 @@ truth = pd.Categorical(true_mean == 0,
|
|||||||
|
|
||||||
```
|
```
|
||||||
Since this is a simulated data set, we can create a $2 \times 2$ table
|
Since this is a simulated data set, we can create a $2 \times 2$ table
|
||||||
similar to Table 13.2.
|
similar to Table~\ref{Ch12:tab-hypotheses}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
pd.crosstab(decision,
|
pd.crosstab(decision,
|
||||||
@@ -108,7 +100,7 @@ pd.crosstab(decision,
|
|||||||
```
|
```
|
||||||
Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false
|
Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false
|
||||||
null hypotheses, and we incorrectly reject 5 of the true null
|
null hypotheses, and we incorrectly reject 5 of the true null
|
||||||
hypotheses. Using the notation from Section 13.3, we have
|
hypotheses. Using the notation from Section~\ref{sec:fwer}, we have
|
||||||
$V=5$, $S=15$, $U=45$ and $W=35$.
|
$V=5$, $S=15$, $U=45$ and $W=35$.
|
||||||
We have set $\alpha=0.05$, which means that we expect to reject around
|
We have set $\alpha=0.05$, which means that we expect to reject around
|
||||||
5% of the true null hypotheses. This is in line with the $2 \times 2$
|
5% of the true null hypotheses. This is in line with the $2 \times 2$
|
||||||
@@ -146,12 +138,12 @@ pd.crosstab(decision,
|
|||||||
|
|
||||||
|
|
||||||
## Family-Wise Error Rate
|
## Family-Wise Error Rate
|
||||||
Recall from (13.5) that if the null hypothesis is true
|
Recall from \eqref{eq:FWER.indep} that if the null hypothesis is true
|
||||||
for each of $m$ independent hypothesis tests, then the FWER is equal
|
for each of $m$ independent hypothesis tests, then the FWER is equal
|
||||||
to $1-(1-\alpha)^m$. We can use this expression to compute the FWER
|
to $1-(1-\alpha)^m$. We can use this expression to compute the FWER
|
||||||
for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.
|
for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.
|
||||||
We plot the FWER for these values of $\alpha$ in order to
|
We plot the FWER for these values of $\alpha$ in order to
|
||||||
reproduce Figure 13.2.
|
reproduce Figure~\ref{Ch12:fwer}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
m = np.linspace(1, 501)
|
m = np.linspace(1, 501)
|
||||||
@@ -167,7 +159,7 @@ ax.legend()
|
|||||||
ax.axhline(0.05, c='k', ls='--');
|
ax.axhline(0.05, c='k', ls='--');
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As discussed previously, even for moderate values of $m$ such as $50$,
|
As discussed previously, even for moderate values of $m$ such as $50$,
|
||||||
the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value,
|
the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value,
|
||||||
such as $0.001$. Of course, the problem with setting $\alpha$ to such
|
such as $0.001$. Of course, the problem with setting $\alpha$ to such
|
||||||
@@ -189,7 +181,7 @@ for i in range(5):
|
|||||||
fund_mini_pvals
|
fund_mini_pvals
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The $p$-values are low for Managers One and Three, and high for the
|
The $p$-values are low for Managers One and Three, and high for the
|
||||||
other three managers. However, we cannot simply reject $H_{0,1}$ and
|
other three managers. However, we cannot simply reject $H_{0,1}$ and
|
||||||
$H_{0,3}$, since this would fail to account for the multiple testing
|
$H_{0,3}$, since this would fail to account for the multiple testing
|
||||||
@@ -219,8 +211,8 @@ reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]
|
|||||||
reject
|
reject
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than
|
The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than
|
||||||
or equal to 1.
|
or equal to 1.
|
||||||
|
|
||||||
@@ -228,7 +220,7 @@ or equal to 1.
|
|||||||
bonf, np.minimum(fund_mini_pvals * 5, 1)
|
bonf, np.minimum(fund_mini_pvals * 5, 1)
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Therefore, using Bonferroni’s method, we are able to reject the null hypothesis only for Manager
|
Therefore, using Bonferroni’s method, we are able to reject the null hypothesis only for Manager
|
||||||
One while controlling FWER at $0.05$.
|
One while controlling FWER at $0.05$.
|
||||||
|
|
||||||
@@ -240,8 +232,8 @@ hypotheses for Managers One and Three at a FWER of $0.05$.
|
|||||||
mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
|
mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
As discussed previously, Manager One seems to perform particularly
|
As discussed previously, Manager One seems to perform particularly
|
||||||
well, whereas Manager Two has poor performance.
|
well, whereas Manager Two has poor performance.
|
||||||
|
|
||||||
@@ -250,8 +242,8 @@ well, whereas Manager Two has poor performance.
|
|||||||
fund_mini.mean()
|
fund_mini.mean()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Is there evidence of a meaningful difference in performance between
|
Is there evidence of a meaningful difference in performance between
|
||||||
these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function
|
these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function
|
||||||
from `scipy.stats`:
|
from `scipy.stats`:
|
||||||
@@ -261,7 +253,7 @@ ttest_rel(fund_mini['Manager1'],
|
|||||||
fund_mini['Manager2']).pvalue
|
fund_mini['Manager2']).pvalue
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The test results in a $p$-value of 0.038,
|
The test results in a $p$-value of 0.038,
|
||||||
suggesting a statistically significant difference.
|
suggesting a statistically significant difference.
|
||||||
|
|
||||||
@@ -269,7 +261,7 @@ However, we decided to perform this test only after examining the data
|
|||||||
and noting that Managers One and Two had the highest and lowest mean
|
and noting that Managers One and Two had the highest and lowest mean
|
||||||
performances. In a sense, this means that we have implicitly
|
performances. In a sense, this means that we have implicitly
|
||||||
performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than
|
performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than
|
||||||
just one, as discussed in Section 13.3.2. Hence, we use the
|
just one, as discussed in Section~\ref{tukey.sec}. Hence, we use the
|
||||||
`pairwise_tukeyhsd()` function from
|
`pairwise_tukeyhsd()` function from
|
||||||
`statsmodels.stats.multicomp` to apply Tukey’s method
|
`statsmodels.stats.multicomp` to apply Tukey’s method
|
||||||
in order to adjust for multiple testing. This function takes
|
in order to adjust for multiple testing. This function takes
|
||||||
@@ -286,8 +278,8 @@ tukey = pairwise_tukeyhsd(returns, managers)
|
|||||||
print(tukey.summary())
|
print(tukey.summary())
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
The `pairwise_tukeyhsd()` function provides confidence intervals
|
The `pairwise_tukeyhsd()` function provides confidence intervals
|
||||||
for the difference between each pair of managers (`lower` and
|
for the difference between each pair of managers (`lower` and
|
||||||
`upper`), as well as a $p$-value. All of these quantities have
|
`upper`), as well as a $p$-value. All of these quantities have
|
||||||
@@ -317,7 +309,7 @@ for i, manager in enumerate(Fund.columns):
|
|||||||
fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue
|
fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
There are far too many managers to consider trying to control the FWER.
|
There are far too many managers to consider trying to control the FWER.
|
||||||
Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives.
|
Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives.
|
||||||
The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure.
|
The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure.
|
||||||
@@ -327,7 +319,7 @@ fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
|
|||||||
fund_qvalues[:10]
|
fund_qvalues[:10]
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The *q-values* output by the
|
The *q-values* output by the
|
||||||
Benjamini--Hochberg procedure can be interpreted as the smallest FDR
|
Benjamini--Hochberg procedure can be interpreted as the smallest FDR
|
||||||
threshold at which we would reject a particular null hypothesis. For
|
threshold at which we would reject a particular null hypothesis. For
|
||||||
@@ -354,9 +346,9 @@ null hypotheses!
|
|||||||
(fund_pvalues <= 0.1 / 2000).sum()
|
(fund_pvalues <= 0.1 / 2000).sum()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Figure 13.6 displays the ordered
|
Figure~\ref{Ch12:fig:BonferroniBenjamini} displays the ordered
|
||||||
$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
|
$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
|
||||||
the `Fund` dataset, as well as the threshold for rejection by the
|
the `Fund` dataset, as well as the threshold for rejection by the
|
||||||
Benjamini--Hochberg procedure. Recall that the Benjamini--Hochberg
|
Benjamini--Hochberg procedure. Recall that the Benjamini--Hochberg
|
||||||
@@ -384,8 +376,8 @@ else:
|
|||||||
sorted_set_ = []
|
sorted_set_ = []
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
We now reproduce the middle panel of Figure 13.6.
|
We now reproduce the middle panel of Figure~\ref{Ch12:fig:BonferroniBenjamini}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
fig, ax = plt.subplots()
|
fig, ax = plt.subplots()
|
||||||
@@ -399,12 +391,12 @@ ax.scatter(sorted_set_+1, sorted_[sorted_set_], c='r', s=20)
|
|||||||
ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
|
ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## A Re-Sampling Approach
|
## A Re-Sampling Approach
|
||||||
Here, we implement the re-sampling approach to hypothesis testing
|
Here, we implement the re-sampling approach to hypothesis testing
|
||||||
using the `Khan` dataset, which we investigated in
|
using the `Khan` dataset, which we investigated in
|
||||||
Section 13.5. First, we merge the training and
|
Section~\ref{sec:permutations}. First, we merge the training and
|
||||||
testing data, which results in observations on 83 patients for
|
testing data, which results in observations on 83 patients for
|
||||||
2,308 genes.
|
2,308 genes.
|
||||||
|
|
||||||
@@ -415,8 +407,8 @@ D['Y'] = pd.concat([Khan['ytrain'], Khan['ytest']])
|
|||||||
D['Y'].value_counts()
|
D['Y'].value_counts()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
There are four classes of cancer. For each gene, we compare the mean
|
There are four classes of cancer. For each gene, we compare the mean
|
||||||
expression in the second class (rhabdomyosarcoma) to the mean
|
expression in the second class (rhabdomyosarcoma) to the mean
|
||||||
expression in the fourth class (Burkitt’s lymphoma). Performing a
|
expression in the fourth class (Burkitt’s lymphoma). Performing a
|
||||||
@@ -436,8 +428,8 @@ observedT, pvalue = ttest_ind(D2[gene_11],
|
|||||||
observedT, pvalue
|
observedT, pvalue
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
However, this $p$-value relies on the assumption that under the null
|
However, this $p$-value relies on the assumption that under the null
|
||||||
hypothesis of no difference between the two groups, the test statistic
|
hypothesis of no difference between the two groups, the test statistic
|
||||||
follows a $t$-distribution with $29+25-2=52$ degrees of freedom.
|
follows a $t$-distribution with $29+25-2=52$ degrees of freedom.
|
||||||
@@ -462,15 +454,15 @@ for b in range(B):
|
|||||||
D_null[n_:],
|
D_null[n_:],
|
||||||
equal_var=True)
|
equal_var=True)
|
||||||
Tnull[b] = ttest_.statistic
|
Tnull[b] = ttest_.statistic
|
||||||
(np.abs(Tnull) > np.abs(observedT)).mean()
|
(np.abs(Tnull) < np.abs(observedT)).mean()
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
This fraction, 0.0398,
|
This fraction, 0.0398,
|
||||||
is our re-sampling-based $p$-value.
|
is our re-sampling-based $p$-value.
|
||||||
It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution.
|
It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution.
|
||||||
We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure 13.7.
|
We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure~\ref{Ch12:fig-permp-1}.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
fig, ax = plt.subplots(figsize=(8,8))
|
fig, ax = plt.subplots(figsize=(8,8))
|
||||||
@@ -493,7 +485,7 @@ ax.set_xlabel("Null Distribution of Test Statistic");
|
|||||||
The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red.
|
The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red.
|
||||||
|
|
||||||
Finally, we implement the plug-in re-sampling FDR approach outlined in
|
Finally, we implement the plug-in re-sampling FDR approach outlined in
|
||||||
Algorithm 13.4. Depending on the speed of your
|
Algorithm~\ref{Ch12:alg-plugin-fdr}. Depending on the speed of your
|
||||||
computer, calculating the FDR for all 2,308 genes in the `Khan`
|
computer, calculating the FDR for all 2,308 genes in the `Khan`
|
||||||
dataset may take a while. Hence, we will illustrate the approach on a
|
dataset may take a while. Hence, we will illustrate the approach on a
|
||||||
random subset of 100 genes. For each gene, we first compute the
|
random subset of 100 genes. For each gene, we first compute the
|
||||||
@@ -522,11 +514,11 @@ for j in range(m):
|
|||||||
Tnull_vals[j,b] = ttest_.statistic
|
Tnull_vals[j,b] = ttest_.statistic
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Next, we compute the number of rejected null hypotheses $R$, the
|
Next, we compute the number of rejected null hypotheses $R$, the
|
||||||
estimated number of false positives $\widehat{V}$, and the estimated
|
estimated number of false positives $\widehat{V}$, and the estimated
|
||||||
FDR, for a range of threshold values $c$ in
|
FDR, for a range of threshold values $c$ in
|
||||||
Algorithm 13.4. The threshold values are chosen
|
Algorithm~\ref{Ch12:alg-plugin-fdr}. The threshold values are chosen
|
||||||
using the absolute values of the test statistics from the 100 genes.
|
using the absolute values of the test statistics from the 100 genes.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
@@ -540,7 +532,7 @@ for j in range(m):
|
|||||||
FDRs[j] = V / R
|
FDRs[j] = V / R
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Now, for any given FDR, we can find the genes that will be
|
Now, for any given FDR, we can find the genes that will be
|
||||||
rejected. For example, with FDR controlled at 0.1, we reject 15 of the
|
rejected. For example, with FDR controlled at 0.1, we reject 15 of the
|
||||||
100 null hypotheses. On average, we would expect about one or two of
|
100 null hypotheses. On average, we would expect about one or two of
|
||||||
@@ -556,7 +548,7 @@ the genes whose estimated FDR is less than 0.1.
|
|||||||
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()])
|
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected
|
At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected
|
||||||
proportion of false discoveries.
|
proportion of false discoveries.
|
||||||
|
|
||||||
@@ -564,9 +556,9 @@ proportion of false discoveries.
|
|||||||
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
|
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The next line generates Figure 13.11, which is similar
|
The next line generates Figure~\ref{fig:labfdr}, which is similar
|
||||||
to Figure 13.9,
|
to Figure~\ref{Ch12:fig-plugin-fdr},
|
||||||
except that it is based on only a subset of the genes.
|
except that it is based on only a subset of the genes.
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,16 +1,16 @@
|
|||||||
numpy==1.24.2
|
numpy==1.26.4
|
||||||
scipy==1.11.1
|
scipy==1.11.4
|
||||||
pandas==1.5.3
|
pandas==2.2.2
|
||||||
lxml==4.9.3
|
lxml==5.2.2
|
||||||
scikit-learn==1.3.0
|
scikit-learn==1.5.0
|
||||||
joblib==1.3.1
|
joblib==1.4.2
|
||||||
statsmodels==0.14.0
|
statsmodels==0.14.2
|
||||||
lifelines==0.27.7
|
lifelines==0.28.0
|
||||||
pygam==0.9.0
|
pygam==0.9.1
|
||||||
l0bnb==1.0.0
|
l0bnb==1.0.0
|
||||||
torch==2.0.1
|
torch==2.3.0
|
||||||
torchvision==0.15.2
|
torchvision==0.18.0
|
||||||
pytorch-lightning==2.0.6
|
pytorch-lightning==2.2.5
|
||||||
torchinfo==1.8.0
|
torchinfo==1.8.0
|
||||||
torchmetrics==1.0.1
|
torchmetrics==1.4.0.post0
|
||||||
ISLP==0.3.22
|
ISLP==0.4.0
|
||||||
|
|||||||
Reference in New Issue
Block a user