fixing whitespace in Rmd so diff of errata is cleaner (#46)

* fixing whitespace in Rmd so diff of errata is cleaner

* reapply kwargs fix
This commit is contained in:
Jonathan Taylor
2025-04-03 12:25:17 -07:00
committed by GitHub
parent f132c18a1c
commit 6d7e40588b
12 changed files with 392 additions and 410 deletions

View File

@@ -9,8 +9,8 @@
## Getting Started ## Getting Started
@@ -66,21 +66,21 @@ inputs. For example, the
print('fit a model with', 11, 'variables') print('fit a model with', 11, 'variables')
``` ```
The following command will provide information about the `print()` function. The following command will provide information about the `print()` function.
```{python} ```{python}
print? print?
``` ```
Adding two integers in `Python` is pretty intuitive. Adding two integers in `Python` is pretty intuitive.
```{python} ```{python}
3 + 5 3 + 5
``` ```
In `Python`, textual data is handled using In `Python`, textual data is handled using
*strings*. For instance, `"hello"` and *strings*. For instance, `"hello"` and
`'hello'` `'hello'`
@@ -91,7 +91,7 @@ We can concatenate them using the addition `+` symbol.
"hello" + " " + "world" "hello" + " " + "world"
``` ```
A string is actually a type of *sequence*: this is a generic term for an ordered list. A string is actually a type of *sequence*: this is a generic term for an ordered list.
The three most important types of sequences are lists, tuples, and strings. The three most important types of sequences are lists, tuples, and strings.
We introduce lists now. We introduce lists now.
@@ -107,7 +107,7 @@ x = [3, 4, 5]
x x
``` ```
Note that we used the brackets Note that we used the brackets
`[]` to construct this list. `[]` to construct this list.
@@ -119,14 +119,14 @@ y = [4, 9, 7]
x + y x + y
``` ```
The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists
element-by-element? element-by-element?
In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*. In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*.
In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`. In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`.
This example reflects the fact that This example reflects the fact that
`Python` is a general-purpose programming language. Much of `Python`'s data-specific `Python` is a general-purpose programming language. Much of `Python`'s data-specific
functionality comes from other packages, notably `numpy` functionality comes from other packages, notably `numpy`
@@ -141,8 +141,8 @@ See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/n
As mentioned earlier, this book makes use of functionality that is contained in the `numpy` As mentioned earlier, this book makes use of functionality that is contained in the `numpy`
*library*, or *package*. A package is a collection of modules that are not necessarily included in *library*, or *package*. A package is a collection of modules that are not necessarily included in
the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*. the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*.
To access `numpy`, we must first `import` it. To access `numpy`, we must first `import` it.
```{python} ```{python}
@@ -186,7 +186,7 @@ x
The object `x` has several The object `x` has several
*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute` *attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`
@@ -196,7 +196,7 @@ For instance, we can access the `ndim` attribute of `x` as follows.
```{python} ```{python}
x.ndim x.ndim
``` ```
The output indicates that `x` is a two-dimensional array. The output indicates that `x` is a two-dimensional array.
Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is
comprised of 64-bit integers: comprised of 64-bit integers:
@@ -238,7 +238,7 @@ at its `shape` attribute.
x.shape x.shape
``` ```
A *method* is a function that is associated with an A *method* is a function that is associated with an
object. object.
@@ -275,10 +275,10 @@ x_reshape = x.reshape((2, 3))
print('reshaped x:\n', x_reshape) print('reshaped x:\n', x_reshape)
``` ```
The previous output reveals that `numpy` arrays are specified as a sequence The previous output reveals that `numpy` arrays are specified as a sequence
of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*. of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*.
`Python` (and hence `numpy`) uses 0-based `Python` (and hence `numpy`) uses 0-based
indexing. This means that to access the top left element of `x_reshape`, indexing. This means that to access the top left element of `x_reshape`,
@@ -308,13 +308,13 @@ print('x_reshape after we modify its top left element:\n', x_reshape)
print('x after we modify top left element of x_reshape:\n', x) print('x after we modify top left element of x_reshape:\n', x)
``` ```
Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory. Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory.
We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces
an *exception*, or error. an *exception*, or error.
@@ -323,8 +323,8 @@ my_tuple = (3, 4, 5)
my_tuple[0] = 2 my_tuple[0] = 2
``` ```
We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple. We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple.
The `ndim` attribute yields the number of dimensions, and `T` provides its transpose. The `ndim` attribute yields the number of dimensions, and `T` provides its transpose.
@@ -332,7 +332,7 @@ The `ndim` attribute yields the number of dimensions, and `T` provides its tran
x_reshape.shape, x_reshape.ndim, x_reshape.T x_reshape.shape, x_reshape.ndim, x_reshape.T
``` ```
Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple. Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple.
We will often want to apply functions to arrays. We will often want to apply functions to arrays.
@@ -343,22 +343,22 @@ square root of the entries using the `np.sqrt()` function:
np.sqrt(x) np.sqrt(x)
``` ```
We can also square the elements: We can also square the elements:
```{python} ```{python}
x**2 x**2
``` ```
We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2. We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2.
```{python} ```{python}
x**0.5 x**0.5
``` ```
Throughout this book, we will often want to generate random data. Throughout this book, we will often want to generate random data.
The `np.random.normal()` function generates a vector of random The `np.random.normal()` function generates a vector of random
normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`. normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`.
@@ -375,7 +375,7 @@ x = np.random.normal(size=50)
x x
``` ```
We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`. We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`.
```{python} ```{python}
@@ -387,7 +387,7 @@ correlation between `x` and `y`.
```{python} ```{python}
np.corrcoef(x, y) np.corrcoef(x, y)
``` ```
If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few
commands. In particular, commands. In particular,
each each
@@ -400,7 +400,7 @@ print(np.random.normal(scale=5, size=2))
``` ```
In order to ensure that our code provides exactly the same results In order to ensure that our code provides exactly the same results
each time it is run, we can set a *random seed* each time it is run, we can set a *random seed*
using the using the
@@ -416,7 +416,7 @@ print(rng.normal(scale=5, size=2))
rng2 = np.random.default_rng(1303) rng2 = np.random.default_rng(1303)
print(rng2.normal(scale=5, size=2)) print(rng2.normal(scale=5, size=2))
``` ```
Throughout the labs in this book, we use `np.random.default_rng()` whenever we Throughout the labs in this book, we use `np.random.default_rng()` whenever we
perform calculations involving random quantities within `numpy`. In principle, this perform calculations involving random quantities within `numpy`. In principle, this
should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible
@@ -439,7 +439,7 @@ np.mean(y), y.mean()
```{python} ```{python}
np.var(y), y.var(), np.mean((y - y.mean())**2) np.var(y), y.var(), np.mean((y - y.mean())**2)
``` ```
Notice that by default `np.var()` divides by the sample size $n$ rather Notice that by default `np.var()` divides by the sample size $n$ rather
than $n-1$; see the `ddof` argument in `np.var?`. than $n-1$; see the `ddof` argument in `np.var?`.
@@ -448,7 +448,7 @@ than $n-1$; see the `ddof` argument in `np.var?`.
```{python} ```{python}
np.sqrt(np.var(y)), np.std(y) np.sqrt(np.var(y)), np.std(y)
``` ```
The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix. The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix.
To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums. To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums.
@@ -462,14 +462,14 @@ Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its
```{python} ```{python}
X.mean(axis=0) X.mean(axis=0)
``` ```
The following yields the same result. The following yields the same result.
```{python} ```{python}
X.mean(0) X.mean(0)
``` ```
## Graphics ## Graphics
In `Python`, common practice is to use the library In `Python`, common practice is to use the library
@@ -535,7 +535,7 @@ As an alternative, we could use the `ax.scatter()` function to create a scatter
fig, ax = subplots(figsize=(8, 8)) fig, ax = subplots(figsize=(8, 8))
ax.scatter(x, y, marker='o'); ax.scatter(x, y, marker='o');
``` ```
Notice that in the code blocks above, we have ended Notice that in the code blocks above, we have ended
the last line with a semicolon. This prevents `ax.plot(x, y)` from printing the last line with a semicolon. This prevents `ax.plot(x, y)` from printing
text to the notebook. However, it does not prevent a plot from being produced. text to the notebook. However, it does not prevent a plot from being produced.
@@ -576,7 +576,7 @@ fig.set_size_inches(12,3)
fig fig
``` ```
Occasionally we will want to create several plots within a figure. This can be Occasionally we will want to create several plots within a figure. This can be
achieved by passing additional arguments to `subplots()`. achieved by passing additional arguments to `subplots()`.
@@ -605,8 +605,8 @@ Type `subplots?` to learn more about
To save the output of `fig`, we call its `savefig()` To save the output of `fig`, we call its `savefig()`
method. The argument `dpi` is the dots per inch, used method. The argument `dpi` is the dots per inch, used
to determine how large the figure will be in pixels. to determine how large the figure will be in pixels.
@@ -616,7 +616,7 @@ fig.savefig("Figure.png", dpi=400)
fig.savefig("Figure.pdf", dpi=200); fig.savefig("Figure.pdf", dpi=200);
``` ```
We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it. We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it.
@@ -668,7 +668,7 @@ fig, ax = subplots(figsize=(8, 8))
ax.imshow(f); ax.imshow(f);
``` ```
## Sequences and Slice Notation ## Sequences and Slice Notation
@@ -682,8 +682,8 @@ seq1 = np.linspace(0, 10, 11)
seq1 seq1
``` ```
The function `np.arange()` The function `np.arange()`
returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence
that starts at $0$ and ends at $10$. that starts at $0$ and ends at $10$.
@@ -693,7 +693,7 @@ seq2 = np.arange(0, 10)
seq2 seq2
``` ```
Why isn't $10$ output above? This has to do with *slice* notation in `Python`. Why isn't $10$ output above? This has to do with *slice* notation in `Python`.
Slice notation Slice notation
is used to index sequences such as lists, tuples and arrays. is used to index sequences such as lists, tuples and arrays.
@@ -735,7 +735,7 @@ See the documentation `slice?` for useful options in creating slices.
## Indexing Data ## Indexing Data
To begin, we create a two-dimensional `numpy` array. To begin, we create a two-dimensional `numpy` array.
@@ -745,7 +745,7 @@ A = np.array(np.arange(16)).reshape((4, 4))
A A
``` ```
Typing `A[1,2]` retrieves the element corresponding to the second row and third Typing `A[1,2]` retrieves the element corresponding to the second row and third
column. (As usual, `Python` indexes from $0.$) column. (As usual, `Python` indexes from $0.$)
@@ -753,7 +753,7 @@ column. (As usual, `Python` indexes from $0.$)
A[1,2] A[1,2]
``` ```
The first number after the open-bracket symbol `[` The first number after the open-bracket symbol `[`
refers to the row, and the second number refers to the column. refers to the row, and the second number refers to the column.
@@ -765,7 +765,7 @@ The first number after the open-bracket symbol `[`
A[[1,3]] A[[1,3]]
``` ```
To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets. To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets.
In this case we need to supply the first argument `:` In this case we need to supply the first argument `:`
which selects all rows. which selects all rows.
@@ -774,7 +774,7 @@ which selects all rows.
A[:,[0,2]] A[:,[0,2]]
``` ```
Now, suppose that we want to select the submatrix made up of the second and fourth Now, suppose that we want to select the submatrix made up of the second and fourth
rows as well as the first and third columns. This is where rows as well as the first and third columns. This is where
indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns: indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns:
@@ -783,21 +783,21 @@ indexing gets slightly tricky. It is natural to try to use lists to retrieve th
A[[1,3],[0,2]] A[[1,3],[0,2]]
``` ```
Oops --- what happened? We got a one-dimensional array of length two identical to Oops --- what happened? We got a one-dimensional array of length two identical to
```{python} ```{python}
np.array([A[1,0],A[3,2]]) np.array([A[1,0],A[3,2]])
``` ```
Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns: Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:
```{python} ```{python}
A[[1,3],[0,2,3]] A[[1,3],[0,2,3]]
``` ```
We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix. We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.
One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns. One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns.
@@ -808,7 +808,7 @@ A[[1,3]][:,[0,2]]
``` ```
There are more efficient ways of achieving the same result. There are more efficient ways of achieving the same result.
@@ -820,7 +820,7 @@ idx = np.ix_([1,3],[0,2,3])
A[idx] A[idx]
``` ```
Alternatively, we can subset matrices efficiently using slices. Alternatively, we can subset matrices efficiently using slices.
@@ -834,7 +834,7 @@ A[1:4:2,0:3:2]
``` ```
Why are we able to retrieve a submatrix directly using slices but not using lists? Why are we able to retrieve a submatrix directly using slices but not using lists?
Its because they are different `Python` types, and Its because they are different `Python` types, and
are treated differently by `numpy`. are treated differently by `numpy`.
@@ -850,7 +850,7 @@ Slices can be used to extract objects from arbitrary sequences, such as strings,
### Boolean Indexing ### Boolean Indexing
In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively). In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively).
@@ -867,7 +867,7 @@ keep_rows[[1,3]] = True
keep_rows keep_rows
``` ```
Note that the elements of `keep_rows`, when viewed as integers, are the same as the Note that the elements of `keep_rows`, when viewed as integers, are the same as the
values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When
applied to two arrays, the `==` operation is applied elementwise. applied to two arrays, the `==` operation is applied elementwise.
@@ -876,7 +876,7 @@ applied to two arrays, the `==` operation is applied elementwise.
np.all(keep_rows == np.array([0,1,0,1])) np.all(keep_rows == np.array([0,1,0,1]))
``` ```
(Here, the function `np.all()` has checked whether (Here, the function `np.all()` has checked whether
all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.) all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.)
@@ -888,14 +888,14 @@ The former retrieves the first, second, first, and second rows of `A`.
A[np.array([0,1,0,1])] A[np.array([0,1,0,1])]
``` ```
By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`. By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`.
```{python} ```{python}
A[keep_rows] A[keep_rows]
``` ```
This example shows that Booleans and integers are treated differently by `numpy`. This example shows that Booleans and integers are treated differently by `numpy`.
@@ -919,7 +919,7 @@ A[idx_mixed]
``` ```
For more details on indexing in `numpy`, readers are referred For more details on indexing in `numpy`, readers are referred
to the `numpy` tutorial mentioned earlier. to the `numpy` tutorial mentioned earlier.
@@ -972,7 +972,7 @@ files. Before loading data into `Python`, it is a good idea to view it using
a text editor or other software, such as Microsoft Excel. a text editor or other software, such as Microsoft Excel.
We now take a look at the column of `Auto` corresponding to the variable `horsepower`: We now take a look at the column of `Auto` corresponding to the variable `horsepower`:
@@ -993,7 +993,7 @@ We see the culprit is the value `?`, which is being used to encode missing value
To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`. To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`.
Now, each instance of `?` in the file is replaced with the Now, each instance of `?` in the file is replaced with the
value `np.nan`, which means *not a number*: value `np.nan`, which means *not a number*:
@@ -1005,8 +1005,8 @@ Auto = pd.read_csv('Auto.data',
Auto['horsepower'].sum() Auto['horsepower'].sum()
``` ```
The `Auto.shape` attribute tells us that the data has 397 The `Auto.shape` attribute tells us that the data has 397
observations, or rows, and nine variables, or columns. observations, or rows, and nine variables, or columns.
@@ -1014,7 +1014,7 @@ observations, or rows, and nine variables, or columns.
Auto.shape Auto.shape
``` ```
There are There are
various ways to deal with missing data. various ways to deal with missing data.
In this case, since only five of the rows contain missing In this case, since only five of the rows contain missing
@@ -1025,7 +1025,7 @@ Auto_new = Auto.dropna()
Auto_new.shape Auto_new.shape
``` ```
### Basics of Selecting Rows and Columns ### Basics of Selecting Rows and Columns
@@ -1036,7 +1036,7 @@ Auto = Auto_new # overwrite the previous value
Auto.columns Auto.columns
``` ```
Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array. Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array.
Recall that the first argument to the `[]` method Recall that the first argument to the `[]` method
@@ -1321,8 +1321,8 @@ Auto.plot.scatter('horsepower', 'mpg', ax=axes[1]);
``` ```
Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`. Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`.
We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable. We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable.
However, since there is only a small number of possible values for this variable, we may wish to treat it as However, since there is only a small number of possible values for this variable, we may wish to treat it as
qualitative. Below, we replace qualitative. Below, we replace
@@ -1341,7 +1341,7 @@ fig, ax = subplots(figsize=(8, 8))
Auto.boxplot('mpg', by='cylinders', ax=ax); Auto.boxplot('mpg', by='cylinders', ax=ax);
``` ```
The `hist()` method can be used to plot a *histogram*. The `hist()` method can be used to plot a *histogram*.
```{python} ```{python}

View File

@@ -1,4 +1,3 @@
# Linear Regression # Linear Regression
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch03-linreg-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch03-linreg-lab.ipynb">
@@ -19,7 +18,7 @@ import pandas as pd
from matplotlib.pyplot import subplots from matplotlib.pyplot import subplots
``` ```
### New imports ### New imports
Throughout this lab we will introduce new functions and libraries. However, Throughout this lab we will introduce new functions and libraries. However,
@@ -95,7 +94,7 @@ A.sum()
``` ```
## Simple Linear Regression ## Simple Linear Regression
In this section we will construct model In this section we will construct model
@@ -117,7 +116,7 @@ Boston = load_data("Boston")
Boston.columns Boston.columns
``` ```
Type `Boston?` to find out more about these data. Type `Boston?` to find out more about these data.
We start by using the `sm.OLS()` function to fit a We start by using the `sm.OLS()` function to fit a
@@ -132,7 +131,7 @@ X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]),
X[:4] X[:4]
``` ```
We extract the response, and fit the model. We extract the response, and fit the model.
```{python} ```{python}
@@ -154,7 +153,7 @@ method, and returns such a summary.
summarize(results) summarize(results)
``` ```
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`. Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
### Using Transformations: Fit and Transform ### Using Transformations: Fit and Transform
@@ -225,8 +224,8 @@ The fitted coefficients can also be retrieved as the
results.params results.params
``` ```
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
prediction intervals for the prediction of `medv` for given values of `lstat`. prediction intervals for the prediction of `medv` for given values of `lstat`.
@@ -396,7 +395,7 @@ terms = Boston.columns.drop('medv')
terms terms
``` ```
We can now fit the model with all the variables in `terms` using We can now fit the model with all the variables in `terms` using
the same model matrix builder. the same model matrix builder.
@@ -407,7 +406,7 @@ results = model.fit()
summarize(results) summarize(results)
``` ```
What if we would like to perform a regression using all of the variables but one? For What if we would like to perform a regression using all of the variables but one? For
example, in the above regression output, `age` has a high $p$-value. example, in the above regression output, `age` has a high $p$-value.
So we may wish to run a regression excluding this predictor. So we may wish to run a regression excluding this predictor.
@@ -482,7 +481,7 @@ model2 = sm.OLS(y, X)
summarize(model2.fit()) summarize(model2.fit())
``` ```
## Non-linear Transformations of the Predictors ## Non-linear Transformations of the Predictors
The model matrix builder can include terms beyond The model matrix builder can include terms beyond
@@ -557,7 +556,7 @@ there is little discernible pattern in the residuals.
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
to `poly()`. to `poly()`.
## Qualitative Predictors ## Qualitative Predictors
Here we use the `Carseats` data, which is included in the Here we use the `Carseats` data, which is included in the

View File

@@ -1,7 +1,3 @@
# Logistic Regression, LDA, QDA, and KNN # Logistic Regression, LDA, QDA, and KNN
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch04-classification-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch04-classification-lab.ipynb">

View File

@@ -1,6 +1,3 @@
# Cross-Validation and the Bootstrap # Cross-Validation and the Bootstrap
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb">

View File

@@ -1,4 +1,3 @@
# Linear Models and Regularization Methods # Linear Models and Regularization Methods
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch06-varselect-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch06-varselect-lab.ipynb">
@@ -65,7 +64,7 @@ Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum() np.isnan(Hitters['Salary']).sum()
``` ```
We see that `Salary` is missing for 59 players. The We see that `Salary` is missing for 59 players. The
`dropna()` method of data frames removes all of the rows that have missing `dropna()` method of data frames removes all of the rows that have missing
values in any variable (by default --- see `Hitters.dropna?`). values in any variable (by default --- see `Hitters.dropna?`).
@@ -75,8 +74,8 @@ Hitters = Hitters.dropna();
Hitters.shape Hitters.shape
``` ```
We first choose the best model using forward selection based on $C_p$ (\ref{Ch6:eq:cp}). This score We first choose the best model using forward selection based on $C_p$ (\ref{Ch6:eq:cp}). This score
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
it as a scorer. By default, `sklearn` tries to maximize a score, hence it as a scorer. By default, `sklearn` tries to maximize a score, hence
@@ -110,7 +109,7 @@ neg_Cp = partial(nCp, sigma2)
``` ```
We can now use `neg_Cp()` as a scorer for model selection. We can now use `neg_Cp()` as a scorer for model selection.
Along with a score we need to specify the search strategy. This is done through the object Along with a score we need to specify the search strategy. This is done through the object
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()` `Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
@@ -124,7 +123,7 @@ strategy = Stepwise.first_peak(design,
max_terms=len(design.terms)) max_terms=len(design.terms))
``` ```
We now fit a linear regression model with `Salary` as outcome using forward We now fit a linear regression model with `Salary` as outcome using forward
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
a model from `statsmodels` along with a search strategy and selects a model with its a model from `statsmodels` along with a search strategy and selects a model with its
@@ -138,7 +137,7 @@ hitters_MSE.fit(Hitters, Y)
hitters_MSE.selected_state_ hitters_MSE.selected_state_
``` ```
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected. Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
```{python} ```{python}
@@ -149,7 +148,7 @@ hitters_Cp.fit(Hitters, Y)
hitters_Cp.selected_state_ hitters_Cp.selected_state_
``` ```
### Choosing Among Models Using the Validation Set Approach and Cross-Validation ### Choosing Among Models Using the Validation Set Approach and Cross-Validation
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
@@ -171,7 +170,7 @@ strategy = Stepwise.fixed_steps(design,
full_path = sklearn_selection_path(OLS, strategy) full_path = sklearn_selection_path(OLS, strategy)
``` ```
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values. We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
```{python} ```{python}
@@ -180,8 +179,8 @@ Yhat_in = full_path.predict(Hitters)
Yhat_in.shape Yhat_in.shape
``` ```
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
in-sample MSE. As expected, the in-sample MSE improves each step we take, in-sample MSE. As expected, the in-sample MSE improves each step we take,
indicating we must use either the validation or cross-validation indicating we must use either the validation or cross-validation
@@ -270,7 +269,7 @@ ax.legend()
mse_fig mse_fig
``` ```
To repeat the above using the validation set approach, we simply change our To repeat the above using the validation set approach, we simply change our
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size `cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()` of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
@@ -300,7 +299,7 @@ ax.legend()
mse_fig mse_fig
``` ```
### Best Subset Selection ### Best Subset Selection
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters` Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
@@ -328,7 +327,7 @@ path = fit_path(X,
max_nonzeros=X.shape[1]) max_nonzeros=X.shape[1])
``` ```
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book. The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
```{python} ```{python}
@@ -396,7 +395,7 @@ soln_path.index.name = 'negative log(lambda)'
soln_path soln_path
``` ```
We plot the paths to get a sense of how the coefficients vary with $\lambda$. We plot the paths to get a sense of how the coefficients vary with $\lambda$.
To control the location of the legend we first set `legend` to `False` in the To control the location of the legend we first set `legend` to `False` in the
plot method, adding it afterward with the `legend()` method of `ax`. plot method, adding it afterward with the `legend()` method of `ax`.
@@ -420,14 +419,14 @@ beta_hat = soln_path.loc[soln_path.index[39]]
lambdas[39], beta_hat lambdas[39], beta_hat
``` ```
Lets compute the $\ell_2$ norm of the standardized coefficients. Lets compute the $\ell_2$ norm of the standardized coefficients.
```{python} ```{python}
np.linalg.norm(beta_hat) np.linalg.norm(beta_hat)
``` ```
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01. In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
Note the much larger $\ell_2$ norm of the Note the much larger $\ell_2$ norm of the
coefficients associated with this smaller value of $\lambda$. coefficients associated with this smaller value of $\lambda$.
@@ -481,7 +480,7 @@ results = skm.cross_validate(ridge,
-results['test_score'] -results['test_score']
``` ```
The test MSE is 1.342e+05. Note The test MSE is 1.342e+05. Note
that if we had instead simply fit a model with just an intercept, we that if we had instead simply fit a model with just an intercept, we
would have predicted each test observation using the mean of the would have predicted each test observation using the mean of the
@@ -518,7 +517,7 @@ grid.best_params_['ridge__alpha']
grid.best_estimator_ grid.best_estimator_
``` ```
Alternatively, we can use 5-fold cross-validation. Alternatively, we can use 5-fold cross-validation.
```{python} ```{python}
@@ -544,7 +543,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20); ax.set_ylabel('Cross-validated MSE', fontsize=20);
``` ```
One can cross-validate different metrics to choose a parameter. The default One can cross-validate different metrics to choose a parameter. The default
metric for `skl.ElasticNet()` is test $R^2$. metric for `skl.ElasticNet()` is test $R^2$.
Lets compare $R^2$ to MSE for cross-validation here. Lets compare $R^2$ to MSE for cross-validation here.
@@ -556,7 +555,7 @@ grid_r2 = skm.GridSearchCV(pipe,
grid_r2.fit(X, Y) grid_r2.fit(X, Y)
``` ```
Finally, lets plot the results for cross-validated $R^2$. Finally, lets plot the results for cross-validated $R^2$.
```{python} ```{python}
@@ -568,7 +567,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated $R^2$', fontsize=20); ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
``` ```
### Fast Cross-Validation for Solution Paths ### Fast Cross-Validation for Solution Paths
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
@@ -588,7 +587,7 @@ pipeCV = Pipeline(steps=[('scaler', scaler),
pipeCV.fit(X, Y) pipeCV.fit(X, Y)
``` ```
Lets produce a plot again of the cross-validation error to see that Lets produce a plot again of the cross-validation error to see that
it is similar to using `skm.GridSearchCV`. it is similar to using `skm.GridSearchCV`.
@@ -604,7 +603,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20); ax.set_ylabel('Cross-validated MSE', fontsize=20);
``` ```
We see that the value of $\lambda$ that results in the We see that the value of $\lambda$ that results in the
smallest cross-validation error is 1.19e-02, available smallest cross-validation error is 1.19e-02, available
as the value `tuned_ridge.alpha_`. What is the test MSE as the value `tuned_ridge.alpha_`. What is the test MSE
@@ -614,7 +613,7 @@ associated with this value of $\lambda$?
np.min(tuned_ridge.mse_path_.mean(1)) np.min(tuned_ridge.mse_path_.mean(1))
``` ```
This represents a further improvement over the test MSE that we got This represents a further improvement over the test MSE that we got
using $\lambda=4$. Finally, `tuned_ridge.coef_` using $\lambda=4$. Finally, `tuned_ridge.coef_`
has the coefficients fit on the entire data set has the coefficients fit on the entire data set
@@ -670,7 +669,7 @@ results = skm.cross_validate(pipeCV,
``` ```
### The Lasso ### The Lasso
We saw that ridge regression with a wise choice of $\lambda$ can We saw that ridge regression with a wise choice of $\lambda$ can
@@ -725,7 +724,7 @@ regression (page~\pageref{page:MSECVRidge}) with $\lambda$ chosen by cross-valid
np.min(tuned_lasso.mse_path_.mean(1)) np.min(tuned_lasso.mse_path_.mean(1))
``` ```
Lets again produce a plot of the cross-validation error. Lets again produce a plot of the cross-validation error.
@@ -750,7 +749,7 @@ variables.
tuned_lasso.coef_ tuned_lasso.coef_
``` ```
As in ridge regression, we could evaluate the test error As in ridge regression, we could evaluate the test error
of cross-validated lasso by first splitting into of cross-validated lasso by first splitting into
test and training sets and internally running test and training sets and internally running
@@ -761,7 +760,7 @@ this as an exercise.
## PCR and PLS Regression ## PCR and PLS Regression
### Principal Components Regression ### Principal Components Regression
Principal components regression (PCR) can be performed using Principal components regression (PCR) can be performed using
`PCA()` from the `sklearn.decomposition` `PCA()` from the `sklearn.decomposition`
@@ -782,7 +781,7 @@ pipe.fit(X, Y)
pipe.named_steps['linreg'].coef_ pipe.named_steps['linreg'].coef_
``` ```
When performing PCA, the results vary depending When performing PCA, the results vary depending
on whether the data has been *standardized* or not. on whether the data has been *standardized* or not.
As in the earlier examples, this can be accomplished As in the earlier examples, this can be accomplished
@@ -796,7 +795,7 @@ pipe.fit(X, Y)
pipe.named_steps['linreg'].coef_ pipe.named_steps['linreg'].coef_
``` ```
We can of course use CV to choose the number of components, by We can of course use CV to choose the number of components, by
using `skm.GridSearchCV`, in this using `skm.GridSearchCV`, in this
case fixing the parameters to vary the case fixing the parameters to vary the
@@ -811,7 +810,7 @@ grid = skm.GridSearchCV(pipe,
grid.fit(X, Y) grid.fit(X, Y)
``` ```
Lets plot the results as we have for other methods. Lets plot the results as we have for other methods.
```{python} ```{python}
@@ -826,7 +825,7 @@ ax.set_xticks(n_comp[::2])
ax.set_ylim([50000,250000]); ax.set_ylim([50000,250000]);
``` ```
We see that the smallest cross-validation error occurs when We see that the smallest cross-validation error occurs when
17 17
components are used. However, from the plot we also see that the components are used. However, from the plot we also see that the
@@ -850,8 +849,8 @@ cv_null = skm.cross_validate(linreg,
-cv_null['test_score'].mean() -cv_null['test_score'].mean()
``` ```
The `explained_variance_ratio_` The `explained_variance_ratio_`
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
different numbers of components. This concept is discussed in greater different numbers of components. This concept is discussed in greater
@@ -861,7 +860,7 @@ detail in Section~\ref{Ch10:sec:pca}.
pipe.named_steps['pca'].explained_variance_ratio_ pipe.named_steps['pca'].explained_variance_ratio_
``` ```
Briefly, we can think of Briefly, we can think of
this as the amount of information about the predictors this as the amount of information about the predictors
that is captured using $M$ principal components. For example, setting that is captured using $M$ principal components. For example, setting
@@ -884,7 +883,7 @@ pls = PLSRegression(n_components=2,
pls.fit(X, Y) pls.fit(X, Y)
``` ```
As was the case in PCR, we will want to As was the case in PCR, we will want to
use CV to choose the number of components. use CV to choose the number of components.
@@ -897,7 +896,7 @@ grid = skm.GridSearchCV(pls,
grid.fit(X, Y) grid.fit(X, Y)
``` ```
As for our other methods, we plot the MSE. As for our other methods, we plot the MSE.
```{python} ```{python}
@@ -912,7 +911,7 @@ ax.set_xticks(n_comp[::2])
ax.set_ylim([50000,250000]); ax.set_ylim([50000,250000]);
``` ```
CV error is minimized at 12, CV error is minimized at 12,
though there is little noticable difference between this point and a much lower number like 2 or 3 components. though there is little noticable difference between this point and a much lower number like 2 or 3 components.

View File

@@ -22,7 +22,7 @@ from ISLP.models import (summarize,
ModelSpec as MS) ModelSpec as MS)
from statsmodels.stats.anova import anova_lm from statsmodels.stats.anova import anova_lm
``` ```
We again collect the new imports We again collect the new imports
needed for this lab. Many of these are developed specifically for the needed for this lab. Many of these are developed specifically for the
`ISLP` package. `ISLP` package.
@@ -43,7 +43,7 @@ from ISLP.pygam import (approx_lam,
anova as anova_gam) anova as anova_gam)
``` ```
## Polynomial Regression and Step Functions ## Polynomial Regression and Step Functions
We start by demonstrating how Figure~\ref{Ch7:fig:poly} can be reproduced. We start by demonstrating how Figure~\ref{Ch7:fig:poly} can be reproduced.
Let's begin by loading the data. Let's begin by loading the data.
@@ -54,7 +54,7 @@ y = Wage['wage']
age = Wage['age'] age = Wage['age']
``` ```
Throughout most of this lab, our response is `Wage['wage']`, which Throughout most of this lab, our response is `Wage['wage']`, which
we have stored as `y` above. we have stored as `y` above.
As in Section~\ref{Ch3-linreg-lab:non-linear-transformations-of-the-predictors}, we will use the `poly()` function to create a model matrix As in Section~\ref{Ch3-linreg-lab:non-linear-transformations-of-the-predictors}, we will use the `poly()` function to create a model matrix
@@ -66,8 +66,8 @@ M = sm.OLS(y, poly_age.transform(Wage)).fit()
summarize(M) summarize(M)
``` ```
This polynomial is constructed using the function `poly()`, This polynomial is constructed using the function `poly()`,
which creates which creates
a special *transformer* `Poly()` (using `sklearn` terminology a special *transformer* `Poly()` (using `sklearn` terminology
@@ -88,7 +88,7 @@ on the second line, as well as in the plotting function developed below.
We now create a grid of values for `age` at which we want We now create a grid of values for `age` at which we want
predictions. predictions.
@@ -156,7 +156,7 @@ plot_wage_fit(age_df,
With polynomial regression we must decide on the degree of With polynomial regression we must decide on the degree of
the polynomial to use. Sometimes we just wing it, and decide to use the polynomial to use. Sometimes we just wing it, and decide to use
second or third degree polynomials, simply to obtain a nonlinear fit. But we can second or third degree polynomials, simply to obtain a nonlinear fit. But we can
@@ -187,7 +187,7 @@ anova_lm(*[sm.OLS(y, X_).fit()
for X_ in Xs]) for X_ in Xs])
``` ```
Notice the `*` in the `anova_lm()` line above. This Notice the `*` in the `anova_lm()` line above. This
function takes a variable number of non-keyword arguments, in this case fitted models. function takes a variable number of non-keyword arguments, in this case fitted models.
When these models are provided as a list (as is done here), it must be When these models are provided as a list (as is done here), it must be
@@ -212,8 +212,8 @@ that `poly()` creates orthogonal polynomials.
summarize(M) summarize(M)
``` ```
Notice that the p-values are the same, and in fact the square of Notice that the p-values are the same, and in fact the square of
the t-statistics are equal to the F-statistics from the the t-statistics are equal to the F-statistics from the
`anova_lm()` function; for example: `anova_lm()` function; for example:
@@ -222,8 +222,8 @@ the t-statistics are equal to the F-statistics from the
(-11.983)**2 (-11.983)**2
``` ```
However, the ANOVA method works whether or not we used orthogonal However, the ANOVA method works whether or not we used orthogonal
polynomials, provided the models are nested. For example, we can use polynomials, provided the models are nested. For example, we can use
`anova_lm()` to compare the following three `anova_lm()` to compare the following three
@@ -238,8 +238,8 @@ XEs = [model.fit_transform(Wage)
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs]) anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
``` ```
As an alternative to using hypothesis tests and ANOVA, we could choose As an alternative to using hypothesis tests and ANOVA, we could choose
the polynomial degree using cross-validation, as discussed in Chapter~\ref{Ch5:resample}. the polynomial degree using cross-validation, as discussed in Chapter~\ref{Ch5:resample}.
@@ -259,8 +259,8 @@ B = glm.fit()
summarize(B) summarize(B)
``` ```
Once again, we make predictions using the `get_prediction()` method. Once again, we make predictions using the `get_prediction()` method.
```{python} ```{python}
@@ -269,7 +269,7 @@ preds = B.get_prediction(newX)
bands = preds.conf_int(alpha=0.05) bands = preds.conf_int(alpha=0.05)
``` ```
We now plot the estimated relationship. We now plot the estimated relationship.
```{python} ```{python}
@@ -311,8 +311,8 @@ cut_age = pd.qcut(age, 4)
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit()) summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
``` ```
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
quantiles directly instead of the argument `4`. For cuts not based quantiles directly instead of the argument `4`. For cuts not based
on quantiles we would use the `pd.cut()` function. on quantiles we would use the `pd.cut()` function.
@@ -369,7 +369,7 @@ M = sm.OLS(y, Xbs).fit()
summarize(M) summarize(M)
``` ```
Notice that there are 6 spline coefficients rather than 7. This is because, by default, Notice that there are 6 spline coefficients rather than 7. This is because, by default,
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model. `bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept. So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
@@ -427,7 +427,7 @@ deciding bin membership.
In order to fit a natural spline, we use the `NaturalSpline()` In order to fit a natural spline, we use the `NaturalSpline()`
transform with the corresponding helper `ns()`. Here we fit a natural spline with five transform with the corresponding helper `ns()`. Here we fit a natural spline with five
degrees of freedom (excluding the intercept) and plot the results. degrees of freedom (excluding the intercept) and plot the results.
@@ -445,7 +445,7 @@ plot_wage_fit(age_df,
'Natural spline, df=5'); 'Natural spline, df=5');
``` ```
## Smoothing Splines and GAMs ## Smoothing Splines and GAMs
A smoothing spline is a special case of a GAM with squared-error loss A smoothing spline is a special case of a GAM with squared-error loss
and a single feature. To fit GAMs in `Python` we will use the and a single feature. To fit GAMs in `Python` we will use the
@@ -464,7 +464,7 @@ gam = LinearGAM(s_gam(0, lam=0.6))
gam.fit(X_age, y) gam.fit(X_age, y)
``` ```
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
size of that dimension based on the remaining entries of the shape tuple. size of that dimension based on the remaining entries of the shape tuple.
@@ -487,7 +487,7 @@ ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='$\lambda$'); ax.legend(title='$\lambda$');
``` ```
The `pygam` package can perform a search for an optimal smoothing parameter. The `pygam` package can perform a search for an optimal smoothing parameter.
```{python} ```{python}
@@ -500,7 +500,7 @@ ax.legend()
fig fig
``` ```
Alternatively, we can fix the degrees of freedom of the smoothing Alternatively, we can fix the degrees of freedom of the smoothing
spline using a function included in the `ISLP.pygam` package. Below we spline using a function included in the `ISLP.pygam` package. Below we
find a value of $\lambda$ that gives us roughly four degrees of find a value of $\lambda$ that gives us roughly four degrees of
@@ -515,8 +515,8 @@ age_term.lam = lam_4
degrees_of_freedom(X_age, age_term) degrees_of_freedom(X_age, age_term)
``` ```
Lets vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom Lets vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
as the desired degrees of freedom plus one to account for the fact that these smoothing as the desired degrees of freedom plus one to account for the fact that these smoothing
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit. splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
@@ -628,7 +628,7 @@ ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20); ax.set_title('Partial dependence of year on wage', fontsize=20);
``` ```
We now fit the model (\ref{Ch7:nsmod}) using smoothing splines rather We now fit the model (\ref{Ch7:nsmod}) using smoothing splines rather
than natural splines. All of the than natural splines. All of the
terms in (\ref{Ch7:nsmod}) are fit simultaneously, taking each other terms in (\ref{Ch7:nsmod}) are fit simultaneously, taking each other
@@ -720,7 +720,7 @@ gam_linear = LinearGAM(age_term +
gam_linear.fit(Xgam, y) gam_linear.fit(Xgam, y)
``` ```
Notice our use of `age_term` in the expressions above. We do this because Notice our use of `age_term` in the expressions above. We do this because
earlier we set the value for `lam` in this term to achieve four degrees of freedom. earlier we set the value for `lam` in this term to achieve four degrees of freedom.
@@ -767,7 +767,7 @@ We can make predictions from `gam` objects, just like from
Yhat = gam_full.predict(Xgam) Yhat = gam_full.predict(Xgam)
``` ```
In order to fit a logistic regression GAM, we use `LogisticGAM()` In order to fit a logistic regression GAM, we use `LogisticGAM()`
from `pygam`. from `pygam`.
@@ -778,7 +778,7 @@ gam_logit = LogisticGAM(age_term +
gam_logit.fit(Xgam, high_earn) gam_logit.fit(Xgam, high_earn)
``` ```
```{python} ```{python}
fig, ax = subplots(figsize=(8, 8)) fig, ax = subplots(figsize=(8, 8))
@@ -830,8 +830,8 @@ gam_logit_ = LogisticGAM(age_term +
gam_logit_.fit(Xgam_, high_earn_) gam_logit_.fit(Xgam_, high_earn_)
``` ```
Lets look at the effect of `education`, `year` and `age` on high earner status now that weve Lets look at the effect of `education`, `year` and `age` on high earner status now that weve
removed those observations. removed those observations.
@@ -864,7 +864,7 @@ ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on age', fontsize=20); ax.set_title('Partial dependence of high earner status on age', fontsize=20);
``` ```
## Local Regression ## Local Regression
We illustrate the use of local regression using the `lowess()` We illustrate the use of local regression using the `lowess()`

View File

@@ -1,6 +1,3 @@
# Tree-Based Methods # Tree-Based Methods
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch08-baggboost-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch08-baggboost-lab.ipynb">
@@ -38,10 +35,10 @@ from sklearn.ensemble import \
from ISLP.bart import BART from ISLP.bart import BART
``` ```
## Fitting Classification Trees ## Fitting Classification Trees
We first use classification trees to analyze the `Carseats` data set. We first use classification trees to analyze the `Carseats` data set.
In these data, `Sales` is a continuous variable, and so we begin In these data, `Sales` is a continuous variable, and so we begin
@@ -57,7 +54,7 @@ High = np.where(Carseats.Sales > 8,
"No") "No")
``` ```
We now use `DecisionTreeClassifier()` to fit a classification tree in We now use `DecisionTreeClassifier()` to fit a classification tree in
order to predict `High` using all variables but `Sales`. order to predict `High` using all variables but `Sales`.
To do so, we must form a model matrix as we did when fitting regression To do so, we must form a model matrix as we did when fitting regression
@@ -85,8 +82,8 @@ clf = DTC(criterion='entropy',
clf.fit(X, High) clf.fit(X, High)
``` ```
In our discussion of qualitative features in Section~\ref{ch3:sec3}, In our discussion of qualitative features in Section~\ref{ch3:sec3},
we noted that for a linear regression model such a feature could be we noted that for a linear regression model such a feature could be
represented by including a matrix of dummy variables (one-hot-encoding) in the model represented by including a matrix of dummy variables (one-hot-encoding) in the model
@@ -102,8 +99,8 @@ advantage of this approach; instead it simply treats the one-hot-encoded levels
accuracy_score(High, clf.predict(X)) accuracy_score(High, clf.predict(X))
``` ```
With only the default arguments, the training error rate is With only the default arguments, the training error rate is
21%. 21%.
For classification trees, we can For classification trees, we can
@@ -121,7 +118,7 @@ resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
resid_dev resid_dev
``` ```
This is closely related to the *entropy*, defined in (\ref{Ch8:eq:cross-entropy}). This is closely related to the *entropy*, defined in (\ref{Ch8:eq:cross-entropy}).
A small deviance indicates a A small deviance indicates a
tree that provides a good fit to the (training) data. tree that provides a good fit to the (training) data.
@@ -153,7 +150,7 @@ print(export_text(clf,
show_weights=True)) show_weights=True))
``` ```
In order to properly evaluate the performance of a classification tree In order to properly evaluate the performance of a classification tree
on these data, we must estimate the test error rather than simply on these data, we must estimate the test error rather than simply
computing the training error. We split the observations into a computing the training error. We split the observations into a
@@ -256,8 +253,8 @@ confusion = confusion_table(best_.predict(X_test),
confusion confusion
``` ```
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance. Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
@@ -275,7 +272,7 @@ feature_names = list(D.columns)
X = np.asarray(D) X = np.asarray(D)
``` ```
First, we split the data into training and test sets, and fit the tree First, we split the data into training and test sets, and fit the tree
to the training data. Here we use 30% of the data for the test set. to the training data. Here we use 30% of the data for the test set.
@@ -290,7 +287,7 @@ to the training data. Here we use 30% of the data for the test set.
random_state=0) random_state=0)
``` ```
Having formed our training and test data sets, we fit the regression tree. Having formed our training and test data sets, we fit the regression tree.
```{python} ```{python}
@@ -302,7 +299,7 @@ plot_tree(reg,
ax=ax); ax=ax);
``` ```
The variable `lstat` measures the percentage of individuals with The variable `lstat` measures the percentage of individuals with
lower socioeconomic status. The tree indicates that lower lower socioeconomic status. The tree indicates that lower
values of `lstat` correspond to more expensive houses. values of `lstat` correspond to more expensive houses.
@@ -326,7 +323,7 @@ grid = skm.GridSearchCV(reg,
G = grid.fit(X_train, y_train) G = grid.fit(X_train, y_train)
``` ```
In keeping with the cross-validation results, we use the pruned tree In keeping with the cross-validation results, we use the pruned tree
to make predictions on the test set. to make predictions on the test set.
@@ -335,8 +332,8 @@ best_ = grid.best_estimator_
np.mean((y_test - best_.predict(X_test))**2) np.mean((y_test - best_.predict(X_test))**2)
``` ```
In other words, the test set MSE associated with the regression tree In other words, the test set MSE associated with the regression tree
is 28.07. The square root of is 28.07. The square root of
the MSE is therefore around the MSE is therefore around
@@ -359,7 +356,7 @@ plot_tree(G.best_estimator_,
## Bagging and Random Forests ## Bagging and Random Forests
Here we apply bagging and random forests to the `Boston` data, using Here we apply bagging and random forests to the `Boston` data, using
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
@@ -372,8 +369,8 @@ bag_boston = RF(max_features=X_train.shape[1], random_state=0)
bag_boston.fit(X_train, y_train) bag_boston.fit(X_train, y_train)
``` ```
The argument `max_features` indicates that all 12 predictors should The argument `max_features` indicates that all 12 predictors should
be considered for each split of the tree --- in other words, that be considered for each split of the tree --- in other words, that
bagging should be done. How well does this bagged model perform on bagging should be done. How well does this bagged model perform on
@@ -386,7 +383,7 @@ ax.scatter(y_hat_bag, y_test)
np.mean((y_test - y_hat_bag)**2) np.mean((y_test - y_hat_bag)**2)
``` ```
The test set MSE associated with the bagged regression tree is The test set MSE associated with the bagged regression tree is
14.63, about half that obtained using an optimally-pruned single 14.63, about half that obtained using an optimally-pruned single
tree. We could change the number of trees grown from the default of tree. We could change the number of trees grown from the default of
@@ -417,8 +414,8 @@ y_hat_RF = RF_boston.predict(X_test)
np.mean((y_test - y_hat_RF)**2) np.mean((y_test - y_hat_RF)**2)
``` ```
The test set MSE is 20.04; The test set MSE is 20.04;
this indicates that random forests did somewhat worse than bagging this indicates that random forests did somewhat worse than bagging
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
@@ -442,7 +439,7 @@ house size (`rm`) are by far the two most important variables.
## Boosting ## Boosting
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble` Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
to fit boosted regression trees to the `Boston` data to fit boosted regression trees to the `Boston` data
@@ -461,7 +458,7 @@ boost_boston = GBR(n_estimators=5000,
boost_boston.fit(X_train, y_train) boost_boston.fit(X_train, y_train)
``` ```
We can see how the training error decreases with the `train_score_` attribute. We can see how the training error decreases with the `train_score_` attribute.
To get an idea of how the test error decreases we can use the To get an idea of how the test error decreases we can use the
`staged_predict()` method to get the predicted values along the path. `staged_predict()` method to get the predicted values along the path.
@@ -484,7 +481,7 @@ ax.plot(plot_idx,
ax.legend(); ax.legend();
``` ```
We now use the boosted model to predict `medv` on the test set: We now use the boosted model to predict `medv` on the test set:
```{python} ```{python}
@@ -492,7 +489,7 @@ y_hat_boost = boost_boston.predict(X_test);
np.mean((y_test - y_hat_boost)**2) np.mean((y_test - y_hat_boost)**2)
``` ```
The test MSE obtained is 14.48, The test MSE obtained is 14.48,
similar to the test MSE for bagging. If we want to, we can similar to the test MSE for bagging. If we want to, we can
perform boosting with a different value of the shrinkage parameter perform boosting with a different value of the shrinkage parameter
@@ -510,8 +507,8 @@ y_hat_boost = boost_boston.predict(X_test);
np.mean((y_test - y_hat_boost)**2) np.mean((y_test - y_hat_boost)**2)
``` ```
In this case, using $\lambda=0.2$ leads to a almost the same test MSE In this case, using $\lambda=0.2$ leads to a almost the same test MSE
as when using $\lambda=0.001$. as when using $\lambda=0.001$.
@@ -519,7 +516,7 @@ as when using $\lambda=0.001$.
## Bayesian Additive Regression Trees ## Bayesian Additive Regression Trees
In this section we demonstrate a `Python` implementation of BART found in the In this section we demonstrate a `Python` implementation of BART found in the
`ISLP.bart` package. We fit a model `ISLP.bart` package. We fit a model
@@ -532,8 +529,8 @@ bart_boston = BART(random_state=0, burnin=5, ndraw=15)
bart_boston.fit(X_train, y_train) bart_boston.fit(X_train, y_train)
``` ```
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest. On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
```{python} ```{python}
@@ -541,8 +538,8 @@ yhat_test = bart_boston.predict(X_test.astype(np.float32))
np.mean((y_test - yhat_test)**2) np.mean((y_test - yhat_test)**2)
``` ```
We can check how many times each variable appeared in the collection of trees. We can check how many times each variable appeared in the collection of trees.
This gives a summary similar to the variable importance plot for boosting and random forests. This gives a summary similar to the variable importance plot for boosting and random forests.

View File

@@ -1,6 +1,3 @@
# Support Vector Machines # Support Vector Machines
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch09-svm-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch09-svm-lab.ipynb">
@@ -31,7 +28,7 @@ from ISLP.svm import plot as plot_svm
from sklearn.metrics import RocCurveDisplay from sklearn.metrics import RocCurveDisplay
``` ```
We will use the function `RocCurveDisplay.from_estimator()` to We will use the function `RocCurveDisplay.from_estimator()` to
produce several ROC plots, using a shorthand `roc_curve`. produce several ROC plots, using a shorthand `roc_curve`.
@@ -76,8 +73,8 @@ svm_linear = SVC(C=10, kernel='linear')
svm_linear.fit(X, y) svm_linear.fit(X, y)
``` ```
The support vector classifier with two features can The support vector classifier with two features can
be visualized by plotting values of its *decision function*. be visualized by plotting values of its *decision function*.
We have included a function for this in the `ISLP` package (inspired by a similar We have included a function for this in the `ISLP` package (inspired by a similar
@@ -91,7 +88,7 @@ plot_svm(X,
ax=ax) ax=ax)
``` ```
The decision The decision
boundary between the two classes is linear (because we used the boundary between the two classes is linear (because we used the
argument `kernel='linear'`). The support vectors are marked with `+` argument `kernel='linear'`). The support vectors are marked with `+`
@@ -118,8 +115,8 @@ coefficients of the linear decision boundary as follows:
svm_linear.coef_ svm_linear.coef_
``` ```
Since the support vector machine is an estimator in `sklearn`, we Since the support vector machine is an estimator in `sklearn`, we
can use the usual machinery to tune it. can use the usual machinery to tune it.
@@ -136,8 +133,8 @@ grid.fit(X, y)
grid.best_params_ grid.best_params_
``` ```
We can easily access the cross-validation errors for each of these models We can easily access the cross-validation errors for each of these models
in `grid.cv_results_`. This prints out a lot of detail, so we in `grid.cv_results_`. This prints out a lot of detail, so we
extract the accuracy results only. extract the accuracy results only.
@@ -158,7 +155,7 @@ y_test = np.array([-1]*10+[1]*10)
X_test[y_test==1] += 1 X_test[y_test==1] += 1
``` ```
Now we predict the class labels of these test observations. Here we Now we predict the class labels of these test observations. Here we
use the best model selected by cross-validation in order to make the use the best model selected by cross-validation in order to make the
predictions. predictions.
@@ -169,7 +166,7 @@ y_test_hat = best_.predict(X_test)
confusion_table(y_test_hat, y_test) confusion_table(y_test_hat, y_test)
``` ```
Thus, with this value of `C`, Thus, with this value of `C`,
70% of the test 70% of the test
observations are correctly classified. What if we had instead used observations are correctly classified. What if we had instead used
@@ -182,7 +179,7 @@ y_test_hat = svm_.predict(X_test)
confusion_table(y_test_hat, y_test) confusion_table(y_test_hat, y_test)
``` ```
In this case 60% of test observations are correctly classified. In this case 60% of test observations are correctly classified.
We now consider a situation in which the two classes are linearly We now consider a situation in which the two classes are linearly
@@ -197,7 +194,7 @@ fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm); ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
``` ```
Now the observations are just barely linearly separable. Now the observations are just barely linearly separable.
```{python} ```{python}
@@ -206,7 +203,7 @@ y_hat = svm_.predict(X)
confusion_table(y_hat, y) confusion_table(y_hat, y)
``` ```
We fit the We fit the
support vector classifier and plot the resulting hyperplane, using a support vector classifier and plot the resulting hyperplane, using a
very large value of `C` so that no observations are very large value of `C` so that no observations are
@@ -232,7 +229,7 @@ y_hat = svm_.predict(X)
confusion_table(y_hat, y) confusion_table(y_hat, y)
``` ```
Using `C=0.1`, we again do not misclassify any training observations, but we Using `C=0.1`, we again do not misclassify any training observations, but we
also obtain a much wider margin and make use of twelve support also obtain a much wider margin and make use of twelve support
vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test
@@ -246,7 +243,7 @@ plot_svm(X,
ax=ax) ax=ax)
``` ```
## Support Vector Machine ## Support Vector Machine
In order to fit an SVM using a non-linear kernel, we once again use In order to fit an SVM using a non-linear kernel, we once again use
@@ -269,7 +266,7 @@ X[100:150] -= 2
y = np.array([1]*150+[2]*50) y = np.array([1]*150+[2]*50)
``` ```
Plotting the data makes it clear that the class boundary is indeed non-linear. Plotting the data makes it clear that the class boundary is indeed non-linear.
```{python} ```{python}
@@ -280,8 +277,8 @@ ax.scatter(X[:,0],
cmap=cm.coolwarm); cmap=cm.coolwarm);
``` ```
The data is randomly split into training and testing groups. We then The data is randomly split into training and testing groups. We then
fit the training data using the `SVC()` estimator with a fit the training data using the `SVC()` estimator with a
radial kernel and $\gamma=1$: radial kernel and $\gamma=1$:
@@ -298,7 +295,7 @@ svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
svm_rbf.fit(X_train, y_train) svm_rbf.fit(X_train, y_train)
``` ```
The plot shows that the resulting SVM has a decidedly non-linear The plot shows that the resulting SVM has a decidedly non-linear
boundary. boundary.
@@ -310,7 +307,7 @@ plot_svm(X_train,
ax=ax) ax=ax)
``` ```
We can see from the figure that there are a fair number of training We can see from the figure that there are a fair number of training
errors in this SVM fit. If we increase the value of `C`, we errors in this SVM fit. If we increase the value of `C`, we
can reduce the number of training errors. However, this comes at the can reduce the number of training errors. However, this comes at the
@@ -327,7 +324,7 @@ plot_svm(X_train,
ax=ax) ax=ax)
``` ```
We can perform cross-validation using `skm.GridSearchCV()` to select the We can perform cross-validation using `skm.GridSearchCV()` to select the
best choice of $\gamma$ and `C` for an SVM with a radial best choice of $\gamma$ and `C` for an SVM with a radial
kernel: kernel:
@@ -346,7 +343,7 @@ grid.fit(X_train, y_train)
grid.best_params_ grid.best_params_
``` ```
The best choice of parameters under five-fold CV is achieved at `C=1` The best choice of parameters under five-fold CV is achieved at `C=1`
and `gamma=0.5`, though several other values also achieve the same and `gamma=0.5`, though several other values also achieve the same
value. value.
@@ -363,7 +360,7 @@ y_hat_test = best_svm.predict(X_test)
confusion_table(y_hat_test, y_test) confusion_table(y_hat_test, y_test)
``` ```
With these parameters, 12% of test With these parameters, 12% of test
observations are misclassified by this SVM. observations are misclassified by this SVM.
@@ -423,7 +420,7 @@ roc_curve(svm_flex,
ax=ax); ax=ax);
``` ```
However, these ROC curves are all on the training data. We are really However, these ROC curves are all on the training data. We are really
more interested in the level of prediction accuracy on the test more interested in the level of prediction accuracy on the test
data. When we compute the ROC curves on the test data, the model with data. When we compute the ROC curves on the test data, the model with
@@ -439,7 +436,7 @@ roc_curve(svm_flex,
fig; fig;
``` ```
Lets look at our tuned SVM. Lets look at our tuned SVM.
```{python} ```{python}
@@ -458,7 +455,7 @@ for (X_, y_, c, name) in zip(
color=c) color=c)
``` ```
## SVM with Multiple Classes ## SVM with Multiple Classes
If the response is a factor containing more than two levels, then the If the response is a factor containing more than two levels, then the
@@ -477,7 +474,7 @@ fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm); ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
``` ```
We now fit an SVM to the data: We now fit an SVM to the data:
```{python} ```{python}
@@ -513,7 +510,7 @@ Khan = load_data('Khan')
Khan['xtrain'].shape, Khan['xtest'].shape Khan['xtrain'].shape, Khan['xtest'].shape
``` ```
This data set consists of expression measurements for 2,308 This data set consists of expression measurements for 2,308
genes. The training and test sets consist of 63 and 20 genes. The training and test sets consist of 63 and 20
observations, respectively. observations, respectively.
@@ -532,7 +529,7 @@ confusion_table(khan_linear.predict(Khan['xtrain']),
Khan['ytrain']) Khan['ytrain'])
``` ```
We see that there are *no* training We see that there are *no* training
errors. In fact, this is not surprising, because the large number of errors. In fact, this is not surprising, because the large number of
variables relative to the number of observations implies that it is variables relative to the number of observations implies that it is
@@ -545,7 +542,7 @@ confusion_table(khan_linear.predict(Khan['xtest']),
Khan['ytest']) Khan['ytest'])
``` ```
We see that using `C=10` yields two test set errors on these data. We see that using `C=10` yields two test set errors on these data.

View File

@@ -34,7 +34,7 @@ from sklearn.model_selection import \
(train_test_split, (train_test_split,
GridSearchCV) GridSearchCV)
``` ```
### Torch-Specific Imports ### Torch-Specific Imports
There are a number of imports for `torch`. (These are not There are a number of imports for `torch`. (These are not
@@ -49,7 +49,7 @@ from torch.optim import RMSprop
from torch.utils.data import TensorDataset from torch.utils.data import TensorDataset
``` ```
There are several other helper packages for `torch`. For instance, There are several other helper packages for `torch`. For instance,
the `torchmetrics` package has utilities to compute the `torchmetrics` package has utilities to compute
various metrics to evaluate performance when fitting various metrics to evaluate performance when fitting
@@ -69,7 +69,7 @@ from torchmetrics import (MeanAbsoluteError,
from torchinfo import summary from torchinfo import summary
``` ```
The package `pytorch_lightning` is a somewhat higher-level The package `pytorch_lightning` is a somewhat higher-level
interface to `torch` that simplifies the specification and interface to `torch` that simplifies the specification and
fitting of fitting of
@@ -81,7 +81,7 @@ from pytorch_lightning import Trainer
from pytorch_lightning.loggers import CSVLogger from pytorch_lightning.loggers import CSVLogger
``` ```
In order to reproduce results we use `seed_everything()`. We will also instruct `torch` to use deterministic algorithms In order to reproduce results we use `seed_everything()`. We will also instruct `torch` to use deterministic algorithms
where possible. where possible.
@@ -91,7 +91,7 @@ seed_everything(0, workers=True)
torch.use_deterministic_algorithms(True, warn_only=True) torch.use_deterministic_algorithms(True, warn_only=True)
``` ```
We will use several datasets shipped with `torchvision` for our We will use several datasets shipped with `torchvision` for our
examples: a pretrained network for image classification, examples: a pretrained network for image classification,
as well as some transforms used for preprocessing. as well as some transforms used for preprocessing.
@@ -124,7 +124,7 @@ from ISLP.torch import (SimpleDataModule,
rec_num_workers) rec_num_workers)
``` ```
In addition we have included some helper In addition we have included some helper
functions to load the functions to load the
`IMDb` database, as well as a lookup that maps integers `IMDb` database, as well as a lookup that maps integers
@@ -158,7 +158,7 @@ from glob import glob
import json import json
``` ```
## Single Layer Network on Hitters Data ## Single Layer Network on Hitters Data
We start by fitting the models in Section~\ref{Ch13:sec:when-use-deep} on the `Hitters` data. We start by fitting the models in Section~\ref{Ch13:sec:when-use-deep} on the `Hitters` data.
@@ -212,7 +212,7 @@ hit_lm = LinearRegression().fit(X_train, Y_train)
Yhat_test = hit_lm.predict(X_test) Yhat_test = hit_lm.predict(X_test)
np.abs(Yhat_test - Y_test).mean() np.abs(Yhat_test - Y_test).mean()
``` ```
Next we fit the lasso using `sklearn`. We are using Next we fit the lasso using `sklearn`. We are using
mean absolute error to select and evaluate a model, rather than mean squared error. mean absolute error to select and evaluate a model, rather than mean squared error.
The specialized solver we used in Section~\ref{Ch6-varselect-lab:lab-2-ridge-regression-and-the-lasso} uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly. The specialized solver we used in Section~\ref{Ch6-varselect-lab:lab-2-ridge-regression-and-the-lasso} uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly.
@@ -251,7 +251,7 @@ grid = GridSearchCV(standard_lasso,
scoring='neg_mean_absolute_error') scoring='neg_mean_absolute_error')
grid.fit(X_train, Y_train); grid.fit(X_train, Y_train);
``` ```
We extract the lasso model with best cross-validated mean absolute error, and evaluate its We extract the lasso model with best cross-validated mean absolute error, and evaluate its
performance on `X_test` and `Y_test`, which were not used in performance on `X_test` and `Y_test`, which were not used in
cross-validation. cross-validation.
@@ -290,7 +290,7 @@ class HittersModel(nn.Module):
return torch.flatten(self.sequential(x)) return torch.flatten(self.sequential(x))
``` ```
The `class` statement identifies the code chunk as a The `class` statement identifies the code chunk as a
declaration for a class `HittersModel` declaration for a class `HittersModel`
that inherits from the base class `nn.Module`. This base that inherits from the base class `nn.Module`. This base
@@ -324,7 +324,7 @@ to disable dropout for when we want to evaluate the model on test data.
hit_model = HittersModel(X.shape[1]) hit_model = HittersModel(X.shape[1])
``` ```
The object `self.sequential` is a composition of four maps. The The object `self.sequential` is a composition of four maps. The
first maps the 19 features of `Hitters` to 50 dimensions, introducing $50\times 19+50$ parameters first maps the 19 features of `Hitters` to 50 dimensions, introducing $50\times 19+50$ parameters
for the weights and *intercept* of the map (often called the *bias*). This layer for the weights and *intercept* of the map (often called the *bias*). This layer
@@ -334,7 +334,7 @@ trainable parameters is therefore $50\times 19+50+50+1=1051$.
The package `torchinfo` provides a `summary()` function that neatly summarizes The package `torchinfo` provides a `summary()` function that neatly summarizes
this information. We specify the size of the input and see the size this information. We specify the size of the input and see the size
of each tensor as it passes through layers of the network. of each tensor as it passes through layers of the network.
@@ -375,7 +375,7 @@ Y_test_t = torch.tensor(Y_test.astype(np.float32))
hit_test = TensorDataset(X_test_t, Y_test_t) hit_test = TensorDataset(X_test_t, Y_test_t)
``` ```
Finally, this dataset is passed to a `DataLoader()` which ultimately Finally, this dataset is passed to a `DataLoader()` which ultimately
passes data into our network. While this may seem passes data into our network. While this may seem
like a lot of overhead, this structure is helpful for more like a lot of overhead, this structure is helpful for more
@@ -396,7 +396,7 @@ workers might be reasonable (here the max was 16).
```{python} ```{python}
max_num_workers = rec_num_workers() max_num_workers = rec_num_workers()
``` ```
The general training setup in `pytorch_lightning` involves The general training setup in `pytorch_lightning` involves
training, validation and test data. These are each training, validation and test data. These are each
represented by different data loaders. During each epoch, represented by different data loaders. During each epoch,
@@ -421,7 +421,7 @@ hit_dm = SimpleDataModule(hit_train,
validation=hit_test) validation=hit_test)
``` ```
Next we must provide a `pytorch_lightning` module that controls Next we must provide a `pytorch_lightning` module that controls
the steps performed during the training process. We provide methods for our the steps performed during the training process. We provide methods for our
`SimpleModule()` that simply record the value `SimpleModule()` that simply record the value
@@ -435,7 +435,7 @@ hit_module = SimpleModule.regression(hit_model,
metrics={'mae':MeanAbsoluteError()}) metrics={'mae':MeanAbsoluteError()})
``` ```
By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in
(\ref{Ch13:eq:4}). (\ref{Ch13:eq:4}).
We have also asked for mean absolute error to be tracked as well We have also asked for mean absolute error to be tracked as well
@@ -449,7 +449,7 @@ we will not cover those here in detail.
```{python} ```{python}
hit_logger = CSVLogger('logs', name='hitters') hit_logger = CSVLogger('logs', name='hitters')
``` ```
Finally we are ready to train our model and log the results. We Finally we are ready to train our model and log the results. We
use the `Trainer()` object from `pytorch_lightning` use the `Trainer()` object from `pytorch_lightning`
to do this work. The argument `datamodule=hit_dm` tells the trainer to do this work. The argument `datamodule=hit_dm` tells the trainer
@@ -486,8 +486,8 @@ data using the `test()` method of our trainer.
hit_trainer.test(hit_module, datamodule=hit_dm) hit_trainer.test(hit_module, datamodule=hit_dm)
``` ```
The results of the fit have been logged into a CSV file. We can find the The results of the fit have been logged into a CSV file. We can find the
results specific to this run in the `experiment.metrics_file_path` results specific to this run in the `experiment.metrics_file_path`
attribute of our logger. Note that each time the model is fit, the logger will output attribute of our logger. Note that each time the model is fit, the logger will output
@@ -543,7 +543,7 @@ ax.set_ylim([0, 400])
ax.set_xticks(np.linspace(0, 50, 11).astype(int)); ax.set_xticks(np.linspace(0, 50, 11).astype(int));
``` ```
We can predict directly from the final model, and We can predict directly from the final model, and
evaluate its performance on the test data. evaluate its performance on the test data.
Before fitting, we call the `eval()` method Before fitting, we call the `eval()` method
@@ -561,7 +561,7 @@ preds = hit_module(X_test_t)
torch.abs(Y_test_t - preds).mean() torch.abs(Y_test_t - preds).mean()
``` ```
### Cleanup ### Cleanup
In setting up our data module, we had initiated In setting up our data module, we had initiated
@@ -582,7 +582,7 @@ del(Hitters,
hit_trainer, hit_module) hit_trainer, hit_module)
``` ```
## Multilayer Network on the MNIST Digit Data ## Multilayer Network on the MNIST Digit Data
The `torchvision` package comes with a number of example datasets, The `torchvision` package comes with a number of example datasets,
@@ -601,7 +601,7 @@ data will be downloaded the first time this function is executed, and stored in
mnist_train mnist_train
``` ```
There are 60,000 images in the training data and 10,000 in the test There are 60,000 images in the training data and 10,000 in the test
data. The images are $28\times 28$, and stored as a matrix of pixels. We data. The images are $28\times 28$, and stored as a matrix of pixels. We
need to transform each one into a vector. need to transform each one into a vector.
@@ -627,7 +627,7 @@ mnist_dm = SimpleDataModule(mnist_train,
batch_size=256) batch_size=256)
``` ```
Lets take a look at the data that will get fed into our network. We loop through the first few Lets take a look at the data that will get fed into our network. We loop through the first few
chunks of the test dataset, breaking after 2 batches: chunks of the test dataset, breaking after 2 batches:
@@ -639,8 +639,8 @@ for idx, (X_ ,Y_) in enumerate(mnist_dm.train_dataloader()):
break break
``` ```
We see that the $X$ for each batch consists of 256 images of size `1x28x28`. We see that the $X$ for each batch consists of 256 images of size `1x28x28`.
Here the `1` indicates a single channel (greyscale). For RGB images such as `CIFAR100` below, Here the `1` indicates a single channel (greyscale). For RGB images such as `CIFAR100` below,
we will see that the `1` in the size will be replaced by `3` for the three RGB channels. we will see that the `1` in the size will be replaced by `3` for the three RGB channels.
@@ -667,7 +667,7 @@ class MNISTModel(nn.Module):
def forward(self, x): def forward(self, x):
return self._forward(x) return self._forward(x)
``` ```
We see that in the first layer, each `1x28x28` image is flattened, then mapped to We see that in the first layer, each `1x28x28` image is flattened, then mapped to
256 dimensions where we apply a ReLU activation with 40% dropout. 256 dimensions where we apply a ReLU activation with 40% dropout.
A second layer maps the first layers output down to A second layer maps the first layers output down to
@@ -679,14 +679,14 @@ the 128 dimensions are mapped down to 10, the number of classes in the
mnist_model = MNISTModel() mnist_model = MNISTModel()
``` ```
We can check that the model produces output of expected size based We can check that the model produces output of expected size based
on our existing batch `X_` above. on our existing batch `X_` above.
```{python} ```{python}
mnist_model(X_).size() mnist_model(X_).size()
``` ```
Lets take a look at the summary of the model. Instead of an `input_size` we can pass Lets take a look at the summary of the model. Instead of an `input_size` we can pass
a tensor of correct shape. In this case, we pass through the final a tensor of correct shape. In this case, we pass through the final
batched `X_` from above. batched `X_` from above.
@@ -711,7 +711,7 @@ mnist_module = SimpleModule.classification(mnist_model,
mnist_logger = CSVLogger('logs', name='MNIST') mnist_logger = CSVLogger('logs', name='MNIST')
``` ```
Now we are ready to go. The final step is to supply training data, and fit the model. We disable the progress bar below to avoid lengthy output in the browser when running. Now we are ready to go. The final step is to supply training data, and fit the model. We disable the progress bar below to avoid lengthy output in the browser when running.
```{python} ```{python}
@@ -735,7 +735,7 @@ alternative to actually supplying validation data, like we did for the `Hitters`
SGD uses batches SGD uses batches
of 256 observations in computing the gradient, and doing the of 256 observations in computing the gradient, and doing the
arithmetic, we see that an epoch corresponds to 188 gradient steps. arithmetic, we see that an epoch corresponds to 188 gradient steps.
`SimpleModule.classification()` includes `SimpleModule.classification()` includes
an accuracy metric by default. Other an accuracy metric by default. Other
@@ -762,7 +762,7 @@ Once again we evaluate the accuracy using the `test()` method of our trainer. Th
mnist_trainer.test(mnist_module, mnist_trainer.test(mnist_module,
datamodule=mnist_dm) datamodule=mnist_dm)
``` ```
Table~\ref{Ch13:tab:mnist} also reports the error rates resulting from LDA (Chapter~\ref{Ch4:classification}) and multiclass logistic Table~\ref{Ch13:tab:mnist} also reports the error rates resulting from LDA (Chapter~\ref{Ch4:classification}) and multiclass logistic
regression. For LDA we refer the reader to Section~\ref{Ch4-classification-lab:linear-discriminant-analysis}. regression. For LDA we refer the reader to Section~\ref{Ch4-classification-lab:linear-discriminant-analysis}.
Although we could use the `sklearn` function `LogisticRegression()` to fit Although we could use the `sklearn` function `LogisticRegression()` to fit
@@ -840,7 +840,7 @@ cifar_train = TensorDataset(cifar_train_X,
cifar_test = TensorDataset(cifar_test_X, cifar_test = TensorDataset(cifar_test_X,
torch.tensor(cifar_test.targets)) torch.tensor(cifar_test.targets))
``` ```
The `CIFAR100` dataset consists of 50,000 training images, each represented by a three-dimensional tensor: The `CIFAR100` dataset consists of 50,000 training images, each represented by a three-dimensional tensor:
each three-color image is represented as a set of three channels, each of which consists of each three-color image is represented as a set of three channels, each of which consists of
$32\times 32$ eight-bit pixels. We standardize as we did for the $32\times 32$ eight-bit pixels. We standardize as we did for the
@@ -866,8 +866,8 @@ for idx, (X_ ,Y_) in enumerate(cifar_dm.train_dataloader()):
break break
``` ```
Before we start, we look at some of the training images; similar code produced Before we start, we look at some of the training images; similar code produced
Figure~\ref{Ch13:fig:cifar100} on page \pageref{Ch13:fig:cifar100}. The example below also illustrates Figure~\ref{Ch13:fig:cifar100} on page \pageref{Ch13:fig:cifar100}. The example below also illustrates
that `TensorDataset` objects can be indexed with integers --- we are choosing that `TensorDataset` objects can be indexed with integers --- we are choosing
@@ -918,7 +918,7 @@ class BuildingBlock(nn.Module):
return self.pool(self.activation(self.conv(x))) return self.pool(self.activation(self.conv(x)))
``` ```
Notice that we used the `padding = "same"` argument to Notice that we used the `padding = "same"` argument to
`nn.Conv2d()`, which ensures that the output channels have the `nn.Conv2d()`, which ensures that the output channels have the
same dimension as the input channels. There are 32 channels in the first same dimension as the input channels. There are 32 channels in the first
@@ -955,7 +955,7 @@ class CIFARModel(nn.Module):
return self.output(val) return self.output(val)
``` ```
We build the model and look at the summary. (We had created examples of `X_` earlier.) We build the model and look at the summary. (We had created examples of `X_` earlier.)
```{python} ```{python}
@@ -1014,7 +1014,7 @@ cifar_trainer.fit(cifar_module,
datamodule=cifar_dm) datamodule=cifar_dm)
``` ```
This model can take 10 minutes or more to run and achieves about 42% accuracy on the test This model can take 10 minutes or more to run and achieves about 42% accuracy on the test
data. Although this is not terrible for 100-class data (a random data. Although this is not terrible for 100-class data (a random
classifier gets 1% accuracy), searching the web we see results around classifier gets 1% accuracy), searching the web we see results around
@@ -1043,7 +1043,7 @@ cifar_trainer.test(cifar_module,
datamodule=cifar_dm) datamodule=cifar_dm)
``` ```
### Hardware Acceleration ### Hardware Acceleration
As deep learning has become ubiquitous in machine learning, hardware As deep learning has become ubiquitous in machine learning, hardware
@@ -1105,8 +1105,8 @@ imgs = torch.stack([torch.div(crop(resize(read_image(f))), 255)
imgs = normalize(imgs) imgs = normalize(imgs)
imgs.size() imgs.size()
``` ```
We now set up the trained network with the weights we read in code block~6. The model has 50 layers, with a fair bit of complexity. We now set up the trained network with the weights we read in code block~6. The model has 50 layers, with a fair bit of complexity.
```{python} ```{python}
@@ -1133,7 +1133,7 @@ We now feed our six images through the fitted network.
img_preds = resnet_model(imgs) img_preds = resnet_model(imgs)
``` ```
Lets look at the predicted probabilities for each of the top 3 choices. First we compute Lets look at the predicted probabilities for each of the top 3 choices. First we compute
the probabilities by applying the softmax to the logits in `img_preds`. Note that the probabilities by applying the softmax to the logits in `img_preds`. Note that
we have had to call the `detach()` method on the tensor `img_preds` in order to convert we have had to call the `detach()` method on the tensor `img_preds` in order to convert
@@ -1144,7 +1144,7 @@ img_probs = np.exp(np.asarray(img_preds.detach()))
img_probs /= img_probs.sum(1)[:,None] img_probs /= img_probs.sum(1)[:,None]
``` ```
In order to see the class labels, we must download the index file associated with `imagenet`. {This is avalable from the book website and [s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json](https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json).} In order to see the class labels, we must download the index file associated with `imagenet`. {This is avalable from the book website and [s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json](https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json).}
```{python} ```{python}
@@ -1156,7 +1156,7 @@ class_labels = class_labels.set_index('idx')
class_labels = class_labels.sort_index() class_labels = class_labels.sort_index()
``` ```
Well now construct a data frame for each image file Well now construct a data frame for each image file
with the labels with the three highest probabilities as with the labels with the three highest probabilities as
estimated by the model above. estimated by the model above.
@@ -1170,8 +1170,8 @@ for i, imgfile in enumerate(imgfiles):
print(img_df.reset_index().drop(columns=['idx'])) print(img_df.reset_index().drop(columns=['idx']))
``` ```
We see that the model We see that the model
is quite confident about `Flamingo.jpg`, but a little less so for the is quite confident about `Flamingo.jpg`, but a little less so for the
other images. other images.
@@ -1187,7 +1187,7 @@ del(cifar_test,
cifar_optimizer, cifar_optimizer,
cifar_trainer) cifar_trainer)
``` ```
## IMDB Document Classification ## IMDB Document Classification
We now implement models for sentiment classification (Section~\ref{Ch13:sec:docum-class}) on the `IMDB` We now implement models for sentiment classification (Section~\ref{Ch13:sec:docum-class}) on the `IMDB`
@@ -1285,7 +1285,7 @@ summary(imdb_model,
'num_params']) 'num_params'])
``` ```
Well again use Well again use
a smaller learning rate for these data, a smaller learning rate for these data,
hence we pass an `optimizer` to the hence we pass an `optimizer` to the
@@ -1304,7 +1304,7 @@ imdb_module = SimpleModule.binary_classification(
optimizer=imdb_optimizer) optimizer=imdb_optimizer)
``` ```
Having loaded the datasets into a data module Having loaded the datasets into a data module
and created a `SimpleModule`, the remaining steps and created a `SimpleModule`, the remaining steps
are familiar. are familiar.
@@ -1319,14 +1319,14 @@ imdb_trainer = Trainer(deterministic=True,
imdb_trainer.fit(imdb_module, imdb_trainer.fit(imdb_module,
datamodule=imdb_dm) datamodule=imdb_dm)
``` ```
Evaluating the test error yields roughly 86% accuracy. Evaluating the test error yields roughly 86% accuracy.
```{python} ```{python}
test_results = imdb_trainer.test(imdb_module, datamodule=imdb_dm) test_results = imdb_trainer.test(imdb_module, datamodule=imdb_dm)
test_results test_results
``` ```
### Comparison to Lasso ### Comparison to Lasso
We now fit a lasso logistic regression model We now fit a lasso logistic regression model
@@ -1379,7 +1379,7 @@ for l in lam_val:
intercepts.append(logit.intercept_) intercepts.append(logit.intercept_)
``` ```
The coefficient and intercepts have an extraneous dimension which can be removed The coefficient and intercepts have an extraneous dimension which can be removed
by the `np.squeeze()` function. by the `np.squeeze()` function.
@@ -1392,7 +1392,7 @@ Well now make a plot to compare our neural network results with the
lasso. lasso.
```{python} ```{python}
%%capture # %%capture
fig, axes = subplots(1, 2, figsize=(16, 8), sharey=True) fig, axes = subplots(1, 2, figsize=(16, 8), sharey=True)
for ((X_, Y_), for ((X_, Y_),
data_, data_,
@@ -1448,7 +1448,7 @@ del(imdb_model,
imdb_train, imdb_train,
imdb_test) imdb_test)
``` ```
## Recurrent Neural Networks ## Recurrent Neural Networks
In this lab we fit the models illustrated in In this lab we fit the models illustrated in
@@ -1478,7 +1478,7 @@ imdb_seq_dm = SimpleDataModule(imdb_seq_train,
) )
``` ```
The first layer of the RNN is an embedding layer of size 32, which will be The first layer of the RNN is an embedding layer of size 32, which will be
learned during training. This layer one-hot encodes each document learned during training. This layer one-hot encodes each document
as a matrix of dimension $500 \times 10,003$, and then maps these as a matrix of dimension $500 \times 10,003$, and then maps these
@@ -1490,7 +1490,7 @@ matrix of size $10,003\times 32$; each of the 500 integers in the
document are then mapped to the appropriate 32 real numbers by document are then mapped to the appropriate 32 real numbers by
indexing the appropriate rows of this matrix. indexing the appropriate rows of this matrix.
The second layer is an LSTM with 32 units, and the output The second layer is an LSTM with 32 units, and the output
layer is a single logit for the binary classification task. layer is a single logit for the binary classification task.
In the last line of the `forward()` method below, In the last line of the `forward()` method below,
@@ -1521,7 +1521,7 @@ summary(lstm_model,
'num_params']) 'num_params'])
``` ```
The 10,003 is suppressed in the summary, but we see it in the The 10,003 is suppressed in the summary, but we see it in the
parameter count, since $10,003\times 32=320,096$. parameter count, since $10,003\times 32=320,096$.
@@ -1547,7 +1547,7 @@ track the test performance as the network is fit, and see that it attains 85% ac
```{python} ```{python}
lstm_trainer.test(lstm_module, datamodule=imdb_seq_dm) lstm_trainer.test(lstm_module, datamodule=imdb_seq_dm)
``` ```
We once again show the learning progress, followed by cleanup. We once again show the learning progress, followed by cleanup.
```{python} ```{python}
@@ -1573,7 +1573,7 @@ del(lstm_model,
imdb_seq_test) imdb_seq_test)
``` ```
### Time Series Prediction ### Time Series Prediction
We now show how to fit the models in Section~\ref{Ch13:sec:time-seri-pred} We now show how to fit the models in Section~\ref{Ch13:sec:time-seri-pred}
@@ -1590,7 +1590,7 @@ X = pd.DataFrame(StandardScaler(
index=NYSE.index) index=NYSE.index)
``` ```
Next we set up the lagged versions of the data, dropping Next we set up the lagged versions of the data, dropping
any rows with missing values using the `dropna()` method. any rows with missing values using the `dropna()` method.
@@ -1604,7 +1604,7 @@ X.insert(len(X.columns), 'train', NYSE['train'])
X = X.dropna() X = X.dropna()
``` ```
Finally, we extract the response, training indicator, and drop the current days `DJ_return` and Finally, we extract the response, training indicator, and drop the current days `DJ_return` and
`log_volatility` to predict only from previous days data. `log_volatility` to predict only from previous days data.
@@ -1614,8 +1614,8 @@ X = X.drop(columns=['train'] + cols)
X.columns X.columns
``` ```
We first fit a simple linear model and compute the $R^2$ on the test data using We first fit a simple linear model and compute the $R^2$ on the test data using
the `score()` method. the `score()` method.
@@ -1624,7 +1624,7 @@ M = LinearRegression()
M.fit(X[train], Y[train]) M.fit(X[train], Y[train])
M.score(X[~train], Y[~train]) M.score(X[~train], Y[~train])
``` ```
We refit this model, including the factor variable `day_of_week`. We refit this model, including the factor variable `day_of_week`.
For a categorical series in `pandas`, we can form the indicators For a categorical series in `pandas`, we can form the indicators
using the `get_dummies()` method. using the `get_dummies()` method.
@@ -1697,7 +1697,7 @@ class NYSEModel(nn.Module):
return torch.flatten(val) return torch.flatten(val)
nyse_model = NYSEModel() nyse_model = NYSEModel()
``` ```
We fit the model in a similar fashion to previous networks. We We fit the model in a similar fashion to previous networks. We
supply the `fit` function with test data as validation data, so that when supply the `fit` function with test data as validation data, so that when
we monitor its progress and plot the history function we can see the we monitor its progress and plot the history function we can see the
@@ -1716,7 +1716,7 @@ for mask in [train, ~train]:
nyse_train, nyse_test = datasets nyse_train, nyse_test = datasets
``` ```
Following our usual pattern, we inspect the summary. Following our usual pattern, we inspect the summary.
```{python} ```{python}
@@ -1747,7 +1747,7 @@ for idx, (x, y) in enumerate(nyse_dm.train_dataloader()):
break break
``` ```
We follow our previous example for setting up a trainer for a We follow our previous example for setting up a trainer for a
regression problem, requesting the $R^2$ metric regression problem, requesting the $R^2$ metric
to be be computed at each epoch. to be be computed at each epoch.
@@ -1760,7 +1760,7 @@ nyse_module = SimpleModule.regression(nyse_model,
metrics={'r2':R2Score()}) metrics={'r2':R2Score()})
``` ```
Fitting the model should by now be familiar. Fitting the model should by now be familiar.
The results on the test data are very similar to the linear AR model. The results on the test data are very similar to the linear AR model.
@@ -1775,7 +1775,7 @@ nyse_trainer.test(nyse_module,
datamodule=nyse_dm) datamodule=nyse_dm)
``` ```
We could also fit a model without the `nn.RNN()` layer by just We could also fit a model without the `nn.RNN()` layer by just
using a `nn.Flatten()` layer instead. This would be a nonlinear AR model. If in addition we excluded the using a `nn.Flatten()` layer instead. This would be a nonlinear AR model. If in addition we excluded the
hidden layer, this would be equivalent to our earlier linear AR model. hidden layer, this would be equivalent to our earlier linear AR model.
@@ -1795,7 +1795,7 @@ for mask in [train, ~train]:
datasets.append(TensorDataset(X_day_t, Y_t)) datasets.append(TensorDataset(X_day_t, Y_t))
day_train, day_test = datasets day_train, day_test = datasets
``` ```
Creating a data module follows a familiar pattern. Creating a data module follows a familiar pattern.
```{python} ```{python}
@@ -1806,7 +1806,7 @@ day_dm = SimpleDataModule(day_train,
batch_size=64) batch_size=64)
``` ```
We build a `NonLinearARModel()` that takes as input the 20 features and a hidden layer with 32 units. The remaining steps are familiar. We build a `NonLinearARModel()` that takes as input the 20 features and a hidden layer with 32 units. The remaining steps are familiar.
```{python} ```{python}
@@ -1832,7 +1832,7 @@ nl_module = SimpleModule.regression(nl_model,
metrics={'r2':R2Score()}) metrics={'r2':R2Score()})
``` ```
We continue with the usual training steps, fit the model, We continue with the usual training steps, fit the model,
and evaluate the test error. We see the test $R^2$ is a slight improvement over the linear AR model that also includes `day_of_week`. and evaluate the test error. We see the test $R^2$ is a slight improvement over the linear AR model that also includes `day_of_week`.

View File

@@ -1,6 +1,3 @@
# Survival Analysis # Survival Analysis
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch11-surv-lab.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch11-surv-lab.ipynb">
@@ -29,7 +26,7 @@ from ISLP.models import ModelSpec as MS
from ISLP import load_data from ISLP import load_data
``` ```
We also collect the new imports We also collect the new imports
needed for this lab. needed for this lab.
@@ -53,7 +50,7 @@ BrainCancer = load_data('BrainCancer')
BrainCancer.columns BrainCancer.columns
``` ```
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables. The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
We first briefly examine the data. We first briefly examine the data.
@@ -61,20 +58,20 @@ We first briefly examine the data.
BrainCancer['sex'].value_counts() BrainCancer['sex'].value_counts()
``` ```
```{python} ```{python}
BrainCancer['diagnosis'].value_counts() BrainCancer['diagnosis'].value_counts()
``` ```
```{python} ```{python}
BrainCancer['status'].value_counts() BrainCancer['status'].value_counts()
``` ```
Before beginning an analysis, it is important to know how the Before beginning an analysis, it is important to know how the
`status` variable has been coded. Most software `status` variable has been coded. Most software
uses the convention that a `status` of 1 indicates an uses the convention that a `status` of 1 indicates an
@@ -101,7 +98,7 @@ km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
km_brain.plot(label='Kaplan Meier estimate', ax=ax) km_brain.plot(label='Kaplan Meier estimate', ax=ax)
``` ```
Next we create Kaplan-Meier survival curves that are stratified by Next we create Kaplan-Meier survival curves that are stratified by
`sex`, in order to reproduce Figure~\ref{fig:survbrain2}. `sex`, in order to reproduce Figure~\ref{fig:survbrain2}.
We do this using the `groupby()` method of a dataframe. We do this using the `groupby()` method of a dataframe.
@@ -130,7 +127,7 @@ for sex, df in BrainCancer.groupby('sex'):
km_sex.plot(label='Sex=%s' % sex, ax=ax) km_sex.plot(label='Sex=%s' % sex, ax=ax)
``` ```
As discussed in Section~\ref{sec:logrank}, we can perform a As discussed in Section~\ref{sec:logrank}, we can perform a
log-rank test to compare the survival of males to females. We use log-rank test to compare the survival of males to females. We use
the `logrank_test()` function from the `lifelines.statistics` module. the `logrank_test()` function from the `lifelines.statistics` module.
@@ -144,8 +141,8 @@ logrank_test(by_sex['Male']['time'],
by_sex['Female']['status']) by_sex['Female']['status'])
``` ```
The resulting $p$-value is $0.23$, indicating no evidence of a The resulting $p$-value is $0.23$, indicating no evidence of a
difference in survival between the two sexes. difference in survival between the two sexes.
@@ -164,7 +161,7 @@ cox_fit = coxph().fit(model_df,
cox_fit.summary[['coef', 'se(coef)', 'p']] cox_fit.summary[['coef', 'se(coef)', 'p']]
``` ```
The first argument to `fit` should be a data frame containing The first argument to `fit` should be a data frame containing
at least the event time (the second argument `time` in this case), at least the event time (the second argument `time` in this case),
as well as an optional censoring variable (the argument `status` in this case). as well as an optional censoring variable (the argument `status` in this case).
@@ -178,7 +175,7 @@ with no features as follows:
cox_fit.log_likelihood_ratio_test() cox_fit.log_likelihood_ratio_test()
``` ```
Regardless of which test we use, we see that there is no clear Regardless of which test we use, we see that there is no clear
evidence for a difference in survival between males and females. As evidence for a difference in survival between males and females. As
we learned in this chapter, the score test from the Cox model is we learned in this chapter, the score test from the Cox model is
@@ -198,7 +195,7 @@ fit_all = coxph().fit(all_df,
fit_all.summary[['coef', 'se(coef)', 'p']] fit_all.summary[['coef', 'se(coef)', 'p']]
``` ```
The `diagnosis` variable has been coded so that the baseline The `diagnosis` variable has been coded so that the baseline
corresponds to HG glioma. The results indicate that the risk associated with HG glioma corresponds to HG glioma. The results indicate that the risk associated with HG glioma
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
@@ -225,7 +222,7 @@ def representative(series):
modal_data = cleaned.apply(representative, axis=0) modal_data = cleaned.apply(representative, axis=0)
``` ```
We make four We make four
copies of the column means and assign the `diagnosis` column to be the four different copies of the column means and assign the `diagnosis` column to be the four different
diagnoses. diagnoses.
@@ -237,7 +234,7 @@ modal_df['diagnosis'] = levels
modal_df modal_df
``` ```
We then construct the model matrix based on the model specification `all_MS` used to fit We then construct the model matrix based on the model specification `all_MS` used to fit
the model, and name the rows according to the levels of `diagnosis`. the model, and name the rows according to the levels of `diagnosis`.
@@ -264,7 +261,7 @@ fig, ax = subplots(figsize=(8, 8))
predicted_survival.plot(ax=ax); predicted_survival.plot(ax=ax);
``` ```
## Publication Data ## Publication Data
The `Publication` data presented in Section~\ref{sec:pub} can be The `Publication` data presented in Section~\ref{sec:pub} can be
@@ -283,7 +280,7 @@ for result, df in Publication.groupby('posres'):
km_result.plot(label='Result=%d' % result, ax=ax) km_result.plot(label='Result=%d' % result, ax=ax)
``` ```
As discussed previously, the $p$-values from fitting Coxs As discussed previously, the $p$-values from fitting Coxs
proportional hazards model to the `posres` variable are quite proportional hazards model to the `posres` variable are quite
large, providing no evidence of a difference in time-to-publication large, providing no evidence of a difference in time-to-publication
@@ -300,8 +297,8 @@ posres_fit = coxph().fit(posres_df,
posres_fit.summary[['coef', 'se(coef)', 'p']] posres_fit.summary[['coef', 'se(coef)', 'p']]
``` ```
However, the results change dramatically when we include other However, the results change dramatically when we include other
predictors in the model. Here we exclude the funding mechanism predictors in the model. Here we exclude the funding mechanism
variable. variable.
@@ -314,7 +311,7 @@ coxph().fit(model.fit_transform(Publication),
'status').summary[['coef', 'se(coef)', 'p']] 'status').summary[['coef', 'se(coef)', 'p']]
``` ```
We see that there are a number of statistically significant variables, We see that there are a number of statistically significant variables,
including whether the trial focused on a clinical endpoint, the impact including whether the trial focused on a clinical endpoint, the impact
of the study, and whether the study had positive or negative results. of the study, and whether the study had positive or negative results.
@@ -364,7 +361,7 @@ model = MS(['Operators',
intercept=False) intercept=False)
X = model.fit_transform(D) X = model.fit_transform(D)
``` ```
It is worthwhile to take a peek at the model matrix `X`, so It is worthwhile to take a peek at the model matrix `X`, so
that we can be sure that we understand how the variables have been coded. By default, that we can be sure that we understand how the variables have been coded. By default,
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
@@ -374,7 +371,7 @@ of the variable is dropped.
X[:5] X[:5]
``` ```
Next, we specify the coefficients and the hazard function. Next, we specify the coefficients and the hazard function.
```{python} ```{python}
@@ -423,7 +420,7 @@ W = np.array([sim_time(l, cum_hazard, rng)
D['Wait time'] = np.clip(W, 0, 1000) D['Wait time'] = np.clip(W, 0, 1000)
``` ```
We now simulate our censoring variable, for which we assume We now simulate our censoring variable, for which we assume
90% of calls were answered (`Failed==1`) before the 90% of calls were answered (`Failed==1`) before the
customer hung up (`Failed==0`). customer hung up (`Failed==0`).
@@ -435,13 +432,13 @@ D['Failed'] = rng.choice([1, 0],
D[:5] D[:5]
``` ```
```{python} ```{python}
D['Failed'].mean() D['Failed'].mean()
``` ```
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`. We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
```{python} ```{python}
@@ -454,7 +451,7 @@ for center, df in D.groupby('Center'):
ax.set_title("Probability of Still Being on Hold") ax.set_title("Probability of Still Being on Hold")
``` ```
Next, we stratify by `Time`. Next, we stratify by `Time`.
```{python} ```{python}
@@ -467,7 +464,7 @@ for time, df in D.groupby('Time'):
ax.set_title("Probability of Still Being on Hold") ax.set_title("Probability of Still Being on Hold")
``` ```
It seems that calls at Call Center B take longer to be answered than It seems that calls at Call Center B take longer to be answered than
calls at Centers A and C. Similarly, it appears that wait times are calls at Centers A and C. Similarly, it appears that wait times are
longest in the morning and shortest in the evening hours. We can use a longest in the morning and shortest in the evening hours. We can use a
@@ -480,8 +477,8 @@ multivariate_logrank_test(D['Wait time'],
D['Failed']) D['Failed'])
``` ```
Next, we consider the effect of `Time`. Next, we consider the effect of `Time`.
```{python} ```{python}
@@ -490,8 +487,8 @@ multivariate_logrank_test(D['Wait time'],
D['Failed']) D['Failed'])
``` ```
As in the case of a categorical variable with 2 levels, these As in the case of a categorical variable with 2 levels, these
results are similar to the likelihood ratio test results are similar to the likelihood ratio test
from the Cox proportional hazards model. First, we from the Cox proportional hazards model. First, we
@@ -506,8 +503,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
F.log_likelihood_ratio_test() F.log_likelihood_ratio_test()
``` ```
Next, we look at the results for `Time`. Next, we look at the results for `Time`.
```{python} ```{python}
@@ -519,8 +516,8 @@ F = coxph().fit(X, 'Wait time', 'Failed')
F.log_likelihood_ratio_test() F.log_likelihood_ratio_test()
``` ```
We find that differences between centers are highly significant, as We find that differences between centers are highly significant, as
are differences between times of day. are differences between times of day.
@@ -536,8 +533,8 @@ fit_queuing = coxph().fit(
fit_queuing.summary[['coef', 'se(coef)', 'p']] fit_queuing.summary[['coef', 'se(coef)', 'p']]
``` ```
The $p$-values for Center B and evening time The $p$-values for Center B and evening time
are very small. It is also clear that the are very small. It is also clear that the
hazard --- that is, the instantaneous risk that a call will be hazard --- that is, the instantaneous risk that a call will be

View File

@@ -36,7 +36,7 @@ from scipy.cluster.hierarchy import \
from ISLP.cluster import compute_linkage from ISLP.cluster import compute_linkage
``` ```
## Principal Components Analysis ## Principal Components Analysis
In this lab, we perform PCA on `USArrests`, a data set in the In this lab, we perform PCA on `USArrests`, a data set in the
`R` computing environment. `R` computing environment.
@@ -50,22 +50,22 @@ USArrests = get_rdataset('USArrests').data
USArrests USArrests
``` ```
The columns of the data set contain the four variables. The columns of the data set contain the four variables.
```{python} ```{python}
USArrests.columns USArrests.columns
``` ```
We first briefly examine the data. We notice that the variables have vastly different means. We first briefly examine the data. We notice that the variables have vastly different means.
```{python} ```{python}
USArrests.mean() USArrests.mean()
``` ```
Dataframes have several useful methods for computing Dataframes have several useful methods for computing
column-wise summaries. We can also examine the column-wise summaries. We can also examine the
variance of the four variables using the `var()` method. variance of the four variables using the `var()` method.
@@ -74,7 +74,7 @@ variance of the four variables using the `var()` method.
USArrests.var() USArrests.var()
``` ```
Not surprisingly, the variables also have vastly different variances. Not surprisingly, the variables also have vastly different variances.
The `UrbanPop` variable measures the percentage of the population The `UrbanPop` variable measures the percentage of the population
in each state living in an urban area, which is not a comparable in each state living in an urban area, which is not a comparable
@@ -124,7 +124,7 @@ of the variables. In this case, since we centered and scaled the data with
pcaUS.mean_ pcaUS.mean_
``` ```
The scores can be computed using the `transform()` method The scores can be computed using the `transform()` method
of `pcaUS` after it has been fit. of `pcaUS` after it has been fit.
@@ -142,7 +142,7 @@ principal component loading vector.
pcaUS.components_ pcaUS.components_
``` ```
The `biplot` is a common visualization method used with The `biplot` is a common visualization method used with
PCA. It is not built in as a standard PCA. It is not built in as a standard
part of `sklearn`, though there are python part of `sklearn`, though there are python
@@ -183,14 +183,14 @@ for k in range(pcaUS.components_.shape[1]):
USArrests.columns[k]) USArrests.columns[k])
``` ```
The standard deviations of the principal component scores are as follows: The standard deviations of the principal component scores are as follows:
```{python} ```{python}
scores.std(0, ddof=1) scores.std(0, ddof=1)
``` ```
The variance of each score can be extracted directly from the `pcaUS` object via The variance of each score can be extracted directly from the `pcaUS` object via
the `explained_variance_` attribute. the `explained_variance_` attribute.
@@ -212,7 +212,7 @@ We can plot the PVE explained by each component, as well as the cumulative PVE.
plot the proportion of variance explained. plot the proportion of variance explained.
```{python} ```{python}
%%capture # %%capture
fig, axes = plt.subplots(1, 2, figsize=(15, 6)) fig, axes = plt.subplots(1, 2, figsize=(15, 6))
ticks = np.arange(pcaUS.n_components_)+1 ticks = np.arange(pcaUS.n_components_)+1
ax = axes[0] ax = axes[0]
@@ -312,7 +312,7 @@ Xna = X.copy()
Xna[r_idx, c_idx] = np.nan Xna[r_idx, c_idx] = np.nan
``` ```
Here the array `r_idx` Here the array `r_idx`
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states. 20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
@@ -340,7 +340,7 @@ Xbar = np.nanmean(Xhat, axis=0)
Xhat[r_idx, c_idx] = Xbar[c_idx] Xhat[r_idx, c_idx] = Xbar[c_idx]
``` ```
Before we begin Step 2, we set ourselves up to measure the progress of our Before we begin Step 2, we set ourselves up to measure the progress of our
iterations: iterations:
@@ -379,7 +379,7 @@ while rel_err > thresh:
.format(count, mss, rel_err)) .format(count, mss, rel_err))
``` ```
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381. We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
Finally, we compute the correlation between the 20 imputed values Finally, we compute the correlation between the 20 imputed values
@@ -389,8 +389,8 @@ and the actual values:
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1] np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
``` ```
In this lab, we implemented Algorithm~\ref{Ch10:alg:hardimpute} ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations. In this lab, we implemented Algorithm~\ref{Ch10:alg:hardimpute} ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations.
@@ -436,7 +436,7 @@ ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
ax.set_title("K-Means Clustering Results with K=2"); ax.set_title("K-Means Clustering Results with K=2");
``` ```
Here the observations can be easily plotted because they are Here the observations can be easily plotted because they are
two-dimensional. If there were more than two variables then we could two-dimensional. If there were more than two variables then we could
instead perform PCA and plot the first two principal component score instead perform PCA and plot the first two principal component score
@@ -511,7 +511,7 @@ hc_comp = HClust(distance_threshold=0,
hc_comp.fit(X) hc_comp.fit(X)
``` ```
This computes the entire dendrogram. This computes the entire dendrogram.
We could just as easily perform hierarchical clustering with average or single linkage instead: We could just as easily perform hierarchical clustering with average or single linkage instead:
@@ -526,7 +526,7 @@ hc_sing = HClust(distance_threshold=0,
hc_sing.fit(X); hc_sing.fit(X);
``` ```
To use a precomputed distance matrix, we provide an additional To use a precomputed distance matrix, we provide an additional
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix. argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
@@ -542,7 +542,7 @@ hc_sing_pre = HClust(distance_threshold=0,
hc_sing_pre.fit(D) hc_sing_pre.fit(D)
``` ```
We use We use
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However, `dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
`dendrogram()` expects a so-called *linkage-matrix representation* `dendrogram()` expects a so-called *linkage-matrix representation*
@@ -565,7 +565,7 @@ dendrogram(linkage_comp,
**cargs); **cargs);
``` ```
We may want to color branches of the tree above We may want to color branches of the tree above
and below a cut-threshold differently. This can be achieved and below a cut-threshold differently. This can be achieved
by changing the `color_threshold`. Lets cut the tree at a height of 4, by changing the `color_threshold`. Lets cut the tree at a height of 4,
@@ -579,7 +579,7 @@ dendrogram(linkage_comp,
above_threshold_color='black'); above_threshold_color='black');
``` ```
To determine the cluster labels for each observation associated with a To determine the cluster labels for each observation associated with a
given cut of the dendrogram, we can use the `cut_tree()` given cut of the dendrogram, we can use the `cut_tree()`
function from `scipy.cluster.hierarchy`: function from `scipy.cluster.hierarchy`:
@@ -599,7 +599,7 @@ or `height` to `cut_tree()`.
cut_tree(linkage_comp, height=5) cut_tree(linkage_comp, height=5)
``` ```
To scale the variables before performing hierarchical clustering of To scale the variables before performing hierarchical clustering of
the observations, we use `StandardScaler()` as in our PCA example: the observations, we use `StandardScaler()` as in our PCA example:
@@ -643,7 +643,7 @@ dendrogram(linkage_cor, ax=ax, **cargs)
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity"); ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
``` ```
## NCI60 Data Example ## NCI60 Data Example
Unsupervised techniques are often used in the analysis of genomic Unsupervised techniques are often used in the analysis of genomic
@@ -658,7 +658,7 @@ nci_labs = NCI60['labels']
nci_data = NCI60['data'] nci_data = NCI60['data']
``` ```
Each cell line is labeled with a cancer type. We do not make use of Each cell line is labeled with a cancer type. We do not make use of
the cancer types in performing PCA and clustering, as these are the cancer types in performing PCA and clustering, as these are
unsupervised techniques. But after performing PCA and clustering, we unsupervised techniques. But after performing PCA and clustering, we
@@ -671,8 +671,8 @@ The data has 64 rows and 6830 columns.
nci_data.shape nci_data.shape
``` ```
We begin by examining the cancer types for the cell lines. We begin by examining the cancer types for the cell lines.
@@ -680,7 +680,7 @@ We begin by examining the cancer types for the cell lines.
nci_labs.value_counts() nci_labs.value_counts()
``` ```
### PCA on the NCI60 Data ### PCA on the NCI60 Data
@@ -695,7 +695,7 @@ nci_pca = PCA()
nci_scores = nci_pca.fit_transform(nci_scaled) nci_scores = nci_pca.fit_transform(nci_scaled)
``` ```
We now plot the first few principal component score vectors, in order We now plot the first few principal component score vectors, in order
to visualize the data. The observations (cell lines) corresponding to to visualize the data. The observations (cell lines) corresponding to
a given cancer type will be plotted in the same color, so that we can a given cancer type will be plotted in the same color, so that we can
@@ -731,7 +731,7 @@ to have pretty similar gene expression levels.
We can also plot the percent variance We can also plot the percent variance
explained by the principal components as well as the cumulative percent variance explained. explained by the principal components as well as the cumulative percent variance explained.
This is similar to the plots we made earlier for the `USArrests` data. This is similar to the plots we made earlier for the `USArrests` data.
@@ -790,7 +790,7 @@ def plot_nci(linkage, ax, cut=-np.inf):
return hc return hc
``` ```
Lets plot our results. Lets plot our results.
```{python} ```{python}
@@ -822,7 +822,7 @@ pd.crosstab(nci_labs['label'],
pd.Series(comp_cut.reshape(-1), name='Complete')) pd.Series(comp_cut.reshape(-1), name='Complete'))
``` ```
There are some clear patterns. All the leukemia cell lines fall in There are some clear patterns. All the leukemia cell lines fall in
one cluster, while the breast cancer cell lines are spread out over one cluster, while the breast cancer cell lines are spread out over
@@ -836,7 +836,7 @@ plot_nci('Complete', ax, cut=140)
ax.axhline(140, c='r', linewidth=4); ax.axhline(140, c='r', linewidth=4);
``` ```
The `axhline()` function draws a horizontal line line on top of any The `axhline()` function draws a horizontal line line on top of any
existing set of axes. The argument `140` plots a horizontal existing set of axes. The argument `140` plots a horizontal
line at height 140 on the dendrogram; this is a height that line at height 140 on the dendrogram; this is a height that
@@ -858,7 +858,7 @@ pd.crosstab(pd.Series(comp_cut, name='HClust'),
pd.Series(nci_kmeans.labels_, name='K-means')) pd.Series(nci_kmeans.labels_, name='K-means'))
``` ```
We see that the four clusters obtained using hierarchical clustering We see that the four clusters obtained using hierarchical clustering
and $K$-means clustering are somewhat different. First we note and $K$-means clustering are somewhat different. First we note
that the labels in the two clusterings are arbitrary. That is, swapping that the labels in the two clusterings are arbitrary. That is, swapping

View File

@@ -8,7 +8,7 @@
We include our usual imports seen in earlier labs. We include our usual imports seen in earlier labs.
@@ -20,7 +20,7 @@ import statsmodels.api as sm
from ISLP import load_data from ISLP import load_data
``` ```
We also collect the new imports We also collect the new imports
needed for this lab. needed for this lab.
@@ -52,7 +52,7 @@ true_mean = np.array([0.5]*50 + [0]*50)
X += true_mean[None,:] X += true_mean[None,:]
``` ```
To begin, we use `ttest_1samp()` from the To begin, we use `ttest_1samp()` from the
`scipy.stats` module to test $H_{0}: \mu_1=0$, the null `scipy.stats` module to test $H_{0}: \mu_1=0$, the null
hypothesis that the first variable has mean zero. hypothesis that the first variable has mean zero.
@@ -62,7 +62,7 @@ result = ttest_1samp(X[:,0], 0)
result.pvalue result.pvalue
``` ```
The $p$-value comes out to 0.931, which is not low enough to The $p$-value comes out to 0.931, which is not low enough to
reject the null hypothesis at level $\alpha=0.05$. In this case, reject the null hypothesis at level $\alpha=0.05$. In this case,
$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made $\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made
@@ -159,7 +159,7 @@ ax.legend()
ax.axhline(0.05, c='k', ls='--'); ax.axhline(0.05, c='k', ls='--');
``` ```
As discussed previously, even for moderate values of $m$ such as $50$, As discussed previously, even for moderate values of $m$ such as $50$,
the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value, the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value,
such as $0.001$. Of course, the problem with setting $\alpha$ to such such as $0.001$. Of course, the problem with setting $\alpha$ to such
@@ -181,7 +181,7 @@ for i in range(5):
fund_mini_pvals fund_mini_pvals
``` ```
The $p$-values are low for Managers One and Three, and high for the The $p$-values are low for Managers One and Three, and high for the
other three managers. However, we cannot simply reject $H_{0,1}$ and other three managers. However, we cannot simply reject $H_{0,1}$ and
$H_{0,3}$, since this would fail to account for the multiple testing $H_{0,3}$, since this would fail to account for the multiple testing
@@ -211,8 +211,8 @@ reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]
reject reject
``` ```
The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than
or equal to 1. or equal to 1.
@@ -220,7 +220,7 @@ or equal to 1.
bonf, np.minimum(fund_mini_pvals * 5, 1) bonf, np.minimum(fund_mini_pvals * 5, 1)
``` ```
Therefore, using Bonferronis method, we are able to reject the null hypothesis only for Manager Therefore, using Bonferronis method, we are able to reject the null hypothesis only for Manager
One while controlling FWER at $0.05$. One while controlling FWER at $0.05$.
@@ -232,8 +232,8 @@ hypotheses for Managers One and Three at a FWER of $0.05$.
mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2] mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
``` ```
As discussed previously, Manager One seems to perform particularly As discussed previously, Manager One seems to perform particularly
well, whereas Manager Two has poor performance. well, whereas Manager Two has poor performance.
@@ -242,8 +242,8 @@ well, whereas Manager Two has poor performance.
fund_mini.mean() fund_mini.mean()
``` ```
Is there evidence of a meaningful difference in performance between Is there evidence of a meaningful difference in performance between
these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function
from `scipy.stats`: from `scipy.stats`:
@@ -253,7 +253,7 @@ ttest_rel(fund_mini['Manager1'],
fund_mini['Manager2']).pvalue fund_mini['Manager2']).pvalue
``` ```
The test results in a $p$-value of 0.038, The test results in a $p$-value of 0.038,
suggesting a statistically significant difference. suggesting a statistically significant difference.
@@ -278,8 +278,8 @@ tukey = pairwise_tukeyhsd(returns, managers)
print(tukey.summary()) print(tukey.summary())
``` ```
The `pairwise_tukeyhsd()` function provides confidence intervals The `pairwise_tukeyhsd()` function provides confidence intervals
for the difference between each pair of managers (`lower` and for the difference between each pair of managers (`lower` and
`upper`), as well as a $p$-value. All of these quantities have `upper`), as well as a $p$-value. All of these quantities have
@@ -309,7 +309,7 @@ for i, manager in enumerate(Fund.columns):
fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue
``` ```
There are far too many managers to consider trying to control the FWER. There are far too many managers to consider trying to control the FWER.
Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives. Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives.
The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure. The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure.
@@ -319,7 +319,7 @@ fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
fund_qvalues[:10] fund_qvalues[:10]
``` ```
The *q-values* output by the The *q-values* output by the
Benjamini--Hochberg procedure can be interpreted as the smallest FDR Benjamini--Hochberg procedure can be interpreted as the smallest FDR
threshold at which we would reject a particular null hypothesis. For threshold at which we would reject a particular null hypothesis. For
@@ -346,8 +346,8 @@ null hypotheses!
(fund_pvalues <= 0.1 / 2000).sum() (fund_pvalues <= 0.1 / 2000).sum()
``` ```
Figure~\ref{Ch12:fig:BonferroniBenjamini} displays the ordered Figure~\ref{Ch12:fig:BonferroniBenjamini} displays the ordered
$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for $p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
the `Fund` dataset, as well as the threshold for rejection by the the `Fund` dataset, as well as the threshold for rejection by the
@@ -376,7 +376,7 @@ else:
sorted_set_ = [] sorted_set_ = []
``` ```
We now reproduce the middle panel of Figure~\ref{Ch12:fig:BonferroniBenjamini}. We now reproduce the middle panel of Figure~\ref{Ch12:fig:BonferroniBenjamini}.
```{python} ```{python}
@@ -391,7 +391,7 @@ ax.scatter(sorted_set_+1, sorted_[sorted_set_], c='r', s=20)
ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3); ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
``` ```
## A Re-Sampling Approach ## A Re-Sampling Approach
Here, we implement the re-sampling approach to hypothesis testing Here, we implement the re-sampling approach to hypothesis testing
@@ -407,8 +407,8 @@ D['Y'] = pd.concat([Khan['ytrain'], Khan['ytest']])
D['Y'].value_counts() D['Y'].value_counts()
``` ```
There are four classes of cancer. For each gene, we compare the mean There are four classes of cancer. For each gene, we compare the mean
expression in the second class (rhabdomyosarcoma) to the mean expression in the second class (rhabdomyosarcoma) to the mean
expression in the fourth class (Burkitts lymphoma). Performing a expression in the fourth class (Burkitts lymphoma). Performing a
@@ -428,8 +428,8 @@ observedT, pvalue = ttest_ind(D2[gene_11],
observedT, pvalue observedT, pvalue
``` ```
However, this $p$-value relies on the assumption that under the null However, this $p$-value relies on the assumption that under the null
hypothesis of no difference between the two groups, the test statistic hypothesis of no difference between the two groups, the test statistic
follows a $t$-distribution with $29+25-2=52$ degrees of freedom. follows a $t$-distribution with $29+25-2=52$ degrees of freedom.
@@ -457,8 +457,8 @@ for b in range(B):
(np.abs(Tnull) < np.abs(observedT)).mean() (np.abs(Tnull) < np.abs(observedT)).mean()
``` ```
This fraction, 0.0398, This fraction, 0.0398,
is our re-sampling-based $p$-value. is our re-sampling-based $p$-value.
It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution. It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution.
@@ -514,7 +514,7 @@ for j in range(m):
Tnull_vals[j,b] = ttest_.statistic Tnull_vals[j,b] = ttest_.statistic
``` ```
Next, we compute the number of rejected null hypotheses $R$, the Next, we compute the number of rejected null hypotheses $R$, the
estimated number of false positives $\widehat{V}$, and the estimated estimated number of false positives $\widehat{V}$, and the estimated
FDR, for a range of threshold values $c$ in FDR, for a range of threshold values $c$ in
@@ -532,7 +532,7 @@ for j in range(m):
FDRs[j] = V / R FDRs[j] = V / R
``` ```
Now, for any given FDR, we can find the genes that will be Now, for any given FDR, we can find the genes that will be
rejected. For example, with FDR controlled at 0.1, we reject 15 of the rejected. For example, with FDR controlled at 0.1, we reject 15 of the
100 null hypotheses. On average, we would expect about one or two of 100 null hypotheses. On average, we would expect about one or two of
@@ -548,7 +548,7 @@ the genes whose estimated FDR is less than 0.1.
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()]) sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()])
``` ```
At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected
proportion of false discoveries. proportion of false discoveries.
@@ -556,7 +556,7 @@ proportion of false discoveries.
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()]) sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
``` ```
The next line generates Figure~\ref{fig:labfdr}, which is similar The next line generates Figure~\ref{fig:labfdr}, which is similar
to Figure~\ref{Ch12:fig-plugin-fdr}, to Figure~\ref{Ch12:fig-plugin-fdr},
except that it is based on only a subset of the genes. except that it is based on only a subset of the genes.