Remove rmd (#77)
* Ch2->Ch02 * removed Rmd; made a new Rmd target for make; described in README.md
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -8767,6 +8767,11 @@
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
|
||||
@@ -1,610 +0,0 @@
|
||||
|
||||
|
||||
# Linear Regression
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch03-linreg-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch03-linreg-lab.ipynb)
|
||||
|
||||
|
||||
|
||||
## Importing packages
|
||||
We import our standard libraries at this top
|
||||
level.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
|
||||
```
|
||||
|
||||
|
||||
### New imports
|
||||
Throughout this lab we will introduce new functions and libraries. However,
|
||||
we will import them here to emphasize these are the new
|
||||
code objects in this lab. Keeping imports near the top
|
||||
of a notebook makes the code more readable, since scanning the first few
|
||||
lines tells us what libraries are used.
|
||||
|
||||
```{python}
|
||||
import statsmodels.api as sm
|
||||
|
||||
```
|
||||
We will provide relevant details about the
|
||||
functions below as they are needed.
|
||||
|
||||
Besides importing whole modules, it is also possible
|
||||
to import only a few items from a given module. This
|
||||
will help keep the *namespace* clean.
|
||||
We will use a few specific objects from the `statsmodels` package
|
||||
which we import here.
|
||||
|
||||
```{python}
|
||||
from statsmodels.stats.outliers_influence \
|
||||
import variance_inflation_factor as VIF
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
|
||||
```
|
||||
|
||||
As one of the import statements above is quite a long line, we inserted a line break `\` to
|
||||
ease readability.
|
||||
|
||||
We will also use some functions written for the labs in this book in the `ISLP`
|
||||
package.
|
||||
|
||||
```{python}
|
||||
from ISLP import load_data
|
||||
from ISLP.models import (ModelSpec as MS,
|
||||
summarize,
|
||||
poly)
|
||||
|
||||
```
|
||||
|
||||
### Inspecting Objects and Namespaces
|
||||
The
|
||||
function `dir()`
|
||||
provides a list of
|
||||
objects in a namespace.
|
||||
|
||||
```{python}
|
||||
dir()
|
||||
|
||||
```
|
||||
This shows you everything that `Python` can find at the top level.
|
||||
There are certain objects like `__builtins__` that contain references to built-in
|
||||
functions like `print()`.
|
||||
|
||||
Every python object has its own notion of
|
||||
namespace, also accessible with `dir()`. This will include
|
||||
both the attributes of the object
|
||||
as well as any methods associated with it. For instance, we see `'sum'` in the listing for an
|
||||
array.
|
||||
|
||||
```{python}
|
||||
A = np.array([3,5,11])
|
||||
dir(A)
|
||||
|
||||
```
|
||||
This indicates that the object `A.sum` exists. In this case it is a method
|
||||
that can be used to compute the sum of the array `A` as can be seen by typing `A.sum?`.
|
||||
|
||||
```{python}
|
||||
A.sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Simple Linear Regression
|
||||
In this section we will construct model
|
||||
matrices (also called design matrices) using the `ModelSpec()` transform from `ISLP.models`.
|
||||
|
||||
We will use the `Boston` housing data set, which is contained in the `ISLP` package. The `Boston` dataset records `medv` (median house value) for $506$ neighborhoods
|
||||
around Boston. We will build a regression model to predict `medv` using $13$
|
||||
predictors such as `rm` (average number of rooms per house),
|
||||
`age` (proportion of owner-occupied units built prior to 1940), and `lstat` (percent of
|
||||
households with low socioeconomic status). We will use `statsmodels` for this
|
||||
task, a `Python` package that implements several commonly used
|
||||
regression methods.
|
||||
|
||||
We have included a simple loading function `load_data()` in the
|
||||
`ISLP` package:
|
||||
|
||||
```{python}
|
||||
Boston = load_data("Boston")
|
||||
Boston.columns
|
||||
|
||||
```
|
||||
|
||||
Type `Boston?` to find out more about these data.
|
||||
|
||||
We start by using the `sm.OLS()` function to fit a
|
||||
simple linear regression model. Our response will be
|
||||
`medv` and `lstat` will be the single predictor.
|
||||
For this model, we can create the model matrix by hand.
|
||||
|
||||
|
||||
```{python}
|
||||
X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]),
|
||||
'lstat': Boston['lstat']})
|
||||
X[:4]
|
||||
|
||||
```
|
||||
|
||||
We extract the response, and fit the model.
|
||||
|
||||
```{python}
|
||||
y = Boston['medv']
|
||||
model = sm.OLS(y, X)
|
||||
results = model.fit()
|
||||
|
||||
```
|
||||
Note that `sm.OLS()` does
|
||||
not fit the model; it specifies the model, and then `model.fit()` does the actual fitting.
|
||||
|
||||
Our `ISLP` function `summarize()` produces a simple table of the parameter estimates,
|
||||
their standard errors, t-statistics and p-values.
|
||||
The function takes a single argument, such as the object `results`
|
||||
returned here by the `fit`
|
||||
method, and returns such a summary.
|
||||
|
||||
```{python}
|
||||
summarize(results)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
|
||||
### Using Transformations: Fit and Transform
|
||||
Our model above has a single predictor, and constructing `X` was straightforward.
|
||||
In practice we often fit models with more than one predictor, typically selected from an array or data frame.
|
||||
We may wish to introduce transformations to the variables before fitting the model, specify interactions between variables, and expand some particular variables into sets of variables (e.g. polynomials).
|
||||
The `sklearn` package has a particular notion
|
||||
for this type of task: a *transform*. A transform is an object
|
||||
that is created with some parameters as arguments. The
|
||||
object has two main methods: `fit()` and `transform()`.
|
||||
|
||||
We provide a general approach for specifying models and constructing
|
||||
the model matrix through the transform `ModelSpec()` in the `ISLP` library.
|
||||
`ModelSpec()`
|
||||
(renamed `MS()` in the preamble) creates a
|
||||
transform object, and then a pair of methods
|
||||
`transform()` and `fit()` are used to construct a
|
||||
corresponding model matrix.
|
||||
|
||||
We first describe this process for our simple regression model using a single predictor `lstat` in
|
||||
the `Boston` data frame, but will use it repeatedly in more
|
||||
complex tasks in this and other labs in this book.
|
||||
In our case the transform is created by the expression
|
||||
`design = MS(['lstat'])`.
|
||||
|
||||
The `fit()` method takes the original array and may do some
|
||||
initial computations on it, as specified in the transform object.
|
||||
For example, it may compute means and standard deviations for centering and scaling.
|
||||
The `transform()`
|
||||
method applies the fitted transformation to the array of data, and produces the model matrix.
|
||||
|
||||
|
||||
```{python}
|
||||
design = MS(['lstat'])
|
||||
design = design.fit(Boston)
|
||||
X = design.transform(Boston)
|
||||
X[:4]
|
||||
```
|
||||
In this simple case, the `fit()` method does very little; it simply checks that the variable `'lstat'` specified in `design` exists in `Boston`. Then `transform()` constructs the model matrix with two columns: an `intercept` and the variable `lstat`.
|
||||
|
||||
These two operations can be combined with the
|
||||
`fit_transform()` method.
|
||||
|
||||
```{python}
|
||||
design = MS(['lstat'])
|
||||
X = design.fit_transform(Boston)
|
||||
X[:4]
|
||||
```
|
||||
Note that, as in the previous code chunk when the two steps were done separately, the `design` object is changed as a result of the `fit()` operation. The power of this pipeline will become clearer when we fit more complex models that involve interactions and transformations.
|
||||
|
||||
|
||||
Let's return to our fitted regression model.
|
||||
The object
|
||||
`results` has several methods that can be used for inference.
|
||||
We already presented a function `summarize()` for showing the essentials of the fit.
|
||||
For a full and somewhat exhaustive summary of the fit, we can use the `summary()`
|
||||
method.
|
||||
|
||||
```{python}
|
||||
results.summary()
|
||||
|
||||
```
|
||||
|
||||
The fitted coefficients can also be retrieved as the
|
||||
`params` attribute of `results`.
|
||||
|
||||
```{python}
|
||||
results.params
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
|
||||
prediction intervals for the prediction of `medv` for given values of `lstat`.
|
||||
|
||||
We first create a new data frame, in this case containing only the variable `lstat`, with the values for this variable at which we wish to make predictions.
|
||||
We then use the `transform()` method of `design` to create the corresponding model matrix.
|
||||
|
||||
```{python}
|
||||
new_df = pd.DataFrame({'lstat':[5, 10, 15]})
|
||||
newX = design.transform(new_df)
|
||||
newX
|
||||
|
||||
```
|
||||
|
||||
Next we compute the predictions at `newX`, and view them by extracting the `predicted_mean` attribute.
|
||||
|
||||
```{python}
|
||||
new_predictions = results.get_prediction(newX);
|
||||
new_predictions.predicted_mean
|
||||
|
||||
```
|
||||
We can produce confidence intervals for the predicted values.
|
||||
|
||||
```{python}
|
||||
new_predictions.conf_int(alpha=0.05)
|
||||
|
||||
```
|
||||
Prediction intervals are computed by setting `obs=True`:
|
||||
|
||||
```{python}
|
||||
new_predictions.conf_int(obs=True, alpha=0.05)
|
||||
|
||||
```
|
||||
For instance, the 95% confidence interval associated with an
|
||||
`lstat` value of 10 is (24.47, 25.63), and the 95% prediction
|
||||
interval is (12.82, 37.28). As expected, the confidence and
|
||||
prediction intervals are centered around the same point (a predicted
|
||||
value of 25.05 for `medv` when `lstat` equals
|
||||
10), but the latter are substantially wider.
|
||||
|
||||
Next we will plot `medv` and `lstat`
|
||||
using `DataFrame.plot.scatter()`, \definelongblankMR{plot.scatter()}{plot.slashslashscatter()}
|
||||
and wish to
|
||||
add the regression line to the resulting plot.
|
||||
|
||||
|
||||
### Defining Functions
|
||||
While there is a function
|
||||
within the `ISLP` package that adds a line to an existing plot, we take this opportunity
|
||||
to define our first function to do so.
|
||||
|
||||
```{python}
|
||||
def abline(ax, b, m):
|
||||
"Add a line with slope m and intercept b to ax"
|
||||
xlim = ax.get_xlim()
|
||||
ylim = [m * xlim[0] + b, m * xlim[1] + b]
|
||||
ax.plot(xlim, ylim)
|
||||
|
||||
```
|
||||
A few things are illustrated above. First we see the syntax for defining a function:
|
||||
`def funcname(...)`. The function has arguments `ax, b, m`
|
||||
where `ax` is an axis object for an existing plot, `b` is the intercept and
|
||||
`m` is the slope of the desired line. Other plotting options can be passed on to
|
||||
`ax.plot` by including additional optional arguments as follows:
|
||||
|
||||
```{python}
|
||||
def abline(ax, b, m, *args, **kwargs):
|
||||
"Add a line with slope m and intercept b to ax"
|
||||
xlim = ax.get_xlim()
|
||||
ylim = [m * xlim[0] + b, m * xlim[1] + b]
|
||||
ax.plot(xlim, ylim, *args, **kwargs)
|
||||
|
||||
```
|
||||
The addition of `*args` allows any number of
|
||||
non-named arguments to `abline`, while `**kwargs` allows any
|
||||
number of named arguments (such as `linewidth=3`) to `abline`.
|
||||
In our function, we pass
|
||||
these arguments verbatim to `ax.plot` above. Readers
|
||||
interested in learning more about
|
||||
functions are referred to the section on
|
||||
defining functions in [docs.python.org/tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions).
|
||||
|
||||
Let’s use our new function to add this regression line to a plot of
|
||||
`medv` vs. `lstat`.
|
||||
|
||||
```{python}
|
||||
ax = Boston.plot.scatter('lstat', 'medv')
|
||||
abline(ax,
|
||||
results.params['intercept'],
|
||||
results.params['lstat'],
|
||||
'r--',
|
||||
linewidth=3)
|
||||
|
||||
```
|
||||
Thus, the final call to `ax.plot()` is `ax.plot(xlim, ylim, 'r--', linewidth=3)`.
|
||||
We have used the argument `'r--'` to produce a red dashed line, and added
|
||||
an argument to make it of width 3.
|
||||
There is some evidence for non-linearity in the relationship between `lstat` and `medv`. We will explore this issue later in this lab.
|
||||
|
||||
As mentioned above, there is an existing function to add a line to a plot --- `ax.axline()` --- but knowing how to write such functions empowers us to create more expressive displays.
|
||||
|
||||
|
||||
|
||||
|
||||
Next we examine some diagnostic plots, several of which were discussed
|
||||
in Section 3.3.3.
|
||||
We can find the fitted values and residuals
|
||||
of the fit as attributes of the `results` object.
|
||||
Various influence measures describing the regression model
|
||||
are computed with the `get_influence()` method.
|
||||
As we will not use the `fig` component returned
|
||||
as the first value from `subplots()`, we simply
|
||||
capture the second returned value in `ax` below.
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.scatter(results.fittedvalues, results.resid)
|
||||
ax.set_xlabel('Fitted value')
|
||||
ax.set_ylabel('Residual')
|
||||
ax.axhline(0, c='k', ls='--');
|
||||
|
||||
```
|
||||
We add a horizontal line at 0 for reference using the
|
||||
`ax.axhline()` method, indicating
|
||||
it should be black (`c='k'`) and have a dashed linestyle (`ls='--'`).
|
||||
|
||||
On the basis of the residual plot, there is some evidence of non-linearity.
|
||||
Leverage statistics can be computed for any number of predictors using the
|
||||
`hat_matrix_diag` attribute of the value returned by the
|
||||
`get_influence()` method.
|
||||
|
||||
```{python}
|
||||
infl = results.get_influence()
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
|
||||
ax.set_xlabel('Index')
|
||||
ax.set_ylabel('Leverage')
|
||||
np.argmax(infl.hat_matrix_diag)
|
||||
|
||||
```
|
||||
The `np.argmax()` function identifies the index of the largest element of an array, optionally computed over an axis of the array.
|
||||
In this case, we maximized over the entire array
|
||||
to determine which observation has the largest leverage statistic.
|
||||
|
||||
|
||||
## Multiple Linear Regression
|
||||
In order to fit a multiple linear regression model using least squares, we again use
|
||||
the `ModelSpec()` transform to construct the required
|
||||
model matrix and response. The arguments
|
||||
to `ModelSpec()` can be quite general, but in this case
|
||||
a list of column names suffice. We consider a fit here with
|
||||
the two variables `lstat` and `age`.
|
||||
|
||||
```{python}
|
||||
X = MS(['lstat', 'age']).fit_transform(Boston)
|
||||
model1 = sm.OLS(y, X)
|
||||
results1 = model1.fit()
|
||||
summarize(results1)
|
||||
```
|
||||
Notice how we have compacted the first line into a succinct expression describing the construction of `X`.
|
||||
|
||||
The `Boston` data set contains 12 variables, and so it would be cumbersome
|
||||
to have to type all of these in order to perform a regression using all of the predictors.
|
||||
Instead, we can use the following short-hand:\definelongblankMR{columns.drop()}{columns.slashslashdrop()}
|
||||
|
||||
```{python}
|
||||
terms = Boston.columns.drop('medv')
|
||||
terms
|
||||
|
||||
```
|
||||
|
||||
We can now fit the model with all the variables in `terms` using
|
||||
the same model matrix builder.
|
||||
|
||||
```{python}
|
||||
X = MS(terms).fit_transform(Boston)
|
||||
model = sm.OLS(y, X)
|
||||
results = model.fit()
|
||||
summarize(results)
|
||||
|
||||
```
|
||||
|
||||
What if we would like to perform a regression using all of the variables but one? For
|
||||
example, in the above regression output, `age` has a high $p$-value.
|
||||
So we may wish to run a regression excluding this predictor.
|
||||
The following syntax results in a regression using all predictors except `age`.
|
||||
|
||||
```{python}
|
||||
minus_age = Boston.columns.drop(['medv', 'age'])
|
||||
Xma = MS(minus_age).fit_transform(Boston)
|
||||
model1 = sm.OLS(y, Xma)
|
||||
summarize(model1.fit())
|
||||
|
||||
```
|
||||
|
||||
## Multivariate Goodness of Fit
|
||||
We can access the individual components of `results` by name
|
||||
(`dir(results)` shows us what is available). Hence
|
||||
`results.rsquared` gives us the $R^2$,
|
||||
and
|
||||
`np.sqrt(results.scale)` gives us the RSE.
|
||||
|
||||
Variance inflation factors (section 3.3.3) are sometimes useful
|
||||
to assess the effect of collinearity in the model matrix of a regression model.
|
||||
We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
|
||||
|
||||
### List Comprehension
|
||||
Often we encounter a sequence of objects which we would like to transform
|
||||
for some other task. Below, we compute the VIF for each
|
||||
feature in our `X` matrix and produce a data frame
|
||||
whose index agrees with the columns of `X`.
|
||||
The notion of list comprehension can often make such
|
||||
a task easier.
|
||||
|
||||
List comprehensions are simple and powerful ways to form
|
||||
lists of `Python` objects. The language also supports
|
||||
dictionary and *generator* comprehension, though these are
|
||||
beyond our scope here. Let's look at an example. We compute the VIF for each of the variables
|
||||
in the model matrix `X`, using the function `variance_inflation_factor()`.
|
||||
|
||||
|
||||
```{python}
|
||||
vals = [VIF(X, i)
|
||||
for i in range(1, X.shape[1])]
|
||||
vif = pd.DataFrame({'vif':vals},
|
||||
index=X.columns[1:])
|
||||
vif
|
||||
|
||||
```
|
||||
The function `VIF()` takes two arguments: a dataframe or array,
|
||||
and a variable column index. In the code above we call `VIF()` on the fly for all columns in `X`.
|
||||
We have excluded column 0 above (the intercept), which is not of interest. In this case the VIFs are not that exciting.
|
||||
|
||||
The object `vals` above could have been constructed with the following for loop:
|
||||
|
||||
```{python}
|
||||
vals = []
|
||||
for i in range(1, X.values.shape[1]):
|
||||
vals.append(VIF(X.values, i))
|
||||
|
||||
```
|
||||
List comprehension allows us to perform such repetitive operations in a more straightforward way.
|
||||
## Interaction Terms
|
||||
It is easy to include interaction terms in a linear model using `ModelSpec()`.
|
||||
Including a tuple `("lstat","age")` tells the model
|
||||
matrix builder to include an interaction term between
|
||||
`lstat` and `age`.
|
||||
|
||||
```{python}
|
||||
X = MS(['lstat',
|
||||
'age',
|
||||
('lstat', 'age')]).fit_transform(Boston)
|
||||
model2 = sm.OLS(y, X)
|
||||
summarize(model2.fit())
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Non-linear Transformations of the Predictors
|
||||
The model matrix builder can include terms beyond
|
||||
just column names and interactions. For instance,
|
||||
the `poly()` function supplied in `ISLP` specifies that
|
||||
columns representing polynomial functions
|
||||
of its first argument are added to the model matrix.
|
||||
|
||||
```{python}
|
||||
X = MS([poly('lstat', degree=2), 'age']).fit_transform(Boston)
|
||||
model3 = sm.OLS(y, X)
|
||||
results3 = model3.fit()
|
||||
summarize(results3)
|
||||
|
||||
```
|
||||
The effectively zero *p*-value associated with the quadratic term
|
||||
(i.e. the third row above) suggests that it leads to an improved model.
|
||||
|
||||
By default, `poly()` creates a basis matrix for inclusion in the
|
||||
model matrix whose
|
||||
columns are *orthogonal polynomials*, which are designed for stable
|
||||
least squares computations. {Actually, `poly()` is a wrapper for the workhorse and standalone function `Poly()` that does the work in building the model matrix.}
|
||||
Alternatively, had we included an argument
|
||||
`raw=True` in the above call to `poly()`, the basis matrix would consist simply of
|
||||
`lstat` and `lstat**2`. Since either of these bases
|
||||
represent quadratic polynomials, the fitted values would not
|
||||
change in this case, just the polynomial coefficients. Also by default, the columns
|
||||
created by `poly()` do not include an intercept column as
|
||||
that is automatically added by `MS()`.
|
||||
|
||||
We use the `anova_lm()` function to further quantify the extent to which the quadratic fit is
|
||||
superior to the linear fit.
|
||||
|
||||
```{python}
|
||||
anova_lm(results1, results3)
|
||||
|
||||
```
|
||||
Here `results1` represents the linear submodel containing
|
||||
predictors `lstat` and `age`,
|
||||
while `results3` corresponds to the larger model above with a quadratic
|
||||
term in `lstat`.
|
||||
The `anova_lm()` function performs a hypothesis test
|
||||
comparing the two models. The null hypothesis is that the quadratic
|
||||
term in the bigger model is not needed, and the alternative hypothesis is that the
|
||||
bigger model is superior. Here the *F*-statistic is 177.28 and
|
||||
the associated *p*-value is zero.
|
||||
In this case the *F*-statistic is the square of the
|
||||
*t*-statistic for the quadratic term in the linear model summary
|
||||
for `results3` --- a consequence of the fact that these nested
|
||||
models differ by one degree of freedom.
|
||||
This provides very clear evidence that the quadratic polynomial in
|
||||
`lstat` improves the linear model.
|
||||
This is not surprising, since earlier we saw evidence for non-linearity in the relationship between `medv`
|
||||
and `lstat`.
|
||||
|
||||
The function `anova_lm()` can take more than two nested models
|
||||
as input, in which case it compares every successive pair of models.
|
||||
That also explains why there are `NaN`s in the first row above, since
|
||||
there is no previous model with which to compare the first.
|
||||
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.scatter(results3.fittedvalues, results3.resid)
|
||||
ax.set_xlabel('Fitted value')
|
||||
ax.set_ylabel('Residual')
|
||||
ax.axhline(0, c='k', ls='--');
|
||||
|
||||
```
|
||||
We see that when the quadratic term is included in the model,
|
||||
there is little discernible pattern in the residuals.
|
||||
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
|
||||
to `poly()`.
|
||||
|
||||
|
||||
|
||||
## Qualitative Predictors
|
||||
Here we use the `Carseats` data, which is included in the
|
||||
`ISLP` package. We will attempt to predict `Sales`
|
||||
(child car seat sales) in 400 locations based on a number of
|
||||
predictors.
|
||||
|
||||
```{python}
|
||||
Carseats = load_data('Carseats')
|
||||
Carseats.columns
|
||||
|
||||
```
|
||||
The `Carseats`
|
||||
data includes qualitative predictors such as
|
||||
`ShelveLoc`, an indicator of the quality of the shelving
|
||||
location --- that is,
|
||||
the space within a store in which the car seat is displayed. The predictor
|
||||
`ShelveLoc` takes on three possible values, `Bad`, `Medium`, and `Good`.
|
||||
Given a qualitative variable such as `ShelveLoc`, `ModelSpec()` generates dummy
|
||||
variables automatically.
|
||||
These variables are often referred to as a *one-hot encoding* of the categorical
|
||||
feature. Their columns sum to one, so to avoid collinearity with an intercept, the first column is dropped. Below we see
|
||||
the column `ShelveLoc[Bad]` has been dropped, since `Bad` is the first level of `ShelveLoc`.
|
||||
Below we fit a multiple regression model that includes some interaction terms.
|
||||
|
||||
```{python}
|
||||
allvars = list(Carseats.columns.drop('Sales'))
|
||||
y = Carseats['Sales']
|
||||
final = allvars + [('Income', 'Advertising'),
|
||||
('Price', 'Age')]
|
||||
X = MS(final).fit_transform(Carseats)
|
||||
model = sm.OLS(y, X)
|
||||
summarize(model.fit())
|
||||
|
||||
```
|
||||
In the first line above, we made `allvars` a list, so that we
|
||||
could add the interaction terms two lines down.
|
||||
Our model-matrix builder has created a `ShelveLoc[Good]`
|
||||
dummy variable that takes on a value of 1 if the
|
||||
shelving location is good, and 0 otherwise. It has also created a `ShelveLoc[Medium]`
|
||||
dummy variable that equals 1 if the shelving location is medium, and 0 otherwise.
|
||||
A bad shelving location corresponds to a zero for each of the two dummy variables.
|
||||
The fact that the coefficient for `ShelveLoc[Good]` in the regression output is
|
||||
positive indicates that a good shelving location is associated with high sales (relative to a bad location).
|
||||
And `ShelveLoc[Medium]` has a smaller positive coefficient,
|
||||
indicating that a medium shelving location leads to higher sales than a bad
|
||||
shelving location, but lower sales than a good shelving location.
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,556 +0,0 @@
|
||||
|
||||
|
||||
# Cross-Validation and the Bootstrap
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch05-resample-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch05-resample-lab.ipynb)
|
||||
|
||||
|
||||
In this lab, we explore the resampling techniques covered in this
|
||||
chapter. Some of the commands in this lab may take a while to run on
|
||||
your computer.
|
||||
|
||||
We again begin by placing most of our imports at this top level.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import statsmodels.api as sm
|
||||
from ISLP import load_data
|
||||
from ISLP.models import (ModelSpec as MS,
|
||||
summarize,
|
||||
poly)
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
```
|
||||
|
||||
|
||||
There are several new imports needed for this lab.
|
||||
|
||||
```{python}
|
||||
from functools import partial
|
||||
from sklearn.model_selection import \
|
||||
(cross_validate,
|
||||
KFold,
|
||||
ShuffleSplit)
|
||||
from sklearn.base import clone
|
||||
from ISLP.models import sklearn_sm
|
||||
|
||||
```
|
||||
|
||||
|
||||
## The Validation Set Approach
|
||||
We explore the use of the validation set approach in order to estimate
|
||||
the test error rates that result from fitting various linear models on
|
||||
the `Auto` data set.
|
||||
|
||||
We use the function `train_test_split()` to split
|
||||
the data into training and validation sets. As there are 392 observations,
|
||||
we split into two equal sets of size 196 using the
|
||||
argument `test_size=196`. It is generally a good idea to set a random seed
|
||||
when performing operations like this that contain an
|
||||
element of randomness, so that the results obtained can be reproduced
|
||||
precisely at a later time. We set the random seed of the splitter
|
||||
with the argument `random_state=0`.
|
||||
|
||||
```{python}
|
||||
Auto = load_data('Auto')
|
||||
Auto_train, Auto_valid = train_test_split(Auto,
|
||||
test_size=196,
|
||||
random_state=0)
|
||||
|
||||
```
|
||||
|
||||
Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`.
|
||||
|
||||
```{python}
|
||||
hp_mm = MS(['horsepower'])
|
||||
X_train = hp_mm.fit_transform(Auto_train)
|
||||
y_train = Auto_train['mpg']
|
||||
model = sm.OLS(y_train, X_train)
|
||||
results = model.fit()
|
||||
|
||||
```
|
||||
|
||||
We now use the `predict()` method of `results` evaluated on the model matrix for this model
|
||||
created using the validation data set. We also calculate the validation MSE of our model.
|
||||
|
||||
```{python}
|
||||
X_valid = hp_mm.transform(Auto_valid)
|
||||
y_valid = Auto_valid['mpg']
|
||||
valid_pred = results.predict(X_valid)
|
||||
np.mean((y_valid - valid_pred)**2)
|
||||
|
||||
```
|
||||
|
||||
Hence our estimate for the validation MSE of the linear regression
|
||||
fit is $23.62$.
|
||||
|
||||
We can also estimate the validation error for
|
||||
higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well
|
||||
as training and test sets and returns the MSE on the test set.
|
||||
|
||||
```{python}
|
||||
def evalMSE(terms,
|
||||
response,
|
||||
train,
|
||||
test):
|
||||
|
||||
mm = MS(terms)
|
||||
X_train = mm.fit_transform(train)
|
||||
y_train = train[response]
|
||||
|
||||
X_test = mm.transform(test)
|
||||
y_test = test[response]
|
||||
|
||||
results = sm.OLS(y_train, X_train).fit()
|
||||
test_pred = results.predict(X_test)
|
||||
|
||||
return np.mean((y_test - test_pred)**2)
|
||||
|
||||
```
|
||||
|
||||
Let’s use this function to estimate the validation MSE
|
||||
using linear, quadratic and cubic fits. We use the `enumerate()` function
|
||||
here, which gives both the values and indices of objects as one iterates
|
||||
over a for loop.
|
||||
|
||||
```{python}
|
||||
MSE = np.zeros(3)
|
||||
for idx, degree in enumerate(range(1, 4)):
|
||||
MSE[idx] = evalMSE([poly('horsepower', degree)],
|
||||
'mpg',
|
||||
Auto_train,
|
||||
Auto_valid)
|
||||
MSE
|
||||
|
||||
```
|
||||
|
||||
These error rates are $23.62, 18.76$, and $18.80$, respectively. If we
|
||||
choose a different training/validation split instead, then we
|
||||
can expect somewhat different errors on the validation set.
|
||||
|
||||
```{python}
|
||||
Auto_train, Auto_valid = train_test_split(Auto,
|
||||
test_size=196,
|
||||
random_state=3)
|
||||
MSE = np.zeros(3)
|
||||
for idx, degree in enumerate(range(1, 4)):
|
||||
MSE[idx] = evalMSE([poly('horsepower', degree)],
|
||||
'mpg',
|
||||
Auto_train,
|
||||
Auto_valid)
|
||||
MSE
|
||||
```
|
||||
|
||||
Using this split of the observations into a training set and a validation set,
|
||||
we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$, and $16.97$, respectively.
|
||||
|
||||
These results are consistent with our previous findings: a model that
|
||||
predicts `mpg` using a quadratic function of `horsepower`
|
||||
performs better than a model that involves only a linear function of
|
||||
`horsepower`, and there is no evidence of an improvement in using a cubic function of `horsepower`.
|
||||
|
||||
|
||||
## Cross-Validation
|
||||
In theory, the cross-validation estimate can be computed for any generalized
|
||||
linear model. {}
|
||||
In practice, however, the simplest way to cross-validate in
|
||||
Python is to use `sklearn`, which has a different interface or API
|
||||
than `statsmodels`, the code we have been using to fit GLMs.
|
||||
|
||||
This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ don’t naturally speak to each other, this
|
||||
requires the use of a *wrapper*.
|
||||
In the `ISLP` package,
|
||||
we provide
|
||||
a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with
|
||||
models fit by `statsmodels`.
|
||||
|
||||
The class `sklearn_sm()`
|
||||
has as its first argument
|
||||
a model from `statsmodels`. It can take two additional
|
||||
optional arguments: `model_str` which can be
|
||||
used to specify a formula, and `model_args` which should
|
||||
be a dictionary of additional arguments used when fitting
|
||||
the model. For example, to fit a logistic regression model
|
||||
we have to specify a `family` argument. This
|
||||
is passed as `model_args={'family':sm.families.Binomial()}`.
|
||||
|
||||
Here is our wrapper in action:
|
||||
|
||||
```{python}
|
||||
hp_model = sklearn_sm(sm.OLS,
|
||||
MS(['horsepower']))
|
||||
X, Y = Auto.drop(columns=['mpg']), Auto['mpg']
|
||||
cv_results = cross_validate(hp_model,
|
||||
X,
|
||||
Y,
|
||||
cv=Auto.shape[0])
|
||||
cv_err = np.mean(cv_results['test_score'])
|
||||
cv_err
|
||||
|
||||
```
|
||||
The arguments to `cross_validate()` are as follows: an
|
||||
object with the appropriate `fit()`, `predict()`,
|
||||
and `score()` methods, an
|
||||
array of features `X` and a response `Y`.
|
||||
We also included an additional argument `cv` to `cross_validate()`; specifying an integer
|
||||
$k$ results in $k$-fold cross-validation. We have provided a value
|
||||
corresponding to the total number of observations, which results in
|
||||
leave-one-out cross-validation (LOOCV). The `cross_validate()` function produces a dictionary with several components;
|
||||
we simply want the cross-validated test score here (MSE), which is estimated to be 24.23.
|
||||
|
||||
|
||||
We can repeat this procedure for increasingly complex polynomial fits.
|
||||
To automate the process, we again
|
||||
use a for loop which iteratively fits polynomial
|
||||
regressions of degree 1 to 5, computes the
|
||||
associated cross-validation error, and stores it in the $i$th element
|
||||
of the vector `cv_error`. The variable `d` in the for loop
|
||||
corresponds to the degree of the polynomial. We begin by initializing the
|
||||
vector. This command may take a couple of seconds to run.
|
||||
|
||||
```{python}
|
||||
cv_error = np.zeros(5)
|
||||
H = np.array(Auto['horsepower'])
|
||||
M = sklearn_sm(sm.OLS)
|
||||
for i, d in enumerate(range(1,6)):
|
||||
X = np.power.outer(H, np.arange(d+1))
|
||||
M_CV = cross_validate(M,
|
||||
X,
|
||||
Y,
|
||||
cv=Auto.shape[0])
|
||||
cv_error[i] = np.mean(M_CV['test_score'])
|
||||
cv_error
|
||||
|
||||
```
|
||||
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
|
||||
quadratic fits, but then no clear improvement from using higher-degree polynomials.
|
||||
|
||||
Above we introduced the `outer()` method of the `np.power()`
|
||||
function. The `outer()` method is applied to an operation
|
||||
that has two arguments, such as `add()`, `min()`, or
|
||||
`power()`.
|
||||
It has two arrays as
|
||||
arguments, and then forms a larger
|
||||
array where the operation is applied to each pair of elements of the
|
||||
two arrays.
|
||||
|
||||
```{python}
|
||||
A = np.array([3, 5, 9])
|
||||
B = np.array([2, 4])
|
||||
np.add.outer(A, B)
|
||||
|
||||
```
|
||||
|
||||
In the CV example above, we used $k=n$, but of course we can also use $k<n$. The code is very similar
|
||||
to the above (and is significantly faster). Here we use `KFold()` to partition the data into $k=10$ random groups. We use `random_state` to set a random seed and initialize a vector `cv_error` in which we will store the CV errors corresponding to the
|
||||
polynomial fits of degrees one to five.
|
||||
|
||||
```{python}
|
||||
cv_error = np.zeros(5)
|
||||
cv = KFold(n_splits=10,
|
||||
shuffle=True,
|
||||
random_state=0) # use same splits for each degree
|
||||
for i, d in enumerate(range(1,6)):
|
||||
X = np.power.outer(H, np.arange(d+1))
|
||||
M_CV = cross_validate(M,
|
||||
X,
|
||||
Y,
|
||||
cv=cv)
|
||||
cv_error[i] = np.mean(M_CV['test_score'])
|
||||
cv_error
|
||||
|
||||
```
|
||||
Notice that the computation time is much shorter than that of LOOCV.
|
||||
(In principle, the computation time for LOOCV for a least squares
|
||||
linear model should be faster than for $k$-fold CV, due to the
|
||||
availability of the formula~(5.2) for LOOCV;
|
||||
however, the generic `cross_validate()` function does not make
|
||||
use of this formula.) We still see little evidence that using cubic
|
||||
or higher-degree polynomial terms leads to a lower test error than simply
|
||||
using a quadratic fit.
|
||||
|
||||
|
||||
The `cross_validate()` function is flexible and can take
|
||||
different splitting mechanisms as an argument. For instance, one can use the `ShuffleSplit()`
|
||||
function to implement
|
||||
the validation set approach just as easily as $k$-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
validation = ShuffleSplit(n_splits=1,
|
||||
test_size=196,
|
||||
random_state=0)
|
||||
results = cross_validate(hp_model,
|
||||
Auto.drop(['mpg'], axis=1),
|
||||
Auto['mpg'],
|
||||
cv=validation);
|
||||
results['test_score']
|
||||
|
||||
```
|
||||
|
||||
|
||||
One can estimate the variability in the test error by running the following:
|
||||
|
||||
```{python}
|
||||
validation = ShuffleSplit(n_splits=10,
|
||||
test_size=196,
|
||||
random_state=0)
|
||||
results = cross_validate(hp_model,
|
||||
Auto.drop(['mpg'], axis=1),
|
||||
Auto['mpg'],
|
||||
cv=validation)
|
||||
results['test_score'].mean(), results['test_score'].std()
|
||||
|
||||
```
|
||||
|
||||
Note that this standard deviation is not a valid estimate of the
|
||||
sampling variability of the mean test score or the individual scores, since the randomly-selected training
|
||||
samples overlap and hence introduce correlations. But it does give an
|
||||
idea of the Monte Carlo variation
|
||||
incurred by picking different random folds.
|
||||
|
||||
## The Bootstrap
|
||||
We illustrate the use of the bootstrap in the simple example
|
||||
{of Section 5.2,} as well as on an example involving
|
||||
estimating the accuracy of the linear regression model on the `Auto`
|
||||
data set.
|
||||
### Estimating the Accuracy of a Statistic of Interest
|
||||
One of the great advantages of the bootstrap approach is that it can
|
||||
be applied in almost all situations. No complicated mathematical
|
||||
calculations are required. While there are several implementations
|
||||
of the bootstrap in Python, its use for estimating
|
||||
standard error is simple enough that we write our own function
|
||||
below for the case when our data is stored
|
||||
in a dataframe.
|
||||
|
||||
To illustrate the bootstrap, we
|
||||
start with a simple example.
|
||||
The `Portfolio` data set in the `ISLP` package is described
|
||||
in Section 5.2. The goal is to estimate the
|
||||
sampling variance of the parameter $\alpha$ given in formula~(5.7). We will
|
||||
create a function
|
||||
`alpha_func()`, which takes as input a dataframe `D` assumed
|
||||
to have columns `X` and `Y`, as well as a
|
||||
vector `idx` indicating which observations should be used to
|
||||
estimate
|
||||
$\alpha$. The function then outputs the estimate for $\alpha$ based on
|
||||
the selected observations.
|
||||
|
||||
```{python}
|
||||
Portfolio = load_data('Portfolio')
|
||||
def alpha_func(D, idx):
|
||||
cov_ = np.cov(D[['X','Y']].loc[idx], rowvar=False)
|
||||
return ((cov_[1,1] - cov_[0,1]) /
|
||||
(cov_[0,0]+cov_[1,1]-2*cov_[0,1]))
|
||||
|
||||
```
|
||||
This function returns an estimate for $\alpha$
|
||||
based on applying the minimum
|
||||
variance formula (5.7) to the observations indexed by
|
||||
the argument `idx`. For instance, the following command
|
||||
estimates $\alpha$ using all 100 observations.
|
||||
|
||||
```{python}
|
||||
alpha_func(Portfolio, range(100))
|
||||
```
|
||||
|
||||
Next we randomly select
|
||||
100 observations from `range(100)`, with replacement. This is equivalent
|
||||
to constructing a new bootstrap data set and recomputing $\hat{\alpha}$
|
||||
based on the new data set.
|
||||
|
||||
```{python}
|
||||
rng = np.random.default_rng(0)
|
||||
alpha_func(Portfolio,
|
||||
rng.choice(100,
|
||||
100,
|
||||
replace=True))
|
||||
```
|
||||
|
||||
|
||||
This process can be generalized to create a simple function `boot_SE()` for
|
||||
computing the bootstrap standard error for arbitrary
|
||||
functions that take only a data frame as an argument.
|
||||
|
||||
```{python}
|
||||
def boot_SE(func,
|
||||
D,
|
||||
n=None,
|
||||
B=1000,
|
||||
seed=0):
|
||||
rng = np.random.default_rng(seed)
|
||||
first_, second_ = 0, 0
|
||||
n = n or D.shape[0]
|
||||
for _ in range(B):
|
||||
idx = rng.choice(D.index,
|
||||
n,
|
||||
replace=True)
|
||||
value = func(D, idx)
|
||||
first_ += value
|
||||
second_ += value**2
|
||||
return np.sqrt(second_ / B - (first_ / B)**2)
|
||||
```
|
||||
Notice the use of `_` as a loop variable in `for _ in range(B)`. This is often used if the value of the counter is
|
||||
unimportant and simply makes sure the loop is executed `B` times.
|
||||
|
||||
Let’s use our function to evaluate the accuracy of our
|
||||
estimate of $\alpha$ using $B=1{,}000$ bootstrap replications.
|
||||
|
||||
```{python}
|
||||
alpha_SE = boot_SE(alpha_func,
|
||||
Portfolio,
|
||||
B=1000,
|
||||
seed=0)
|
||||
alpha_SE
|
||||
|
||||
```
|
||||
|
||||
The final output shows that the bootstrap estimate for ${\rm SE}(\hat{\alpha})$ is $0.0912$.
|
||||
|
||||
### Estimating the Accuracy of a Linear Regression Model
|
||||
The bootstrap approach can be used to assess the variability of the
|
||||
coefficient estimates and predictions from a statistical learning
|
||||
method. Here we use the bootstrap approach in order to assess the
|
||||
variability of the estimates for $\beta_0$ and $\beta_1$, the
|
||||
intercept and slope terms for the linear regression model that uses
|
||||
`horsepower` to predict `mpg` in the `Auto` data set. We
|
||||
will compare the estimates obtained using the bootstrap to those
|
||||
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
|
||||
${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
|
||||
|
||||
To use our `boot_SE()` function, we must write a function (its
|
||||
first argument)
|
||||
that takes a data frame `D` and indices `idx`
|
||||
as its only arguments. But here we want to bootstrap a specific
|
||||
regression model, specified by a model formula and data. We show how
|
||||
to do this in a few simple steps.
|
||||
|
||||
We start by writing a generic
|
||||
function `boot_OLS()` for bootstrapping a regression model that takes a formula to
|
||||
define the corresponding regression. We use the `clone()` function to
|
||||
make a copy of the formula that can be refit to the new dataframe. This means
|
||||
that any derived features such as those defined by `poly()`
|
||||
(which we will see shortly),
|
||||
will be re-fit on the resampled data frame.
|
||||
|
||||
```{python}
|
||||
def boot_OLS(model_matrix, response, D, idx):
|
||||
D_ = D.loc[idx]
|
||||
Y_ = D_[response]
|
||||
X_ = clone(model_matrix).fit_transform(D_)
|
||||
return sm.OLS(Y_, X_).fit().params
|
||||
```
|
||||
This is not quite what is needed as the first argument to
|
||||
`boot_SE()`. The first two arguments which specify the model will not change in the
|
||||
bootstrap process, and we would like to *freeze* them. The
|
||||
function `partial()` from the `functools` module does precisely this: it takes a function
|
||||
as an argument, and freezes some of its arguments, starting from the
|
||||
left. We use it to freeze the first two model-formula arguments of `boot_OLS()`.
|
||||
|
||||
```{python}
|
||||
hp_func = partial(boot_OLS, MS(['horsepower']), 'mpg')
|
||||
|
||||
```
|
||||
Typing `hp_func?` will show that it has two arguments `D`
|
||||
and `idx` --- it is a version of `boot_OLS()` with the first
|
||||
two arguments frozen --- and hence is ideal as the first argument for `boot_SE()`.
|
||||
|
||||
The `hp_func()` function can now be used in order to create
|
||||
bootstrap estimates for the intercept and slope terms by randomly
|
||||
sampling from among the observations with replacement. We first
|
||||
demonstrate its utility on 10 bootstrap samples.
|
||||
|
||||
```{python}
|
||||
rng = np.random.default_rng(0)
|
||||
np.array([hp_func(Auto,
|
||||
rng.choice(Auto.index,
|
||||
392,
|
||||
replace=True)) for _ in range(10)])
|
||||
|
||||
```
|
||||
Next, we use the `boot_SE()` {} function to compute the standard
|
||||
errors of 1,000 bootstrap estimates for the intercept and slope terms.
|
||||
|
||||
```{python}
|
||||
hp_se = boot_SE(hp_func,
|
||||
Auto,
|
||||
B=1000,
|
||||
seed=10)
|
||||
hp_se
|
||||
|
||||
```
|
||||
|
||||
|
||||
This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
|
||||
0.85, and that the bootstrap
|
||||
estimate for ${\rm SE}(\hat{\beta}_1)$ is
|
||||
0.0074. As discussed in
|
||||
Section 3.1.2, standard formulas can be used to compute
|
||||
the standard errors for the regression coefficients in a linear
|
||||
model. These can be obtained using the `summarize()` function
|
||||
from `ISLP.sm`.
|
||||
|
||||
```{python}
|
||||
hp_model.fit(Auto, Auto['mpg'])
|
||||
model_se = summarize(hp_model.results_)['std err']
|
||||
model_se
|
||||
|
||||
```
|
||||
|
||||
|
||||
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
|
||||
obtained using the formulas from Section 3.1.2 are
|
||||
0.717 for the
|
||||
intercept and
|
||||
0.006 for the
|
||||
slope. Interestingly, these are somewhat different from the estimates
|
||||
obtained using the bootstrap. Does this indicate a problem with the
|
||||
bootstrap? In fact, it suggests the opposite. Recall that the
|
||||
standard formulas given in
|
||||
Equation 3.8 on page 75
|
||||
rely on certain assumptions. For example,
|
||||
they depend on the unknown parameter $\sigma^2$, the noise
|
||||
variance. We then estimate $\sigma^2$ using the RSS. Now although the
|
||||
formulas for the standard errors do not rely on the linear model being
|
||||
correct, the estimate for $\sigma^2$ does. We see
|
||||
in Figure 3.8 on page 99 that there is
|
||||
a non-linear relationship in the data, and so the residuals from a
|
||||
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
|
||||
the standard formulas assume (somewhat unrealistically) that the $x_i$
|
||||
are fixed, and all the variability comes from the variation in the
|
||||
errors $\epsilon_i$. The bootstrap approach does not rely on any of
|
||||
these assumptions, and so it is likely giving a more accurate estimate
|
||||
of the standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ than
|
||||
the results from `sm.OLS`.
|
||||
|
||||
Below we compute the bootstrap standard error estimates and the
|
||||
standard linear regression estimates that result from fitting the
|
||||
quadratic model to the data. Since this model provides a good fit to
|
||||
the data (Figure 3.8), there is now a better
|
||||
correspondence between the bootstrap estimates and the standard
|
||||
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
|
||||
${\rm SE}(\hat{\beta}_2)$.
|
||||
|
||||
```{python}
|
||||
quad_model = MS([poly('horsepower', 2, raw=True)])
|
||||
quad_func = partial(boot_OLS,
|
||||
quad_model,
|
||||
'mpg')
|
||||
boot_SE(quad_func, Auto, B=1000)
|
||||
|
||||
```
|
||||
|
||||
We compare the results to the standard errors computed using `sm.OLS()`.
|
||||
|
||||
```{python}
|
||||
M = sm.OLS(Auto['mpg'],
|
||||
quad_model.fit_transform(Auto))
|
||||
summarize(M.fit())['std err']
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
@@ -1,928 +0,0 @@
|
||||
|
||||
|
||||
# Linear Models and Regularization Methods
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch06-varselect-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch06-varselect-lab.ipynb)
|
||||
|
||||
|
||||
In this lab we implement many of the techniques discussed in this chapter.
|
||||
We import some of our libraries at this top
|
||||
level.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
from statsmodels.api import OLS
|
||||
import sklearn.model_selection as skm
|
||||
import sklearn.linear_model as skl
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from ISLP import load_data
|
||||
from ISLP.models import ModelSpec as MS
|
||||
from functools import partial
|
||||
|
||||
```
|
||||
|
||||
We again collect the new imports
|
||||
needed for this lab. Readers will have installed `l0bnb` when installing the requirements.
|
||||
|
||||
```{python}
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.cross_decomposition import PLSRegression
|
||||
from ISLP.models import \
|
||||
(Stepwise,
|
||||
sklearn_selected,
|
||||
sklearn_selection_path)
|
||||
from l0bnb import fit_path
|
||||
|
||||
```
|
||||
|
||||
Using `skl.ElasticNet` to fit ridge regression
|
||||
throws up many warnings. We will suppress them below by a call to `warnings.simplefilter()`.
|
||||
|
||||
```{python}
|
||||
import warnings
|
||||
warnings.simplefilter("ignore")
|
||||
```
|
||||
|
||||
## Subset Selection Methods
|
||||
Here we implement methods that reduce the number of parameters in a
|
||||
model by restricting the model to a subset of the input variables.
|
||||
|
||||
### Forward Selection
|
||||
|
||||
We will apply the forward-selection approach to the `Hitters`
|
||||
data. We wish to predict a baseball player’s `Salary` on the
|
||||
basis of various statistics associated with performance in the
|
||||
previous year.
|
||||
|
||||
First of all, we note that the `Salary` variable is missing for
|
||||
some of the players. The `np.isnan()` function can be used to
|
||||
identify the missing observations. It returns an array
|
||||
of the same shape as the input vector, with a `True` for any elements that
|
||||
are missing, and a `False` for non-missing elements. The
|
||||
`sum()` method can then be used to count all of the
|
||||
missing elements.
|
||||
|
||||
```{python}
|
||||
Hitters = load_data('Hitters')
|
||||
np.isnan(Hitters['Salary']).sum()
|
||||
|
||||
```
|
||||
|
||||
We see that `Salary` is missing for 59 players. The
|
||||
`dropna()` method of data frames removes all of the rows that have missing
|
||||
values in any variable (by default --- see `Hitters.dropna?`).
|
||||
|
||||
```{python}
|
||||
Hitters = Hitters.dropna();
|
||||
Hitters.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
We first choose the best model using forward selection based on $C_p$ (6.2). This score
|
||||
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
|
||||
it as a scorer. By default, `sklearn` tries to maximize a score, hence
|
||||
our scoring function computes the negative $C_p$ statistic.
|
||||
|
||||
```{python}
|
||||
def nCp(sigma2, estimator, X, Y):
|
||||
"Negative Cp statistic"
|
||||
n, p = X.shape
|
||||
Yhat = estimator.predict(X)
|
||||
RSS = np.sum((Y - Yhat)**2)
|
||||
return -(RSS + 2 * p * sigma2) / n
|
||||
|
||||
```
|
||||
We need to estimate the residual variance $\sigma^2$, which is the first argument in our scoring function above.
|
||||
We will fit the biggest model, using all the variables, and estimate $\sigma^2$ based on its MSE.
|
||||
|
||||
```{python}
|
||||
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
|
||||
Y = np.array(Hitters['Salary'])
|
||||
X = design.transform(Hitters)
|
||||
sigma2 = OLS(Y,X).fit().scale
|
||||
|
||||
```
|
||||
|
||||
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$.
|
||||
|
||||
```{python}
|
||||
neg_Cp = partial(nCp, sigma2)
|
||||
|
||||
```
|
||||
We can now use `neg_Cp()` as a scorer for model selection.
|
||||
|
||||
|
||||
|
||||
Along with a score we need to specify the search strategy. This is done through the object
|
||||
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
|
||||
runs forward stepwise until any further additions to the model do not result
|
||||
in an improvement in the evaluation score. Similarly, the method `Stepwise.fixed_steps()`
|
||||
runs a fixed number of steps of stepwise search.
|
||||
|
||||
```{python}
|
||||
strategy = Stepwise.first_peak(design,
|
||||
direction='forward',
|
||||
max_terms=len(design.terms))
|
||||
|
||||
```
|
||||
|
||||
We now fit a linear regression model with `Salary` as outcome using forward
|
||||
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
|
||||
a model from `statsmodels` along with a search strategy and selects a model with its
|
||||
`fit` method. Without specifying a `scoring` argument, the score defaults to MSE, and so all 19 variables will be
|
||||
selected.
|
||||
|
||||
```{python}
|
||||
hitters_MSE = sklearn_selected(OLS,
|
||||
strategy)
|
||||
hitters_MSE.fit(Hitters, Y)
|
||||
hitters_MSE.selected_state_
|
||||
|
||||
```
|
||||
|
||||
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
|
||||
|
||||
```{python}
|
||||
hitters_Cp = sklearn_selected(OLS,
|
||||
strategy,
|
||||
scoring=neg_Cp)
|
||||
hitters_Cp.fit(Hitters, Y)
|
||||
hitters_Cp.selected_state_
|
||||
|
||||
```
|
||||
|
||||
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
|
||||
|
||||
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
|
||||
method that stores the full path of models found in forward selection, and allows predictions for each of these. This can be done with the `sklearn_selection_path()`
|
||||
estimator from `ISLP.models`. The function `cross_val_predict()` from `ISLP.models`
|
||||
computes the cross-validated predictions for each of the models
|
||||
along the path, which we can use to evaluate the cross-validated MSE
|
||||
along the path.
|
||||
|
||||
|
||||
Here we define a strategy that fits the full forward selection path.
|
||||
While there are various parameter choices for `sklearn_selection_path()`,
|
||||
we use the defaults here, which selects the model at each step based on the biggest reduction in RSS.
|
||||
|
||||
```{python}
|
||||
strategy = Stepwise.fixed_steps(design,
|
||||
len(design.terms),
|
||||
direction='forward')
|
||||
full_path = sklearn_selection_path(OLS, strategy)
|
||||
|
||||
```
|
||||
|
||||
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
|
||||
|
||||
```{python}
|
||||
full_path.fit(Hitters, Y)
|
||||
Yhat_in = full_path.predict(Hitters)
|
||||
Yhat_in.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
|
||||
in-sample MSE. As expected, the in-sample MSE improves each step we take,
|
||||
indicating we must use either the validation or cross-validation
|
||||
approach to select the number of steps. We fix the y-axis to range from
|
||||
50,000 to 250,000 to compare to the cross-validation and validation
|
||||
set MSE below, as well as other methods such as ridge regression, lasso and
|
||||
principal components regression.
|
||||
|
||||
```{python}
|
||||
mse_fig, ax = subplots(figsize=(8,8))
|
||||
insample_mse = ((Yhat_in - Y[:,None])**2).mean(0)
|
||||
n_steps = insample_mse.shape[0]
|
||||
ax.plot(np.arange(n_steps),
|
||||
insample_mse,
|
||||
'k', # color black
|
||||
label='In-sample')
|
||||
ax.set_ylabel('MSE',
|
||||
fontsize=20)
|
||||
ax.set_xlabel('# steps of forward stepwise',
|
||||
fontsize=20)
|
||||
ax.set_xticks(np.arange(n_steps)[::2])
|
||||
ax.legend()
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
Notice the expression `None` in `Y[:,None]` above.
|
||||
This adds an axis (dimension) to the one-dimensional array `Y`,
|
||||
which allows it to be recycled when subtracted from the two-dimensional `Yhat_in`.
|
||||
|
||||
|
||||
We are now ready to use cross-validation to estimate test error along
|
||||
the model path. We must use *only the training observations* to perform all aspects of model-fitting --- including
|
||||
variable selection. Therefore, the determination of which model of a
|
||||
given size is best must be made using \emph{only the training
|
||||
observations} in each training fold. This point is subtle but important. If the full data
|
||||
set is used to select the best subset at each step, then the validation
|
||||
set errors and cross-validation errors that we obtain will not be
|
||||
accurate estimates of the test error.
|
||||
|
||||
We now compute the cross-validated predicted values using 5-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
K = 5
|
||||
kfold = skm.KFold(K,
|
||||
random_state=0,
|
||||
shuffle=True)
|
||||
Yhat_cv = skm.cross_val_predict(full_path,
|
||||
Hitters,
|
||||
Y,
|
||||
cv=kfold)
|
||||
Yhat_cv.shape
|
||||
|
||||
```
|
||||
`skm.cross_val_predict()`
|
||||
The prediction matrix `Yhat_cv` is the same shape as `Yhat_in`; the difference is that the predictions in each row, corresponding to a particular sample index, were made from models fit on a training fold that did not include that row.
|
||||
|
||||
At each model along the path, we compute the MSE in each of the cross-validation folds.
|
||||
These we will average to get the mean MSE, and can also use the individual values to compute a crude estimate of the standard error of the mean. {The estimate is crude because the five error estimates are based on overlapping training sets, and hence are not independent.}
|
||||
Hence we must know the test indices for each cross-validation
|
||||
split. This can be found by using the `split()` method of `kfold`. Because
|
||||
we fixed the random state above, whenever we split any array with the same
|
||||
number of rows as $Y$ we recover the same training and test indices, though we simply
|
||||
ignore the training indices below.
|
||||
|
||||
```{python}
|
||||
cv_mse = []
|
||||
for train_idx, test_idx in kfold.split(Y):
|
||||
errors = (Yhat_cv[test_idx] - Y[test_idx,None])**2
|
||||
cv_mse.append(errors.mean(0)) # column means
|
||||
cv_mse = np.array(cv_mse).T
|
||||
cv_mse.shape
|
||||
|
||||
```
|
||||
|
||||
We now add the cross-validation error estimates to our MSE plot.
|
||||
We include the mean error across the five folds, and the estimate of the standard error of the mean.
|
||||
|
||||
```{python}
|
||||
ax.errorbar(np.arange(n_steps),
|
||||
cv_mse.mean(1),
|
||||
cv_mse.std(1) / np.sqrt(K),
|
||||
label='Cross-validated',
|
||||
c='r') # color red
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
To repeat the above using the validation set approach, we simply change our
|
||||
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
|
||||
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
|
||||
|
||||
```{python}
|
||||
validation = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=0.2,
|
||||
random_state=0)
|
||||
for train_idx, test_idx in validation.split(Y):
|
||||
full_path.fit(Hitters.iloc[train_idx],
|
||||
Y[train_idx])
|
||||
Yhat_val = full_path.predict(Hitters.iloc[test_idx])
|
||||
errors = (Yhat_val - Y[test_idx,None])**2
|
||||
validation_mse = errors.mean(0)
|
||||
|
||||
```
|
||||
As for the in-sample MSE case, the validation set approach does not provide standard errors.
|
||||
|
||||
```{python}
|
||||
ax.plot(np.arange(n_steps),
|
||||
validation_mse,
|
||||
'b--', # color blue, broken line
|
||||
label='Validation')
|
||||
ax.set_xticks(np.arange(n_steps)[::2])
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Best Subset Selection
|
||||
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
|
||||
data, which for every subset size, searches for the best set of predictors.
|
||||
|
||||
We will use a package called `l0bnb` to perform
|
||||
best subset selection.
|
||||
Instead of constraining the subset to be a given size,
|
||||
this package produces a path of solutions using the subset size as a
|
||||
penalty rather than a constraint. Although the distinction is subtle, the difference comes when we cross-validate.
|
||||
|
||||
|
||||
```{python}
|
||||
D = design.fit_transform(Hitters)
|
||||
D = D.drop('intercept', axis=1)
|
||||
X = np.asarray(D)
|
||||
|
||||
```
|
||||
Here we excluded the first column corresponding to the intercept, as
|
||||
`l0bnb` will fit the intercept separately. We can find a path using the `fit_path()` function.
|
||||
|
||||
```{python}
|
||||
path = fit_path(X,
|
||||
Y,
|
||||
max_nonzeros=X.shape[1])
|
||||
|
||||
```
|
||||
|
||||
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
|
||||
|
||||
```{python}
|
||||
path[3]
|
||||
|
||||
```
|
||||
In the example above, we see that at the fourth step in the path, we have two nonzero coefficients in `'B'`, corresponding to the value $0.0114$ for the penalty parameter `lambda_0`.
|
||||
We could make predictions using this sequence of fits on a validation set as a function of `lambda_0`, or with more work using cross-validation.
|
||||
|
||||
## Ridge Regression and the Lasso
|
||||
We will use the `sklearn.linear_model` package (for which
|
||||
we use `skl` as shorthand below)
|
||||
to fit ridge and lasso regularized linear models on the `Hitters` data.
|
||||
We start with the model matrix `X` (without an intercept) that we computed in the previous section on best subset regression.
|
||||
|
||||
### Ridge Regression
|
||||
We will use the function `skl.ElasticNet()` to fit both ridge and the lasso.
|
||||
To fit a *path* of ridge regressions models, we use
|
||||
`skl.ElasticNet.path()`, which can fit both ridge and lasso, as well as a hybrid mixture; ridge regression
|
||||
corresponds to `l1_ratio=0`.
|
||||
It is good practice to standardize the columns of `X` in these applications, if the variables are measured in different units. Since `skl.ElasticNet()` does no normalization, we have to take care of that ourselves.
|
||||
Since we
|
||||
standardize first, in order to find coefficient
|
||||
estimates on the original scale, we must *unstandardize*
|
||||
the coefficient estimates. The parameter
|
||||
$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to
|
||||
be consistent with the rest of this chapter, we use `lambdas`
|
||||
rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [23] issue unwarranted convergence warning messages; we suppressed these when we filtered the warnings above.}
|
||||
|
||||
```{python}
|
||||
Xs = X - X.mean(0)[None,:]
|
||||
X_scale = X.std(0)
|
||||
Xs = Xs / X_scale[None,:]
|
||||
lambdas = 10**np.linspace(8, -2, 100) / Y.std()
|
||||
soln_array = skl.ElasticNet.path(Xs,
|
||||
Y,
|
||||
l1_ratio=0.,
|
||||
alphas=lambdas)[1]
|
||||
soln_array.shape
|
||||
|
||||
```
|
||||
Here we extract the array of coefficients corresponding to the solutions along the regularization path.
|
||||
By default the `skl.ElasticNet.path` method fits a path along
|
||||
an automatically selected range of $\lambda$ values, except for the case when
|
||||
`l1_ratio=0`, which results in ridge regression (as is the case here). {The reason is rather technical; for all models except ridge, we can find the smallest value of $\lambda$ for which all coefficients are zero. For ridge this value is $\infty$.} So here
|
||||
we have chosen to implement the function over a grid of values ranging
|
||||
from $\lambda=10^{8}$ to $\lambda=10^{-2}$ scaled by the standard
|
||||
deviation of $y$, essentially covering the full range of scenarios
|
||||
from the null model containing only the intercept, to the least
|
||||
squares fit.
|
||||
|
||||
Associated with each value of $\lambda$ is a vector of ridge
|
||||
regression coefficients, that can be accessed by
|
||||
a column of `soln_array`. In this case, `soln_array` is a $19 \times 100$ matrix, with
|
||||
19 rows (one for each predictor) and 100
|
||||
columns (one for each value of $\lambda$).
|
||||
|
||||
We transpose this matrix and turn it into a data frame to facilitate viewing and plotting.
|
||||
|
||||
```{python}
|
||||
soln_path = pd.DataFrame(soln_array.T,
|
||||
columns=D.columns,
|
||||
index=-np.log(lambdas))
|
||||
soln_path.index.name = 'negative log(lambda)'
|
||||
soln_path
|
||||
|
||||
```
|
||||
|
||||
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
|
||||
To control the location of the legend we first set `legend` to `False` in the
|
||||
plot method, adding it afterward with the `legend()` method of `ax`.
|
||||
|
||||
```{python}
|
||||
path_fig, ax = subplots(figsize=(8,8))
|
||||
soln_path.plot(ax=ax, legend=False)
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Standardized coefficients', fontsize=20)
|
||||
ax.legend(loc='upper left');
|
||||
|
||||
```
|
||||
(We have used `latex` formatting in the horizontal label, in order to format the Greek $\lambda$ appropriately.)
|
||||
We expect the coefficient estimates to be much smaller, in terms of
|
||||
$\ell_2$ norm, when a large value of $\lambda$ is used, as compared to
|
||||
when a small value of $\lambda$ is used. (Recall that the $\ell_2$ norm is the square root of the sum of squared coefficient values.) We display the coefficients at the $40$th step,
|
||||
where $\lambda$ is 25.535.
|
||||
|
||||
```{python}
|
||||
beta_hat = soln_path.loc[soln_path.index[39]]
|
||||
lambdas[39], beta_hat
|
||||
|
||||
```
|
||||
|
||||
Let’s compute the $\ell_2$ norm of the standardized coefficients.
|
||||
|
||||
```{python}
|
||||
np.linalg.norm(beta_hat)
|
||||
|
||||
```
|
||||
|
||||
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
|
||||
Note the much larger $\ell_2$ norm of the
|
||||
coefficients associated with this smaller value of $\lambda$.
|
||||
|
||||
```{python}
|
||||
beta_hat = soln_path.loc[soln_path.index[59]]
|
||||
lambdas[59], np.linalg.norm(beta_hat)
|
||||
|
||||
```
|
||||
Above we normalized `X` upfront, and fit the ridge model using `Xs`.
|
||||
The `Pipeline()` object
|
||||
in `sklearn` provides a clear way to separate feature
|
||||
normalization from the fitting of the ridge model itself.
|
||||
|
||||
```{python}
|
||||
ridge = skl.ElasticNet(alpha=lambdas[59], l1_ratio=0)
|
||||
scaler = StandardScaler(with_mean=True, with_std=True)
|
||||
pipe = Pipeline(steps=[('scaler', scaler), ('ridge', ridge)])
|
||||
pipe.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
We show that it gives the same $\ell_2$ norm as in our previous fit on the standardized data.
|
||||
|
||||
```{python}
|
||||
np.linalg.norm(ridge.coef_)
|
||||
|
||||
```
|
||||
Notice that the operation `pipe.fit(X, Y)` above has changed the `ridge` object, and in particular has added attributes such as `coef_` that were not there before.
|
||||
### Estimating Test Error of Ridge Regression
|
||||
Choosing an *a priori* value of $\lambda$ for ridge regression is
|
||||
difficult if not impossible. We will want to use the validation method
|
||||
or cross-validation to select the tuning parameter. The reader may not
|
||||
be surprised that the `Pipeline()` approach can be used in
|
||||
`skm.cross_validate()` with either a validation method
|
||||
(i.e. `validation`) or $k$-fold cross-validation.
|
||||
|
||||
We fix the random state of the splitter
|
||||
so that the results obtained will be reproducible.
|
||||
|
||||
```{python}
|
||||
validation = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=0.5,
|
||||
random_state=0)
|
||||
ridge.alpha = 0.01
|
||||
results = skm.cross_validate(ridge,
|
||||
X,
|
||||
Y,
|
||||
scoring='neg_mean_squared_error',
|
||||
cv=validation)
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
|
||||
The test MSE is 1.342e+05. Note
|
||||
that if we had instead simply fit a model with just an intercept, we
|
||||
would have predicted each test observation using the mean of the
|
||||
training observations. We can get the same result by fitting a ridge regression model
|
||||
with a *very* large value of $\lambda$. Note that `1e10`
|
||||
means $10^{10}$.
|
||||
|
||||
```{python}
|
||||
ridge.alpha = 1e10
|
||||
results = skm.cross_validate(ridge,
|
||||
X,
|
||||
Y,
|
||||
scoring='neg_mean_squared_error',
|
||||
cv=validation)
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
Obviously choosing $\lambda=0.01$ is arbitrary, so we will use cross-validation or the validation-set
|
||||
approach to choose the tuning parameter $\lambda$.
|
||||
The object `GridSearchCV()` allows exhaustive
|
||||
grid search to choose such a parameter.
|
||||
|
||||
We first use the validation set method
|
||||
to choose $\lambda$.
|
||||
|
||||
```{python}
|
||||
param_grid = {'ridge__alpha': lambdas}
|
||||
grid = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=validation,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
|
||||
Alternatively, we can use 5-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
grid = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
Recall we set up the `kfold` object for 5-fold cross-validation on page 271. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
|
||||
to right.
|
||||
|
||||
```{python}
|
||||
ridge_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(lambdas),
|
||||
-grid.cv_results_['mean_test_score'],
|
||||
yerr=grid.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
One can cross-validate different metrics to choose a parameter. The default
|
||||
metric for `skl.ElasticNet()` is test $R^2$.
|
||||
Let’s compare $R^2$ to MSE for cross-validation here.
|
||||
|
||||
```{python}
|
||||
grid_r2 = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=kfold)
|
||||
grid_r2.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
Finally, let’s plot the results for cross-validated $R^2$.
|
||||
|
||||
```{python}
|
||||
r2_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(lambdas),
|
||||
grid_r2.cv_results_['mean_test_score'],
|
||||
yerr=grid_r2.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Fast Cross-Validation for Solution Paths
|
||||
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
|
||||
such paths, and to choose a suitable value of $\lambda$ using cross-validation. Even with
|
||||
identical splits the results will not agree *exactly* with our `grid`
|
||||
above because the standardization of each feature in `grid` is carried out on each fold,
|
||||
while in `pipeCV` below it is carried out only once.
|
||||
Nevertheless, the results are similar as the normalization
|
||||
is relatively stable across folds.
|
||||
|
||||
```{python}
|
||||
ridgeCV = skl.ElasticNetCV(alphas=lambdas,
|
||||
l1_ratio=0,
|
||||
cv=kfold)
|
||||
pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
('ridge', ridgeCV)])
|
||||
pipeCV.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
Let’s produce a plot again of the cross-validation error to see that
|
||||
it is similar to using `skm.GridSearchCV`.
|
||||
|
||||
```{python}
|
||||
tuned_ridge = pipeCV.named_steps['ridge']
|
||||
ridgeCV_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(lambdas),
|
||||
tuned_ridge.mse_path_.mean(1),
|
||||
yerr=tuned_ridge.mse_path_.std(1) / np.sqrt(K))
|
||||
ax.axvline(-np.log(tuned_ridge.alpha_), c='k', ls='--')
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
We see that the value of $\lambda$ that results in the
|
||||
smallest cross-validation error is 1.19e-02, available
|
||||
as the value `tuned_ridge.alpha_`. What is the test MSE
|
||||
associated with this value of $\lambda$?
|
||||
|
||||
```{python}
|
||||
np.min(tuned_ridge.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
This represents a further improvement over the test MSE that we got
|
||||
using $\lambda=4$. Finally, `tuned_ridge.coef_`
|
||||
has the coefficients fit on the entire data set
|
||||
at this value of $\lambda$.
|
||||
|
||||
```{python}
|
||||
tuned_ridge.coef_
|
||||
|
||||
```
|
||||
As expected, none of the coefficients are zero—ridge regression does
|
||||
not perform variable selection!
|
||||
|
||||
|
||||
### Evaluating Test Error of Cross-Validated Ridge
|
||||
Choosing $\lambda$ using cross-validation provides a single regression
|
||||
estimator, similar to fitting a linear regression model as we saw in
|
||||
Chapter 3. It is therefore reasonable to estimate what its test error
|
||||
is. We run into a problem here in that cross-validation will have
|
||||
*touched* all of its data in choosing $\lambda$, hence we have no
|
||||
further data to estimate test error. A compromise is to do an initial
|
||||
split of the data into two disjoint sets: a training set and a test set.
|
||||
We then fit a cross-validation
|
||||
tuned ridge regression on the training set, and evaluate its performance on the test set.
|
||||
We might call this cross-validation nested
|
||||
within the validation set approach. A priori there is no reason to use
|
||||
half of the data for each of the two sets in validation. Below, we use
|
||||
75% for training and 25% for test, with the estimator being ridge
|
||||
regression tuned using 5-fold cross-validation. This can be achieved
|
||||
in code as follows:
|
||||
|
||||
```{python}
|
||||
outer_valid = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=0.25,
|
||||
random_state=1)
|
||||
inner_cv = skm.KFold(n_splits=5,
|
||||
shuffle=True,
|
||||
random_state=2)
|
||||
ridgeCV = skl.ElasticNetCV(alphas=lambdas,
|
||||
l1_ratio=0,
|
||||
cv=inner_cv)
|
||||
pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
('ridge', ridgeCV)]);
|
||||
|
||||
```
|
||||
|
||||
```{python}
|
||||
results = skm.cross_validate(pipeCV,
|
||||
X,
|
||||
Y,
|
||||
cv=outer_valid,
|
||||
scoring='neg_mean_squared_error')
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### The Lasso
|
||||
We saw that ridge regression with a wise choice of $\lambda$ can
|
||||
outperform least squares as well as the null model on the
|
||||
`Hitters` data set. We now ask whether the lasso can yield
|
||||
either a more accurate or a more interpretable model than ridge
|
||||
regression. In order to fit a lasso model, we once again use the
|
||||
`ElasticNetCV()` function; however, this time we use the argument
|
||||
`l1_ratio=1`. Other than that change, we proceed just as we did in
|
||||
fitting a ridge model.
|
||||
|
||||
```{python}
|
||||
lassoCV = skl.ElasticNetCV(n_alphas=100,
|
||||
l1_ratio=1,
|
||||
cv=kfold)
|
||||
pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
('lasso', lassoCV)])
|
||||
pipeCV.fit(X, Y)
|
||||
tuned_lasso = pipeCV.named_steps['lasso']
|
||||
tuned_lasso.alpha_
|
||||
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
lambdas, soln_array = skl.Lasso.path(Xs,
|
||||
Y,
|
||||
l1_ratio=1,
|
||||
n_alphas=100)[:2]
|
||||
soln_path = pd.DataFrame(soln_array.T,
|
||||
columns=D.columns,
|
||||
index=-np.log(lambdas))
|
||||
|
||||
```
|
||||
We can see from the coefficient plot of the standardized coefficients that depending on the choice of
|
||||
tuning parameter, some of the coefficients will be exactly equal to
|
||||
zero.
|
||||
|
||||
```{python}
|
||||
path_fig, ax = subplots(figsize=(8,8))
|
||||
soln_path.plot(ax=ax, legend=False)
|
||||
ax.legend(loc='upper left')
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Standardized coefficiients', fontsize=20);
|
||||
|
||||
```
|
||||
The smallest cross-validated error is lower than the test set MSE of the null model
|
||||
and of least squares, and very similar to the test MSE of 115526.71 of ridge
|
||||
regression (page 278) with $\lambda$ chosen by cross-validation.
|
||||
|
||||
```{python}
|
||||
np.min(tuned_lasso.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
Let’s again produce a plot of the cross-validation error.
|
||||
|
||||
|
||||
```{python}
|
||||
lassoCV_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(tuned_lasso.alphas_),
|
||||
tuned_lasso.mse_path_.mean(1),
|
||||
yerr=tuned_lasso.mse_path_.std(1) / np.sqrt(K))
|
||||
ax.axvline(-np.log(tuned_lasso.alpha_), c='k', ls='--')
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
```
|
||||
|
||||
However, the lasso has a substantial advantage over ridge regression
|
||||
in that the resulting coefficient estimates are sparse. Here we see
|
||||
that 6 of the 19 coefficient estimates are exactly zero. So the lasso
|
||||
model with $\lambda$ chosen by cross-validation contains only 13
|
||||
variables.
|
||||
|
||||
```{python}
|
||||
tuned_lasso.coef_
|
||||
|
||||
```
|
||||
|
||||
As in ridge regression, we could evaluate the test error
|
||||
of cross-validated lasso by first splitting into
|
||||
test and training sets and internally running
|
||||
cross-validation on the training set. We leave
|
||||
this as an exercise.
|
||||
|
||||
|
||||
## PCR and PLS Regression
|
||||
|
||||
### Principal Components Regression
|
||||
|
||||
|
||||
Principal components regression (PCR) can be performed using
|
||||
`PCA()` from the `sklearn.decomposition`
|
||||
module. We now apply PCR to the `Hitters` data, in order to
|
||||
predict `Salary`. Again, ensure that the missing values have
|
||||
been removed from the data, as described in Section 6.5.1.
|
||||
|
||||
We use `LinearRegression()` to fit the regression model
|
||||
here. Note that it fits an intercept by default, unlike
|
||||
the `OLS()` function seen earlier in Section 6.5.1.
|
||||
|
||||
```{python}
|
||||
pca = PCA(n_components=2)
|
||||
linreg = skl.LinearRegression()
|
||||
pipe = Pipeline([('pca', pca),
|
||||
('linreg', linreg)])
|
||||
pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
When performing PCA, the results vary depending
|
||||
on whether the data has been *standardized* or not.
|
||||
As in the earlier examples, this can be accomplished
|
||||
by including an additional step in the pipeline.
|
||||
|
||||
```{python}
|
||||
pipe = Pipeline([('scaler', scaler),
|
||||
('pca', pca),
|
||||
('linreg', linreg)])
|
||||
pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
We can of course use CV to choose the number of components, by
|
||||
using `skm.GridSearchCV`, in this
|
||||
case fixing the parameters to vary the
|
||||
`n_components`.
|
||||
|
||||
```{python}
|
||||
param_grid = {'pca__n_components': range(1, 20)}
|
||||
grid = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
Let’s plot the results as we have for other methods.
|
||||
|
||||
```{python}
|
||||
pcr_fig, ax = subplots(figsize=(8,8))
|
||||
n_comp = param_grid['pca__n_components']
|
||||
ax.errorbar(n_comp,
|
||||
-grid.cv_results_['mean_test_score'],
|
||||
grid.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20)
|
||||
ax.set_xlabel('# principal components', fontsize=20)
|
||||
ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
We see that the smallest cross-validation error occurs when
|
||||
17
|
||||
components are used. However, from the plot we also see that the
|
||||
cross-validation error is roughly the same when only one component is
|
||||
included in the model. This suggests that a model that uses just a
|
||||
small number of components might suffice.
|
||||
|
||||
The CV score is provided for each possible number of components from
|
||||
1 to 19 inclusive. The `PCA()` method complains
|
||||
if we try to fit an intercept only with `n_components=0`
|
||||
so we also compute the MSE for just the null model with
|
||||
these splits.
|
||||
|
||||
```{python}
|
||||
Xn = np.zeros((X.shape[0], 1))
|
||||
cv_null = skm.cross_validate(linreg,
|
||||
Xn,
|
||||
Y,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
-cv_null['test_score'].mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `explained_variance_ratio_`
|
||||
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
|
||||
different numbers of components. This concept is discussed in greater
|
||||
detail in Section 12.2.
|
||||
|
||||
```{python}
|
||||
pipe.named_steps['pca'].explained_variance_ratio_
|
||||
|
||||
```
|
||||
|
||||
Briefly, we can think of
|
||||
this as the amount of information about the predictors
|
||||
that is captured using $M$ principal components. For example, setting
|
||||
$M=1$ only captures 38.31% of the variance, while $M=2$ captures an additional 21.84%, for a total of 60.15% of the variance.
|
||||
By $M=6$ it increases to
|
||||
88.63%. Beyond this the increments continue to diminish, until we use all $M=p=19$ components, which captures all 100% of the variance.
|
||||
|
||||
|
||||
|
||||
|
||||
### Partial Least Squares
|
||||
Partial least squares (PLS) is implemented in the
|
||||
`PLSRegression()` function.
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
pls = PLSRegression(n_components=2,
|
||||
scale=True)
|
||||
pls.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
As was the case in PCR, we will want to
|
||||
use CV to choose the number of components.
|
||||
|
||||
```{python}
|
||||
param_grid = {'n_components':range(1, 20)}
|
||||
grid = skm.GridSearchCV(pls,
|
||||
param_grid,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
As for our other methods, we plot the MSE.
|
||||
|
||||
```{python}
|
||||
pls_fig, ax = subplots(figsize=(8,8))
|
||||
n_comp = param_grid['n_components']
|
||||
ax.errorbar(n_comp,
|
||||
-grid.cv_results_['mean_test_score'],
|
||||
grid.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20)
|
||||
ax.set_xlabel('# principal components', fontsize=20)
|
||||
ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
CV error is minimized at 12,
|
||||
though there is little noticeable difference between this point and a much lower number like 2 or 3 components.
|
||||
|
||||
|
||||
@@ -1,900 +0,0 @@
|
||||
|
||||
|
||||
# Non-Linear Modeling
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch07-nonlin-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch07-nonlin-lab.ipynb)
|
||||
|
||||
|
||||
In this lab, we demonstrate some of the nonlinear models discussed in
|
||||
this chapter. We use the `Wage` data as a running example, and show that many of the complex non-linear fitting procedures discussed can easily be implemented in `Python`.
|
||||
|
||||
As usual, we start with some of our standard imports.
|
||||
|
||||
```{python}
|
||||
import numpy as np, pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
import statsmodels.api as sm
|
||||
from ISLP import load_data
|
||||
from ISLP.models import (summarize,
|
||||
poly,
|
||||
ModelSpec as MS)
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
```
|
||||
|
||||
We again collect the new imports
|
||||
needed for this lab. Many of these are developed specifically for the
|
||||
`ISLP` package.
|
||||
|
||||
```{python}
|
||||
from pygam import (s as s_gam,
|
||||
l as l_gam,
|
||||
f as f_gam,
|
||||
LinearGAM,
|
||||
LogisticGAM)
|
||||
|
||||
from ISLP.transforms import (BSpline,
|
||||
NaturalSpline)
|
||||
from ISLP.models import bs, ns
|
||||
from ISLP.pygam import (approx_lam,
|
||||
degrees_of_freedom,
|
||||
plot as plot_gam,
|
||||
anova as anova_gam)
|
||||
|
||||
```
|
||||
|
||||
## Polynomial Regression and Step Functions
|
||||
We start by demonstrating how Figure 7.1 can be reproduced.
|
||||
Let's begin by loading the data.
|
||||
|
||||
```{python}
|
||||
Wage = load_data('Wage')
|
||||
y = Wage['wage']
|
||||
age = Wage['age']
|
||||
|
||||
```
|
||||
|
||||
Throughout most of this lab, our response is `Wage['wage']`, which
|
||||
we have stored as `y` above.
|
||||
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
|
||||
that will fit a $4$th degree polynomial in `age`.
|
||||
|
||||
```{python}
|
||||
poly_age = MS([poly('age', degree=4)]).fit(Wage)
|
||||
M = sm.OLS(y, poly_age.transform(Wage)).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
This polynomial is constructed using the function `poly()`,
|
||||
which creates
|
||||
a special *transformer* `Poly()` (using `sklearn` terminology
|
||||
for feature transformations such as `PCA()` seen in Section 6.5.3) which
|
||||
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
|
||||
the
|
||||
discussion of transformations on
|
||||
page 118.
|
||||
|
||||
In the code above, the first line executes the `fit()` method
|
||||
using the dataframe
|
||||
`Wage`. This recomputes and stores as attributes any parameters needed by `Poly()`
|
||||
on the training data, and these will be used on all subsequent
|
||||
evaluations of the `transform()` method. For example, it is used
|
||||
on the second line, as well as in the plotting function developed below.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We now create a grid of values for `age` at which we want
|
||||
predictions.
|
||||
|
||||
```{python}
|
||||
age_grid = np.linspace(age.min(),
|
||||
age.max(),
|
||||
100)
|
||||
age_df = pd.DataFrame({'age': age_grid})
|
||||
|
||||
```
|
||||
|
||||
Finally, we wish to plot the data and add the fit from the fourth-degree polynomial. As we will make
|
||||
several similar plots below, we first write a function
|
||||
to create all the ingredients and produce the plot. Our function
|
||||
takes in a model specification (here a basis specified by a
|
||||
transform), as well as a grid of `age` values. The function
|
||||
produces a fitted curve as well as 95% confidence bands. By using
|
||||
an argument for `basis` we can produce and plot the results with several different
|
||||
transforms, such as the splines we will see shortly.
|
||||
|
||||
```{python}
|
||||
def plot_wage_fit(age_df,
|
||||
basis,
|
||||
title):
|
||||
|
||||
X = basis.transform(Wage)
|
||||
Xnew = basis.transform(age_df)
|
||||
M = sm.OLS(y, X).fit()
|
||||
preds = M.get_prediction(Xnew)
|
||||
bands = preds.conf_int(alpha=0.05)
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(age,
|
||||
y,
|
||||
facecolor='gray',
|
||||
alpha=0.5)
|
||||
for val, ls in zip([preds.predicted_mean,
|
||||
bands[:,0],
|
||||
bands[:,1]],
|
||||
['b','r--','r--']):
|
||||
ax.plot(age_df.values, val, ls, linewidth=3)
|
||||
ax.set_title(title, fontsize=20)
|
||||
ax.set_xlabel('Age', fontsize=20)
|
||||
ax.set_ylabel('Wage', fontsize=20);
|
||||
return ax
|
||||
|
||||
```
|
||||
We include an argument `alpha` to `ax.scatter()`
|
||||
to add some transparency to the points. This provides a visual indication
|
||||
of density. Notice the use of the `zip()` function in the
|
||||
`for` loop above (see Section 2.3.8).
|
||||
We have three lines to plot, each with different colors and line
|
||||
types. Here `zip()` conveniently bundles these together as
|
||||
iterators in the loop. {In `Python`{} speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
|
||||
|
||||
We now plot the fit of the fourth-degree polynomial using this
|
||||
function.
|
||||
|
||||
```{python}
|
||||
plot_wage_fit(age_df,
|
||||
poly_age,
|
||||
'Degree-4 Polynomial');
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
With polynomial regression we must decide on the degree of
|
||||
the polynomial to use. Sometimes we just wing it, and decide to use
|
||||
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
|
||||
make such a decision in a more systematic way. One way to do this is through hypothesis
|
||||
tests, which we demonstrate here. We now fit a series of models ranging from
|
||||
linear (degree-one) to degree-five polynomials,
|
||||
and look to determine the simplest model that is sufficient to
|
||||
explain the relationship between `wage` and `age`. We use the
|
||||
`anova_lm()` function, which performs a series of ANOVA
|
||||
tests.
|
||||
An \emph{analysis of
|
||||
variance} or ANOVA tests the null
|
||||
hypothesis that a model $\mathcal{M}_1$ is sufficient to explain the
|
||||
data against the alternative hypothesis that a more complex model
|
||||
$\mathcal{M}_2$ is required. The determination is based on an F-test.
|
||||
To perform the test, the models $\mathcal{M}_1$ and $\mathcal{M}_2$ must be *nested*:
|
||||
the space spanned by the predictors in $\mathcal{M}_1$ must be a subspace of the
|
||||
space spanned by the predictors in $\mathcal{M}_2$. In this case, we
|
||||
fit five different polynomial
|
||||
models and sequentially compare the simpler model to the more complex
|
||||
model.
|
||||
|
||||
```{python}
|
||||
models = [MS([poly('age', degree=d)])
|
||||
for d in range(1, 6)]
|
||||
Xs = [model.fit_transform(Wage) for model in models]
|
||||
anova_lm(*[sm.OLS(y, X_).fit()
|
||||
for X_ in Xs])
|
||||
|
||||
```
|
||||
|
||||
Notice the `*` in the `anova_lm()` line above. This
|
||||
function takes a variable number of non-keyword arguments, in this case fitted models.
|
||||
When these models are provided as a list (as is done here), it must be
|
||||
prefixed by `*`.
|
||||
|
||||
The p-value comparing the linear `models[0]` to the quadratic
|
||||
`models[1]` is essentially zero, indicating that a linear
|
||||
fit is not sufficient. {Indexing starting at zero is confusing for the polynomial degree example, since `models[1]` is quadratic rather than linear!} Similarly the p-value comparing the quadratic
|
||||
`models[1]` to the cubic `models[2]` is very low (0.0017), so the
|
||||
quadratic fit is also insufficient. The p-value comparing the cubic
|
||||
and degree-four polynomials, `models[2]` and `models[3]`, is
|
||||
approximately 5%, while the degree-five polynomial `models[4]` seems
|
||||
unnecessary because its p-value is 0.37. Hence, either a cubic or a
|
||||
quartic polynomial appear to provide a reasonable fit to the data, but
|
||||
lower- or higher-order models are not justified.
|
||||
|
||||
In this case, instead of using the `anova()` function, we could
|
||||
have obtained these p-values more succinctly by exploiting the fact
|
||||
that `poly()` creates orthogonal polynomials.
|
||||
|
||||
```{python}
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice that the p-values are the same, and in fact the square of
|
||||
the t-statistics are equal to the F-statistics from the
|
||||
`anova_lm()` function; for example:
|
||||
|
||||
```{python}
|
||||
(-11.983)**2
|
||||
|
||||
```
|
||||
|
||||
|
||||
However, the ANOVA method works whether or not we used orthogonal
|
||||
polynomials, provided the models are nested. For example, we can use
|
||||
`anova_lm()` to compare the following three
|
||||
models, which all have a linear term in `education` and a
|
||||
polynomial in `age` of different degrees:
|
||||
|
||||
```{python}
|
||||
models = [MS(['education', poly('age', degree=d)])
|
||||
for d in range(1, 4)]
|
||||
XEs = [model.fit_transform(Wage)
|
||||
for model in models]
|
||||
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
|
||||
|
||||
```
|
||||
|
||||
|
||||
As an alternative to using hypothesis tests and ANOVA, we could choose
|
||||
the polynomial degree using cross-validation, as discussed in Chapter 5.
|
||||
|
||||
Next we consider the task of predicting whether an individual earns
|
||||
more than $250,000 per year. We proceed much as before, except
|
||||
that first we create the appropriate response vector, and then apply
|
||||
the `glm()` function using the binomial family in order
|
||||
to fit a polynomial logistic regression model.
|
||||
|
||||
```{python}
|
||||
X = poly_age.transform(Wage)
|
||||
high_earn = Wage['high_earn'] = y > 250 # shorthand
|
||||
glm = sm.GLM(y > 250,
|
||||
X,
|
||||
family=sm.families.Binomial())
|
||||
B = glm.fit()
|
||||
summarize(B)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Once again, we make predictions using the `get_prediction()` method.
|
||||
|
||||
```{python}
|
||||
newX = poly_age.transform(age_df)
|
||||
preds = B.get_prediction(newX)
|
||||
bands = preds.conf_int(alpha=0.05)
|
||||
|
||||
```
|
||||
|
||||
We now plot the estimated relationship.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
rng = np.random.default_rng(0)
|
||||
ax.scatter(age +
|
||||
0.2 * rng.uniform(size=y.shape[0]),
|
||||
np.where(high_earn, 0.198, 0.002),
|
||||
fc='gray',
|
||||
marker='|')
|
||||
for val, ls in zip([preds.predicted_mean,
|
||||
bands[:,0],
|
||||
bands[:,1]],
|
||||
['b','r--','r--']):
|
||||
ax.plot(age_df.values, val, ls, linewidth=3)
|
||||
ax.set_title('Degree-4 Polynomial', fontsize=20)
|
||||
ax.set_xlabel('Age', fontsize=20)
|
||||
ax.set_ylim([0,0.2])
|
||||
ax.set_ylabel('P(Wage > 250)', fontsize=20);
|
||||
|
||||
```
|
||||
We have drawn the `age` values corresponding to the observations with
|
||||
`wage` values above 250 as gray marks on the top of the plot, and
|
||||
those with `wage` values below 250 are shown as gray marks on the
|
||||
bottom of the plot. We added a small amount of noise to jitter
|
||||
the `age` values a bit so that observations with the same `age`
|
||||
value do not cover each other up. This type of plot is often called a
|
||||
*rug plot*.
|
||||
|
||||
In order to fit a step function, as discussed in
|
||||
Section 7.2, we first use the `pd.qcut()`
|
||||
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
|
||||
columns of the model matrix for this categorical variable. Note that this function will
|
||||
include *all* columns for a given categorical, rather than the usual approach which drops one
|
||||
of the levels.
|
||||
|
||||
```{python}
|
||||
cut_age = pd.qcut(age, 4)
|
||||
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
|
||||
|
||||
```
|
||||
|
||||
|
||||
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
|
||||
quantiles directly instead of the argument `4`. For cuts not based
|
||||
on quantiles we would use the `pd.cut()` function.
|
||||
The function `pd.qcut()` (and `pd.cut()`) returns an ordered categorical variable.
|
||||
The regression model then creates a set of
|
||||
dummy variables for use in the regression. Since `age` is the only variable in the model, the value $94,158.40 is the average salary for those under 33.75 years of
|
||||
age, and the other coefficients are the average
|
||||
salary for those in the other age groups. We can produce
|
||||
predictions and plots just as we did in the case of the polynomial
|
||||
fit.
|
||||
|
||||
|
||||
## Splines
|
||||
In order to fit regression splines, we use transforms
|
||||
from the `ISLP` package. The actual spline
|
||||
evaluation functions are in the `scipy.interpolate` package;
|
||||
we have simply wrapped them as transforms
|
||||
similar to `Poly()` and `PCA()`.
|
||||
|
||||
In Section 7.4, we saw
|
||||
that regression splines can be fit by constructing an appropriate
|
||||
matrix of basis functions. The `BSpline()` function generates the
|
||||
entire matrix of basis functions for splines with the specified set of
|
||||
knots. By default, the B-splines produced are cubic. To change the degree, use
|
||||
the argument `degree`.
|
||||
|
||||
```{python}
|
||||
bs_ = BSpline(internal_knots=[25,40,60], intercept=True).fit(age)
|
||||
bs_age = bs_.transform(age)
|
||||
bs_age.shape
|
||||
|
||||
```
|
||||
This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots.
|
||||
We can form this same matrix using the `bs()` object,
|
||||
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1.
|
||||
|
||||
We now fit a cubic spline model to the `Wage` data.
|
||||
|
||||
```{python}
|
||||
bs_age = MS([bs('age', internal_knots=[25,40,60])])
|
||||
Xbs = bs_age.fit_transform(Wage)
|
||||
M = sm.OLS(y, Xbs).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
The column names are a little cumbersome, and have caused us to truncate the printed summary. They can be set on construction using the `name` argument as follows.
|
||||
|
||||
```{python}
|
||||
bs_age = MS([bs('age',
|
||||
internal_knots=[25,40,60],
|
||||
name='bs(age)')])
|
||||
Xbs = bs_age.fit_transform(Wage)
|
||||
M = sm.OLS(y, Xbs).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
|
||||
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
|
||||
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
|
||||
|
||||
We could also use the `df` (degrees of freedom) option to
|
||||
specify the complexity of the spline. We see above that with 3 knots,
|
||||
the spline basis has 6 columns or degrees of freedom. When we specify
|
||||
`df=6` rather than the actual knots, `bs()` will produce a
|
||||
spline with 3 knots chosen at uniform quantiles of the training data.
|
||||
We can see these chosen knots most easily using `Bspline()` directly:
|
||||
|
||||
```{python}
|
||||
BSpline(df=6).fit(age).internal_knots_
|
||||
|
||||
```
|
||||
When asking for six degrees of freedom,
|
||||
the transform chooses knots at ages 33.75, 42.0, and 51.0,
|
||||
which correspond to the 25th, 50th, and 75th percentiles of
|
||||
`age`.
|
||||
|
||||
When using B-splines we need not limit ourselves to cubic polynomials
|
||||
(i.e. `degree=3`). For instance, using `degree=0` results
|
||||
in piecewise constant functions, as in our example with
|
||||
`pd.qcut()` above.
|
||||
|
||||
|
||||
```{python}
|
||||
bs_age0 = MS([bs('age',
|
||||
df=3,
|
||||
degree=0)]).fit(Wage)
|
||||
Xbs0 = bs_age0.transform(Wage)
|
||||
summarize(sm.OLS(y, Xbs0).fit())
|
||||
|
||||
```
|
||||
This fit should be compared with cell [15] where we use `qcut()`
|
||||
to create four bins by cutting at the 25%, 50% and 75% quantiles of
|
||||
`age`. Since we specified `df=3` for degree-zero splines here, there will also be
|
||||
knots at the same three quantiles. Although the coefficients appear different, we
|
||||
see that this is a result of the different coding. For example, the
|
||||
first coefficient is identical in both cases, and is the mean response
|
||||
in the first bin. For the second coefficient, we have
|
||||
$94.158 + 22.349 = 116.507 \approx 116.611$, the latter being the mean
|
||||
in the second bin in cell [15]. Here the intercept is coded by a column
|
||||
of ones, so the second, third and fourth coefficients are increments
|
||||
for those bins. Why is the sum not exactly the same? It turns out that
|
||||
the `qcut()` uses $\leq$, while `bs()` uses $<$ when
|
||||
deciding bin membership.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
In order to fit a natural spline, we use the `NaturalSpline()`
|
||||
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
|
||||
degrees of freedom (excluding the intercept) and plot the results.
|
||||
|
||||
```{python}
|
||||
ns_age = MS([ns('age', df=5)]).fit(Wage)
|
||||
M_ns = sm.OLS(y, ns_age.transform(Wage)).fit()
|
||||
summarize(M_ns)
|
||||
```
|
||||
We now plot the natural spline using our plotting function.
|
||||
|
||||
```{python}
|
||||
plot_wage_fit(age_df,
|
||||
ns_age,
|
||||
'Natural spline, df=5');
|
||||
|
||||
```
|
||||
|
||||
## Smoothing Splines and GAMs
|
||||
A smoothing spline is a special case of a GAM with squared-error loss
|
||||
and a single feature. To fit GAMs in `Python` we will use the
|
||||
`pygam` package which can be installed via `pip install pygam`. The
|
||||
estimator `LinearGAM()` uses squared-error loss.
|
||||
The GAM is specified by associating each column
|
||||
of a model matrix with a particular smoothing operation:
|
||||
`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
|
||||
The argument `0` passed to `s` below indicates that this smoother will
|
||||
apply to the first column of a feature matrix. Below, we pass it a
|
||||
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2.
|
||||
|
||||
```{python}
|
||||
X_age = np.asarray(age).reshape((-1,1))
|
||||
gam = LinearGAM(s_gam(0, lam=0.6))
|
||||
gam.fit(X_age, y)
|
||||
|
||||
```
|
||||
|
||||
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
|
||||
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
|
||||
size of that dimension based on the remaining entries of the shape tuple.
|
||||
|
||||
Let’s investigate how the fit changes with the smoothing parameter `lam`.
|
||||
The function `np.logspace()` is similar to `np.linspace()` but spaces points
|
||||
evenly on the log-scale. Below we vary `lam` from $10^{-2}$ to $10^6$.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(age, y, facecolor='gray', alpha=0.5)
|
||||
for lam in np.logspace(-2, 6, 5):
|
||||
gam = LinearGAM(s_gam(0, lam=lam)).fit(X_age, y)
|
||||
ax.plot(age_grid,
|
||||
gam.predict(age_grid),
|
||||
label='{:.1e}'.format(lam),
|
||||
linewidth=3)
|
||||
ax.set_xlabel('Age', fontsize=20)
|
||||
ax.set_ylabel('Wage', fontsize=20);
|
||||
ax.legend(title='$\lambda$');
|
||||
|
||||
```
|
||||
|
||||
The `pygam` package can perform a search for an optimal smoothing parameter.
|
||||
|
||||
```{python}
|
||||
gam_opt = gam.gridsearch(X_age, y)
|
||||
ax.plot(age_grid,
|
||||
gam_opt.predict(age_grid),
|
||||
label='Grid search',
|
||||
linewidth=4)
|
||||
ax.legend()
|
||||
fig
|
||||
|
||||
```
|
||||
|
||||
Alternatively, we can fix the degrees of freedom of the smoothing
|
||||
spline using a function included in the `ISLP.pygam` package. Below we
|
||||
find a value of $\lambda$ that gives us roughly four degrees of
|
||||
freedom. We note here that these degrees of freedom include the
|
||||
unpenalized intercept and linear term of the smoothing spline, hence there are at least two
|
||||
degrees of freedom.
|
||||
|
||||
```{python}
|
||||
age_term = gam.terms[0]
|
||||
lam_4 = approx_lam(X_age, age_term, 4)
|
||||
age_term.lam = lam_4
|
||||
degrees_of_freedom(X_age, age_term)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
|
||||
as the desired degrees of freedom plus one to account for the fact that these smoothing
|
||||
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X_age,
|
||||
y,
|
||||
facecolor='gray',
|
||||
alpha=0.3)
|
||||
for df in [1,3,4,8,15]:
|
||||
lam = approx_lam(X_age, age_term, df+1)
|
||||
age_term.lam = lam
|
||||
gam.fit(X_age, y)
|
||||
ax.plot(age_grid,
|
||||
gam.predict(age_grid),
|
||||
label='{:d}'.format(df),
|
||||
linewidth=4)
|
||||
ax.set_xlabel('Age', fontsize=20)
|
||||
ax.set_ylabel('Wage', fontsize=20);
|
||||
ax.legend(title='Degrees of freedom');
|
||||
|
||||
```
|
||||
|
||||
### Additive Models with Several Terms
|
||||
The strength of generalized additive models lies in their ability to fit multivariate regression models with more flexibility than linear models. We demonstrate two approaches: the first in a more manual fashion using natural splines and piecewise constant functions, and the second using the `pygam` package and smoothing splines.
|
||||
|
||||
We now fit a GAM by hand to predict
|
||||
`wage` using natural spline functions of `year` and `age`,
|
||||
treating `education` as a qualitative predictor, as in (7.16).
|
||||
Since this is just a big linear regression model
|
||||
using an appropriate choice of basis functions, we can simply do this
|
||||
using the `sm.OLS()` function.
|
||||
|
||||
We will build the model matrix in a more manual fashion here, since we wish
|
||||
to access the pieces separately when constructing partial dependence plots.
|
||||
|
||||
```{python}
|
||||
ns_age = NaturalSpline(df=4).fit(age)
|
||||
ns_year = NaturalSpline(df=5).fit(Wage['year'])
|
||||
Xs = [ns_age.transform(age),
|
||||
ns_year.transform(Wage['year']),
|
||||
pd.get_dummies(Wage['education']).values]
|
||||
X_bh = np.hstack(Xs)
|
||||
gam_bh = sm.OLS(y, X_bh).fit()
|
||||
|
||||
```
|
||||
Here the function `NaturalSpline()` is the workhorse supporting
|
||||
the `ns()` helper function. We chose to use all columns of the
|
||||
indicator matrix for the categorical variable `education`, making an intercept redundant.
|
||||
Finally, we stacked the three component matrices horizontally to form the model matrix `X_bh`.
|
||||
|
||||
We now show how to construct partial dependence plots for each of the terms in our rudimentary GAM. We can do this by hand,
|
||||
given grids for `age` and `year`.
|
||||
We simply predict with new $X$ matrices, fixing all but one of the features at a time.
|
||||
|
||||
```{python}
|
||||
age_grid = np.linspace(age.min(),
|
||||
age.max(),
|
||||
100)
|
||||
X_age_bh = X_bh.copy()[:100]
|
||||
X_age_bh[:] = X_bh[:].mean(0)[None,:]
|
||||
X_age_bh[:,:4] = ns_age.transform(age_grid)
|
||||
preds = gam_bh.get_prediction(X_age_bh)
|
||||
bounds_age = preds.conf_int(alpha=0.05)
|
||||
partial_age = preds.predicted_mean
|
||||
center = partial_age.mean()
|
||||
partial_age -= center
|
||||
bounds_age -= center
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.plot(age_grid, partial_age, 'b', linewidth=3)
|
||||
ax.plot(age_grid, bounds_age[:,0], 'r--', linewidth=3)
|
||||
ax.plot(age_grid, bounds_age[:,1], 'r--', linewidth=3)
|
||||
ax.set_xlabel('Age')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of age on wage', fontsize=20);
|
||||
|
||||
```
|
||||
Let's explain in some detail what we did above. The idea is to create a new prediction matrix, where all but the columns belonging to `age` are constant (and set to their training-data means). The four columns for `age` are filled in with the natural spline basis evaluated at the 100 values in `age_grid`.
|
||||
|
||||
* We made a grid of length 100 in `age`, and created a matrix `X_age_bh` with 100 rows and the same number of columns as `X_bh`.
|
||||
* We replaced every row of this matrix with the column means of the original.
|
||||
* We then replace just the first four columns representing `age` with the natural spline basis computed at the values in `age_grid`.
|
||||
|
||||
The remaining steps should by now be familiar.
|
||||
|
||||
We also look at the effect of `year` on `wage`; the process is the same.
|
||||
|
||||
```{python}
|
||||
year_grid = np.linspace(2003, 2009, 100)
|
||||
year_grid = np.linspace(Wage['year'].min(),
|
||||
Wage['year'].max(),
|
||||
100)
|
||||
X_year_bh = X_bh.copy()[:100]
|
||||
X_year_bh[:] = X_bh[:].mean(0)[None,:]
|
||||
X_year_bh[:,4:9] = ns_year.transform(year_grid)
|
||||
preds = gam_bh.get_prediction(X_year_bh)
|
||||
bounds_year = preds.conf_int(alpha=0.05)
|
||||
partial_year = preds.predicted_mean
|
||||
center = partial_year.mean()
|
||||
partial_year -= center
|
||||
bounds_year -= center
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.plot(year_grid, partial_year, 'b', linewidth=3)
|
||||
ax.plot(year_grid, bounds_year[:,0], 'r--', linewidth=3)
|
||||
ax.plot(year_grid, bounds_year[:,1], 'r--', linewidth=3)
|
||||
ax.set_xlabel('Year')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of year on wage', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
We now fit the model (7.16) using smoothing splines rather
|
||||
than natural splines. All of the
|
||||
terms in (7.16) are fit simultaneously, taking each other
|
||||
into account to explain the response. The `pygam` package only works with matrices, so we must convert
|
||||
the categorical series `education` to its array representation, which can be found
|
||||
with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we
|
||||
use only seven basis functions for it.
|
||||
|
||||
```{python}
|
||||
gam_full = LinearGAM(s_gam(0) +
|
||||
s_gam(1, n_splines=7) +
|
||||
f_gam(2, lam=0))
|
||||
Xgam = np.column_stack([age,
|
||||
Wage['year'],
|
||||
Wage['education'].cat.codes])
|
||||
gam_full = gam_full.fit(Xgam, y)
|
||||
|
||||
```
|
||||
The two `s_gam()` terms result in smoothing spline fits, and use a default value for $\lambda$ (`lam=0.6`), which is somewhat arbitrary. For the categorical term `education`, specified using a `f_gam()` term, we specify `lam=0` to avoid any shrinkage.
|
||||
We produce the partial dependence plot in `age` to see the effect of these choices.
|
||||
|
||||
The values for the plot
|
||||
are generated by the `pygam` package. We provide a `plot_gam()`
|
||||
function for partial-dependence plots in `ISLP.pygam`, which makes this job easier than in our last example with natural splines.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_gam(gam_full, 0, ax=ax)
|
||||
ax.set_xlabel('Age')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of age on wage - default lam=0.6', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
We see that the function is somewhat wiggly. It is more natural to specify the `df` than a value for `lam`.
|
||||
We refit a GAM using four degrees of freedom each for
|
||||
`age` and `year`. Recall that the addition of one below takes into account the intercept
|
||||
of the smoothing spline.
|
||||
|
||||
```{python}
|
||||
age_term = gam_full.terms[0]
|
||||
age_term.lam = approx_lam(Xgam, age_term, df=4+1)
|
||||
year_term = gam_full.terms[1]
|
||||
year_term.lam = approx_lam(Xgam, year_term, df=4+1)
|
||||
gam_full = gam_full.fit(Xgam, y)
|
||||
|
||||
```
|
||||
Note that updating `age_term.lam` above updates it in `gam_full.terms[0]` as well! Likewise for `year_term.lam`.
|
||||
|
||||
Repeating the plot for `age`, we see that it is much smoother.
|
||||
We also produce the plot for `year`.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_gam(gam_full,
|
||||
1,
|
||||
ax=ax)
|
||||
ax.set_xlabel('Year')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of year on wage', fontsize=20)
|
||||
|
||||
```
|
||||
Finally we plot `education`, which is categorical. The partial dependence plot is different, and more suitable for the set of fitted constants for each level of this variable.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
ax = plot_gam(gam_full, 2)
|
||||
ax.set_xlabel('Education')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of wage on education',
|
||||
fontsize=20);
|
||||
ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8);
|
||||
|
||||
```
|
||||
|
||||
### ANOVA Tests for Additive Models
|
||||
In all of our models, the function of `year` looks rather linear. We can
|
||||
perform a series of ANOVA tests in order to determine which of these
|
||||
three models is best: a GAM that excludes `year` ($\mathcal{M}_1$),
|
||||
a GAM that uses a linear function of `year` ($\mathcal{M}_2$), or a
|
||||
GAM that uses a spline function of `year` ($\mathcal{M}_3$).
|
||||
|
||||
```{python}
|
||||
gam_0 = LinearGAM(age_term + f_gam(2, lam=0))
|
||||
gam_0.fit(Xgam, y)
|
||||
gam_linear = LinearGAM(age_term +
|
||||
l_gam(1, lam=0) +
|
||||
f_gam(2, lam=0))
|
||||
gam_linear.fit(Xgam, y)
|
||||
|
||||
```
|
||||
|
||||
Notice our use of `age_term` in the expressions above. We do this because
|
||||
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
|
||||
|
||||
To directly assess the effect of `year` we run an ANOVA on the
|
||||
three models fit above.
|
||||
|
||||
```{python}
|
||||
anova_gam(gam_0, gam_linear, gam_full)
|
||||
|
||||
```
|
||||
We find that there is compelling evidence that a GAM with a linear
|
||||
function in `year` is better than a GAM that does not include
|
||||
`year` at all ($p$-value= 0.002). However, there is no
|
||||
evidence that a non-linear function of `year` is needed
|
||||
($p$-value=0.435). In other words, based on the results of this
|
||||
ANOVA, $\mathcal{M}_2$ is preferred.
|
||||
|
||||
We can repeat the same process for `age` as well. We see there is very clear evidence that
|
||||
a non-linear term is required for `age`.
|
||||
|
||||
```{python}
|
||||
gam_0 = LinearGAM(year_term +
|
||||
f_gam(2, lam=0))
|
||||
gam_linear = LinearGAM(l_gam(0, lam=0) +
|
||||
year_term +
|
||||
f_gam(2, lam=0))
|
||||
gam_0.fit(Xgam, y)
|
||||
gam_linear.fit(Xgam, y)
|
||||
anova_gam(gam_0, gam_linear, gam_full)
|
||||
```
|
||||
|
||||
There is a (verbose) `summary()` method for the GAM fit.
|
||||
|
||||
```{python}
|
||||
gam_full.summary()
|
||||
|
||||
```
|
||||
|
||||
We can make predictions from `gam` objects, just like from
|
||||
`lm` objects, using the `predict()` method for the class
|
||||
`gam`. Here we make predictions on the training set.
|
||||
|
||||
```{python}
|
||||
Yhat = gam_full.predict(Xgam)
|
||||
|
||||
```
|
||||
|
||||
In order to fit a logistic regression GAM, we use `LogisticGAM()`
|
||||
from `pygam`.
|
||||
|
||||
```{python}
|
||||
gam_logit = LogisticGAM(age_term +
|
||||
l_gam(1, lam=0) +
|
||||
f_gam(2, lam=0))
|
||||
gam_logit.fit(Xgam, high_earn)
|
||||
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
ax = plot_gam(gam_logit, 2)
|
||||
ax.set_xlabel('Education')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of wage on education',
|
||||
fontsize=20);
|
||||
ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8);
|
||||
|
||||
```
|
||||
The model seems to be very flat, with especially high error bars for the first category.
|
||||
Let's look at the data a bit more closely.
|
||||
|
||||
```{python}
|
||||
pd.crosstab(Wage['high_earn'], Wage['education'])
|
||||
|
||||
```
|
||||
We see that there are no high earners in the first category of
|
||||
education, meaning that the model will have a hard time fitting. We
|
||||
will fit a logistic regression GAM excluding all observations falling into this
|
||||
category. This provides more sensible results.
|
||||
|
||||
To do so,
|
||||
we could subset the model matrix, though this will not remove the
|
||||
column from `Xgam`. While we can deduce which column corresponds
|
||||
to this feature, for reproducibility’s sake we reform the model matrix
|
||||
on this smaller subset.
|
||||
|
||||
|
||||
```{python}
|
||||
only_hs = Wage['education'] == '1. < HS Grad'
|
||||
Wage_ = Wage.loc[~only_hs]
|
||||
Xgam_ = np.column_stack([Wage_['age'],
|
||||
Wage_['year'],
|
||||
Wage_['education'].cat.codes-1])
|
||||
high_earn_ = Wage_['high_earn']
|
||||
|
||||
```
|
||||
In the second-to-last line above, we subtract one from the codes of the category, due to a bug in `pygam`. It just relabels
|
||||
the education values and hence has no effect on the fit.
|
||||
|
||||
We now fit the model.
|
||||
|
||||
```{python}
|
||||
gam_logit_ = LogisticGAM(age_term +
|
||||
year_term +
|
||||
f_gam(2, lam=0))
|
||||
gam_logit_.fit(Xgam_, high_earn_)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve
|
||||
removed those observations.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
ax = plot_gam(gam_logit_, 2)
|
||||
ax.set_xlabel('Education')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of high earner status on education', fontsize=20);
|
||||
ax.set_xticklabels(Wage['education'].cat.categories[1:],
|
||||
fontsize=8);
|
||||
|
||||
```
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
ax = plot_gam(gam_logit_, 1)
|
||||
ax.set_xlabel('Year')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of high earner status on year',
|
||||
fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
ax = plot_gam(gam_logit_, 0)
|
||||
ax.set_xlabel('Age')
|
||||
ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Local Regression
|
||||
We illustrate the use of local regression using the `lowess()`
|
||||
function from `sm.nonparametric`. Some implementations of
|
||||
GAMs allow terms to be local regression operators; this is not the case in `pygam`.
|
||||
|
||||
Here we fit local linear regression models using spans of 0.2
|
||||
and 0.5; that is, each neighborhood consists of 20% or 50% of
|
||||
the observations. As expected, using a span of 0.5 is smoother than 0.2.
|
||||
|
||||
```{python}
|
||||
lowess = sm.nonparametric.lowess
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(age, y, facecolor='gray', alpha=0.5)
|
||||
for span in [0.2, 0.5]:
|
||||
fitted = lowess(y,
|
||||
age,
|
||||
frac=span,
|
||||
xvals=age_grid)
|
||||
ax.plot(age_grid,
|
||||
fitted,
|
||||
label='{:.1f}'.format(span),
|
||||
linewidth=4)
|
||||
ax.set_xlabel('Age', fontsize=20)
|
||||
ax.set_ylabel('Wage', fontsize=20);
|
||||
ax.legend(title='span', fontsize=15);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
@@ -1,558 +0,0 @@
|
||||
|
||||
|
||||
# Tree-Based Methods
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch08-baggboost-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch08-baggboost-lab.ipynb)
|
||||
|
||||
|
||||
We import some of our usual libraries at this top
|
||||
level.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
import sklearn.model_selection as skm
|
||||
from ISLP import load_data, confusion_table
|
||||
from ISLP.models import ModelSpec as MS
|
||||
|
||||
```
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
```{python}
|
||||
from sklearn.tree import (DecisionTreeClassifier as DTC,
|
||||
DecisionTreeRegressor as DTR,
|
||||
plot_tree,
|
||||
export_text)
|
||||
from sklearn.metrics import (accuracy_score,
|
||||
log_loss)
|
||||
from sklearn.ensemble import \
|
||||
(RandomForestRegressor as RF,
|
||||
GradientBoostingRegressor as GBR)
|
||||
from ISLP.bart import BART
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Fitting Classification Trees
|
||||
|
||||
|
||||
We first use classification trees to analyze the `Carseats` data set.
|
||||
In these data, `Sales` is a continuous variable, and so we begin
|
||||
by recoding it as a binary variable. We use the `where()`
|
||||
function to create a variable, called `High`, which takes on a
|
||||
value of `Yes` if the `Sales` variable exceeds 8, and takes
|
||||
on a value of `No` otherwise.
|
||||
|
||||
```{python}
|
||||
Carseats = load_data('Carseats')
|
||||
High = np.where(Carseats.Sales > 8,
|
||||
"Yes",
|
||||
"No")
|
||||
|
||||
```
|
||||
|
||||
We now use `DecisionTreeClassifier()` to fit a classification tree in
|
||||
order to predict `High` using all variables but `Sales`.
|
||||
To do so, we must form a model matrix as we did when fitting regression
|
||||
models.
|
||||
|
||||
```{python}
|
||||
model = MS(Carseats.columns.drop('Sales'), intercept=False)
|
||||
D = model.fit_transform(Carseats)
|
||||
feature_names = list(D.columns)
|
||||
X = np.asarray(D)
|
||||
|
||||
```
|
||||
We have converted `D` from a data frame to an array `X`, which is needed in some of the analysis below. We also need the `feature_names` for annotating our plots later.
|
||||
|
||||
There are several options needed to specify the classifier,
|
||||
such as `max_depth` (how deep to grow the tree), `min_samples_split`
|
||||
(minimum number of observations in a node to be eligible for splitting)
|
||||
and `criterion` (whether to use Gini or cross-entropy as the split criterion).
|
||||
We also set `random_state` for reproducibility; ties in the split criterion are broken at random.
|
||||
|
||||
```{python}
|
||||
clf = DTC(criterion='entropy',
|
||||
max_depth=3,
|
||||
random_state=0)
|
||||
clf.fit(X, High)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In our discussion of qualitative features in Section 3.3,
|
||||
we noted that for a linear regression model such a feature could be
|
||||
represented by including a matrix of dummy variables (one-hot-encoding) in the model
|
||||
matrix, using the formula notation of `statsmodels`.
|
||||
As mentioned in Section 8.1, there is a more
|
||||
natural way to handle qualitative features when building a decision
|
||||
tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
|
||||
However,
|
||||
the `sklearn` implementation of decision trees does not take
|
||||
advantage of this approach; instead it simply treats the one-hot-encoded levels as separate variables.
|
||||
|
||||
```{python}
|
||||
accuracy_score(High, clf.predict(X))
|
||||
|
||||
```
|
||||
|
||||
|
||||
With only the default arguments, the training error rate is
|
||||
21%.
|
||||
For classification trees, we can
|
||||
access the value of the deviance using `log_loss()`,
|
||||
\begin{equation*}
|
||||
\begin{split}
|
||||
-2 \sum_m \sum_k n_{mk} \log \hat{p}_{mk},
|
||||
\end{split}
|
||||
\end{equation*}
|
||||
where $n_{mk}$ is the number of observations in the $m$th terminal
|
||||
node that belong to the $k$th class.
|
||||
|
||||
```{python}
|
||||
resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
|
||||
resid_dev
|
||||
|
||||
```
|
||||
|
||||
This is closely related to the *entropy*, defined in (8.7).
|
||||
A small deviance indicates a
|
||||
tree that provides a good fit to the (training) data.
|
||||
|
||||
One of the most attractive properties of trees is that they can
|
||||
be graphically displayed. Here we use the `plot()` function
|
||||
to display the tree structure.
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(12,12))[1]
|
||||
plot_tree(clf,
|
||||
feature_names=feature_names,
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
The most important indicator of `Sales` appears to be `ShelveLoc`.
|
||||
|
||||
We can see a text representation of the tree using
|
||||
`export_text()`, which displays the split
|
||||
criterion (e.g. `Price <= 92.5`) for each branch.
|
||||
For leaf nodes it shows the overall prediction
|
||||
(`Yes` or `No`).
|
||||
We can also see the number of observations in that
|
||||
leaf that take on values of `Yes` and `No` by specifying `show_weights=True`.
|
||||
|
||||
```{python}
|
||||
print(export_text(clf,
|
||||
feature_names=feature_names,
|
||||
show_weights=True))
|
||||
|
||||
```
|
||||
|
||||
In order to properly evaluate the performance of a classification tree
|
||||
on these data, we must estimate the test error rather than simply
|
||||
computing the training error. We split the observations into a
|
||||
training set and a test set, build the tree using the training set,
|
||||
and evaluate its performance on the test data. This pattern is
|
||||
similar to that in Chapter 6, with the linear models
|
||||
replaced here by decision trees --- the code for validation
|
||||
is almost identical. This approach leads to correct predictions
|
||||
for 68.5% of the locations in the test data set.
|
||||
|
||||
```{python}
|
||||
validation = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=200,
|
||||
random_state=0)
|
||||
results = skm.cross_validate(clf,
|
||||
D,
|
||||
High,
|
||||
cv=validation)
|
||||
results['test_score']
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
Next, we consider whether pruning the tree might lead to improved
|
||||
classification performance. We first split the data into a training and
|
||||
test set. We will use cross-validation to prune the tree on the training
|
||||
set, and then evaluate the performance of the pruned tree on the test
|
||||
set.
|
||||
|
||||
```{python}
|
||||
(X_train,
|
||||
X_test,
|
||||
High_train,
|
||||
High_test) = skm.train_test_split(X,
|
||||
High,
|
||||
test_size=0.5,
|
||||
random_state=0)
|
||||
|
||||
```
|
||||
We first refit the full tree on the training set; here we do not set a `max_depth` parameter, since we will learn that through cross-validation.
|
||||
|
||||
|
||||
```{python}
|
||||
clf = DTC(criterion='entropy', random_state=0)
|
||||
clf.fit(X_train, High_train)
|
||||
accuracy_score(High_test, clf.predict(X_test))
|
||||
|
||||
```
|
||||
Next we use the `cost_complexity_pruning_path()` method of
|
||||
`clf` to extract cost-complexity values.
|
||||
|
||||
```{python}
|
||||
ccp_path = clf.cost_complexity_pruning_path(X_train, High_train)
|
||||
kfold = skm.KFold(10,
|
||||
random_state=1,
|
||||
shuffle=True)
|
||||
|
||||
```
|
||||
This yields a set of impurities and $\alpha$ values
|
||||
from which we can extract an optimal one by cross-validation.
|
||||
|
||||
```{python}
|
||||
grid = skm.GridSearchCV(clf,
|
||||
{'ccp_alpha': ccp_path.ccp_alphas},
|
||||
refit=True,
|
||||
cv=kfold,
|
||||
scoring='accuracy')
|
||||
grid.fit(X_train, High_train)
|
||||
grid.best_score_
|
||||
|
||||
```
|
||||
Let’s take a look at the pruned tree.
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(12, 12))[1]
|
||||
best_ = grid.best_estimator_
|
||||
plot_tree(best_,
|
||||
feature_names=feature_names,
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
This is quite a bushy tree. We could count the leaves, or query
|
||||
`best_` instead.
|
||||
|
||||
```{python}
|
||||
best_.tree_.n_leaves
|
||||
|
||||
```
|
||||
The tree with 30 terminal
|
||||
nodes results in the lowest cross-validation error rate, with an accuracy of
|
||||
68.5%. How well does this pruned tree perform on the test data set? Once
|
||||
again, we apply the `predict()` function.
|
||||
|
||||
```{python}
|
||||
print(accuracy_score(High_test,
|
||||
best_.predict(X_test)))
|
||||
confusion = confusion_table(best_.predict(X_test),
|
||||
High_test)
|
||||
confusion
|
||||
|
||||
```
|
||||
|
||||
|
||||
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
|
||||
|
||||
|
||||
|
||||
|
||||
## Fitting Regression Trees
|
||||
Here we fit a regression tree to the `Boston` data set. The
|
||||
steps are similar to those for classification trees.
|
||||
|
||||
```{python}
|
||||
Boston = load_data("Boston")
|
||||
model = MS(Boston.columns.drop('medv'), intercept=False)
|
||||
D = model.fit_transform(Boston)
|
||||
feature_names = list(D.columns)
|
||||
X = np.asarray(D)
|
||||
|
||||
```
|
||||
|
||||
First, we split the data into training and test sets, and fit the tree
|
||||
to the training data. Here we use 30% of the data for the test set.
|
||||
|
||||
|
||||
```{python}
|
||||
(X_train,
|
||||
X_test,
|
||||
y_train,
|
||||
y_test) = skm.train_test_split(X,
|
||||
Boston['medv'],
|
||||
test_size=0.3,
|
||||
random_state=0)
|
||||
|
||||
```
|
||||
|
||||
Having formed our training and test data sets, we fit the regression tree.
|
||||
|
||||
```{python}
|
||||
reg = DTR(max_depth=3)
|
||||
reg.fit(X_train, y_train)
|
||||
ax = subplots(figsize=(12,12))[1]
|
||||
plot_tree(reg,
|
||||
feature_names=feature_names,
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
|
||||
The variable `lstat` measures the percentage of individuals with
|
||||
lower socioeconomic status. The tree indicates that lower
|
||||
values of `lstat` correspond to more expensive houses.
|
||||
The tree predicts a median house price of $12,042 for small-sized homes (`rm < 6.8`), in
|
||||
suburbs in which residents have low socioeconomic status (`lstat > 14.4`) and the crime-rate is moderate (`crim > 5.8`).
|
||||
|
||||
|
||||
Now we use the cross-validation function to see whether pruning
|
||||
the tree will improve performance.
|
||||
|
||||
```{python}
|
||||
ccp_path = reg.cost_complexity_pruning_path(X_train, y_train)
|
||||
kfold = skm.KFold(5,
|
||||
shuffle=True,
|
||||
random_state=10)
|
||||
grid = skm.GridSearchCV(reg,
|
||||
{'ccp_alpha': ccp_path.ccp_alphas},
|
||||
refit=True,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
G = grid.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
In keeping with the cross-validation results, we use the pruned tree
|
||||
to make predictions on the test set.
|
||||
|
||||
```{python}
|
||||
best_ = grid.best_estimator_
|
||||
np.mean((y_test - best_.predict(X_test))**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In other words, the test set MSE associated with the regression tree
|
||||
is 28.07. The square root of
|
||||
the MSE is therefore around
|
||||
5.30,
|
||||
indicating that this model leads to test predictions that are within around
|
||||
$5300
|
||||
of the true median home value for the suburb.
|
||||
|
||||
Let’s plot the best tree to see how interpretable it is.
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(12,12))[1]
|
||||
plot_tree(G.best_estimator_,
|
||||
feature_names=feature_names,
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Bagging and Random Forests
|
||||
|
||||
|
||||
Here we apply bagging and random forests to the `Boston` data, using
|
||||
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
|
||||
that bagging is simply a special case of a random forest with
|
||||
$m=p$. Therefore, the `RandomForestRegressor()` function can be used to
|
||||
perform both bagging and random forests. We start with bagging.
|
||||
|
||||
```{python}
|
||||
bag_boston = RF(max_features=X_train.shape[1], random_state=0)
|
||||
bag_boston.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The argument `max_features` indicates that all 12 predictors should
|
||||
be considered for each split of the tree --- in other words, that
|
||||
bagging should be done. How well does this bagged model perform on
|
||||
the test set?
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
y_hat_bag = bag_boston.predict(X_test)
|
||||
ax.scatter(y_hat_bag, y_test)
|
||||
np.mean((y_test - y_hat_bag)**2)
|
||||
|
||||
```
|
||||
|
||||
The test set MSE associated with the bagged regression tree is
|
||||
14.63, about half that obtained using an optimally-pruned single
|
||||
tree. We could change the number of trees grown from the default of
|
||||
100 by
|
||||
using the `n_estimators` argument:
|
||||
|
||||
```{python}
|
||||
bag_boston = RF(max_features=X_train.shape[1],
|
||||
n_estimators=500,
|
||||
random_state=0).fit(X_train, y_train)
|
||||
y_hat_bag = bag_boston.predict(X_test)
|
||||
np.mean((y_test - y_hat_bag)**2)
|
||||
```
|
||||
There is not much change. Bagging and random forests cannot overfit by
|
||||
increasing the number of trees, but can underfit if the number is too small.
|
||||
|
||||
Growing a random forest proceeds in exactly the same way, except that
|
||||
we use a smaller value of the `max_features` argument. By default,
|
||||
`RandomForestRegressor()` uses $p$ variables when building a random
|
||||
forest of regression trees (i.e. it defaults to bagging), and `RandomForestClassifier()` uses
|
||||
$\sqrt{p}$ variables when building a
|
||||
random forest of classification trees. Here we use `max_features=6`.
|
||||
|
||||
```{python}
|
||||
RF_boston = RF(max_features=6,
|
||||
random_state=0).fit(X_train, y_train)
|
||||
y_hat_RF = RF_boston.predict(X_test)
|
||||
np.mean((y_test - y_hat_RF)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The test set MSE is 20.04;
|
||||
this indicates that random forests did somewhat worse than bagging
|
||||
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
|
||||
importance of each variable.
|
||||
|
||||
```{python}
|
||||
feature_imp = pd.DataFrame(
|
||||
{'importance':RF_boston.feature_importances_},
|
||||
index=feature_names)
|
||||
feature_imp.sort_values(by='importance', ascending=False)
|
||||
```
|
||||
This
|
||||
is a relative measure of the total decrease in node impurity that results from
|
||||
splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data).
|
||||
|
||||
The results indicate that across all of the trees considered in the
|
||||
random forest, the wealth level of the community (`lstat`) and the
|
||||
house size (`rm`) are by far the two most important variables.
|
||||
|
||||
|
||||
|
||||
|
||||
## Boosting
|
||||
|
||||
|
||||
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
|
||||
to fit boosted regression trees to the `Boston` data
|
||||
set. For classification we would use `GradientBoostingClassifier()`.
|
||||
The argument `n_estimators=5000`
|
||||
indicates that we want 5000 trees, and the option
|
||||
`max_depth=3` limits the depth of each tree. The
|
||||
argument `learning_rate` is the $\lambda$
|
||||
mentioned earlier in the description of boosting.
|
||||
|
||||
```{python}
|
||||
boost_boston = GBR(n_estimators=5000,
|
||||
learning_rate=0.001,
|
||||
max_depth=3,
|
||||
random_state=0)
|
||||
boost_boston.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
We can see how the training error decreases with the `train_score_` attribute.
|
||||
To get an idea of how the test error decreases we can use the
|
||||
`staged_predict()` method to get the predicted values along the path.
|
||||
|
||||
```{python}
|
||||
test_error = np.zeros_like(boost_boston.train_score_)
|
||||
for idx, y_ in enumerate(boost_boston.staged_predict(X_test)):
|
||||
test_error[idx] = np.mean((y_test - y_)**2)
|
||||
|
||||
plot_idx = np.arange(boost_boston.train_score_.shape[0])
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.plot(plot_idx,
|
||||
boost_boston.train_score_,
|
||||
'b',
|
||||
label='Training')
|
||||
ax.plot(plot_idx,
|
||||
test_error,
|
||||
'r',
|
||||
label='Test')
|
||||
ax.legend();
|
||||
|
||||
```
|
||||
|
||||
We now use the boosted model to predict `medv` on the test set:
|
||||
|
||||
```{python}
|
||||
y_hat_boost = boost_boston.predict(X_test);
|
||||
np.mean((y_test - y_hat_boost)**2)
|
||||
|
||||
```
|
||||
|
||||
The test MSE obtained is 14.48,
|
||||
similar to the test MSE for bagging. If we want to, we can
|
||||
perform boosting with a different value of the shrinkage parameter
|
||||
$\lambda$ in (8.10). The default value is 0.001, but
|
||||
this is easily modified. Here we take $\lambda=0.2$.
|
||||
|
||||
```{python}
|
||||
boost_boston = GBR(n_estimators=5000,
|
||||
learning_rate=0.2,
|
||||
max_depth=3,
|
||||
random_state=0)
|
||||
boost_boston.fit(X_train,
|
||||
y_train)
|
||||
y_hat_boost = boost_boston.predict(X_test);
|
||||
np.mean((y_test - y_hat_boost)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In this case, using $\lambda=0.2$ leads to almost the same test MSE
|
||||
as when using $\lambda=0.001$.
|
||||
|
||||
|
||||
|
||||
|
||||
## Bayesian Additive Regression Trees
|
||||
|
||||
|
||||
In this section we demonstrate a `Python` implementation of BART found in the
|
||||
`ISLP.bart` package. We fit a model
|
||||
to the `Boston` housing data set. This `BART()` estimator is
|
||||
designed for quantitative outcome variables, though other implementations are available for
|
||||
fitting logistic and probit models to categorical outcomes.
|
||||
|
||||
```{python}
|
||||
bart_boston = BART(random_state=0, burnin=5, ndraw=15)
|
||||
bart_boston.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
|
||||
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
|
||||
|
||||
```{python}
|
||||
yhat_test = bart_boston.predict(X_test.astype(np.float32))
|
||||
np.mean((y_test - yhat_test)**2)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can check how many times each variable appeared in the collection of trees.
|
||||
This gives a summary similar to the variable importance plot for boosting and random forests.
|
||||
|
||||
```{python}
|
||||
var_inclusion = pd.Series(bart_boston.variable_inclusion_.mean(0),
|
||||
index=D.columns)
|
||||
var_inclusion
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
550
Ch09-svm-lab.Rmd
550
Ch09-svm-lab.Rmd
@@ -1,550 +0,0 @@
|
||||
|
||||
|
||||
# Support Vector Machines
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch09-svm-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch09-svm-lab.ipynb)
|
||||
|
||||
|
||||
In this lab, we use the `sklearn.svm` library to demonstrate the support
|
||||
vector classifier and the support vector machine.
|
||||
|
||||
We import some of our usual libraries.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
from matplotlib.pyplot import subplots, cm
|
||||
import sklearn.model_selection as skm
|
||||
from ISLP import load_data, confusion_table
|
||||
|
||||
```
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
```{python}
|
||||
from sklearn.svm import SVC
|
||||
from ISLP.svm import plot as plot_svm
|
||||
from sklearn.metrics import RocCurveDisplay
|
||||
|
||||
```
|
||||
|
||||
We will use the function `RocCurveDisplay.from_estimator()` to
|
||||
produce several ROC plots, using a shorthand `roc_curve`.
|
||||
|
||||
```{python}
|
||||
roc_curve = RocCurveDisplay.from_estimator # shorthand
|
||||
|
||||
```
|
||||
|
||||
## Support Vector Classifier
|
||||
|
||||
We now use the `SupportVectorClassifier()` function (abbreviated `SVC()`) from `sklearn` to fit the support vector
|
||||
classifier for a given value of the parameter `C`. The
|
||||
`C` argument allows us to specify the cost of a violation to
|
||||
the margin. When the `C` argument is small, then the margins
|
||||
will be wide and many support vectors will be on the margin or will
|
||||
violate the margin. When the `C` argument is large, then the
|
||||
margins will be narrow and there will be few support vectors on the
|
||||
margin or violating the margin.
|
||||
|
||||
Here we demonstrate
|
||||
the use of `SVC()` on a two-dimensional example, so that we can
|
||||
plot the resulting decision boundary. We begin by generating the
|
||||
observations, which belong to two classes, and checking whether the
|
||||
classes are linearly separable.
|
||||
|
||||
```{python}
|
||||
rng = np.random.default_rng(1)
|
||||
X = rng.standard_normal((50, 2))
|
||||
y = np.array([-1]*25+[1]*25)
|
||||
X[y==1] += 1
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0],
|
||||
X[:,1],
|
||||
c=y,
|
||||
cmap=cm.coolwarm);
|
||||
|
||||
```
|
||||
They are not. We now fit the classifier.
|
||||
|
||||
```{python}
|
||||
svm_linear = SVC(C=10, kernel='linear')
|
||||
svm_linear.fit(X, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The support vector classifier with two features can
|
||||
be visualized by plotting values of its *decision function*.
|
||||
We have included a function for this in the `ISLP` package (inspired by a similar
|
||||
example in the `sklearn` docs).
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X,
|
||||
y,
|
||||
svm_linear,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
The decision
|
||||
boundary between the two classes is linear (because we used the
|
||||
argument `kernel='linear'`). The support vectors are marked with `+`
|
||||
and the remaining observations are plotted as circles.
|
||||
|
||||
What if we instead used a smaller value of the cost parameter?
|
||||
|
||||
```{python}
|
||||
svm_linear_small = SVC(C=0.1, kernel='linear')
|
||||
svm_linear_small.fit(X, y)
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X,
|
||||
y,
|
||||
svm_linear_small,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
With a smaller value of the cost parameter, we
|
||||
obtain a larger number of support vectors, because the margin is now
|
||||
wider. For linear kernels, we can extract the
|
||||
coefficients of the linear decision boundary as follows:
|
||||
|
||||
```{python}
|
||||
svm_linear.coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Since the support vector machine is an estimator in `sklearn`, we
|
||||
can use the usual machinery to tune it.
|
||||
|
||||
```{python}
|
||||
kfold = skm.KFold(5,
|
||||
random_state=0,
|
||||
shuffle=True)
|
||||
grid = skm.GridSearchCV(svm_linear,
|
||||
{'C':[0.001,0.01,0.1,1,5,10,100]},
|
||||
refit=True,
|
||||
cv=kfold,
|
||||
scoring='accuracy')
|
||||
grid.fit(X, y)
|
||||
grid.best_params_
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can easily access the cross-validation errors for each of these models
|
||||
in `grid.cv_results_`. This prints out a lot of detail, so we
|
||||
extract the accuracy results only.
|
||||
|
||||
```{python}
|
||||
grid.cv_results_[('mean_test_score')]
|
||||
|
||||
```
|
||||
We see that `C=1` results in the highest cross-validation
|
||||
accuracy of 0.74, though
|
||||
the accuracy is the same for several values of `C`.
|
||||
The classifier `grid.best_estimator_` can be used to predict the class
|
||||
label on a set of test observations. Let’s generate a test data set.
|
||||
|
||||
```{python}
|
||||
X_test = rng.standard_normal((20, 2))
|
||||
y_test = np.array([-1]*10+[1]*10)
|
||||
X_test[y_test==1] += 1
|
||||
|
||||
```
|
||||
|
||||
Now we predict the class labels of these test observations. Here we
|
||||
use the best model selected by cross-validation in order to make the
|
||||
predictions.
|
||||
|
||||
```{python}
|
||||
best_ = grid.best_estimator_
|
||||
y_test_hat = best_.predict(X_test)
|
||||
confusion_table(y_test_hat, y_test)
|
||||
|
||||
```
|
||||
|
||||
Thus, with this value of `C`,
|
||||
70% of the test
|
||||
observations are correctly classified. What if we had instead used
|
||||
`C=0.001`?
|
||||
|
||||
```{python}
|
||||
svm_ = SVC(C=0.001,
|
||||
kernel='linear').fit(X, y)
|
||||
y_test_hat = svm_.predict(X_test)
|
||||
confusion_table(y_test_hat, y_test)
|
||||
|
||||
```
|
||||
|
||||
In this case 60% of test observations are correctly classified.
|
||||
|
||||
We now consider a situation in which the two classes are linearly
|
||||
separable. Then we can find an optimal separating hyperplane using the
|
||||
`SVC()` estimator. We first
|
||||
further separate the two classes in our simulated data so that they
|
||||
are linearly separable:
|
||||
|
||||
```{python}
|
||||
X[y==1] += 1.9;
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
||||
|
||||
```
|
||||
|
||||
Now the observations are just barely linearly separable.
|
||||
|
||||
```{python}
|
||||
svm_ = SVC(C=1e5, kernel='linear').fit(X, y)
|
||||
y_hat = svm_.predict(X)
|
||||
confusion_table(y_hat, y)
|
||||
|
||||
```
|
||||
|
||||
We fit the
|
||||
support vector classifier and plot the resulting hyperplane, using a
|
||||
very large value of `C` so that no observations are
|
||||
misclassified.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X,
|
||||
y,
|
||||
svm_,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
Indeed no training errors were made and only three support vectors were used.
|
||||
In fact, the large value of `C` also means that these three support points are *on the margin*, and define it.
|
||||
One may wonder how good the classifier could be on test data that depends on only three data points!
|
||||
We now try a smaller
|
||||
value of `C`.
|
||||
|
||||
```{python}
|
||||
svm_ = SVC(C=0.1, kernel='linear').fit(X, y)
|
||||
y_hat = svm_.predict(X)
|
||||
confusion_table(y_hat, y)
|
||||
|
||||
```
|
||||
|
||||
Using `C=0.1`, we again do not misclassify any training observations, but we
|
||||
also obtain a much wider margin and make use of twelve support
|
||||
vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test
|
||||
data than the model with `C=1e5` (and indeed, a simple experiment with a large test set would bear this out).
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X,
|
||||
y,
|
||||
svm_,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Support Vector Machine
|
||||
In order to fit an SVM using a non-linear kernel, we once again use
|
||||
the `SVC()` estimator. However, now we use a different value
|
||||
of the parameter `kernel`. To fit an SVM with a polynomial
|
||||
kernel we use `kernel="poly"`, and to fit an SVM with a
|
||||
radial kernel we use
|
||||
`kernel="rbf"`. In the former case we also use the
|
||||
`degree` argument to specify a degree for the polynomial kernel
|
||||
(this is $d$ in (9.22)), and in the latter case we use
|
||||
`gamma` to specify a value of $\gamma$ for the radial basis
|
||||
kernel (9.24).
|
||||
|
||||
We first generate some data with a non-linear class boundary, as follows:
|
||||
|
||||
```{python}
|
||||
X = rng.standard_normal((200, 2))
|
||||
X[:100] += 2
|
||||
X[100:150] -= 2
|
||||
y = np.array([1]*150+[2]*50)
|
||||
|
||||
```
|
||||
|
||||
Plotting the data makes it clear that the class boundary is indeed non-linear.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0],
|
||||
X[:,1],
|
||||
c=y,
|
||||
cmap=cm.coolwarm);
|
||||
|
||||
```
|
||||
|
||||
|
||||
The data is randomly split into training and testing groups. We then
|
||||
fit the training data using the `SVC()` estimator with a
|
||||
radial kernel and $\gamma=1$:
|
||||
|
||||
```{python}
|
||||
(X_train,
|
||||
X_test,
|
||||
y_train,
|
||||
y_test) = skm.train_test_split(X,
|
||||
y,
|
||||
test_size=0.5,
|
||||
random_state=0)
|
||||
svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
|
||||
svm_rbf.fit(X_train, y_train)
|
||||
|
||||
```
|
||||
|
||||
The plot shows that the resulting SVM has a decidedly non-linear
|
||||
boundary.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X_train,
|
||||
y_train,
|
||||
svm_rbf,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
We can see from the figure that there are a fair number of training
|
||||
errors in this SVM fit. If we increase the value of `C`, we
|
||||
can reduce the number of training errors. However, this comes at the
|
||||
price of a more irregular decision boundary that seems to be at risk
|
||||
of overfitting the data.
|
||||
|
||||
```{python}
|
||||
svm_rbf = SVC(kernel="rbf", gamma=1, C=1e5)
|
||||
svm_rbf.fit(X_train, y_train)
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X_train,
|
||||
y_train,
|
||||
svm_rbf,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
|
||||
We can perform cross-validation using `skm.GridSearchCV()` to select the
|
||||
best choice of $\gamma$ and `C` for an SVM with a radial
|
||||
kernel:
|
||||
|
||||
```{python}
|
||||
kfold = skm.KFold(5,
|
||||
random_state=0,
|
||||
shuffle=True)
|
||||
grid = skm.GridSearchCV(svm_rbf,
|
||||
{'C':[0.1,1,10,100,1000],
|
||||
'gamma':[0.5,1,2,3,4]},
|
||||
refit=True,
|
||||
cv=kfold,
|
||||
scoring='accuracy');
|
||||
grid.fit(X_train, y_train)
|
||||
grid.best_params_
|
||||
|
||||
```
|
||||
|
||||
The best choice of parameters under five-fold CV is achieved at `C=1`
|
||||
and `gamma=0.5`, though several other values also achieve the same
|
||||
value.
|
||||
|
||||
```{python}
|
||||
best_svm = grid.best_estimator_
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X_train,
|
||||
y_train,
|
||||
best_svm,
|
||||
ax=ax)
|
||||
|
||||
y_hat_test = best_svm.predict(X_test)
|
||||
confusion_table(y_hat_test, y_test)
|
||||
|
||||
```
|
||||
|
||||
With these parameters, 12% of test
|
||||
observations are misclassified by this SVM.
|
||||
|
||||
|
||||
## ROC Curves
|
||||
|
||||
SVMs and support vector classifiers output class labels for each
|
||||
observation. However, it is also possible to obtain *fitted values*
|
||||
for each observation, which are the numerical scores used to
|
||||
obtain the class labels. For instance, in the case of a support vector
|
||||
classifier, the fitted value for an observation $X= (X_1, X_2, \ldots,
|
||||
X_p)^T$ takes the form $\hat{\beta}_0 + \hat{\beta}_1 X_1 +
|
||||
\hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$. For an SVM with a
|
||||
non-linear kernel, the equation that yields the fitted value is given
|
||||
in (9.23). The sign of the fitted value
|
||||
determines on which side of the decision boundary the observation
|
||||
lies. Therefore, the relationship between the fitted value and the
|
||||
class prediction for a given observation is simple: if the fitted
|
||||
value exceeds zero then the observation is assigned to one class, and
|
||||
if it is less than zero then it is assigned to the other.
|
||||
By changing this threshold from zero to some positive value,
|
||||
we skew the classifications in favor of one class versus the other.
|
||||
By considering a range of these thresholds, positive and negative, we produce the ingredients for a ROC plot.
|
||||
We can access these values by calling the `decision_function()`
|
||||
method of a fitted SVM estimator.
|
||||
|
||||
The function `ROCCurveDisplay.from_estimator()` (which we have abbreviated to `roc_curve()`) will produce a plot of a ROC curve. It takes a fitted estimator as its first argument, followed
|
||||
by a model matrix $X$ and labels $y$. The argument `name` is used in the legend,
|
||||
while `color` is used for the color of the line. Results are plotted
|
||||
on our axis object `ax`.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
roc_curve(best_svm,
|
||||
X_train,
|
||||
y_train,
|
||||
name='Training',
|
||||
color='r',
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
In this example, the SVM appears to provide accurate predictions. By increasing
|
||||
$\gamma$ we can produce a more flexible fit and generate further
|
||||
improvements in accuracy.
|
||||
|
||||
```{python}
|
||||
svm_flex = SVC(kernel="rbf",
|
||||
gamma=50,
|
||||
C=1)
|
||||
svm_flex.fit(X_train, y_train)
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
roc_curve(svm_flex,
|
||||
X_train,
|
||||
y_train,
|
||||
name='Training $\gamma=50$',
|
||||
color='r',
|
||||
ax=ax);
|
||||
|
||||
```
|
||||
|
||||
However, these ROC curves are all on the training data. We are really
|
||||
more interested in the level of prediction accuracy on the test
|
||||
data. When we compute the ROC curves on the test data, the model with
|
||||
$\gamma=0.5$ appears to provide the most accurate results.
|
||||
|
||||
```{python}
|
||||
roc_curve(svm_flex,
|
||||
X_test,
|
||||
y_test,
|
||||
name='Test $\gamma=50$',
|
||||
color='b',
|
||||
ax=ax)
|
||||
fig;
|
||||
|
||||
```
|
||||
|
||||
Let’s look at our tuned SVM.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
for (X_, y_, c, name) in zip(
|
||||
(X_train, X_test),
|
||||
(y_train, y_test),
|
||||
('r', 'b'),
|
||||
('CV tuned on training',
|
||||
'CV tuned on test')):
|
||||
roc_curve(best_svm,
|
||||
X_,
|
||||
y_,
|
||||
name=name,
|
||||
ax=ax,
|
||||
color=c)
|
||||
|
||||
```
|
||||
|
||||
## SVM with Multiple Classes
|
||||
|
||||
If the response is a factor containing more than two levels, then the
|
||||
`SVC()` function will perform multi-class classification using
|
||||
either the one-versus-one approach (when `decision_function_shape=='ovo'`)
|
||||
or one-versus-rest {One-versus-rest is also known as one-versus-all.} (when `decision_function_shape=='ovr'`).
|
||||
We explore that setting briefly here by
|
||||
generating a third class of observations.
|
||||
|
||||
```{python}
|
||||
rng = np.random.default_rng(123)
|
||||
X = np.vstack([X, rng.standard_normal((50, 2))])
|
||||
y = np.hstack([y, [0]*50])
|
||||
X[y==0,1] += 2
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
|
||||
|
||||
```
|
||||
|
||||
We now fit an SVM to the data:
|
||||
|
||||
```{python}
|
||||
svm_rbf_3 = SVC(kernel="rbf",
|
||||
C=10,
|
||||
gamma=1,
|
||||
decision_function_shape='ovo');
|
||||
svm_rbf_3.fit(X, y)
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
plot_svm(X,
|
||||
y,
|
||||
svm_rbf_3,
|
||||
scatter_cmap=cm.tab10,
|
||||
ax=ax)
|
||||
|
||||
```
|
||||
The `sklearn.svm` library can also be used to perform support vector
|
||||
regression with a numerical response using the estimator `SupportVectorRegression()`.
|
||||
|
||||
|
||||
## Application to Gene Expression Data
|
||||
|
||||
We now examine the `Khan` data set, which consists of a number of
|
||||
tissue samples corresponding to four distinct types of small round
|
||||
blue cell tumors. For each tissue sample, gene expression measurements
|
||||
are available. The data set consists of training data, `xtrain`
|
||||
and `ytrain`, and testing data, `xtest` and `ytest`.
|
||||
|
||||
We examine the dimension of the data:
|
||||
|
||||
```{python}
|
||||
Khan = load_data('Khan')
|
||||
Khan['xtrain'].shape, Khan['xtest'].shape
|
||||
|
||||
```
|
||||
|
||||
This data set consists of expression measurements for 2,308
|
||||
genes. The training and test sets consist of 63 and 20
|
||||
observations, respectively.
|
||||
|
||||
We will use a support vector approach to predict cancer subtype using
|
||||
gene expression measurements. In this data set, there is a very
|
||||
large number of features relative to the number of observations. This
|
||||
suggests that we should use a linear kernel, because the additional
|
||||
flexibility that will result from using a polynomial or radial kernel
|
||||
is unnecessary.
|
||||
|
||||
```{python}
|
||||
khan_linear = SVC(kernel='linear', C=10)
|
||||
khan_linear.fit(Khan['xtrain'], Khan['ytrain'])
|
||||
confusion_table(khan_linear.predict(Khan['xtrain']),
|
||||
Khan['ytrain'])
|
||||
|
||||
```
|
||||
|
||||
We see that there are *no* training
|
||||
errors. In fact, this is not surprising, because the large number of
|
||||
variables relative to the number of observations implies that it is
|
||||
easy to find hyperplanes that fully separate the classes. We are more
|
||||
interested in the support vector classifier’s performance on the
|
||||
test observations.
|
||||
|
||||
```{python}
|
||||
confusion_table(khan_linear.predict(Khan['xtest']),
|
||||
Khan['ytest'])
|
||||
|
||||
```
|
||||
|
||||
We see that using `C=10` yields two test set errors on these data.
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,550 +0,0 @@
|
||||
|
||||
|
||||
# Survival Analysis
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch11-surv-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch11-surv-lab.ipynb)
|
||||
|
||||
|
||||
In this lab, we perform survival analyses on three separate data
|
||||
sets. In Section 11.8.1 we analyze the `BrainCancer`
|
||||
data that was first described in Section 11.3. In Section 11.8.2, we examine the `Publication`
|
||||
data from Section 11.5.4. Finally, Section 11.8.3 explores
|
||||
a simulated call-center data set.
|
||||
|
||||
We begin by importing some of our libraries at this top
|
||||
level. This makes the code more readable, as scanning the first few
|
||||
lines of the notebook tell us what libraries are used in this
|
||||
notebook.
|
||||
|
||||
```{python}
|
||||
from matplotlib.pyplot import subplots
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from ISLP.models import ModelSpec as MS
|
||||
from ISLP import load_data
|
||||
|
||||
```
|
||||
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
```{python}
|
||||
from lifelines import \
|
||||
(KaplanMeierFitter,
|
||||
CoxPHFitter)
|
||||
from lifelines.statistics import \
|
||||
(logrank_test,
|
||||
multivariate_logrank_test)
|
||||
from ISLP.survival import sim_time
|
||||
|
||||
```
|
||||
|
||||
## Brain Cancer Data
|
||||
|
||||
We begin with the `BrainCancer` data set, contained in the `ISLP` package.
|
||||
|
||||
```{python}
|
||||
BrainCancer = load_data('BrainCancer')
|
||||
BrainCancer.columns
|
||||
|
||||
```
|
||||
|
||||
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
|
||||
We first briefly examine the data.
|
||||
|
||||
```{python}
|
||||
BrainCancer['sex'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
BrainCancer['diagnosis'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
BrainCancer['status'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Before beginning an analysis, it is important to know how the
|
||||
`status` variable has been coded. Most software
|
||||
uses the convention that a `status` of 1 indicates an
|
||||
uncensored observation (often death), and a `status` of 0 indicates a censored
|
||||
observation. But some scientists might use the opposite coding. For
|
||||
the `BrainCancer` data set 35 patients died before the end of
|
||||
the study, so we are using the conventional coding.
|
||||
|
||||
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure 11.2. The main
|
||||
package we will use for survival analysis
|
||||
is `lifelines`.
|
||||
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
|
||||
death). The first argument to `km.fit` is the event time, and the
|
||||
second argument is the censoring variable, with a 1 indicating an observed
|
||||
failure time. The `plot()` method produces a survival curve with pointwise confidence
|
||||
intervals. By default, these are 90% confidence intervals, but this can be changed
|
||||
by setting the `alpha` argument to one minus the desired
|
||||
confidence level.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
km = KaplanMeierFitter()
|
||||
km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
|
||||
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
|
||||
|
||||
```
|
||||
|
||||
Next we create Kaplan-Meier survival curves that are stratified by
|
||||
`sex`, in order to reproduce Figure 11.3.
|
||||
We do this using the `groupby()` method of a dataframe.
|
||||
This method returns a generator that can
|
||||
be iterated over in the `for` loop. In this case,
|
||||
the items in the `for` loop are 2-tuples representing
|
||||
the groups: the first entry is the value
|
||||
of the grouping column `sex` while the second value
|
||||
is the dataframe consisting of all rows in the
|
||||
dataframe matching that value of `sex`.
|
||||
We will want to use this data below
|
||||
in the log-rank test, hence we store this
|
||||
information in the dictionary `by_sex`. Finally,
|
||||
we have also used the notion of
|
||||
*string interpolation* to automatically
|
||||
label the different lines in the plot. String
|
||||
interpolation is a powerful technique to format strings ---
|
||||
`Python` has many ways to facilitate such operations.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
by_sex = {}
|
||||
for sex, df in BrainCancer.groupby('sex'):
|
||||
by_sex[sex] = df
|
||||
km_sex = km.fit(df['time'], df['status'])
|
||||
km_sex.plot(label='Sex=%s' % sex, ax=ax)
|
||||
|
||||
```
|
||||
|
||||
As discussed in Section 11.4, we can perform a
|
||||
log-rank test to compare the survival of males to females. We use
|
||||
the `logrank_test()` function from the `lifelines.statistics` module.
|
||||
The first two arguments are the event times, with the second
|
||||
denoting the corresponding (optional) censoring indicators.
|
||||
|
||||
```{python}
|
||||
logrank_test(by_sex['Male']['time'],
|
||||
by_sex['Female']['time'],
|
||||
by_sex['Male']['status'],
|
||||
by_sex['Female']['status'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
The resulting $p$-value is $0.23$, indicating no evidence of a
|
||||
difference in survival between the two sexes.
|
||||
|
||||
Next, we use the `CoxPHFitter()` estimator
|
||||
from `lifelines` to fit Cox proportional hazards models.
|
||||
To begin, we consider a model that uses `sex` as the only predictor.
|
||||
|
||||
```{python}
|
||||
coxph = CoxPHFitter # shorthand
|
||||
sex_df = BrainCancer[['time', 'status', 'sex']]
|
||||
model_df = MS(['time', 'status', 'sex'],
|
||||
intercept=False).fit_transform(sex_df)
|
||||
cox_fit = coxph().fit(model_df,
|
||||
'time',
|
||||
'status')
|
||||
cox_fit.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
The first argument to `fit` should be a data frame containing
|
||||
at least the event time (the second argument `time` in this case),
|
||||
as well as an optional censoring variable (the argument `status` in this case).
|
||||
Note also that the Cox model does not include an intercept, which is why
|
||||
we used the `intercept=False` argument to `ModelSpec` above.
|
||||
The `summary()` method delivers many columns; we chose to abbreviate its output here.
|
||||
It is possible to obtain the likelihood ratio test comparing this model to the one
|
||||
with no features as follows:
|
||||
|
||||
```{python}
|
||||
cox_fit.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
Regardless of which test we use, we see that there is no clear
|
||||
evidence for a difference in survival between males and females. As
|
||||
we learned in this chapter, the score test from the Cox model is
|
||||
exactly equal to the log rank test statistic!
|
||||
|
||||
Now we fit a model that makes use of additional predictors. We first note
|
||||
that one of our `diagnosis` values is missing, hence
|
||||
we drop that observation before continuing.
|
||||
|
||||
```{python}
|
||||
cleaned = BrainCancer.dropna()
|
||||
all_MS = MS(cleaned.columns, intercept=False)
|
||||
all_df = all_MS.fit_transform(cleaned)
|
||||
fit_all = coxph().fit(all_df,
|
||||
'time',
|
||||
'status')
|
||||
fit_all.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
The `diagnosis` variable has been coded so that the baseline
|
||||
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
|
||||
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
|
||||
with meningioma. In other words, after adjusting for the other
|
||||
predictors, patients with HG glioma have much worse survival compared
|
||||
to those with meningioma. In addition, larger values of the Karnofsky
|
||||
index, `ki`, are associated with lower risk, i.e. longer survival.
|
||||
|
||||
Finally, we plot estimated survival curves for each diagnosis category,
|
||||
adjusting for the other predictors. To make these plots, we set the
|
||||
values of the other predictors equal to the mean for quantitative variables
|
||||
and equal to the mode for categorical. To do this, we use the
|
||||
`apply()` method across rows (i.e. `axis=0`) with a function
|
||||
`representative` that checks if a column is categorical
|
||||
or not.
|
||||
|
||||
```{python}
|
||||
levels = cleaned['diagnosis'].unique()
|
||||
def representative(series):
|
||||
if hasattr(series.dtype, 'categories'):
|
||||
return pd.Series.mode(series)
|
||||
else:
|
||||
return series.mean()
|
||||
modal_data = cleaned.apply(representative, axis=0)
|
||||
|
||||
```
|
||||
|
||||
We make four
|
||||
copies of the column means and assign the `diagnosis` column to be the four different
|
||||
diagnoses.
|
||||
|
||||
```{python}
|
||||
modal_df = pd.DataFrame(
|
||||
[modal_data.iloc[0] for _ in range(len(levels))])
|
||||
modal_df['diagnosis'] = levels
|
||||
modal_df
|
||||
|
||||
```
|
||||
|
||||
We then construct the model matrix based on the model specification `all_MS` used to fit
|
||||
the model, and name the rows according to the levels of `diagnosis`.
|
||||
|
||||
```{python}
|
||||
modal_X = all_MS.transform(modal_df)
|
||||
modal_X.index = levels
|
||||
modal_X
|
||||
|
||||
```
|
||||
|
||||
We can use the `predict_survival_function()` method to obtain the estimated survival function.
|
||||
|
||||
```{python}
|
||||
predicted_survival = fit_all.predict_survival_function(modal_X)
|
||||
predicted_survival
|
||||
|
||||
```
|
||||
This returns a data frame,
|
||||
whose plot methods yields the different survival curves. To avoid clutter in
|
||||
the plots, we do not display confidence intervals.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
predicted_survival.plot(ax=ax);
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Publication Data
|
||||
The `Publication` data presented in Section 11.5.4 can be
|
||||
found in the `ISLP` package.
|
||||
We first reproduce Figure 11.5 by plotting the Kaplan-Meier curves
|
||||
stratified on the `posres` variable, which records whether the
|
||||
study had a positive or negative result.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
Publication = load_data('Publication')
|
||||
by_result = {}
|
||||
for result, df in Publication.groupby('posres'):
|
||||
by_result[result] = df
|
||||
km_result = km.fit(df['time'], df['status'])
|
||||
km_result.plot(label='Result=%d' % result, ax=ax)
|
||||
|
||||
```
|
||||
|
||||
As discussed previously, the $p$-values from fitting Cox’s
|
||||
proportional hazards model to the `posres` variable are quite
|
||||
large, providing no evidence of a difference in time-to-publication
|
||||
between studies with positive versus negative results.
|
||||
|
||||
```{python}
|
||||
posres_df = MS(['posres',
|
||||
'time',
|
||||
'status'],
|
||||
intercept=False).fit_transform(Publication)
|
||||
posres_fit = coxph().fit(posres_df,
|
||||
'time',
|
||||
'status')
|
||||
posres_fit.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
However, the results change dramatically when we include other
|
||||
predictors in the model. Here we exclude the funding mechanism
|
||||
variable.
|
||||
|
||||
```{python}
|
||||
model = MS(Publication.columns.drop('mech'),
|
||||
intercept=False)
|
||||
coxph().fit(model.fit_transform(Publication),
|
||||
'time',
|
||||
'status').summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
We see that there are a number of statistically significant variables,
|
||||
including whether the trial focused on a clinical endpoint, the impact
|
||||
of the study, and whether the study had positive or negative results.
|
||||
|
||||
|
||||
## Call Center Data
|
||||
|
||||
In this section, we will simulate survival data using the relationship
|
||||
between cumulative hazard and
|
||||
the survival function explored in Exercise 8.
|
||||
Our simulated data will represent the observed
|
||||
wait times (in seconds) for 2,000 customers who have phoned a call
|
||||
center. In this context, censoring occurs if a customer hangs up
|
||||
before his or her call is answered.
|
||||
|
||||
There are three covariates: `Operators` (the number of call
|
||||
center operators available at the time of the call, which can range
|
||||
from $5$ to $15$), `Center` (either A, B, or C), and
|
||||
`Time` of day (Morning, Afternoon, or Evening). We generate data
|
||||
for these covariates so that all possibilities are equally likely: for
|
||||
instance, morning, afternoon and evening calls are equally likely, and
|
||||
any number of operators from $5$ to $15$ is equally likely.
|
||||
|
||||
```{python}
|
||||
rng = np.random.default_rng(10)
|
||||
N = 2000
|
||||
Operators = rng.choice(np.arange(5, 16),
|
||||
N,
|
||||
replace=True)
|
||||
Center = rng.choice(['A', 'B', 'C'],
|
||||
N,
|
||||
replace=True)
|
||||
Time = rng.choice(['Morn.', 'After.', 'Even.'],
|
||||
N,
|
||||
replace=True)
|
||||
D = pd.DataFrame({'Operators': Operators,
|
||||
'Center': pd.Categorical(Center),
|
||||
'Time': pd.Categorical(Time)})
|
||||
```
|
||||
|
||||
We then build a model matrix (omitting the intercept)
|
||||
|
||||
```{python}
|
||||
model = MS(['Operators',
|
||||
'Center',
|
||||
'Time'],
|
||||
intercept=False)
|
||||
X = model.fit_transform(D)
|
||||
```
|
||||
|
||||
It is worthwhile to take a peek at the model matrix `X`, so
|
||||
that we can be sure that we understand how the variables have been coded. By default,
|
||||
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
|
||||
of the variable is dropped.
|
||||
|
||||
```{python}
|
||||
X[:5]
|
||||
|
||||
```
|
||||
|
||||
Next, we specify the coefficients and the hazard function.
|
||||
|
||||
```{python}
|
||||
true_beta = np.array([0.04, -0.3, 0, 0.2, -0.2])
|
||||
true_linpred = X.dot(true_beta)
|
||||
hazard = lambda t: 1e-5 * t
|
||||
|
||||
```
|
||||
|
||||
Here, we have set the coefficient associated with `Operators` to
|
||||
equal $0.04$; in other words, each additional operator leads to a
|
||||
$e^{0.04}=1.041$-fold increase in the “risk” that the call will be
|
||||
answered, given the `Center` and `Time` covariates. This
|
||||
makes sense: the greater the number of operators at hand, the shorter
|
||||
the wait time! The coefficient associated with `Center == B` is
|
||||
$-0.3$, and `Center == A` is treated as the baseline. This means
|
||||
that the risk of a call being answered at Center B is 0.74 times the
|
||||
risk that it will be answered at Center A; in other words, the wait
|
||||
times are a bit longer at Center B.
|
||||
|
||||
Recall from Section 2.3.7 the use of `lambda`
|
||||
for creating short functions on the fly.
|
||||
We use the function
|
||||
`sim_time()` from the `ISLP.survival` package. This function
|
||||
uses the relationship between the survival function
|
||||
and cumulative hazard $S(t) = \exp(-H(t))$ and the specific
|
||||
form of the cumulative hazard function in the Cox model
|
||||
to simulate data based on values of the linear predictor
|
||||
`true_linpred` and the cumulative hazard.
|
||||
We need to provide the cumulative hazard function, which we do here.
|
||||
|
||||
```{python}
|
||||
cum_hazard = lambda t: 1e-5 * t**2 / 2
|
||||
|
||||
```
|
||||
We are now ready to generate data under the Cox proportional hazards
|
||||
model. We truncate the maximum time to 1000 seconds to keep
|
||||
simulated wait times reasonable. The function
|
||||
`sim_time()` takes a linear predictor,
|
||||
a cumulative hazard function and a
|
||||
random number generator.
|
||||
|
||||
```{python}
|
||||
W = np.array([sim_time(l, cum_hazard, rng)
|
||||
for l in true_linpred])
|
||||
D['Wait time'] = np.clip(W, 0, 1000)
|
||||
|
||||
```
|
||||
|
||||
We now simulate our censoring variable, for which we assume
|
||||
90% of calls were answered (`Failed==1`) before the
|
||||
customer hung up (`Failed==0`).
|
||||
|
||||
```{python}
|
||||
D['Failed'] = rng.choice([1, 0],
|
||||
N,
|
||||
p=[0.9, 0.1])
|
||||
D[:5]
|
||||
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
D['Failed'].mean()
|
||||
|
||||
```
|
||||
|
||||
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
by_center = {}
|
||||
for center, df in D.groupby('Center'):
|
||||
by_center[center] = df
|
||||
km_center = km.fit(df['Wait time'], df['Failed'])
|
||||
km_center.plot(label='Center=%s' % center, ax=ax)
|
||||
ax.set_title("Probability of Still Being on Hold")
|
||||
|
||||
```
|
||||
|
||||
Next, we stratify by `Time`.
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8,8))
|
||||
by_time = {}
|
||||
for time, df in D.groupby('Time'):
|
||||
by_time[time] = df
|
||||
km_time = km.fit(df['Wait time'], df['Failed'])
|
||||
km_time.plot(label='Time=%s' % time, ax=ax)
|
||||
ax.set_title("Probability of Still Being on Hold")
|
||||
|
||||
```
|
||||
|
||||
It seems that calls at Call Center B take longer to be answered than
|
||||
calls at Centers A and C. Similarly, it appears that wait times are
|
||||
longest in the morning and shortest in the evening hours. We can use a
|
||||
log-rank test to determine whether these differences are statistically
|
||||
significant using the function `multivariate_logrank_test()`.
|
||||
|
||||
```{python}
|
||||
multivariate_logrank_test(D['Wait time'],
|
||||
D['Center'],
|
||||
D['Failed'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next, we consider the effect of `Time`.
|
||||
|
||||
```{python}
|
||||
multivariate_logrank_test(D['Wait time'],
|
||||
D['Time'],
|
||||
D['Failed'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
As in the case of a categorical variable with 2 levels, these
|
||||
results are similar to the likelihood ratio test
|
||||
from the Cox proportional hazards model. First, we
|
||||
look at the results for `Center`.
|
||||
|
||||
```{python}
|
||||
X = MS(['Wait time',
|
||||
'Failed',
|
||||
'Center'],
|
||||
intercept=False).fit_transform(D)
|
||||
F = coxph().fit(X, 'Wait time', 'Failed')
|
||||
F.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Next, we look at the results for `Time`.
|
||||
|
||||
```{python}
|
||||
X = MS(['Wait time',
|
||||
'Failed',
|
||||
'Time'],
|
||||
intercept=False).fit_transform(D)
|
||||
F = coxph().fit(X, 'Wait time', 'Failed')
|
||||
F.log_likelihood_ratio_test()
|
||||
|
||||
```
|
||||
|
||||
|
||||
We find that differences between centers are highly significant, as
|
||||
are differences between times of day.
|
||||
|
||||
Finally, we fit Cox's proportional hazards model to the data.
|
||||
|
||||
```{python}
|
||||
X = MS(D.columns,
|
||||
intercept=False).fit_transform(D)
|
||||
fit_queuing = coxph().fit(
|
||||
X,
|
||||
'Wait time',
|
||||
'Failed')
|
||||
fit_queuing.summary[['coef', 'se(coef)', 'p']]
|
||||
|
||||
```
|
||||
|
||||
|
||||
The $p$-values for Center B and evening time
|
||||
are very small. It is also clear that the
|
||||
hazard --- that is, the instantaneous risk that a call will be
|
||||
answered --- increases with the number of operators. Since we
|
||||
generated the data ourselves, we know that the true coefficients for
|
||||
`Operators`, `Center = B`, `Center = C`,
|
||||
`Time = Even.` and `Time = Morn.` are $0.04$, $-0.3$,
|
||||
$0$, $0.2$, and $-0.2$, respectively. The coefficient estimates
|
||||
from the fitted Cox model are fairly accurate.
|
||||
|
||||
|
||||
@@ -1,904 +0,0 @@
|
||||
|
||||
|
||||
# Unsupervised Learning
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch12-unsup-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch12-unsup-lab.ipynb)
|
||||
|
||||
|
||||
In this lab we demonstrate PCA and clustering on several datasets.
|
||||
As in other labs, we import some of our libraries at this top
|
||||
level. This makes the code more readable, as scanning the first few
|
||||
lines of the notebook tells us what libraries are used in this
|
||||
notebook.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from statsmodels.datasets import get_rdataset
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from ISLP import load_data
|
||||
|
||||
```
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
```{python}
|
||||
from sklearn.cluster import \
|
||||
(KMeans,
|
||||
AgglomerativeClustering)
|
||||
from scipy.cluster.hierarchy import \
|
||||
(dendrogram,
|
||||
cut_tree)
|
||||
from ISLP.cluster import compute_linkage
|
||||
|
||||
```
|
||||
|
||||
## Principal Components Analysis
|
||||
In this lab, we perform PCA on `USArrests`, a data set in the
|
||||
`R` computing environment.
|
||||
We retrieve the data using `get_rdataset()`, which can fetch data from
|
||||
many standard `R` packages.
|
||||
|
||||
The rows of the data set contain the 50 states, in alphabetical order.
|
||||
|
||||
```{python}
|
||||
USArrests = get_rdataset('USArrests').data
|
||||
USArrests
|
||||
|
||||
```
|
||||
|
||||
The columns of the data set contain the four variables.
|
||||
|
||||
```{python}
|
||||
USArrests.columns
|
||||
|
||||
```
|
||||
|
||||
We first briefly examine the data. We notice that the variables have vastly different means.
|
||||
|
||||
```{python}
|
||||
USArrests.mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Dataframes have several useful methods for computing
|
||||
column-wise summaries. We can also examine the
|
||||
variance of the four variables using the `var()` method.
|
||||
|
||||
```{python}
|
||||
USArrests.var()
|
||||
|
||||
```
|
||||
|
||||
Not surprisingly, the variables also have vastly different variances.
|
||||
The `UrbanPop` variable measures the percentage of the population
|
||||
in each state living in an urban area, which is not a comparable
|
||||
number to the number of rapes in each state per 100,000 individuals.
|
||||
PCA looks for derived variables that account for most of the variance in the data set.
|
||||
If we do not scale the variables before performing PCA, then the principal components
|
||||
would mostly be driven by the
|
||||
`Assault` variable, since it has by far the largest
|
||||
variance. So if the variables are measured in different units or vary widely in scale, it is recommended to standardize the variables to have standard deviation one before performing PCA.
|
||||
Typically we set the means to zero as well.
|
||||
|
||||
This scaling can be done via the `StandardScaler()` transform imported above. We first `fit` the
|
||||
scaler, which computes the necessary means and standard
|
||||
deviations and then apply it to our data using the
|
||||
`transform` method. As before, we combine these steps using the `fit_transform()` method.
|
||||
|
||||
|
||||
```{python}
|
||||
scaler = StandardScaler(with_std=True,
|
||||
with_mean=True)
|
||||
USArrests_scaled = scaler.fit_transform(USArrests)
|
||||
|
||||
```
|
||||
Having scaled the data, we can then
|
||||
perform principal components analysis using the `PCA()` transform
|
||||
from the `sklearn.decomposition` package.
|
||||
|
||||
```{python}
|
||||
pcaUS = PCA()
|
||||
|
||||
```
|
||||
(By default, the `PCA()` transform centers the variables to have
|
||||
mean zero though it does not scale them.) The transform `pcaUS`
|
||||
can be used to find the PCA
|
||||
`scores` returned by `fit()`. Once the `fit` method has been called, the `pcaUS` object also contains a number of useful quantities.
|
||||
|
||||
```{python}
|
||||
pcaUS.fit(USArrests_scaled)
|
||||
|
||||
```
|
||||
|
||||
After fitting, the `mean_` attribute corresponds to the means
|
||||
of the variables. In this case, since we centered and scaled the data with
|
||||
`scaler()` the means will all be 0.
|
||||
|
||||
```{python}
|
||||
pcaUS.mean_
|
||||
|
||||
```
|
||||
|
||||
The scores can be computed using the `transform()` method
|
||||
of `pcaUS` after it has been fit.
|
||||
|
||||
```{python}
|
||||
scores = pcaUS.transform(USArrests_scaled)
|
||||
|
||||
```
|
||||
We will plot these scores a bit further down.
|
||||
The `components_` attribute provides the principal component loadings:
|
||||
each row of `pcaUS.components_` contains the corresponding
|
||||
principal component loading vector.
|
||||
|
||||
|
||||
```{python}
|
||||
pcaUS.components_
|
||||
|
||||
```
|
||||
|
||||
The `biplot` is a common visualization method used with
|
||||
PCA. It is not built in as a standard
|
||||
part of `sklearn`, though there are python
|
||||
packages that do produce such plots. Here we
|
||||
make a simple biplot manually.
|
||||
|
||||
```{python}
|
||||
i, j = 0, 1 # which components
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
|
||||
ax.scatter(scores[:,0], scores[:,1])
|
||||
ax.set_xlabel('PC%d' % (i+1))
|
||||
ax.set_ylabel('PC%d' % (j+1))
|
||||
for k in range(pcaUS.components_.shape[1]):
|
||||
ax.arrow(0, 0, pcaUS.components_[i,k], pcaUS.components_[j,k])
|
||||
ax.text(pcaUS.components_[i,k],
|
||||
pcaUS.components_[j,k],
|
||||
USArrests.columns[k])
|
||||
|
||||
```
|
||||
Notice that this figure is a reflection of Figure 12.1 through the $y$-axis. Recall that the
|
||||
principal components are only unique up to a sign change, so we can
|
||||
reproduce that figure by flipping the
|
||||
signs of the second set of scores and loadings.
|
||||
We also increase the length of the arrows to emphasize the loadings.
|
||||
|
||||
```{python}
|
||||
scale_arrow = s_ = 2
|
||||
scores[:,1] *= -1
|
||||
pcaUS.components_[1] *= -1 # flip the y-axis
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
|
||||
ax.scatter(scores[:,0], scores[:,1])
|
||||
ax.set_xlabel('PC%d' % (i+1))
|
||||
ax.set_ylabel('PC%d' % (j+1))
|
||||
for k in range(pcaUS.components_.shape[1]):
|
||||
ax.arrow(0, 0, s_*pcaUS.components_[i,k], s_*pcaUS.components_[j,k])
|
||||
ax.text(s_*pcaUS.components_[i,k],
|
||||
s_*pcaUS.components_[j,k],
|
||||
USArrests.columns[k])
|
||||
|
||||
```
|
||||
|
||||
The standard deviations of the principal component scores are as follows:
|
||||
|
||||
```{python}
|
||||
scores.std(0, ddof=1)
|
||||
```
|
||||
|
||||
|
||||
The variance of each score can be extracted directly from the `pcaUS` object via
|
||||
the `explained_variance_` attribute.
|
||||
|
||||
```{python}
|
||||
pcaUS.explained_variance_
|
||||
|
||||
```
|
||||
The proportion of variance explained by each principal
|
||||
component (PVE) is stored as `explained_variance_ratio_`:
|
||||
|
||||
```{python}
|
||||
pcaUS.explained_variance_ratio_
|
||||
|
||||
```
|
||||
We see that the first principal component explains 62.0% of the
|
||||
variance in the data, the next principal component explains 24.7%
|
||||
of the variance, and so forth.
|
||||
We can plot the PVE explained by each component, as well as the cumulative PVE. We first
|
||||
plot the proportion of variance explained.
|
||||
|
||||
```{python}
|
||||
# %%capture
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
|
||||
ticks = np.arange(pcaUS.n_components_)+1
|
||||
ax = axes[0]
|
||||
ax.plot(ticks,
|
||||
pcaUS.explained_variance_ratio_,
|
||||
marker='o')
|
||||
ax.set_xlabel('Principal Component');
|
||||
ax.set_ylabel('Proportion of Variance Explained')
|
||||
ax.set_ylim([0,1])
|
||||
ax.set_xticks(ticks)
|
||||
|
||||
```
|
||||
Notice the use of `%%capture`, which suppresses the displaying of the partially completed figure.
|
||||
|
||||
```{python}
|
||||
ax = axes[1]
|
||||
ax.plot(ticks,
|
||||
pcaUS.explained_variance_ratio_.cumsum(),
|
||||
marker='o')
|
||||
ax.set_xlabel('Principal Component')
|
||||
ax.set_ylabel('Cumulative Proportion of Variance Explained')
|
||||
ax.set_ylim([0, 1])
|
||||
ax.set_xticks(ticks)
|
||||
fig
|
||||
|
||||
```
|
||||
The result is similar to that shown in Figure 12.3. Note
|
||||
that the method `cumsum()` computes the cumulative sum of
|
||||
the elements of a numeric vector. For instance:
|
||||
|
||||
```{python}
|
||||
a = np.array([1,2,8,-3])
|
||||
np.cumsum(a)
|
||||
|
||||
```
|
||||
## Matrix Completion
|
||||
|
||||
We now re-create the analysis carried out on the `USArrests` data in
|
||||
Section 12.3.
|
||||
|
||||
We saw in Section 12.2.2 that solving the optimization
|
||||
problem~(12.6) on a centered data matrix $\bf X$ is
|
||||
equivalent to computing the first $M$ principal
|
||||
components of the data. We use our scaled
|
||||
and centered `USArrests` data as $\bf X$ below. The *singular value decomposition*
|
||||
(SVD) is a general algorithm for solving
|
||||
(12.6).
|
||||
|
||||
```{python}
|
||||
X = USArrests_scaled
|
||||
U, D, V = np.linalg.svd(X, full_matrices=False)
|
||||
U.shape, D.shape, V.shape
|
||||
|
||||
```
|
||||
The `np.linalg.svd()` function returns three components, `U`, `D` and `V`. The matrix `V` is equivalent to the
|
||||
loading matrix from principal components (up to an unimportant sign flip). Using the `full_matrices=False` option ensures that
|
||||
for a tall matrix the shape of `U` is the same as the shape of `X`.
|
||||
|
||||
```{python}
|
||||
V
|
||||
|
||||
```
|
||||
|
||||
```{python}
|
||||
pcaUS.components_
|
||||
|
||||
```
|
||||
The matrix `U` corresponds to a *standardized* version of the PCA score matrix (each column standardized to have sum-of-squares one). If we multiply each column of `U` by the corresponding element of `D`, we recover the PCA scores exactly (up to a meaningless sign flip).
|
||||
|
||||
```{python}
|
||||
(U * D[None,:])[:3]
|
||||
|
||||
```
|
||||
|
||||
```{python}
|
||||
scores[:3]
|
||||
|
||||
```
|
||||
While it would be possible to carry out this lab using the `PCA()` estimator,
|
||||
here we use the `np.linalg.svd()` function in order to illustrate its use.
|
||||
|
||||
We now omit 20 entries in the $50\times 4$ data matrix at random. We do so
|
||||
by first selecting 20 rows (states) at random, and then selecting one
|
||||
of the four entries in each row at random. This ensures that every row has
|
||||
at least three observed values.
|
||||
|
||||
```{python}
|
||||
n_omit = 20
|
||||
np.random.seed(15)
|
||||
r_idx = np.random.choice(np.arange(X.shape[0]),
|
||||
n_omit,
|
||||
replace=False)
|
||||
c_idx = np.random.choice(np.arange(X.shape[1]),
|
||||
n_omit,
|
||||
replace=True)
|
||||
Xna = X.copy()
|
||||
Xna[r_idx, c_idx] = np.nan
|
||||
|
||||
```
|
||||
|
||||
Here the array `r_idx`
|
||||
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
|
||||
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
|
||||
|
||||
We now write some code to implement Algorithm 12.1.
|
||||
We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function.
|
||||
This will be needed in Step 2 of Algorithm 12.1.
|
||||
|
||||
```{python}
|
||||
def low_rank(X, M=1):
|
||||
U, D, V = np.linalg.svd(X)
|
||||
L = U[:,:M] * D[None,:M]
|
||||
return L.dot(V[:M])
|
||||
|
||||
```
|
||||
To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm 12.1 --- by replacing
|
||||
the missing values with the column means of the non-missing entries. These are stored in
|
||||
`Xbar` below after running `np.nanmean()` over the row axis.
|
||||
We make a copy so that when we assign values to `Xhat` below we do not also overwrite the
|
||||
values in `Xna`.
|
||||
|
||||
```{python}
|
||||
Xhat = Xna.copy()
|
||||
Xbar = np.nanmean(Xhat, axis=0)
|
||||
Xhat[r_idx, c_idx] = Xbar[c_idx]
|
||||
|
||||
```
|
||||
|
||||
Before we begin Step 2, we set ourselves up to measure the progress of our
|
||||
iterations:
|
||||
|
||||
```{python}
|
||||
thresh = 1e-7
|
||||
rel_err = 1
|
||||
count = 0
|
||||
ismiss = np.isnan(Xna)
|
||||
mssold = np.mean(Xhat[~ismiss]**2)
|
||||
mss0 = np.mean(Xna[~ismiss]**2)
|
||||
|
||||
```
|
||||
Here `ismiss` is a logical matrix with the same dimensions as `Xna`;
|
||||
a given element is `True` if the corresponding matrix element is missing. The notation `~ismiss` negates this boolean vector. This is useful
|
||||
because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`.
|
||||
We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold` (which currently
|
||||
agrees with `mss0`). We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then
|
||||
iterate Step 2 of Algorithm 12.1 until the *relative error*, defined as
|
||||
`(mssold - mss) / mss0`, falls below `thresh = 1e-7`.
|
||||
{Algorithm 12.1 tells us to iterate Step 2 until 12.14 is no longer decreasing. Determining whether 12.14 is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm 12.1 to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.}
|
||||
|
||||
In Step 2(a) of Algorithm 12.1, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop:
|
||||
|
||||
```{python}
|
||||
while rel_err > thresh:
|
||||
count += 1
|
||||
# Step 2(a)
|
||||
Xapp = low_rank(Xhat, M=1)
|
||||
# Step 2(b)
|
||||
Xhat[ismiss] = Xapp[ismiss]
|
||||
# Step 2(c)
|
||||
mss = np.mean(((Xna - Xapp)[~ismiss])**2)
|
||||
rel_err = (mssold - mss) / mss0
|
||||
mssold = mss
|
||||
print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}"
|
||||
.format(count, mss, rel_err))
|
||||
|
||||
```
|
||||
|
||||
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
|
||||
|
||||
Finally, we compute the correlation between the 20 imputed values
|
||||
and the actual values:
|
||||
|
||||
```{python}
|
||||
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
|
||||
|
||||
```
|
||||
|
||||
|
||||
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations.
|
||||
|
||||
|
||||
## Clustering
|
||||
|
||||
|
||||
### $K$-Means Clustering
|
||||
|
||||
The estimator `sklearn.cluster.KMeans()` performs $K$-means clustering in
|
||||
`Python`. We begin with a simple simulated example in which there
|
||||
truly are two clusters in the data: the first 25 observations have a
|
||||
mean shift relative to the next 25 observations.
|
||||
|
||||
```{python}
|
||||
np.random.seed(0);
|
||||
X = np.random.standard_normal((50,2));
|
||||
X[:25,0] += 3;
|
||||
X[:25,1] -= 4;
|
||||
|
||||
```
|
||||
We now perform $K$-means clustering with $K=2$.
|
||||
|
||||
```{python}
|
||||
kmeans = KMeans(n_clusters=2,
|
||||
random_state=2,
|
||||
n_init=20).fit(X)
|
||||
|
||||
```
|
||||
We specify `random_state` to make the results reproducible. The cluster assignments of the 50 observations are contained in `kmeans.labels_`.
|
||||
|
||||
```{python}
|
||||
kmeans.labels_
|
||||
|
||||
```
|
||||
The $K$-means clustering perfectly separated the observations into two
|
||||
clusters even though we did not supply any group information to
|
||||
`KMeans()`. We can plot the data, with each observation
|
||||
colored according to its cluster assignment.
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8,8))
|
||||
ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
|
||||
ax.set_title("K-Means Clustering Results with K=2");
|
||||
|
||||
```
|
||||
|
||||
Here the observations can be easily plotted because they are
|
||||
two-dimensional. If there were more than two variables then we could
|
||||
instead perform PCA and plot the first two principal component score
|
||||
vectors to represent the clusters.
|
||||
|
||||
In this example, we knew that there really
|
||||
were two clusters because we generated the data. However, for real
|
||||
data, we do not know the true number of clusters, nor whether they exist in any precise way. We could
|
||||
instead have performed $K$-means clustering on this example with
|
||||
$K=3$.
|
||||
|
||||
```{python}
|
||||
kmeans = KMeans(n_clusters=3,
|
||||
random_state=3,
|
||||
n_init=20).fit(X)
|
||||
fig, ax = plt.subplots(figsize=(8,8))
|
||||
ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
|
||||
ax.set_title("K-Means Clustering Results with K=3");
|
||||
|
||||
```
|
||||
When $K=3$, $K$-means clustering splits up the two clusters.
|
||||
We have used the `n_init` argument to run the $K$-means with 20
|
||||
initial cluster assignments (the default is 10). If a
|
||||
value of `n_init` greater than one is used, then $K$-means
|
||||
clustering will be performed using multiple random assignments in
|
||||
Step 1 of Algorithm 12.2, and the `KMeans()`
|
||||
function will report only the best results. Here we compare using
|
||||
`n_init=1` to `n_init=20`.
|
||||
|
||||
```{python}
|
||||
kmeans1 = KMeans(n_clusters=3,
|
||||
random_state=3,
|
||||
n_init=1).fit(X)
|
||||
kmeans20 = KMeans(n_clusters=3,
|
||||
random_state=3,
|
||||
n_init=20).fit(X);
|
||||
kmeans1.inertia_, kmeans20.inertia_
|
||||
|
||||
```
|
||||
Note that `kmeans.inertia_` is the total within-cluster sum
|
||||
of squares, which we seek to minimize by performing $K$-means
|
||||
clustering 12.17.
|
||||
|
||||
We *strongly* recommend always running $K$-means clustering with
|
||||
a large value of `n_init`, such as 20 or 50, since otherwise an
|
||||
undesirable local optimum may be obtained.
|
||||
|
||||
When performing $K$-means clustering, in addition to using multiple
|
||||
initial cluster assignments, it is also important to set a random seed
|
||||
using the `random_state` argument to `KMeans()`. This way, the initial
|
||||
cluster assignments in Step 1 can be replicated, and the $K$-means
|
||||
output will be fully reproducible.
|
||||
|
||||
|
||||
### Hierarchical Clustering
|
||||
|
||||
The `AgglomerativeClustering()` class from
|
||||
the `sklearn.clustering` package implements hierarchical clustering.
|
||||
As its
|
||||
name is long, we use the short hand `HClust` for *hierarchical clustering*. Note that this will not change the return type
|
||||
when using this method, so instances will still be of class `AgglomerativeClustering`.
|
||||
In the following example we use the data from the previous lab to plot the hierarchical clustering
|
||||
dendrogram using complete, single, and average linkage clustering
|
||||
with Euclidean distance as the dissimilarity measure. We begin by
|
||||
clustering observations using complete linkage.
|
||||
|
||||
```{python}
|
||||
HClust = AgglomerativeClustering
|
||||
hc_comp = HClust(distance_threshold=0,
|
||||
n_clusters=None,
|
||||
linkage='complete')
|
||||
hc_comp.fit(X)
|
||||
|
||||
```
|
||||
|
||||
This computes the entire dendrogram.
|
||||
We could just as easily perform hierarchical clustering with average or single linkage instead:
|
||||
|
||||
```{python}
|
||||
hc_avg = HClust(distance_threshold=0,
|
||||
n_clusters=None,
|
||||
linkage='average');
|
||||
hc_avg.fit(X)
|
||||
hc_sing = HClust(distance_threshold=0,
|
||||
n_clusters=None,
|
||||
linkage='single');
|
||||
hc_sing.fit(X);
|
||||
|
||||
```
|
||||
|
||||
To use a precomputed distance matrix, we provide an additional
|
||||
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
|
||||
|
||||
```{python}
|
||||
D = np.zeros((X.shape[0], X.shape[0]));
|
||||
for i in range(X.shape[0]):
|
||||
x_ = np.multiply.outer(np.ones(X.shape[0]), X[i])
|
||||
D[i] = np.sqrt(np.sum((X - x_)**2, 1));
|
||||
hc_sing_pre = HClust(distance_threshold=0,
|
||||
n_clusters=None,
|
||||
metric='precomputed',
|
||||
linkage='single')
|
||||
hc_sing_pre.fit(D)
|
||||
|
||||
```
|
||||
|
||||
We use
|
||||
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
|
||||
`dendrogram()` expects a so-called *linkage-matrix representation*
|
||||
of the clustering, which is not provided by `AgglomerativeClustering()`,
|
||||
but can be computed. The function `compute_linkage()` in the
|
||||
`ISLP.cluster` package is provided for this purpose.
|
||||
|
||||
We can now plot the dendrograms. The numbers at the bottom of the plot
|
||||
identify each observation. The `dendrogram()` function has a default method to
|
||||
color different branches of the tree that suggests a pre-defined cut of the tree at a particular depth.
|
||||
We prefer to overwrite this default by setting this threshold to be infinite. Since we want this behavior for many dendrograms, we store these values in a dictionary `cargs` and pass this as keyword arguments using the notation `**cargs`.
|
||||
|
||||
```{python}
|
||||
cargs = {'color_threshold':-np.inf,
|
||||
'above_threshold_color':'black'}
|
||||
linkage_comp = compute_linkage(hc_comp)
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
|
||||
dendrogram(linkage_comp,
|
||||
ax=ax,
|
||||
**cargs);
|
||||
|
||||
```
|
||||
|
||||
We may want to color branches of the tree above
|
||||
and below a cut-threshold differently. This can be achieved
|
||||
by changing the `color_threshold`. Let’s cut the tree at a height of 4,
|
||||
coloring links that merge above 4 in black.
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
|
||||
dendrogram(linkage_comp,
|
||||
ax=ax,
|
||||
color_threshold=4,
|
||||
above_threshold_color='black');
|
||||
|
||||
```
|
||||
|
||||
To determine the cluster labels for each observation associated with a
|
||||
given cut of the dendrogram, we can use the `cut_tree()`
|
||||
function from `scipy.cluster.hierarchy`:
|
||||
|
||||
```{python}
|
||||
cut_tree(linkage_comp, n_clusters=4).T
|
||||
|
||||
```
|
||||
|
||||
This can also be achieved by providing an argument `n_clusters`
|
||||
to `HClust()`; however each cut would require recomputing
|
||||
the clustering. Similarly, trees may be cut by distance threshold
|
||||
with an argument of `distance_threshold` to `HClust()`
|
||||
or `height` to `cut_tree()`.
|
||||
|
||||
```{python}
|
||||
cut_tree(linkage_comp, height=5)
|
||||
|
||||
```
|
||||
|
||||
To scale the variables before performing hierarchical clustering of
|
||||
the observations, we use `StandardScaler()` as in our PCA example:
|
||||
|
||||
```{python}
|
||||
scaler = StandardScaler()
|
||||
X_scale = scaler.fit_transform(X)
|
||||
hc_comp_scale = HClust(distance_threshold=0,
|
||||
n_clusters=None,
|
||||
linkage='complete').fit(X_scale)
|
||||
linkage_comp_scale = compute_linkage(hc_comp_scale)
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
|
||||
dendrogram(linkage_comp_scale, ax=ax, **cargs)
|
||||
ax.set_title("Hierarchical Clustering with Scaled Features");
|
||||
|
||||
```
|
||||
|
||||
Correlation-based distances between observations can be used for
|
||||
clustering. The correlation between two observations measures the
|
||||
similarity of their feature values. {Suppose each observation has
|
||||
$p$ features, each a single numerical value. We measure the
|
||||
similarity of two such observations by computing the
|
||||
correlation of these $p$ pairs of numbers.}
|
||||
With $n$ observations, the $n\times n$ correlation matrix can then be used as a similarity (or affinity) matrix, i.e. so that one minus the correlation matrix is the dissimilarity matrix used for clustering.
|
||||
|
||||
Note that using correlation only makes sense for
|
||||
data with at least three features since the absolute correlation
|
||||
between any two observations with measurements on two features is
|
||||
always one. Hence, we will cluster a three-dimensional data set.
|
||||
|
||||
```{python}
|
||||
X = np.random.standard_normal((30, 3))
|
||||
corD = 1 - np.corrcoef(X)
|
||||
hc_cor = HClust(linkage='complete',
|
||||
distance_threshold=0,
|
||||
n_clusters=None,
|
||||
metric='precomputed')
|
||||
hc_cor.fit(corD)
|
||||
linkage_cor = compute_linkage(hc_cor)
|
||||
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
|
||||
dendrogram(linkage_cor, ax=ax, **cargs)
|
||||
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
|
||||
|
||||
```
|
||||
|
||||
|
||||
## NCI60 Data Example
|
||||
Unsupervised techniques are often used in the analysis of genomic
|
||||
data. In particular, PCA and hierarchical clustering are popular
|
||||
tools. We illustrate these techniques on the `NCI60` cancer cell line
|
||||
microarray data, which consists of 6830 gene expression
|
||||
measurements on 64 cancer cell lines.
|
||||
|
||||
```{python}
|
||||
NCI60 = load_data('NCI60')
|
||||
nci_labs = NCI60['labels']
|
||||
nci_data = NCI60['data']
|
||||
|
||||
```
|
||||
|
||||
Each cell line is labeled with a cancer type. We do not make use of
|
||||
the cancer types in performing PCA and clustering, as these are
|
||||
unsupervised techniques. But after performing PCA and clustering, we
|
||||
will check to see the extent to which these cancer types agree with
|
||||
the results of these unsupervised techniques.
|
||||
|
||||
The data has 64 rows and 6830 columns.
|
||||
|
||||
```{python}
|
||||
nci_data.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
We begin by examining the cancer types for the cell lines.
|
||||
|
||||
|
||||
```{python}
|
||||
nci_labs.value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
### PCA on the NCI60 Data
|
||||
|
||||
We first perform PCA on the data after scaling the variables (genes)
|
||||
to have standard deviation one, although here one could reasonably argue
|
||||
that it is better not to scale the genes as they are measured in the same units.
|
||||
|
||||
```{python}
|
||||
scaler = StandardScaler()
|
||||
nci_scaled = scaler.fit_transform(nci_data)
|
||||
nci_pca = PCA()
|
||||
nci_scores = nci_pca.fit_transform(nci_scaled)
|
||||
|
||||
```
|
||||
|
||||
We now plot the first few principal component score vectors, in order
|
||||
to visualize the data. The observations (cell lines) corresponding to
|
||||
a given cancer type will be plotted in the same color, so that we can
|
||||
see to what extent the observations within a cancer type are similar
|
||||
to each other.
|
||||
|
||||
```{python}
|
||||
cancer_types = list(np.unique(nci_labs))
|
||||
nci_groups = np.array([cancer_types.index(lab)
|
||||
for lab in nci_labs.values])
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15,6))
|
||||
ax = axes[0]
|
||||
ax.scatter(nci_scores[:,0],
|
||||
nci_scores[:,1],
|
||||
c=nci_groups,
|
||||
marker='o',
|
||||
s=50)
|
||||
ax.set_xlabel('PC1'); ax.set_ylabel('PC2')
|
||||
ax = axes[1]
|
||||
ax.scatter(nci_scores[:,0],
|
||||
nci_scores[:,2],
|
||||
c=nci_groups,
|
||||
marker='o',
|
||||
s=50)
|
||||
ax.set_xlabel('PC1'); ax.set_ylabel('PC3');
|
||||
|
||||
```
|
||||
On the whole, cell lines corresponding to a single cancer type do tend to
|
||||
have similar values on the first few principal component score
|
||||
vectors. This indicates that cell lines from the same cancer type tend
|
||||
to have pretty similar gene expression levels.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We can also plot the percent variance
|
||||
explained by the principal components as well as the cumulative percent variance explained.
|
||||
This is similar to the plots we made earlier for the `USArrests` data.
|
||||
|
||||
```{python}
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15,6))
|
||||
ax = axes[0]
|
||||
ticks = np.arange(nci_pca.n_components_)+1
|
||||
ax.plot(ticks,
|
||||
nci_pca.explained_variance_ratio_,
|
||||
marker='o')
|
||||
ax.set_xlabel('Principal Component');
|
||||
ax.set_ylabel('PVE')
|
||||
ax = axes[1]
|
||||
ax.plot(ticks,
|
||||
nci_pca.explained_variance_ratio_.cumsum(),
|
||||
marker='o');
|
||||
ax.set_xlabel('Principal Component')
|
||||
ax.set_ylabel('Cumulative PVE');
|
||||
|
||||
```
|
||||
We see that together, the first seven principal components explain
|
||||
around 40% of the variance in the data. This is not a huge amount
|
||||
of the variance. However, looking at the scree plot, we see that while
|
||||
each of the first seven principal components explain a substantial
|
||||
amount of variance, there is a marked decrease in the variance
|
||||
explained by further principal components. That is, there is an
|
||||
*elbow* in the plot after approximately the seventh
|
||||
principal component. This suggests that there may be little benefit
|
||||
to examining more than seven or so principal components (though even
|
||||
examining seven principal components may be difficult).
|
||||
|
||||
|
||||
### Clustering the Observations of the NCI60 Data
|
||||
|
||||
We now perform hierarchical clustering of the cell lines in the `NCI60` data using
|
||||
complete, single, and average linkage. Once again, the goal is to find out whether or not the observations cluster into distinct types of cancer. Euclidean
|
||||
distance is used as the dissimilarity measure. We first write a short
|
||||
function to produce
|
||||
the three dendrograms.
|
||||
|
||||
```{python}
|
||||
def plot_nci(linkage, ax, cut=-np.inf):
|
||||
cargs = {'above_threshold_color':'black',
|
||||
'color_threshold':cut}
|
||||
hc = HClust(n_clusters=None,
|
||||
distance_threshold=0,
|
||||
linkage=linkage.lower()).fit(nci_scaled)
|
||||
linkage_ = compute_linkage(hc)
|
||||
dendrogram(linkage_,
|
||||
ax=ax,
|
||||
labels=np.asarray(nci_labs),
|
||||
leaf_font_size=10,
|
||||
**cargs)
|
||||
ax.set_title('%s Linkage' % linkage)
|
||||
return hc
|
||||
|
||||
```
|
||||
|
||||
Let’s plot our results.
|
||||
|
||||
```{python}
|
||||
fig, axes = plt.subplots(3, 1, figsize=(15,30))
|
||||
ax = axes[0]; hc_comp = plot_nci('Complete', ax)
|
||||
ax = axes[1]; hc_avg = plot_nci('Average', ax)
|
||||
ax = axes[2]; hc_sing = plot_nci('Single', ax)
|
||||
|
||||
```
|
||||
We see that the
|
||||
choice of linkage certainly does affect the results
|
||||
obtained. Typically, single linkage will tend to yield *trailing*
|
||||
clusters: very large clusters onto which individual observations
|
||||
attach one-by-one. On the other hand, complete and average linkage
|
||||
tend to yield more balanced, attractive clusters. For this reason,
|
||||
complete and average linkage are generally preferred to single
|
||||
linkage. Clearly cell lines within a single cancer type do tend to
|
||||
cluster together, although the clustering is not perfect. We will use
|
||||
complete linkage hierarchical clustering for the analysis that
|
||||
follows.
|
||||
|
||||
We can cut the dendrogram at the height that will yield a particular
|
||||
number of clusters, say four:
|
||||
|
||||
```{python}
|
||||
linkage_comp = compute_linkage(hc_comp)
|
||||
comp_cut = cut_tree(linkage_comp, n_clusters=4).reshape(-1)
|
||||
pd.crosstab(nci_labs['label'],
|
||||
pd.Series(comp_cut.reshape(-1), name='Complete'))
|
||||
|
||||
```
|
||||
|
||||
|
||||
There are some clear patterns. All the leukemia cell lines fall in
|
||||
one cluster, while the breast cancer cell lines are spread out over
|
||||
three different clusters.
|
||||
|
||||
We can plot a cut on the dendrogram that produces these four clusters:
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots(figsize=(10,10))
|
||||
plot_nci('Complete', ax, cut=140)
|
||||
ax.axhline(140, c='r', linewidth=4);
|
||||
|
||||
```
|
||||
|
||||
The `axhline()` function draws a horizontal line on top of any
|
||||
existing set of axes. The argument `140` plots a horizontal
|
||||
line at height 140 on the dendrogram; this is a height that
|
||||
results in four distinct clusters. It is easy to verify that the
|
||||
resulting clusters are the same as the ones we obtained in
|
||||
`comp_cut`.
|
||||
|
||||
We claimed earlier in Section 12.4.2 that
|
||||
$K$-means clustering and hierarchical clustering with the dendrogram
|
||||
cut to obtain the same number of clusters can yield very different
|
||||
results. How do these `NCI60` hierarchical clustering results compare
|
||||
to what we get if we perform $K$-means clustering with $K=4$?
|
||||
|
||||
```{python}
|
||||
nci_kmeans = KMeans(n_clusters=4,
|
||||
random_state=0,
|
||||
n_init=20).fit(nci_scaled)
|
||||
pd.crosstab(pd.Series(comp_cut, name='HClust'),
|
||||
pd.Series(nci_kmeans.labels_, name='K-means'))
|
||||
|
||||
```
|
||||
|
||||
We see that the four clusters obtained using hierarchical clustering
|
||||
and $K$-means clustering are somewhat different. First we note
|
||||
that the labels in the two clusterings are arbitrary. That is, swapping
|
||||
the identifier of the cluster does not
|
||||
change the clustering. We see here Cluster 3 in
|
||||
$K$-means clustering is identical to cluster 2 in hierarchical
|
||||
clustering. However, the other clusters differ: for instance,
|
||||
cluster 0 in $K$-means clustering contains a portion of the
|
||||
observations assigned to cluster 0 by hierarchical clustering, as well
|
||||
as all of the observations assigned to cluster 1 by hierarchical
|
||||
clustering.
|
||||
|
||||
Rather than performing hierarchical clustering on the entire data
|
||||
matrix, we can also perform hierarchical clustering on the first few
|
||||
principal component score vectors, regarding these first few components
|
||||
as a less noisy version of the data.
|
||||
|
||||
```{python}
|
||||
hc_pca = HClust(n_clusters=None,
|
||||
distance_threshold=0,
|
||||
linkage='complete'
|
||||
).fit(nci_scores[:,:5])
|
||||
linkage_pca = compute_linkage(hc_pca)
|
||||
fig, ax = plt.subplots(figsize=(8,8))
|
||||
dendrogram(linkage_pca,
|
||||
labels=np.asarray(nci_labs),
|
||||
leaf_font_size=10,
|
||||
ax=ax,
|
||||
**cargs)
|
||||
ax.set_title("Hier. Clust. on First Five Score Vectors")
|
||||
pca_labels = pd.Series(cut_tree(linkage_pca,
|
||||
n_clusters=4).reshape(-1),
|
||||
name='Complete-PCA')
|
||||
pd.crosstab(nci_labs['label'], pca_labels)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -1,574 +0,0 @@
|
||||
|
||||
|
||||
# Multiple Testing
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch13-multiple-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch13-multiple-lab.ipynb)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We include our usual imports seen in earlier labs.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import statsmodels.api as sm
|
||||
from ISLP import load_data
|
||||
|
||||
```
|
||||
|
||||
We also collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
```{python}
|
||||
from scipy.stats import \
|
||||
(ttest_1samp,
|
||||
ttest_rel,
|
||||
ttest_ind,
|
||||
t as t_dbn)
|
||||
from statsmodels.stats.multicomp import \
|
||||
pairwise_tukeyhsd
|
||||
from statsmodels.stats.multitest import \
|
||||
multipletests as mult_test
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Review of Hypothesis Tests
|
||||
We begin by performing some one-sample $t$-tests.
|
||||
|
||||
First we create 100 variables, each consisting of 10 observations. The
|
||||
first 50 variables have mean $0.5$ and variance $1$, while the others
|
||||
have mean $0$ and variance $1$.
|
||||
|
||||
```{python}
|
||||
rng = np.random.default_rng(12)
|
||||
X = rng.standard_normal((10, 100))
|
||||
true_mean = np.array([0.5]*50 + [0]*50)
|
||||
X += true_mean[None,:]
|
||||
|
||||
```
|
||||
|
||||
To begin, we use `ttest_1samp()` from the
|
||||
`scipy.stats` module to test $H_{0}: \mu_1=0$, the null
|
||||
hypothesis that the first variable has mean zero.
|
||||
|
||||
```{python}
|
||||
result = ttest_1samp(X[:,0], 0)
|
||||
result.pvalue
|
||||
|
||||
```
|
||||
|
||||
The $p$-value comes out to 0.931, which is not low enough to
|
||||
reject the null hypothesis at level $\alpha=0.05$. In this case,
|
||||
$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made
|
||||
a Type II error by failing to reject the null hypothesis when the null
|
||||
hypothesis is false.
|
||||
|
||||
We now test $H_{0,j}: \mu_j=0$ for $j=1,\ldots,100$. We compute the
|
||||
100 $p$-values, and then construct a vector recording whether the
|
||||
$j$th $p$-value is less than or equal to 0.05, in which case we reject
|
||||
$H_{0j}$, or greater than 0.05, in which case we do not reject
|
||||
$H_{0j}$, for $j=1,\ldots,100$.
|
||||
|
||||
```{python}
|
||||
p_values = np.empty(100)
|
||||
for i in range(100):
|
||||
p_values[i] = ttest_1samp(X[:,i], 0).pvalue
|
||||
decision = pd.cut(p_values,
|
||||
[0, 0.05, 1],
|
||||
labels=['Reject H0',
|
||||
'Do not reject H0'])
|
||||
truth = pd.Categorical(true_mean == 0,
|
||||
categories=[True, False],
|
||||
ordered=True)
|
||||
|
||||
```
|
||||
Since this is a simulated data set, we can create a $2 \times 2$ table
|
||||
similar to Table 13.2.
|
||||
|
||||
```{python}
|
||||
pd.crosstab(decision,
|
||||
truth,
|
||||
rownames=['Decision'],
|
||||
colnames=['H0'])
|
||||
|
||||
```
|
||||
Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false
|
||||
null hypotheses, and we incorrectly reject 5 of the true null
|
||||
hypotheses. Using the notation from Section 13.3, we have
|
||||
$V=5$, $S=15$, $U=45$ and $W=35$.
|
||||
We have set $\alpha=0.05$, which means that we expect to reject around
|
||||
5% of the true null hypotheses. This is in line with the $2 \times 2$
|
||||
table above, which indicates that we rejected $V=5$ of the $50$ true
|
||||
null hypotheses.
|
||||
|
||||
In the simulation above, for the false null hypotheses, the ratio of
|
||||
the mean to the standard deviation was only $0.5/1 = 0.5$. This
|
||||
amounts to quite a weak signal, and it resulted in a high number of
|
||||
Type II errors. Let’s instead simulate data with a stronger signal,
|
||||
so that the ratio of the mean to the standard deviation for the false
|
||||
null hypotheses equals $1$. We make only 10 Type II errors.
|
||||
|
||||
|
||||
```{python}
|
||||
true_mean = np.array([1]*50 + [0]*50)
|
||||
X = rng.standard_normal((10, 100))
|
||||
X += true_mean[None,:]
|
||||
for i in range(100):
|
||||
p_values[i] = ttest_1samp(X[:,i], 0).pvalue
|
||||
decision = pd.cut(p_values,
|
||||
[0, 0.05, 1],
|
||||
labels=['Reject H0',
|
||||
'Do not reject H0'])
|
||||
truth = pd.Categorical(true_mean == 0,
|
||||
categories=[True, False],
|
||||
ordered=True)
|
||||
pd.crosstab(decision,
|
||||
truth,
|
||||
rownames=['Decision'],
|
||||
colnames=['H0'])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Family-Wise Error Rate
|
||||
Recall from 13.5 that if the null hypothesis is true
|
||||
for each of $m$ independent hypothesis tests, then the FWER is equal
|
||||
to $1-(1-\alpha)^m$. We can use this expression to compute the FWER
|
||||
for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.
|
||||
We plot the FWER for these values of $\alpha$ in order to
|
||||
reproduce Figure 13.2.
|
||||
|
||||
```{python}
|
||||
m = np.linspace(1, 501)
|
||||
fig, ax = plt.subplots()
|
||||
[ax.plot(m,
|
||||
1 - (1 - alpha)**m,
|
||||
label=r'$\alpha=%s$' % str(alpha))
|
||||
for alpha in [0.05, 0.01, 0.001]]
|
||||
ax.set_xscale('log')
|
||||
ax.set_xlabel('Number of Hypotheses')
|
||||
ax.set_ylabel('Family-Wise Error Rate')
|
||||
ax.legend()
|
||||
ax.axhline(0.05, c='k', ls='--');
|
||||
|
||||
```
|
||||
|
||||
As discussed previously, even for moderate values of $m$ such as $50$,
|
||||
the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value,
|
||||
such as $0.001$. Of course, the problem with setting $\alpha$ to such
|
||||
a low value is that we are likely to make a number of Type II errors:
|
||||
in other words, our power is very low.
|
||||
|
||||
We now conduct a one-sample $t$-test for each of the first five
|
||||
managers in the
|
||||
`Fund` dataset, in order to test the null
|
||||
hypothesis that the $j$th fund manager’s mean return equals zero,
|
||||
$H_{0,j}: \mu_j=0$.
|
||||
|
||||
```{python}
|
||||
Fund = load_data('Fund')
|
||||
fund_mini = Fund.iloc[:,:5]
|
||||
fund_mini_pvals = np.empty(5)
|
||||
for i in range(5):
|
||||
fund_mini_pvals[i] = ttest_1samp(fund_mini.iloc[:,i], 0).pvalue
|
||||
fund_mini_pvals
|
||||
|
||||
```
|
||||
|
||||
The $p$-values are low for Managers One and Three, and high for the
|
||||
other three managers. However, we cannot simply reject $H_{0,1}$ and
|
||||
$H_{0,3}$, since this would fail to account for the multiple testing
|
||||
that we have performed. Instead, we will conduct Bonferroni’s method
|
||||
and Holm’s method to control the FWER.
|
||||
|
||||
To do this, we use the `multipletests()` function from the
|
||||
`statsmodels` module (abbreviated to `mult_test()`). Given the $p$-values,
|
||||
for methods like Holm and Bonferroni the function outputs
|
||||
adjusted $p$-values, which
|
||||
can be thought of as a new set of $p$-values that have been corrected
|
||||
for multiple testing. If the adjusted $p$-value for a given hypothesis
|
||||
is less than or equal to $\alpha$, then that hypothesis can be
|
||||
rejected while maintaining a FWER of no more than $\alpha$. In other
|
||||
words, for such methods, the adjusted $p$-values resulting from the `multipletests()`
|
||||
function can simply be compared to the desired FWER in order to
|
||||
determine whether or not to reject each hypothesis. We will later
|
||||
see that we can use the same function to control FDR as well.
|
||||
|
||||
|
||||
The `mult_test()` function takes $p$-values and a `method` argument, as well as an optional
|
||||
`alpha` argument. It returns the decisions (`reject` below)
|
||||
as well as the adjusted $p$-values (`bonf`).
|
||||
|
||||
```{python}
|
||||
reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]
|
||||
reject
|
||||
|
||||
```
|
||||
|
||||
|
||||
The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than
|
||||
or equal to 1.
|
||||
|
||||
```{python}
|
||||
bonf, np.minimum(fund_mini_pvals * 5, 1)
|
||||
|
||||
```
|
||||
|
||||
Therefore, using Bonferroni’s method, we are able to reject the null hypothesis only for Manager
|
||||
One while controlling FWER at $0.05$.
|
||||
|
||||
By contrast, using Holm’s method, the adjusted $p$-values indicate
|
||||
that we can reject the null
|
||||
hypotheses for Managers One and Three at a FWER of $0.05$.
|
||||
|
||||
```{python}
|
||||
mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
|
||||
|
||||
```
|
||||
|
||||
|
||||
As discussed previously, Manager One seems to perform particularly
|
||||
well, whereas Manager Two has poor performance.
|
||||
|
||||
|
||||
```{python}
|
||||
fund_mini.mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Is there evidence of a meaningful difference in performance between
|
||||
these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function
|
||||
from `scipy.stats`:
|
||||
|
||||
```{python}
|
||||
ttest_rel(fund_mini['Manager1'],
|
||||
fund_mini['Manager2']).pvalue
|
||||
|
||||
```
|
||||
|
||||
The test results in a $p$-value of 0.038,
|
||||
suggesting a statistically significant difference.
|
||||
|
||||
However, we decided to perform this test only after examining the data
|
||||
and noting that Managers One and Two had the highest and lowest mean
|
||||
performances. In a sense, this means that we have implicitly
|
||||
performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than
|
||||
just one, as discussed in Section 13.3.2. Hence, we use the
|
||||
`pairwise_tukeyhsd()` function from
|
||||
`statsmodels.stats.multicomp` to apply Tukey’s method
|
||||
in order to adjust for multiple testing. This function takes
|
||||
as input a fitted *ANOVA* regression model, which is
|
||||
essentially just a linear regression in which all of the predictors
|
||||
are qualitative. In this case, the response consists of the monthly
|
||||
excess returns achieved by each manager, and the predictor indicates
|
||||
the manager to which each return corresponds.
|
||||
|
||||
```{python}
|
||||
returns = np.hstack([fund_mini.iloc[:,i] for i in range(5)])
|
||||
managers = np.hstack([[i+1]*50 for i in range(5)])
|
||||
tukey = pairwise_tukeyhsd(returns, managers)
|
||||
print(tukey.summary())
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `pairwise_tukeyhsd()` function provides confidence intervals
|
||||
for the difference between each pair of managers (`lower` and
|
||||
`upper`), as well as a $p$-value. All of these quantities have
|
||||
been adjusted for multiple testing. Notice that the $p$-value for the
|
||||
difference between Managers One and Two has increased from $0.038$ to
|
||||
$0.186$, so there is no longer clear evidence of a difference between
|
||||
the managers’ performances. We can plot the confidence intervals for
|
||||
the pairwise comparisons using the `plot_simultaneous()` method
|
||||
of `tukey`. Any pair of intervals that don’t overlap indicates a significant difference at the nominal level of 0.05. In this case,
|
||||
no differences are considered significant as reported in the table above.
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots(figsize=(8,8))
|
||||
tukey.plot_simultaneous(ax=ax);
|
||||
|
||||
```
|
||||
|
||||
## False Discovery Rate
|
||||
Now we perform hypothesis tests for all 2,000 fund managers in the
|
||||
`Fund` dataset. We perform a one-sample $t$-test
|
||||
of $H_{0,j}: \mu_j=0$, which states that the
|
||||
$j$th fund manager’s mean return is zero.
|
||||
|
||||
```{python}
|
||||
fund_pvalues = np.empty(2000)
|
||||
for i, manager in enumerate(Fund.columns):
|
||||
fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue
|
||||
|
||||
```
|
||||
|
||||
There are far too many managers to consider trying to control the FWER.
|
||||
Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives.
|
||||
The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure.
|
||||
|
||||
```{python}
|
||||
fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
|
||||
fund_qvalues[:10]
|
||||
|
||||
```
|
||||
|
||||
The *q-values* output by the
|
||||
Benjamini--Hochberg procedure can be interpreted as the smallest FDR
|
||||
threshold at which we would reject a particular null hypothesis. For
|
||||
instance, a $q$-value of $0.1$ indicates that we can reject the
|
||||
corresponding null hypothesis at an FDR of 10% or greater, but that
|
||||
we cannot reject the null hypothesis at an FDR below 10%.
|
||||
|
||||
If we control the FDR at 10%, then for how many of the fund managers can we reject $H_{0,j}: \mu_j=0$?
|
||||
|
||||
```{python}
|
||||
(fund_qvalues <= 0.1).sum()
|
||||
|
||||
```
|
||||
We find that 146 of the 2,000 fund managers have a $q$-value below
|
||||
0.1; therefore, we are able to conclude that 146 of the fund managers
|
||||
beat the market at an FDR of 10%. Only about 15 (10% of 146) of
|
||||
these fund managers are likely to be false discoveries.
|
||||
|
||||
By contrast, if we had instead used Bonferroni’s method to control the
|
||||
FWER at level $\alpha=0.1$, then we would have failed to reject any
|
||||
null hypotheses!
|
||||
|
||||
```{python}
|
||||
(fund_pvalues <= 0.1 / 2000).sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Figure 13.6 displays the ordered
|
||||
$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
|
||||
the `Fund` dataset, as well as the threshold for rejection by the
|
||||
Benjamini--Hochberg procedure. Recall that the Benjamini--Hochberg
|
||||
procedure identifies the largest $p$-value such that $p_{(j)}<qj/m$,
|
||||
and rejects all hypotheses for which the $p$-value is less than or
|
||||
equal to $p_{(j)}$. In the code below, we implement the
|
||||
Benjamini--Hochberg procedure ourselves, in order to illustrate how it
|
||||
works. We first order the $p$-values. We then identify all $p$-values
|
||||
that satisfy $p_{(j)}<qj/m$ (`sorted_set_`). Finally, `selected_`
|
||||
is a boolean array indicating which $p$-values
|
||||
are less than or equal to the largest
|
||||
$p$-value in `sorted_[sorted_set_]`. Therefore, `selected_` indexes the
|
||||
$p$-values rejected by the Benjamini--Hochberg procedure.
|
||||
|
||||
```{python}
|
||||
sorted_ = np.sort(fund_pvalues)
|
||||
m = fund_pvalues.shape[0]
|
||||
q = 0.1
|
||||
sorted_set_ = np.where(sorted_ < q * np.linspace(1, m, m) / m)[0]
|
||||
if sorted_set_.shape[0] > 0:
|
||||
selected_ = fund_pvalues < sorted_[sorted_set_].max()
|
||||
sorted_set_ = np.arange(sorted_set_.max())
|
||||
else:
|
||||
selected_ = []
|
||||
sorted_set_ = []
|
||||
|
||||
```
|
||||
|
||||
We now reproduce the middle panel of Figure 13.6.
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots()
|
||||
ax.scatter(np.arange(0, sorted_.shape[0]) + 1,
|
||||
sorted_, s=10)
|
||||
ax.set_yscale('log')
|
||||
ax.set_xscale('log')
|
||||
ax.set_ylabel('P-Value')
|
||||
ax.set_xlabel('Index')
|
||||
ax.scatter(sorted_set_+1, sorted_[sorted_set_], c='r', s=20)
|
||||
ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
|
||||
|
||||
```
|
||||
|
||||
|
||||
## A Re-Sampling Approach
|
||||
Here, we implement the re-sampling approach to hypothesis testing
|
||||
using the `Khan` dataset, which we investigated in
|
||||
Section 13.5. First, we merge the training and
|
||||
testing data, which results in observations on 83 patients for
|
||||
2,308 genes.
|
||||
|
||||
```{python}
|
||||
Khan = load_data('Khan')
|
||||
D = pd.concat([Khan['xtrain'], Khan['xtest']])
|
||||
D['Y'] = pd.concat([Khan['ytrain'], Khan['ytest']])
|
||||
D['Y'].value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
There are four classes of cancer. For each gene, we compare the mean
|
||||
expression in the second class (rhabdomyosarcoma) to the mean
|
||||
expression in the fourth class (Burkitt’s lymphoma). Performing a
|
||||
standard two-sample $t$-test
|
||||
using `ttest_ind()` from `scipy.stats` on the $11$th
|
||||
gene produces a test-statistic of -2.09 and an associated $p$-value
|
||||
of 0.0412, suggesting modest evidence of a difference in mean
|
||||
expression levels between the two cancer types.
|
||||
|
||||
```{python}
|
||||
D2 = D[lambda df:df['Y'] == 2]
|
||||
D4 = D[lambda df:df['Y'] == 4]
|
||||
gene_11 = 'G0011'
|
||||
observedT, pvalue = ttest_ind(D2[gene_11],
|
||||
D4[gene_11],
|
||||
equal_var=True)
|
||||
observedT, pvalue
|
||||
|
||||
```
|
||||
|
||||
|
||||
However, this $p$-value relies on the assumption that under the null
|
||||
hypothesis of no difference between the two groups, the test statistic
|
||||
follows a $t$-distribution with $29+25-2=52$ degrees of freedom.
|
||||
Instead of using this theoretical null distribution, we can randomly
|
||||
split the 54 patients into two groups of 29 and 25, and compute a new
|
||||
test statistic. Under the null hypothesis of no difference between
|
||||
the groups, this new test statistic should have the same distribution
|
||||
as our original one. Repeating this process 10,000 times allows us to
|
||||
approximate the null distribution of the test statistic. We compute
|
||||
the fraction of the time that our observed test statistic exceeds the
|
||||
test statistics obtained via re-sampling.
|
||||
|
||||
```{python}
|
||||
B = 10000
|
||||
Tnull = np.empty(B)
|
||||
D_ = np.hstack([D2[gene_11], D4[gene_11]])
|
||||
n_ = D2[gene_11].shape[0]
|
||||
D_null = D_.copy()
|
||||
for b in range(B):
|
||||
rng.shuffle(D_null)
|
||||
ttest_ = ttest_ind(D_null[:n_],
|
||||
D_null[n_:],
|
||||
equal_var=True)
|
||||
Tnull[b] = ttest_.statistic
|
||||
(np.abs(Tnull) < np.abs(observedT)).mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
This fraction, 0.0398,
|
||||
is our re-sampling-based $p$-value.
|
||||
It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution.
|
||||
We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure 13.7.
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots(figsize=(8,8))
|
||||
ax.hist(Tnull,
|
||||
bins=100,
|
||||
density=True,
|
||||
facecolor='y',
|
||||
label='Null')
|
||||
xval = np.linspace(-4.2, 4.2, 1001)
|
||||
ax.plot(xval,
|
||||
t_dbn.pdf(xval, D_.shape[0]-2),
|
||||
c='r')
|
||||
ax.axvline(observedT,
|
||||
c='b',
|
||||
label='Observed')
|
||||
ax.legend()
|
||||
ax.set_xlabel("Null Distribution of Test Statistic");
|
||||
|
||||
```
|
||||
The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red.
|
||||
|
||||
Finally, we implement the plug-in re-sampling FDR approach outlined in
|
||||
Algorithm 13.4. Depending on the speed of your
|
||||
computer, calculating the FDR for all 2,308 genes in the `Khan`
|
||||
dataset may take a while. Hence, we will illustrate the approach on a
|
||||
random subset of 100 genes. For each gene, we first compute the
|
||||
observed test statistic, and then produce 10,000 re-sampled test
|
||||
statistics. This may take a few minutes to run. If you are in a rush,
|
||||
then you could set `B` equal to a smaller value (e.g. `B=500`).
|
||||
|
||||
```{python}
|
||||
m, B = 100, 10000
|
||||
idx = rng.choice(Khan['xtest'].columns, m, replace=False)
|
||||
T_vals = np.empty(m)
|
||||
Tnull_vals = np.empty((m, B))
|
||||
|
||||
for j in range(m):
|
||||
col = idx[j]
|
||||
T_vals[j] = ttest_ind(D2[col],
|
||||
D4[col],
|
||||
equal_var=True).statistic
|
||||
D_ = np.hstack([D2[col], D4[col]])
|
||||
D_null = D_.copy()
|
||||
for b in range(B):
|
||||
rng.shuffle(D_null)
|
||||
ttest_ = ttest_ind(D_null[:n_],
|
||||
D_null[n_:],
|
||||
equal_var=True)
|
||||
Tnull_vals[j,b] = ttest_.statistic
|
||||
|
||||
```
|
||||
|
||||
Next, we compute the number of rejected null hypotheses $R$, the
|
||||
estimated number of false positives $\widehat{V}$, and the estimated
|
||||
FDR, for a range of threshold values $c$ in
|
||||
Algorithm 13.4. The threshold values are chosen
|
||||
using the absolute values of the test statistics from the 100 genes.
|
||||
|
||||
```{python}
|
||||
cutoffs = np.sort(np.abs(T_vals))
|
||||
FDRs, Rs, Vs = np.empty((3, m))
|
||||
for j in range(m):
|
||||
R = np.sum(np.abs(T_vals) >= cutoffs[j])
|
||||
V = np.sum(np.abs(Tnull_vals) >= cutoffs[j]) / B
|
||||
Rs[j] = R
|
||||
Vs[j] = V
|
||||
FDRs[j] = V / R
|
||||
|
||||
```
|
||||
|
||||
Now, for any given FDR, we can find the genes that will be
|
||||
rejected. For example, with FDR controlled at 0.1, we reject 15 of the
|
||||
100 null hypotheses. On average, we would expect about one or two of
|
||||
these genes (i.e. 10% of 15) to be false discoveries. At an FDR of
|
||||
0.2, we can reject the null hypothesis for 28 genes, of which we
|
||||
expect around six to be false discoveries.
|
||||
|
||||
The variable `idx` stores which
|
||||
genes were included in our 100 randomly-selected genes. Let’s look at
|
||||
the genes whose estimated FDR is less than 0.1.
|
||||
|
||||
```{python}
|
||||
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()])
|
||||
|
||||
```
|
||||
|
||||
At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected
|
||||
proportion of false discoveries.
|
||||
|
||||
```{python}
|
||||
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
|
||||
|
||||
```
|
||||
|
||||
The next line generates Figure 13.11, which is similar
|
||||
to Figure 13.9,
|
||||
except that it is based on only a subset of the genes.
|
||||
|
||||
```{python}
|
||||
fig, ax = plt.subplots()
|
||||
ax.plot(Rs, FDRs, 'b', linewidth=3)
|
||||
ax.set_xlabel("Number of Rejections")
|
||||
ax.set_ylabel("False Discovery Rate");
|
||||
|
||||
```
|
||||
|
||||
|
||||
12
README.md
12
README.md
@@ -95,7 +95,7 @@ Open your terminal and run the following command to set up the environment for v
|
||||
* `--python-version 3.12`: This will use Python 3.12 for the environment.
|
||||
* `Ch02-statlearn-lab.ipynb`: This is an optional argument to run a specific notebook after the setup is complete. It is meant for testing to be sure given notebooks run but is not required. You can list more than one notebook.
|
||||
|
||||
### 2. Activate the environment
|
||||
### 3. Activate the environment
|
||||
|
||||
Once the script is finished, you can activate the virtual environment to run other notebooks or work with the lab materials.
|
||||
|
||||
@@ -109,7 +109,7 @@ Once the script is finished, you can activate the virtual environment to run oth
|
||||
ISLP_v2.2.2\.venv\Scripts\activate
|
||||
```
|
||||
|
||||
### 3. Run other notebooks
|
||||
### 4. Run other notebooks
|
||||
|
||||
After activating the environment, you can start Jupyter Lab to run other notebooks.
|
||||
|
||||
@@ -117,6 +117,14 @@ After activating the environment, you can start Jupyter Lab to run other noteboo
|
||||
jupyter lab
|
||||
```
|
||||
|
||||
# R Markdown files
|
||||
|
||||
Earlier versions of this repo maintained R Markdown versions of the lab next to the Jupyter notebooks. These have been removed from the repo but can be built from the
|
||||
notebooks at any time by
|
||||
|
||||
```bash
|
||||
make Rmd
|
||||
```
|
||||
|
||||
## Contributors ✨
|
||||
|
||||
|
||||
Reference in New Issue
Block a user