Remove rmd (#77)

* Ch2->Ch02

* removed Rmd; made a new Rmd target for make; described in README.md
This commit is contained in:
Jonathan Taylor
2026-02-04 17:48:37 -08:00
committed by GitHub
parent 6bf6160a3d
commit 1d08236560
14 changed files with 15 additions and 10551 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -8767,6 +8767,11 @@
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",

View File

@@ -1,610 +0,0 @@
# Linear Regression
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch03-linreg-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch03-linreg-lab.ipynb)
## Importing packages
We import our standard libraries at this top
level.
```{python}
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
```
### New imports
Throughout this lab we will introduce new functions and libraries. However,
we will import them here to emphasize these are the new
code objects in this lab. Keeping imports near the top
of a notebook makes the code more readable, since scanning the first few
lines tells us what libraries are used.
```{python}
import statsmodels.api as sm
```
We will provide relevant details about the
functions below as they are needed.
Besides importing whole modules, it is also possible
to import only a few items from a given module. This
will help keep the *namespace* clean.
We will use a few specific objects from the `statsmodels` package
which we import here.
```{python}
from statsmodels.stats.outliers_influence \
import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
```
As one of the import statements above is quite a long line, we inserted a line break `\` to
ease readability.
We will also use some functions written for the labs in this book in the `ISLP`
package.
```{python}
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
summarize,
poly)
```
### Inspecting Objects and Namespaces
The
function `dir()`
provides a list of
objects in a namespace.
```{python}
dir()
```
This shows you everything that `Python` can find at the top level.
There are certain objects like `__builtins__` that contain references to built-in
functions like `print()`.
Every python object has its own notion of
namespace, also accessible with `dir()`. This will include
both the attributes of the object
as well as any methods associated with it. For instance, we see `'sum'` in the listing for an
array.
```{python}
A = np.array([3,5,11])
dir(A)
```
This indicates that the object `A.sum` exists. In this case it is a method
that can be used to compute the sum of the array `A` as can be seen by typing `A.sum?`.
```{python}
A.sum()
```
## Simple Linear Regression
In this section we will construct model
matrices (also called design matrices) using the `ModelSpec()` transform from `ISLP.models`.
We will use the `Boston` housing data set, which is contained in the `ISLP` package. The `Boston` dataset records `medv` (median house value) for $506$ neighborhoods
around Boston. We will build a regression model to predict `medv` using $13$
predictors such as `rm` (average number of rooms per house),
`age` (proportion of owner-occupied units built prior to 1940), and `lstat` (percent of
households with low socioeconomic status). We will use `statsmodels` for this
task, a `Python` package that implements several commonly used
regression methods.
We have included a simple loading function `load_data()` in the
`ISLP` package:
```{python}
Boston = load_data("Boston")
Boston.columns
```
Type `Boston?` to find out more about these data.
We start by using the `sm.OLS()` function to fit a
simple linear regression model. Our response will be
`medv` and `lstat` will be the single predictor.
For this model, we can create the model matrix by hand.
```{python}
X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]),
'lstat': Boston['lstat']})
X[:4]
```
We extract the response, and fit the model.
```{python}
y = Boston['medv']
model = sm.OLS(y, X)
results = model.fit()
```
Note that `sm.OLS()` does
not fit the model; it specifies the model, and then `model.fit()` does the actual fitting.
Our `ISLP` function `summarize()` produces a simple table of the parameter estimates,
their standard errors, t-statistics and p-values.
The function takes a single argument, such as the object `results`
returned here by the `fit`
method, and returns such a summary.
```{python}
summarize(results)
```
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
### Using Transformations: Fit and Transform
Our model above has a single predictor, and constructing `X` was straightforward.
In practice we often fit models with more than one predictor, typically selected from an array or data frame.
We may wish to introduce transformations to the variables before fitting the model, specify interactions between variables, and expand some particular variables into sets of variables (e.g. polynomials).
The `sklearn` package has a particular notion
for this type of task: a *transform*. A transform is an object
that is created with some parameters as arguments. The
object has two main methods: `fit()` and `transform()`.
We provide a general approach for specifying models and constructing
the model matrix through the transform `ModelSpec()` in the `ISLP` library.
`ModelSpec()`
(renamed `MS()` in the preamble) creates a
transform object, and then a pair of methods
`transform()` and `fit()` are used to construct a
corresponding model matrix.
We first describe this process for our simple regression model using a single predictor `lstat` in
the `Boston` data frame, but will use it repeatedly in more
complex tasks in this and other labs in this book.
In our case the transform is created by the expression
`design = MS(['lstat'])`.
The `fit()` method takes the original array and may do some
initial computations on it, as specified in the transform object.
For example, it may compute means and standard deviations for centering and scaling.
The `transform()`
method applies the fitted transformation to the array of data, and produces the model matrix.
```{python}
design = MS(['lstat'])
design = design.fit(Boston)
X = design.transform(Boston)
X[:4]
```
In this simple case, the `fit()` method does very little; it simply checks that the variable `'lstat'` specified in `design` exists in `Boston`. Then `transform()` constructs the model matrix with two columns: an `intercept` and the variable `lstat`.
These two operations can be combined with the
`fit_transform()` method.
```{python}
design = MS(['lstat'])
X = design.fit_transform(Boston)
X[:4]
```
Note that, as in the previous code chunk when the two steps were done separately, the `design` object is changed as a result of the `fit()` operation. The power of this pipeline will become clearer when we fit more complex models that involve interactions and transformations.
Let's return to our fitted regression model.
The object
`results` has several methods that can be used for inference.
We already presented a function `summarize()` for showing the essentials of the fit.
For a full and somewhat exhaustive summary of the fit, we can use the `summary()`
method.
```{python}
results.summary()
```
The fitted coefficients can also be retrieved as the
`params` attribute of `results`.
```{python}
results.params
```
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
prediction intervals for the prediction of `medv` for given values of `lstat`.
We first create a new data frame, in this case containing only the variable `lstat`, with the values for this variable at which we wish to make predictions.
We then use the `transform()` method of `design` to create the corresponding model matrix.
```{python}
new_df = pd.DataFrame({'lstat':[5, 10, 15]})
newX = design.transform(new_df)
newX
```
Next we compute the predictions at `newX`, and view them by extracting the `predicted_mean` attribute.
```{python}
new_predictions = results.get_prediction(newX);
new_predictions.predicted_mean
```
We can produce confidence intervals for the predicted values.
```{python}
new_predictions.conf_int(alpha=0.05)
```
Prediction intervals are computed by setting `obs=True`:
```{python}
new_predictions.conf_int(obs=True, alpha=0.05)
```
For instance, the 95% confidence interval associated with an
`lstat` value of 10 is (24.47, 25.63), and the 95% prediction
interval is (12.82, 37.28). As expected, the confidence and
prediction intervals are centered around the same point (a predicted
value of 25.05 for `medv` when `lstat` equals
10), but the latter are substantially wider.
Next we will plot `medv` and `lstat`
using `DataFrame.plot.scatter()`, \definelongblankMR{plot.scatter()}{plot.slashslashscatter()}
and wish to
add the regression line to the resulting plot.
### Defining Functions
While there is a function
within the `ISLP` package that adds a line to an existing plot, we take this opportunity
to define our first function to do so.
```{python}
def abline(ax, b, m):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim)
```
A few things are illustrated above. First we see the syntax for defining a function:
`def funcname(...)`. The function has arguments `ax, b, m`
where `ax` is an axis object for an existing plot, `b` is the intercept and
`m` is the slope of the desired line. Other plotting options can be passed on to
`ax.plot` by including additional optional arguments as follows:
```{python}
def abline(ax, b, m, *args, **kwargs):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
```
The addition of `*args` allows any number of
non-named arguments to `abline`, while `**kwargs` allows any
number of named arguments (such as `linewidth=3`) to `abline`.
In our function, we pass
these arguments verbatim to `ax.plot` above. Readers
interested in learning more about
functions are referred to the section on
defining functions in [docs.python.org/tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions).
Lets use our new function to add this regression line to a plot of
`medv` vs. `lstat`.
```{python}
ax = Boston.plot.scatter('lstat', 'medv')
abline(ax,
results.params['intercept'],
results.params['lstat'],
'r--',
linewidth=3)
```
Thus, the final call to `ax.plot()` is `ax.plot(xlim, ylim, 'r--', linewidth=3)`.
We have used the argument `'r--'` to produce a red dashed line, and added
an argument to make it of width 3.
There is some evidence for non-linearity in the relationship between `lstat` and `medv`. We will explore this issue later in this lab.
As mentioned above, there is an existing function to add a line to a plot --- `ax.axline()` --- but knowing how to write such functions empowers us to create more expressive displays.
Next we examine some diagnostic plots, several of which were discussed
in Section 3.3.3.
We can find the fitted values and residuals
of the fit as attributes of the `results` object.
Various influence measures describing the regression model
are computed with the `get_influence()` method.
As we will not use the `fig` component returned
as the first value from `subplots()`, we simply
capture the second returned value in `ax` below.
```{python}
ax = subplots(figsize=(8,8))[1]
ax.scatter(results.fittedvalues, results.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--');
```
We add a horizontal line at 0 for reference using the
`ax.axhline()` method, indicating
it should be black (`c='k'`) and have a dashed linestyle (`ls='--'`).
On the basis of the residual plot, there is some evidence of non-linearity.
Leverage statistics can be computed for any number of predictors using the
`hat_matrix_diag` attribute of the value returned by the
`get_influence()` method.
```{python}
infl = results.get_influence()
ax = subplots(figsize=(8,8))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag)
```
The `np.argmax()` function identifies the index of the largest element of an array, optionally computed over an axis of the array.
In this case, we maximized over the entire array
to determine which observation has the largest leverage statistic.
## Multiple Linear Regression
In order to fit a multiple linear regression model using least squares, we again use
the `ModelSpec()` transform to construct the required
model matrix and response. The arguments
to `ModelSpec()` can be quite general, but in this case
a list of column names suffice. We consider a fit here with
the two variables `lstat` and `age`.
```{python}
X = MS(['lstat', 'age']).fit_transform(Boston)
model1 = sm.OLS(y, X)
results1 = model1.fit()
summarize(results1)
```
Notice how we have compacted the first line into a succinct expression describing the construction of `X`.
The `Boston` data set contains 12 variables, and so it would be cumbersome
to have to type all of these in order to perform a regression using all of the predictors.
Instead, we can use the following short-hand:\definelongblankMR{columns.drop()}{columns.slashslashdrop()}
```{python}
terms = Boston.columns.drop('medv')
terms
```
We can now fit the model with all the variables in `terms` using
the same model matrix builder.
```{python}
X = MS(terms).fit_transform(Boston)
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
```
What if we would like to perform a regression using all of the variables but one? For
example, in the above regression output, `age` has a high $p$-value.
So we may wish to run a regression excluding this predictor.
The following syntax results in a regression using all predictors except `age`.
```{python}
minus_age = Boston.columns.drop(['medv', 'age'])
Xma = MS(minus_age).fit_transform(Boston)
model1 = sm.OLS(y, Xma)
summarize(model1.fit())
```
## Multivariate Goodness of Fit
We can access the individual components of `results` by name
(`dir(results)` shows us what is available). Hence
`results.rsquared` gives us the $R^2$,
and
`np.sqrt(results.scale)` gives us the RSE.
Variance inflation factors (section 3.3.3) are sometimes useful
to assess the effect of collinearity in the model matrix of a regression model.
We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
### List Comprehension
Often we encounter a sequence of objects which we would like to transform
for some other task. Below, we compute the VIF for each
feature in our `X` matrix and produce a data frame
whose index agrees with the columns of `X`.
The notion of list comprehension can often make such
a task easier.
List comprehensions are simple and powerful ways to form
lists of `Python` objects. The language also supports
dictionary and *generator* comprehension, though these are
beyond our scope here. Let's look at an example. We compute the VIF for each of the variables
in the model matrix `X`, using the function `variance_inflation_factor()`.
```{python}
vals = [VIF(X, i)
for i in range(1, X.shape[1])]
vif = pd.DataFrame({'vif':vals},
index=X.columns[1:])
vif
```
The function `VIF()` takes two arguments: a dataframe or array,
and a variable column index. In the code above we call `VIF()` on the fly for all columns in `X`.
We have excluded column 0 above (the intercept), which is not of interest. In this case the VIFs are not that exciting.
The object `vals` above could have been constructed with the following for loop:
```{python}
vals = []
for i in range(1, X.values.shape[1]):
vals.append(VIF(X.values, i))
```
List comprehension allows us to perform such repetitive operations in a more straightforward way.
## Interaction Terms
It is easy to include interaction terms in a linear model using `ModelSpec()`.
Including a tuple `("lstat","age")` tells the model
matrix builder to include an interaction term between
`lstat` and `age`.
```{python}
X = MS(['lstat',
'age',
('lstat', 'age')]).fit_transform(Boston)
model2 = sm.OLS(y, X)
summarize(model2.fit())
```
## Non-linear Transformations of the Predictors
The model matrix builder can include terms beyond
just column names and interactions. For instance,
the `poly()` function supplied in `ISLP` specifies that
columns representing polynomial functions
of its first argument are added to the model matrix.
```{python}
X = MS([poly('lstat', degree=2), 'age']).fit_transform(Boston)
model3 = sm.OLS(y, X)
results3 = model3.fit()
summarize(results3)
```
The effectively zero *p*-value associated with the quadratic term
(i.e. the third row above) suggests that it leads to an improved model.
By default, `poly()` creates a basis matrix for inclusion in the
model matrix whose
columns are *orthogonal polynomials*, which are designed for stable
least squares computations. {Actually, `poly()` is a wrapper for the workhorse and standalone function `Poly()` that does the work in building the model matrix.}
Alternatively, had we included an argument
`raw=True` in the above call to `poly()`, the basis matrix would consist simply of
`lstat` and `lstat**2`. Since either of these bases
represent quadratic polynomials, the fitted values would not
change in this case, just the polynomial coefficients. Also by default, the columns
created by `poly()` do not include an intercept column as
that is automatically added by `MS()`.
We use the `anova_lm()` function to further quantify the extent to which the quadratic fit is
superior to the linear fit.
```{python}
anova_lm(results1, results3)
```
Here `results1` represents the linear submodel containing
predictors `lstat` and `age`,
while `results3` corresponds to the larger model above with a quadratic
term in `lstat`.
The `anova_lm()` function performs a hypothesis test
comparing the two models. The null hypothesis is that the quadratic
term in the bigger model is not needed, and the alternative hypothesis is that the
bigger model is superior. Here the *F*-statistic is 177.28 and
the associated *p*-value is zero.
In this case the *F*-statistic is the square of the
*t*-statistic for the quadratic term in the linear model summary
for `results3` --- a consequence of the fact that these nested
models differ by one degree of freedom.
This provides very clear evidence that the quadratic polynomial in
`lstat` improves the linear model.
This is not surprising, since earlier we saw evidence for non-linearity in the relationship between `medv`
and `lstat`.
The function `anova_lm()` can take more than two nested models
as input, in which case it compares every successive pair of models.
That also explains why there are `NaN`s in the first row above, since
there is no previous model with which to compare the first.
```{python}
ax = subplots(figsize=(8,8))[1]
ax.scatter(results3.fittedvalues, results3.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--');
```
We see that when the quadratic term is included in the model,
there is little discernible pattern in the residuals.
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
to `poly()`.
## Qualitative Predictors
Here we use the `Carseats` data, which is included in the
`ISLP` package. We will attempt to predict `Sales`
(child car seat sales) in 400 locations based on a number of
predictors.
```{python}
Carseats = load_data('Carseats')
Carseats.columns
```
The `Carseats`
data includes qualitative predictors such as
`ShelveLoc`, an indicator of the quality of the shelving
location --- that is,
the space within a store in which the car seat is displayed. The predictor
`ShelveLoc` takes on three possible values, `Bad`, `Medium`, and `Good`.
Given a qualitative variable such as `ShelveLoc`, `ModelSpec()` generates dummy
variables automatically.
These variables are often referred to as a *one-hot encoding* of the categorical
feature. Their columns sum to one, so to avoid collinearity with an intercept, the first column is dropped. Below we see
the column `ShelveLoc[Bad]` has been dropped, since `Bad` is the first level of `ShelveLoc`.
Below we fit a multiple regression model that includes some interaction terms.
```{python}
allvars = list(Carseats.columns.drop('Sales'))
y = Carseats['Sales']
final = allvars + [('Income', 'Advertising'),
('Price', 'Age')]
X = MS(final).fit_transform(Carseats)
model = sm.OLS(y, X)
summarize(model.fit())
```
In the first line above, we made `allvars` a list, so that we
could add the interaction terms two lines down.
Our model-matrix builder has created a `ShelveLoc[Good]`
dummy variable that takes on a value of 1 if the
shelving location is good, and 0 otherwise. It has also created a `ShelveLoc[Medium]`
dummy variable that equals 1 if the shelving location is medium, and 0 otherwise.
A bad shelving location corresponds to a zero for each of the two dummy variables.
The fact that the coefficient for `ShelveLoc[Good]` in the regression output is
positive indicates that a good shelving location is associated with high sales (relative to a bad location).
And `ShelveLoc[Medium]` has a smaller positive coefficient,
indicating that a medium shelving location leads to higher sales than a bad
shelving location, but lower sales than a good shelving location.

File diff suppressed because it is too large Load Diff

View File

@@ -1,556 +0,0 @@
# Cross-Validation and the Bootstrap
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch05-resample-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch05-resample-lab.ipynb)
In this lab, we explore the resampling techniques covered in this
chapter. Some of the commands in this lab may take a while to run on
your computer.
We again begin by placing most of our imports at this top level.
```{python}
import numpy as np
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
summarize,
poly)
from sklearn.model_selection import train_test_split
```
There are several new imports needed for this lab.
```{python}
from functools import partial
from sklearn.model_selection import \
(cross_validate,
KFold,
ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm
```
## The Validation Set Approach
We explore the use of the validation set approach in order to estimate
the test error rates that result from fitting various linear models on
the `Auto` data set.
We use the function `train_test_split()` to split
the data into training and validation sets. As there are 392 observations,
we split into two equal sets of size 196 using the
argument `test_size=196`. It is generally a good idea to set a random seed
when performing operations like this that contain an
element of randomness, so that the results obtained can be reproduced
precisely at a later time. We set the random seed of the splitter
with the argument `random_state=0`.
```{python}
Auto = load_data('Auto')
Auto_train, Auto_valid = train_test_split(Auto,
test_size=196,
random_state=0)
```
Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`.
```{python}
hp_mm = MS(['horsepower'])
X_train = hp_mm.fit_transform(Auto_train)
y_train = Auto_train['mpg']
model = sm.OLS(y_train, X_train)
results = model.fit()
```
We now use the `predict()` method of `results` evaluated on the model matrix for this model
created using the validation data set. We also calculate the validation MSE of our model.
```{python}
X_valid = hp_mm.transform(Auto_valid)
y_valid = Auto_valid['mpg']
valid_pred = results.predict(X_valid)
np.mean((y_valid - valid_pred)**2)
```
Hence our estimate for the validation MSE of the linear regression
fit is $23.62$.
We can also estimate the validation error for
higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well
as training and test sets and returns the MSE on the test set.
```{python}
def evalMSE(terms,
response,
train,
test):
mm = MS(terms)
X_train = mm.fit_transform(train)
y_train = train[response]
X_test = mm.transform(test)
y_test = test[response]
results = sm.OLS(y_train, X_train).fit()
test_pred = results.predict(X_test)
return np.mean((y_test - test_pred)**2)
```
Lets use this function to estimate the validation MSE
using linear, quadratic and cubic fits. We use the `enumerate()` function
here, which gives both the values and indices of objects as one iterates
over a for loop.
```{python}
MSE = np.zeros(3)
for idx, degree in enumerate(range(1, 4)):
MSE[idx] = evalMSE([poly('horsepower', degree)],
'mpg',
Auto_train,
Auto_valid)
MSE
```
These error rates are $23.62, 18.76$, and $18.80$, respectively. If we
choose a different training/validation split instead, then we
can expect somewhat different errors on the validation set.
```{python}
Auto_train, Auto_valid = train_test_split(Auto,
test_size=196,
random_state=3)
MSE = np.zeros(3)
for idx, degree in enumerate(range(1, 4)):
MSE[idx] = evalMSE([poly('horsepower', degree)],
'mpg',
Auto_train,
Auto_valid)
MSE
```
Using this split of the observations into a training set and a validation set,
we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$, and $16.97$, respectively.
These results are consistent with our previous findings: a model that
predicts `mpg` using a quadratic function of `horsepower`
performs better than a model that involves only a linear function of
`horsepower`, and there is no evidence of an improvement in using a cubic function of `horsepower`.
## Cross-Validation
In theory, the cross-validation estimate can be computed for any generalized
linear model. {}
In practice, however, the simplest way to cross-validate in
Python is to use `sklearn`, which has a different interface or API
than `statsmodels`, the code we have been using to fit GLMs.
This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ dont naturally speak to each other, this
requires the use of a *wrapper*.
In the `ISLP` package,
we provide
a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with
models fit by `statsmodels`.
The class `sklearn_sm()`
has as its first argument
a model from `statsmodels`. It can take two additional
optional arguments: `model_str` which can be
used to specify a formula, and `model_args` which should
be a dictionary of additional arguments used when fitting
the model. For example, to fit a logistic regression model
we have to specify a `family` argument. This
is passed as `model_args={'family':sm.families.Binomial()}`.
Here is our wrapper in action:
```{python}
hp_model = sklearn_sm(sm.OLS,
MS(['horsepower']))
X, Y = Auto.drop(columns=['mpg']), Auto['mpg']
cv_results = cross_validate(hp_model,
X,
Y,
cv=Auto.shape[0])
cv_err = np.mean(cv_results['test_score'])
cv_err
```
The arguments to `cross_validate()` are as follows: an
object with the appropriate `fit()`, `predict()`,
and `score()` methods, an
array of features `X` and a response `Y`.
We also included an additional argument `cv` to `cross_validate()`; specifying an integer
$k$ results in $k$-fold cross-validation. We have provided a value
corresponding to the total number of observations, which results in
leave-one-out cross-validation (LOOCV). The `cross_validate()` function produces a dictionary with several components;
we simply want the cross-validated test score here (MSE), which is estimated to be 24.23.
We can repeat this procedure for increasingly complex polynomial fits.
To automate the process, we again
use a for loop which iteratively fits polynomial
regressions of degree 1 to 5, computes the
associated cross-validation error, and stores it in the $i$th element
of the vector `cv_error`. The variable `d` in the for loop
corresponds to the degree of the polynomial. We begin by initializing the
vector. This command may take a couple of seconds to run.
```{python}
cv_error = np.zeros(5)
H = np.array(Auto['horsepower'])
M = sklearn_sm(sm.OLS)
for i, d in enumerate(range(1,6)):
X = np.power.outer(H, np.arange(d+1))
M_CV = cross_validate(M,
X,
Y,
cv=Auto.shape[0])
cv_error[i] = np.mean(M_CV['test_score'])
cv_error
```
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
quadratic fits, but then no clear improvement from using higher-degree polynomials.
Above we introduced the `outer()` method of the `np.power()`
function. The `outer()` method is applied to an operation
that has two arguments, such as `add()`, `min()`, or
`power()`.
It has two arrays as
arguments, and then forms a larger
array where the operation is applied to each pair of elements of the
two arrays.
```{python}
A = np.array([3, 5, 9])
B = np.array([2, 4])
np.add.outer(A, B)
```
In the CV example above, we used $k=n$, but of course we can also use $k<n$. The code is very similar
to the above (and is significantly faster). Here we use `KFold()` to partition the data into $k=10$ random groups. We use `random_state` to set a random seed and initialize a vector `cv_error` in which we will store the CV errors corresponding to the
polynomial fits of degrees one to five.
```{python}
cv_error = np.zeros(5)
cv = KFold(n_splits=10,
shuffle=True,
random_state=0) # use same splits for each degree
for i, d in enumerate(range(1,6)):
X = np.power.outer(H, np.arange(d+1))
M_CV = cross_validate(M,
X,
Y,
cv=cv)
cv_error[i] = np.mean(M_CV['test_score'])
cv_error
```
Notice that the computation time is much shorter than that of LOOCV.
(In principle, the computation time for LOOCV for a least squares
linear model should be faster than for $k$-fold CV, due to the
availability of the formula~(5.2) for LOOCV;
however, the generic `cross_validate()` function does not make
use of this formula.) We still see little evidence that using cubic
or higher-degree polynomial terms leads to a lower test error than simply
using a quadratic fit.
The `cross_validate()` function is flexible and can take
different splitting mechanisms as an argument. For instance, one can use the `ShuffleSplit()`
function to implement
the validation set approach just as easily as $k$-fold cross-validation.
```{python}
validation = ShuffleSplit(n_splits=1,
test_size=196,
random_state=0)
results = cross_validate(hp_model,
Auto.drop(['mpg'], axis=1),
Auto['mpg'],
cv=validation);
results['test_score']
```
One can estimate the variability in the test error by running the following:
```{python}
validation = ShuffleSplit(n_splits=10,
test_size=196,
random_state=0)
results = cross_validate(hp_model,
Auto.drop(['mpg'], axis=1),
Auto['mpg'],
cv=validation)
results['test_score'].mean(), results['test_score'].std()
```
Note that this standard deviation is not a valid estimate of the
sampling variability of the mean test score or the individual scores, since the randomly-selected training
samples overlap and hence introduce correlations. But it does give an
idea of the Monte Carlo variation
incurred by picking different random folds.
## The Bootstrap
We illustrate the use of the bootstrap in the simple example
{of Section 5.2,} as well as on an example involving
estimating the accuracy of the linear regression model on the `Auto`
data set.
### Estimating the Accuracy of a Statistic of Interest
One of the great advantages of the bootstrap approach is that it can
be applied in almost all situations. No complicated mathematical
calculations are required. While there are several implementations
of the bootstrap in Python, its use for estimating
standard error is simple enough that we write our own function
below for the case when our data is stored
in a dataframe.
To illustrate the bootstrap, we
start with a simple example.
The `Portfolio` data set in the `ISLP` package is described
in Section 5.2. The goal is to estimate the
sampling variance of the parameter $\alpha$ given in formula~(5.7). We will
create a function
`alpha_func()`, which takes as input a dataframe `D` assumed
to have columns `X` and `Y`, as well as a
vector `idx` indicating which observations should be used to
estimate
$\alpha$. The function then outputs the estimate for $\alpha$ based on
the selected observations.
```{python}
Portfolio = load_data('Portfolio')
def alpha_func(D, idx):
cov_ = np.cov(D[['X','Y']].loc[idx], rowvar=False)
return ((cov_[1,1] - cov_[0,1]) /
(cov_[0,0]+cov_[1,1]-2*cov_[0,1]))
```
This function returns an estimate for $\alpha$
based on applying the minimum
variance formula (5.7) to the observations indexed by
the argument `idx`. For instance, the following command
estimates $\alpha$ using all 100 observations.
```{python}
alpha_func(Portfolio, range(100))
```
Next we randomly select
100 observations from `range(100)`, with replacement. This is equivalent
to constructing a new bootstrap data set and recomputing $\hat{\alpha}$
based on the new data set.
```{python}
rng = np.random.default_rng(0)
alpha_func(Portfolio,
rng.choice(100,
100,
replace=True))
```
This process can be generalized to create a simple function `boot_SE()` for
computing the bootstrap standard error for arbitrary
functions that take only a data frame as an argument.
```{python}
def boot_SE(func,
D,
n=None,
B=1000,
seed=0):
rng = np.random.default_rng(seed)
first_, second_ = 0, 0
n = n or D.shape[0]
for _ in range(B):
idx = rng.choice(D.index,
n,
replace=True)
value = func(D, idx)
first_ += value
second_ += value**2
return np.sqrt(second_ / B - (first_ / B)**2)
```
Notice the use of `_` as a loop variable in `for _ in range(B)`. This is often used if the value of the counter is
unimportant and simply makes sure the loop is executed `B` times.
Lets use our function to evaluate the accuracy of our
estimate of $\alpha$ using $B=1{,}000$ bootstrap replications.
```{python}
alpha_SE = boot_SE(alpha_func,
Portfolio,
B=1000,
seed=0)
alpha_SE
```
The final output shows that the bootstrap estimate for ${\rm SE}(\hat{\alpha})$ is $0.0912$.
### Estimating the Accuracy of a Linear Regression Model
The bootstrap approach can be used to assess the variability of the
coefficient estimates and predictions from a statistical learning
method. Here we use the bootstrap approach in order to assess the
variability of the estimates for $\beta_0$ and $\beta_1$, the
intercept and slope terms for the linear regression model that uses
`horsepower` to predict `mpg` in the `Auto` data set. We
will compare the estimates obtained using the bootstrap to those
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
To use our `boot_SE()` function, we must write a function (its
first argument)
that takes a data frame `D` and indices `idx`
as its only arguments. But here we want to bootstrap a specific
regression model, specified by a model formula and data. We show how
to do this in a few simple steps.
We start by writing a generic
function `boot_OLS()` for bootstrapping a regression model that takes a formula to
define the corresponding regression. We use the `clone()` function to
make a copy of the formula that can be refit to the new dataframe. This means
that any derived features such as those defined by `poly()`
(which we will see shortly),
will be re-fit on the resampled data frame.
```{python}
def boot_OLS(model_matrix, response, D, idx):
D_ = D.loc[idx]
Y_ = D_[response]
X_ = clone(model_matrix).fit_transform(D_)
return sm.OLS(Y_, X_).fit().params
```
This is not quite what is needed as the first argument to
`boot_SE()`. The first two arguments which specify the model will not change in the
bootstrap process, and we would like to *freeze* them. The
function `partial()` from the `functools` module does precisely this: it takes a function
as an argument, and freezes some of its arguments, starting from the
left. We use it to freeze the first two model-formula arguments of `boot_OLS()`.
```{python}
hp_func = partial(boot_OLS, MS(['horsepower']), 'mpg')
```
Typing `hp_func?` will show that it has two arguments `D`
and `idx` --- it is a version of `boot_OLS()` with the first
two arguments frozen --- and hence is ideal as the first argument for `boot_SE()`.
The `hp_func()` function can now be used in order to create
bootstrap estimates for the intercept and slope terms by randomly
sampling from among the observations with replacement. We first
demonstrate its utility on 10 bootstrap samples.
```{python}
rng = np.random.default_rng(0)
np.array([hp_func(Auto,
rng.choice(Auto.index,
392,
replace=True)) for _ in range(10)])
```
Next, we use the `boot_SE()` {} function to compute the standard
errors of 1,000 bootstrap estimates for the intercept and slope terms.
```{python}
hp_se = boot_SE(hp_func,
Auto,
B=1000,
seed=10)
hp_se
```
This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
0.85, and that the bootstrap
estimate for ${\rm SE}(\hat{\beta}_1)$ is
0.0074. As discussed in
Section 3.1.2, standard formulas can be used to compute
the standard errors for the regression coefficients in a linear
model. These can be obtained using the `summarize()` function
from `ISLP.sm`.
```{python}
hp_model.fit(Auto, Auto['mpg'])
model_se = summarize(hp_model.results_)['std err']
model_se
```
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
obtained using the formulas from Section 3.1.2 are
0.717 for the
intercept and
0.006 for the
slope. Interestingly, these are somewhat different from the estimates
obtained using the bootstrap. Does this indicate a problem with the
bootstrap? In fact, it suggests the opposite. Recall that the
standard formulas given in
Equation 3.8 on page 75
rely on certain assumptions. For example,
they depend on the unknown parameter $\sigma^2$, the noise
variance. We then estimate $\sigma^2$ using the RSS. Now although the
formulas for the standard errors do not rely on the linear model being
correct, the estimate for $\sigma^2$ does. We see
in Figure 3.8 on page 99 that there is
a non-linear relationship in the data, and so the residuals from a
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
the standard formulas assume (somewhat unrealistically) that the $x_i$
are fixed, and all the variability comes from the variation in the
errors $\epsilon_i$. The bootstrap approach does not rely on any of
these assumptions, and so it is likely giving a more accurate estimate
of the standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ than
the results from `sm.OLS`.
Below we compute the bootstrap standard error estimates and the
standard linear regression estimates that result from fitting the
quadratic model to the data. Since this model provides a good fit to
the data (Figure 3.8), there is now a better
correspondence between the bootstrap estimates and the standard
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
${\rm SE}(\hat{\beta}_2)$.
```{python}
quad_model = MS([poly('horsepower', 2, raw=True)])
quad_func = partial(boot_OLS,
quad_model,
'mpg')
boot_SE(quad_func, Auto, B=1000)
```
We compare the results to the standard errors computed using `sm.OLS()`.
```{python}
M = sm.OLS(Auto['mpg'],
quad_model.fit_transform(Auto))
summarize(M.fit())['std err']
```

View File

@@ -1,928 +0,0 @@
# Linear Models and Regularization Methods
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch06-varselect-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch06-varselect-lab.ipynb)
In this lab we implement many of the techniques discussed in this chapter.
We import some of our libraries at this top
level.
```{python}
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
from statsmodels.api import OLS
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from ISLP import load_data
from ISLP.models import ModelSpec as MS
from functools import partial
```
We again collect the new imports
needed for this lab. Readers will have installed `l0bnb` when installing the requirements.
```{python}
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from ISLP.models import \
(Stepwise,
sklearn_selected,
sklearn_selection_path)
from l0bnb import fit_path
```
Using `skl.ElasticNet` to fit ridge regression
throws up many warnings. We will suppress them below by a call to `warnings.simplefilter()`.
```{python}
import warnings
warnings.simplefilter("ignore")
```
## Subset Selection Methods
Here we implement methods that reduce the number of parameters in a
model by restricting the model to a subset of the input variables.
### Forward Selection
We will apply the forward-selection approach to the `Hitters`
data. We wish to predict a baseball players `Salary` on the
basis of various statistics associated with performance in the
previous year.
First of all, we note that the `Salary` variable is missing for
some of the players. The `np.isnan()` function can be used to
identify the missing observations. It returns an array
of the same shape as the input vector, with a `True` for any elements that
are missing, and a `False` for non-missing elements. The
`sum()` method can then be used to count all of the
missing elements.
```{python}
Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()
```
We see that `Salary` is missing for 59 players. The
`dropna()` method of data frames removes all of the rows that have missing
values in any variable (by default --- see `Hitters.dropna?`).
```{python}
Hitters = Hitters.dropna();
Hitters.shape
```
We first choose the best model using forward selection based on $C_p$ (6.2). This score
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
it as a scorer. By default, `sklearn` tries to maximize a score, hence
our scoring function computes the negative $C_p$ statistic.
```{python}
def nCp(sigma2, estimator, X, Y):
"Negative Cp statistic"
n, p = X.shape
Yhat = estimator.predict(X)
RSS = np.sum((Y - Yhat)**2)
return -(RSS + 2 * p * sigma2) / n
```
We need to estimate the residual variance $\sigma^2$, which is the first argument in our scoring function above.
We will fit the biggest model, using all the variables, and estimate $\sigma^2$ based on its MSE.
```{python}
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)
sigma2 = OLS(Y,X).fit().scale
```
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$.
```{python}
neg_Cp = partial(nCp, sigma2)
```
We can now use `neg_Cp()` as a scorer for model selection.
Along with a score we need to specify the search strategy. This is done through the object
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
runs forward stepwise until any further additions to the model do not result
in an improvement in the evaluation score. Similarly, the method `Stepwise.fixed_steps()`
runs a fixed number of steps of stepwise search.
```{python}
strategy = Stepwise.first_peak(design,
direction='forward',
max_terms=len(design.terms))
```
We now fit a linear regression model with `Salary` as outcome using forward
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
a model from `statsmodels` along with a search strategy and selects a model with its
`fit` method. Without specifying a `scoring` argument, the score defaults to MSE, and so all 19 variables will be
selected.
```{python}
hitters_MSE = sklearn_selected(OLS,
strategy)
hitters_MSE.fit(Hitters, Y)
hitters_MSE.selected_state_
```
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
```{python}
hitters_Cp = sklearn_selected(OLS,
strategy,
scoring=neg_Cp)
hitters_Cp.fit(Hitters, Y)
hitters_Cp.selected_state_
```
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
method that stores the full path of models found in forward selection, and allows predictions for each of these. This can be done with the `sklearn_selection_path()`
estimator from `ISLP.models`. The function `cross_val_predict()` from `ISLP.models`
computes the cross-validated predictions for each of the models
along the path, which we can use to evaluate the cross-validated MSE
along the path.
Here we define a strategy that fits the full forward selection path.
While there are various parameter choices for `sklearn_selection_path()`,
we use the defaults here, which selects the model at each step based on the biggest reduction in RSS.
```{python}
strategy = Stepwise.fixed_steps(design,
len(design.terms),
direction='forward')
full_path = sklearn_selection_path(OLS, strategy)
```
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
```{python}
full_path.fit(Hitters, Y)
Yhat_in = full_path.predict(Hitters)
Yhat_in.shape
```
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
in-sample MSE. As expected, the in-sample MSE improves each step we take,
indicating we must use either the validation or cross-validation
approach to select the number of steps. We fix the y-axis to range from
50,000 to 250,000 to compare to the cross-validation and validation
set MSE below, as well as other methods such as ridge regression, lasso and
principal components regression.
```{python}
mse_fig, ax = subplots(figsize=(8,8))
insample_mse = ((Yhat_in - Y[:,None])**2).mean(0)
n_steps = insample_mse.shape[0]
ax.plot(np.arange(n_steps),
insample_mse,
'k', # color black
label='In-sample')
ax.set_ylabel('MSE',
fontsize=20)
ax.set_xlabel('# steps of forward stepwise',
fontsize=20)
ax.set_xticks(np.arange(n_steps)[::2])
ax.legend()
ax.set_ylim([50000,250000]);
```
Notice the expression `None` in `Y[:,None]` above.
This adds an axis (dimension) to the one-dimensional array `Y`,
which allows it to be recycled when subtracted from the two-dimensional `Yhat_in`.
We are now ready to use cross-validation to estimate test error along
the model path. We must use *only the training observations* to perform all aspects of model-fitting --- including
variable selection. Therefore, the determination of which model of a
given size is best must be made using \emph{only the training
observations} in each training fold. This point is subtle but important. If the full data
set is used to select the best subset at each step, then the validation
set errors and cross-validation errors that we obtain will not be
accurate estimates of the test error.
We now compute the cross-validated predicted values using 5-fold cross-validation.
```{python}
K = 5
kfold = skm.KFold(K,
random_state=0,
shuffle=True)
Yhat_cv = skm.cross_val_predict(full_path,
Hitters,
Y,
cv=kfold)
Yhat_cv.shape
```
`skm.cross_val_predict()`
The prediction matrix `Yhat_cv` is the same shape as `Yhat_in`; the difference is that the predictions in each row, corresponding to a particular sample index, were made from models fit on a training fold that did not include that row.
At each model along the path, we compute the MSE in each of the cross-validation folds.
These we will average to get the mean MSE, and can also use the individual values to compute a crude estimate of the standard error of the mean. {The estimate is crude because the five error estimates are based on overlapping training sets, and hence are not independent.}
Hence we must know the test indices for each cross-validation
split. This can be found by using the `split()` method of `kfold`. Because
we fixed the random state above, whenever we split any array with the same
number of rows as $Y$ we recover the same training and test indices, though we simply
ignore the training indices below.
```{python}
cv_mse = []
for train_idx, test_idx in kfold.split(Y):
errors = (Yhat_cv[test_idx] - Y[test_idx,None])**2
cv_mse.append(errors.mean(0)) # column means
cv_mse = np.array(cv_mse).T
cv_mse.shape
```
We now add the cross-validation error estimates to our MSE plot.
We include the mean error across the five folds, and the estimate of the standard error of the mean.
```{python}
ax.errorbar(np.arange(n_steps),
cv_mse.mean(1),
cv_mse.std(1) / np.sqrt(K),
label='Cross-validated',
c='r') # color red
ax.set_ylim([50000,250000])
ax.legend()
mse_fig
```
To repeat the above using the validation set approach, we simply change our
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
```{python}
validation = skm.ShuffleSplit(n_splits=1,
test_size=0.2,
random_state=0)
for train_idx, test_idx in validation.split(Y):
full_path.fit(Hitters.iloc[train_idx],
Y[train_idx])
Yhat_val = full_path.predict(Hitters.iloc[test_idx])
errors = (Yhat_val - Y[test_idx,None])**2
validation_mse = errors.mean(0)
```
As for the in-sample MSE case, the validation set approach does not provide standard errors.
```{python}
ax.plot(np.arange(n_steps),
validation_mse,
'b--', # color blue, broken line
label='Validation')
ax.set_xticks(np.arange(n_steps)[::2])
ax.set_ylim([50000,250000])
ax.legend()
mse_fig
```
### Best Subset Selection
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
data, which for every subset size, searches for the best set of predictors.
We will use a package called `l0bnb` to perform
best subset selection.
Instead of constraining the subset to be a given size,
this package produces a path of solutions using the subset size as a
penalty rather than a constraint. Although the distinction is subtle, the difference comes when we cross-validate.
```{python}
D = design.fit_transform(Hitters)
D = D.drop('intercept', axis=1)
X = np.asarray(D)
```
Here we excluded the first column corresponding to the intercept, as
`l0bnb` will fit the intercept separately. We can find a path using the `fit_path()` function.
```{python}
path = fit_path(X,
Y,
max_nonzeros=X.shape[1])
```
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
```{python}
path[3]
```
In the example above, we see that at the fourth step in the path, we have two nonzero coefficients in `'B'`, corresponding to the value $0.0114$ for the penalty parameter `lambda_0`.
We could make predictions using this sequence of fits on a validation set as a function of `lambda_0`, or with more work using cross-validation.
## Ridge Regression and the Lasso
We will use the `sklearn.linear_model` package (for which
we use `skl` as shorthand below)
to fit ridge and lasso regularized linear models on the `Hitters` data.
We start with the model matrix `X` (without an intercept) that we computed in the previous section on best subset regression.
### Ridge Regression
We will use the function `skl.ElasticNet()` to fit both ridge and the lasso.
To fit a *path* of ridge regressions models, we use
`skl.ElasticNet.path()`, which can fit both ridge and lasso, as well as a hybrid mixture; ridge regression
corresponds to `l1_ratio=0`.
It is good practice to standardize the columns of `X` in these applications, if the variables are measured in different units. Since `skl.ElasticNet()` does no normalization, we have to take care of that ourselves.
Since we
standardize first, in order to find coefficient
estimates on the original scale, we must *unstandardize*
the coefficient estimates. The parameter
$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to
be consistent with the rest of this chapter, we use `lambdas`
rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [23] issue unwarranted convergence warning messages; we suppressed these when we filtered the warnings above.}
```{python}
Xs = X - X.mean(0)[None,:]
X_scale = X.std(0)
Xs = Xs / X_scale[None,:]
lambdas = 10**np.linspace(8, -2, 100) / Y.std()
soln_array = skl.ElasticNet.path(Xs,
Y,
l1_ratio=0.,
alphas=lambdas)[1]
soln_array.shape
```
Here we extract the array of coefficients corresponding to the solutions along the regularization path.
By default the `skl.ElasticNet.path` method fits a path along
an automatically selected range of $\lambda$ values, except for the case when
`l1_ratio=0`, which results in ridge regression (as is the case here). {The reason is rather technical; for all models except ridge, we can find the smallest value of $\lambda$ for which all coefficients are zero. For ridge this value is $\infty$.} So here
we have chosen to implement the function over a grid of values ranging
from $\lambda=10^{8}$ to $\lambda=10^{-2}$ scaled by the standard
deviation of $y$, essentially covering the full range of scenarios
from the null model containing only the intercept, to the least
squares fit.
Associated with each value of $\lambda$ is a vector of ridge
regression coefficients, that can be accessed by
a column of `soln_array`. In this case, `soln_array` is a $19 \times 100$ matrix, with
19 rows (one for each predictor) and 100
columns (one for each value of $\lambda$).
We transpose this matrix and turn it into a data frame to facilitate viewing and plotting.
```{python}
soln_path = pd.DataFrame(soln_array.T,
columns=D.columns,
index=-np.log(lambdas))
soln_path.index.name = 'negative log(lambda)'
soln_path
```
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
To control the location of the legend we first set `legend` to `False` in the
plot method, adding it afterward with the `legend()` method of `ax`.
```{python}
path_fig, ax = subplots(figsize=(8,8))
soln_path.plot(ax=ax, legend=False)
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Standardized coefficients', fontsize=20)
ax.legend(loc='upper left');
```
(We have used `latex` formatting in the horizontal label, in order to format the Greek $\lambda$ appropriately.)
We expect the coefficient estimates to be much smaller, in terms of
$\ell_2$ norm, when a large value of $\lambda$ is used, as compared to
when a small value of $\lambda$ is used. (Recall that the $\ell_2$ norm is the square root of the sum of squared coefficient values.) We display the coefficients at the $40$th step,
where $\lambda$ is 25.535.
```{python}
beta_hat = soln_path.loc[soln_path.index[39]]
lambdas[39], beta_hat
```
Lets compute the $\ell_2$ norm of the standardized coefficients.
```{python}
np.linalg.norm(beta_hat)
```
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
Note the much larger $\ell_2$ norm of the
coefficients associated with this smaller value of $\lambda$.
```{python}
beta_hat = soln_path.loc[soln_path.index[59]]
lambdas[59], np.linalg.norm(beta_hat)
```
Above we normalized `X` upfront, and fit the ridge model using `Xs`.
The `Pipeline()` object
in `sklearn` provides a clear way to separate feature
normalization from the fitting of the ridge model itself.
```{python}
ridge = skl.ElasticNet(alpha=lambdas[59], l1_ratio=0)
scaler = StandardScaler(with_mean=True, with_std=True)
pipe = Pipeline(steps=[('scaler', scaler), ('ridge', ridge)])
pipe.fit(X, Y)
```
We show that it gives the same $\ell_2$ norm as in our previous fit on the standardized data.
```{python}
np.linalg.norm(ridge.coef_)
```
Notice that the operation `pipe.fit(X, Y)` above has changed the `ridge` object, and in particular has added attributes such as `coef_` that were not there before.
### Estimating Test Error of Ridge Regression
Choosing an *a priori* value of $\lambda$ for ridge regression is
difficult if not impossible. We will want to use the validation method
or cross-validation to select the tuning parameter. The reader may not
be surprised that the `Pipeline()` approach can be used in
`skm.cross_validate()` with either a validation method
(i.e. `validation`) or $k$-fold cross-validation.
We fix the random state of the splitter
so that the results obtained will be reproducible.
```{python}
validation = skm.ShuffleSplit(n_splits=1,
test_size=0.5,
random_state=0)
ridge.alpha = 0.01
results = skm.cross_validate(ridge,
X,
Y,
scoring='neg_mean_squared_error',
cv=validation)
-results['test_score']
```
The test MSE is 1.342e+05. Note
that if we had instead simply fit a model with just an intercept, we
would have predicted each test observation using the mean of the
training observations. We can get the same result by fitting a ridge regression model
with a *very* large value of $\lambda$. Note that `1e10`
means $10^{10}$.
```{python}
ridge.alpha = 1e10
results = skm.cross_validate(ridge,
X,
Y,
scoring='neg_mean_squared_error',
cv=validation)
-results['test_score']
```
Obviously choosing $\lambda=0.01$ is arbitrary, so we will use cross-validation or the validation-set
approach to choose the tuning parameter $\lambda$.
The object `GridSearchCV()` allows exhaustive
grid search to choose such a parameter.
We first use the validation set method
to choose $\lambda$.
```{python}
param_grid = {'ridge__alpha': lambdas}
grid = skm.GridSearchCV(pipe,
param_grid,
cv=validation,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
grid.best_params_['ridge__alpha']
grid.best_estimator_
```
Alternatively, we can use 5-fold cross-validation.
```{python}
grid = skm.GridSearchCV(pipe,
param_grid,
cv=kfold,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
grid.best_params_['ridge__alpha']
grid.best_estimator_
```
Recall we set up the `kfold` object for 5-fold cross-validation on page 271. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
to right.
```{python}
ridge_fig, ax = subplots(figsize=(8,8))
ax.errorbar(-np.log(lambdas),
-grid.cv_results_['mean_test_score'],
yerr=grid.cv_results_['std_test_score'] / np.sqrt(K))
ax.set_ylim([50000,250000])
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20);
```
One can cross-validate different metrics to choose a parameter. The default
metric for `skl.ElasticNet()` is test $R^2$.
Lets compare $R^2$ to MSE for cross-validation here.
```{python}
grid_r2 = skm.GridSearchCV(pipe,
param_grid,
cv=kfold)
grid_r2.fit(X, Y)
```
Finally, lets plot the results for cross-validated $R^2$.
```{python}
r2_fig, ax = subplots(figsize=(8,8))
ax.errorbar(-np.log(lambdas),
grid_r2.cv_results_['mean_test_score'],
yerr=grid_r2.cv_results_['std_test_score'] / np.sqrt(K))
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
```
### Fast Cross-Validation for Solution Paths
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
such paths, and to choose a suitable value of $\lambda$ using cross-validation. Even with
identical splits the results will not agree *exactly* with our `grid`
above because the standardization of each feature in `grid` is carried out on each fold,
while in `pipeCV` below it is carried out only once.
Nevertheless, the results are similar as the normalization
is relatively stable across folds.
```{python}
ridgeCV = skl.ElasticNetCV(alphas=lambdas,
l1_ratio=0,
cv=kfold)
pipeCV = Pipeline(steps=[('scaler', scaler),
('ridge', ridgeCV)])
pipeCV.fit(X, Y)
```
Lets produce a plot again of the cross-validation error to see that
it is similar to using `skm.GridSearchCV`.
```{python}
tuned_ridge = pipeCV.named_steps['ridge']
ridgeCV_fig, ax = subplots(figsize=(8,8))
ax.errorbar(-np.log(lambdas),
tuned_ridge.mse_path_.mean(1),
yerr=tuned_ridge.mse_path_.std(1) / np.sqrt(K))
ax.axvline(-np.log(tuned_ridge.alpha_), c='k', ls='--')
ax.set_ylim([50000,250000])
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20);
```
We see that the value of $\lambda$ that results in the
smallest cross-validation error is 1.19e-02, available
as the value `tuned_ridge.alpha_`. What is the test MSE
associated with this value of $\lambda$?
```{python}
np.min(tuned_ridge.mse_path_.mean(1))
```
This represents a further improvement over the test MSE that we got
using $\lambda=4$. Finally, `tuned_ridge.coef_`
has the coefficients fit on the entire data set
at this value of $\lambda$.
```{python}
tuned_ridge.coef_
```
As expected, none of the coefficients are zero—ridge regression does
not perform variable selection!
### Evaluating Test Error of Cross-Validated Ridge
Choosing $\lambda$ using cross-validation provides a single regression
estimator, similar to fitting a linear regression model as we saw in
Chapter 3. It is therefore reasonable to estimate what its test error
is. We run into a problem here in that cross-validation will have
*touched* all of its data in choosing $\lambda$, hence we have no
further data to estimate test error. A compromise is to do an initial
split of the data into two disjoint sets: a training set and a test set.
We then fit a cross-validation
tuned ridge regression on the training set, and evaluate its performance on the test set.
We might call this cross-validation nested
within the validation set approach. A priori there is no reason to use
half of the data for each of the two sets in validation. Below, we use
75% for training and 25% for test, with the estimator being ridge
regression tuned using 5-fold cross-validation. This can be achieved
in code as follows:
```{python}
outer_valid = skm.ShuffleSplit(n_splits=1,
test_size=0.25,
random_state=1)
inner_cv = skm.KFold(n_splits=5,
shuffle=True,
random_state=2)
ridgeCV = skl.ElasticNetCV(alphas=lambdas,
l1_ratio=0,
cv=inner_cv)
pipeCV = Pipeline(steps=[('scaler', scaler),
('ridge', ridgeCV)]);
```
```{python}
results = skm.cross_validate(pipeCV,
X,
Y,
cv=outer_valid,
scoring='neg_mean_squared_error')
-results['test_score']
```
### The Lasso
We saw that ridge regression with a wise choice of $\lambda$ can
outperform least squares as well as the null model on the
`Hitters` data set. We now ask whether the lasso can yield
either a more accurate or a more interpretable model than ridge
regression. In order to fit a lasso model, we once again use the
`ElasticNetCV()` function; however, this time we use the argument
`l1_ratio=1`. Other than that change, we proceed just as we did in
fitting a ridge model.
```{python}
lassoCV = skl.ElasticNetCV(n_alphas=100,
l1_ratio=1,
cv=kfold)
pipeCV = Pipeline(steps=[('scaler', scaler),
('lasso', lassoCV)])
pipeCV.fit(X, Y)
tuned_lasso = pipeCV.named_steps['lasso']
tuned_lasso.alpha_
```
```{python}
lambdas, soln_array = skl.Lasso.path(Xs,
Y,
l1_ratio=1,
n_alphas=100)[:2]
soln_path = pd.DataFrame(soln_array.T,
columns=D.columns,
index=-np.log(lambdas))
```
We can see from the coefficient plot of the standardized coefficients that depending on the choice of
tuning parameter, some of the coefficients will be exactly equal to
zero.
```{python}
path_fig, ax = subplots(figsize=(8,8))
soln_path.plot(ax=ax, legend=False)
ax.legend(loc='upper left')
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Standardized coefficiients', fontsize=20);
```
The smallest cross-validated error is lower than the test set MSE of the null model
and of least squares, and very similar to the test MSE of 115526.71 of ridge
regression (page 278) with $\lambda$ chosen by cross-validation.
```{python}
np.min(tuned_lasso.mse_path_.mean(1))
```
Lets again produce a plot of the cross-validation error.
```{python}
lassoCV_fig, ax = subplots(figsize=(8,8))
ax.errorbar(-np.log(tuned_lasso.alphas_),
tuned_lasso.mse_path_.mean(1),
yerr=tuned_lasso.mse_path_.std(1) / np.sqrt(K))
ax.axvline(-np.log(tuned_lasso.alpha_), c='k', ls='--')
ax.set_ylim([50000,250000])
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20);
```
However, the lasso has a substantial advantage over ridge regression
in that the resulting coefficient estimates are sparse. Here we see
that 6 of the 19 coefficient estimates are exactly zero. So the lasso
model with $\lambda$ chosen by cross-validation contains only 13
variables.
```{python}
tuned_lasso.coef_
```
As in ridge regression, we could evaluate the test error
of cross-validated lasso by first splitting into
test and training sets and internally running
cross-validation on the training set. We leave
this as an exercise.
## PCR and PLS Regression
### Principal Components Regression
Principal components regression (PCR) can be performed using
`PCA()` from the `sklearn.decomposition`
module. We now apply PCR to the `Hitters` data, in order to
predict `Salary`. Again, ensure that the missing values have
been removed from the data, as described in Section 6.5.1.
We use `LinearRegression()` to fit the regression model
here. Note that it fits an intercept by default, unlike
the `OLS()` function seen earlier in Section 6.5.1.
```{python}
pca = PCA(n_components=2)
linreg = skl.LinearRegression()
pipe = Pipeline([('pca', pca),
('linreg', linreg)])
pipe.fit(X, Y)
pipe.named_steps['linreg'].coef_
```
When performing PCA, the results vary depending
on whether the data has been *standardized* or not.
As in the earlier examples, this can be accomplished
by including an additional step in the pipeline.
```{python}
pipe = Pipeline([('scaler', scaler),
('pca', pca),
('linreg', linreg)])
pipe.fit(X, Y)
pipe.named_steps['linreg'].coef_
```
We can of course use CV to choose the number of components, by
using `skm.GridSearchCV`, in this
case fixing the parameters to vary the
`n_components`.
```{python}
param_grid = {'pca__n_components': range(1, 20)}
grid = skm.GridSearchCV(pipe,
param_grid,
cv=kfold,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
```
Lets plot the results as we have for other methods.
```{python}
pcr_fig, ax = subplots(figsize=(8,8))
n_comp = param_grid['pca__n_components']
ax.errorbar(n_comp,
-grid.cv_results_['mean_test_score'],
grid.cv_results_['std_test_score'] / np.sqrt(K))
ax.set_ylabel('Cross-validated MSE', fontsize=20)
ax.set_xlabel('# principal components', fontsize=20)
ax.set_xticks(n_comp[::2])
ax.set_ylim([50000,250000]);
```
We see that the smallest cross-validation error occurs when
17
components are used. However, from the plot we also see that the
cross-validation error is roughly the same when only one component is
included in the model. This suggests that a model that uses just a
small number of components might suffice.
The CV score is provided for each possible number of components from
1 to 19 inclusive. The `PCA()` method complains
if we try to fit an intercept only with `n_components=0`
so we also compute the MSE for just the null model with
these splits.
```{python}
Xn = np.zeros((X.shape[0], 1))
cv_null = skm.cross_validate(linreg,
Xn,
Y,
cv=kfold,
scoring='neg_mean_squared_error')
-cv_null['test_score'].mean()
```
The `explained_variance_ratio_`
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
different numbers of components. This concept is discussed in greater
detail in Section 12.2.
```{python}
pipe.named_steps['pca'].explained_variance_ratio_
```
Briefly, we can think of
this as the amount of information about the predictors
that is captured using $M$ principal components. For example, setting
$M=1$ only captures 38.31% of the variance, while $M=2$ captures an additional 21.84%, for a total of 60.15% of the variance.
By $M=6$ it increases to
88.63%. Beyond this the increments continue to diminish, until we use all $M=p=19$ components, which captures all 100% of the variance.
### Partial Least Squares
Partial least squares (PLS) is implemented in the
`PLSRegression()` function.
```{python}
pls = PLSRegression(n_components=2,
scale=True)
pls.fit(X, Y)
```
As was the case in PCR, we will want to
use CV to choose the number of components.
```{python}
param_grid = {'n_components':range(1, 20)}
grid = skm.GridSearchCV(pls,
param_grid,
cv=kfold,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
```
As for our other methods, we plot the MSE.
```{python}
pls_fig, ax = subplots(figsize=(8,8))
n_comp = param_grid['n_components']
ax.errorbar(n_comp,
-grid.cv_results_['mean_test_score'],
grid.cv_results_['std_test_score'] / np.sqrt(K))
ax.set_ylabel('Cross-validated MSE', fontsize=20)
ax.set_xlabel('# principal components', fontsize=20)
ax.set_xticks(n_comp[::2])
ax.set_ylim([50000,250000]);
```
CV error is minimized at 12,
though there is little noticeable difference between this point and a much lower number like 2 or 3 components.

View File

@@ -1,900 +0,0 @@
# Non-Linear Modeling
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch07-nonlin-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch07-nonlin-lab.ipynb)
In this lab, we demonstrate some of the nonlinear models discussed in
this chapter. We use the `Wage` data as a running example, and show that many of the complex non-linear fitting procedures discussed can easily be implemented in `Python`.
As usual, we start with some of our standard imports.
```{python}
import numpy as np, pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (summarize,
poly,
ModelSpec as MS)
from statsmodels.stats.anova import anova_lm
```
We again collect the new imports
needed for this lab. Many of these are developed specifically for the
`ISLP` package.
```{python}
from pygam import (s as s_gam,
l as l_gam,
f as f_gam,
LinearGAM,
LogisticGAM)
from ISLP.transforms import (BSpline,
NaturalSpline)
from ISLP.models import bs, ns
from ISLP.pygam import (approx_lam,
degrees_of_freedom,
plot as plot_gam,
anova as anova_gam)
```
## Polynomial Regression and Step Functions
We start by demonstrating how Figure 7.1 can be reproduced.
Let's begin by loading the data.
```{python}
Wage = load_data('Wage')
y = Wage['wage']
age = Wage['age']
```
Throughout most of this lab, our response is `Wage['wage']`, which
we have stored as `y` above.
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
that will fit a $4$th degree polynomial in `age`.
```{python}
poly_age = MS([poly('age', degree=4)]).fit(Wage)
M = sm.OLS(y, poly_age.transform(Wage)).fit()
summarize(M)
```
This polynomial is constructed using the function `poly()`,
which creates
a special *transformer* `Poly()` (using `sklearn` terminology
for feature transformations such as `PCA()` seen in Section 6.5.3) which
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
the
discussion of transformations on
page 118.
In the code above, the first line executes the `fit()` method
using the dataframe
`Wage`. This recomputes and stores as attributes any parameters needed by `Poly()`
on the training data, and these will be used on all subsequent
evaluations of the `transform()` method. For example, it is used
on the second line, as well as in the plotting function developed below.
We now create a grid of values for `age` at which we want
predictions.
```{python}
age_grid = np.linspace(age.min(),
age.max(),
100)
age_df = pd.DataFrame({'age': age_grid})
```
Finally, we wish to plot the data and add the fit from the fourth-degree polynomial. As we will make
several similar plots below, we first write a function
to create all the ingredients and produce the plot. Our function
takes in a model specification (here a basis specified by a
transform), as well as a grid of `age` values. The function
produces a fitted curve as well as 95% confidence bands. By using
an argument for `basis` we can produce and plot the results with several different
transforms, such as the splines we will see shortly.
```{python}
def plot_wage_fit(age_df,
basis,
title):
X = basis.transform(Wage)
Xnew = basis.transform(age_df)
M = sm.OLS(y, X).fit()
preds = M.get_prediction(Xnew)
bands = preds.conf_int(alpha=0.05)
fig, ax = subplots(figsize=(8,8))
ax.scatter(age,
y,
facecolor='gray',
alpha=0.5)
for val, ls in zip([preds.predicted_mean,
bands[:,0],
bands[:,1]],
['b','r--','r--']):
ax.plot(age_df.values, val, ls, linewidth=3)
ax.set_title(title, fontsize=20)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
return ax
```
We include an argument `alpha` to `ax.scatter()`
to add some transparency to the points. This provides a visual indication
of density. Notice the use of the `zip()` function in the
`for` loop above (see Section 2.3.8).
We have three lines to plot, each with different colors and line
types. Here `zip()` conveniently bundles these together as
iterators in the loop. {In `Python`{} speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
We now plot the fit of the fourth-degree polynomial using this
function.
```{python}
plot_wage_fit(age_df,
poly_age,
'Degree-4 Polynomial');
```
With polynomial regression we must decide on the degree of
the polynomial to use. Sometimes we just wing it, and decide to use
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
make such a decision in a more systematic way. One way to do this is through hypothesis
tests, which we demonstrate here. We now fit a series of models ranging from
linear (degree-one) to degree-five polynomials,
and look to determine the simplest model that is sufficient to
explain the relationship between `wage` and `age`. We use the
`anova_lm()` function, which performs a series of ANOVA
tests.
An \emph{analysis of
variance} or ANOVA tests the null
hypothesis that a model $\mathcal{M}_1$ is sufficient to explain the
data against the alternative hypothesis that a more complex model
$\mathcal{M}_2$ is required. The determination is based on an F-test.
To perform the test, the models $\mathcal{M}_1$ and $\mathcal{M}_2$ must be *nested*:
the space spanned by the predictors in $\mathcal{M}_1$ must be a subspace of the
space spanned by the predictors in $\mathcal{M}_2$. In this case, we
fit five different polynomial
models and sequentially compare the simpler model to the more complex
model.
```{python}
models = [MS([poly('age', degree=d)])
for d in range(1, 6)]
Xs = [model.fit_transform(Wage) for model in models]
anova_lm(*[sm.OLS(y, X_).fit()
for X_ in Xs])
```
Notice the `*` in the `anova_lm()` line above. This
function takes a variable number of non-keyword arguments, in this case fitted models.
When these models are provided as a list (as is done here), it must be
prefixed by `*`.
The p-value comparing the linear `models[0]` to the quadratic
`models[1]` is essentially zero, indicating that a linear
fit is not sufficient. {Indexing starting at zero is confusing for the polynomial degree example, since `models[1]` is quadratic rather than linear!} Similarly the p-value comparing the quadratic
`models[1]` to the cubic `models[2]` is very low (0.0017), so the
quadratic fit is also insufficient. The p-value comparing the cubic
and degree-four polynomials, `models[2]` and `models[3]`, is
approximately 5%, while the degree-five polynomial `models[4]` seems
unnecessary because its p-value is 0.37. Hence, either a cubic or a
quartic polynomial appear to provide a reasonable fit to the data, but
lower- or higher-order models are not justified.
In this case, instead of using the `anova()` function, we could
have obtained these p-values more succinctly by exploiting the fact
that `poly()` creates orthogonal polynomials.
```{python}
summarize(M)
```
Notice that the p-values are the same, and in fact the square of
the t-statistics are equal to the F-statistics from the
`anova_lm()` function; for example:
```{python}
(-11.983)**2
```
However, the ANOVA method works whether or not we used orthogonal
polynomials, provided the models are nested. For example, we can use
`anova_lm()` to compare the following three
models, which all have a linear term in `education` and a
polynomial in `age` of different degrees:
```{python}
models = [MS(['education', poly('age', degree=d)])
for d in range(1, 4)]
XEs = [model.fit_transform(Wage)
for model in models]
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
```
As an alternative to using hypothesis tests and ANOVA, we could choose
the polynomial degree using cross-validation, as discussed in Chapter 5.
Next we consider the task of predicting whether an individual earns
more than $250,000 per year. We proceed much as before, except
that first we create the appropriate response vector, and then apply
the `glm()` function using the binomial family in order
to fit a polynomial logistic regression model.
```{python}
X = poly_age.transform(Wage)
high_earn = Wage['high_earn'] = y > 250 # shorthand
glm = sm.GLM(y > 250,
X,
family=sm.families.Binomial())
B = glm.fit()
summarize(B)
```
Once again, we make predictions using the `get_prediction()` method.
```{python}
newX = poly_age.transform(age_df)
preds = B.get_prediction(newX)
bands = preds.conf_int(alpha=0.05)
```
We now plot the estimated relationship.
```{python}
fig, ax = subplots(figsize=(8,8))
rng = np.random.default_rng(0)
ax.scatter(age +
0.2 * rng.uniform(size=y.shape[0]),
np.where(high_earn, 0.198, 0.002),
fc='gray',
marker='|')
for val, ls in zip([preds.predicted_mean,
bands[:,0],
bands[:,1]],
['b','r--','r--']):
ax.plot(age_df.values, val, ls, linewidth=3)
ax.set_title('Degree-4 Polynomial', fontsize=20)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylim([0,0.2])
ax.set_ylabel('P(Wage > 250)', fontsize=20);
```
We have drawn the `age` values corresponding to the observations with
`wage` values above 250 as gray marks on the top of the plot, and
those with `wage` values below 250 are shown as gray marks on the
bottom of the plot. We added a small amount of noise to jitter
the `age` values a bit so that observations with the same `age`
value do not cover each other up. This type of plot is often called a
*rug plot*.
In order to fit a step function, as discussed in
Section 7.2, we first use the `pd.qcut()`
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
columns of the model matrix for this categorical variable. Note that this function will
include *all* columns for a given categorical, rather than the usual approach which drops one
of the levels.
```{python}
cut_age = pd.qcut(age, 4)
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
```
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
quantiles directly instead of the argument `4`. For cuts not based
on quantiles we would use the `pd.cut()` function.
The function `pd.qcut()` (and `pd.cut()`) returns an ordered categorical variable.
The regression model then creates a set of
dummy variables for use in the regression. Since `age` is the only variable in the model, the value $94,158.40 is the average salary for those under 33.75 years of
age, and the other coefficients are the average
salary for those in the other age groups. We can produce
predictions and plots just as we did in the case of the polynomial
fit.
## Splines
In order to fit regression splines, we use transforms
from the `ISLP` package. The actual spline
evaluation functions are in the `scipy.interpolate` package;
we have simply wrapped them as transforms
similar to `Poly()` and `PCA()`.
In Section 7.4, we saw
that regression splines can be fit by constructing an appropriate
matrix of basis functions. The `BSpline()` function generates the
entire matrix of basis functions for splines with the specified set of
knots. By default, the B-splines produced are cubic. To change the degree, use
the argument `degree`.
```{python}
bs_ = BSpline(internal_knots=[25,40,60], intercept=True).fit(age)
bs_age = bs_.transform(age)
bs_age.shape
```
This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots.
We can form this same matrix using the `bs()` object,
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1.
We now fit a cubic spline model to the `Wage` data.
```{python}
bs_age = MS([bs('age', internal_knots=[25,40,60])])
Xbs = bs_age.fit_transform(Wage)
M = sm.OLS(y, Xbs).fit()
summarize(M)
```
The column names are a little cumbersome, and have caused us to truncate the printed summary. They can be set on construction using the `name` argument as follows.
```{python}
bs_age = MS([bs('age',
internal_knots=[25,40,60],
name='bs(age)')])
Xbs = bs_age.fit_transform(Wage)
M = sm.OLS(y, Xbs).fit()
summarize(M)
```
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
We could also use the `df` (degrees of freedom) option to
specify the complexity of the spline. We see above that with 3 knots,
the spline basis has 6 columns or degrees of freedom. When we specify
`df=6` rather than the actual knots, `bs()` will produce a
spline with 3 knots chosen at uniform quantiles of the training data.
We can see these chosen knots most easily using `Bspline()` directly:
```{python}
BSpline(df=6).fit(age).internal_knots_
```
When asking for six degrees of freedom,
the transform chooses knots at ages 33.75, 42.0, and 51.0,
which correspond to the 25th, 50th, and 75th percentiles of
`age`.
When using B-splines we need not limit ourselves to cubic polynomials
(i.e. `degree=3`). For instance, using `degree=0` results
in piecewise constant functions, as in our example with
`pd.qcut()` above.
```{python}
bs_age0 = MS([bs('age',
df=3,
degree=0)]).fit(Wage)
Xbs0 = bs_age0.transform(Wage)
summarize(sm.OLS(y, Xbs0).fit())
```
This fit should be compared with cell [15] where we use `qcut()`
to create four bins by cutting at the 25%, 50% and 75% quantiles of
`age`. Since we specified `df=3` for degree-zero splines here, there will also be
knots at the same three quantiles. Although the coefficients appear different, we
see that this is a result of the different coding. For example, the
first coefficient is identical in both cases, and is the mean response
in the first bin. For the second coefficient, we have
$94.158 + 22.349 = 116.507 \approx 116.611$, the latter being the mean
in the second bin in cell [15]. Here the intercept is coded by a column
of ones, so the second, third and fourth coefficients are increments
for those bins. Why is the sum not exactly the same? It turns out that
the `qcut()` uses $\leq$, while `bs()` uses $<$ when
deciding bin membership.
In order to fit a natural spline, we use the `NaturalSpline()`
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
degrees of freedom (excluding the intercept) and plot the results.
```{python}
ns_age = MS([ns('age', df=5)]).fit(Wage)
M_ns = sm.OLS(y, ns_age.transform(Wage)).fit()
summarize(M_ns)
```
We now plot the natural spline using our plotting function.
```{python}
plot_wage_fit(age_df,
ns_age,
'Natural spline, df=5');
```
## Smoothing Splines and GAMs
A smoothing spline is a special case of a GAM with squared-error loss
and a single feature. To fit GAMs in `Python` we will use the
`pygam` package which can be installed via `pip install pygam`. The
estimator `LinearGAM()` uses squared-error loss.
The GAM is specified by associating each column
of a model matrix with a particular smoothing operation:
`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
The argument `0` passed to `s` below indicates that this smoother will
apply to the first column of a feature matrix. Below, we pass it a
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2.
```{python}
X_age = np.asarray(age).reshape((-1,1))
gam = LinearGAM(s_gam(0, lam=0.6))
gam.fit(X_age, y)
```
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
size of that dimension based on the remaining entries of the shape tuple.
Lets investigate how the fit changes with the smoothing parameter `lam`.
The function `np.logspace()` is similar to `np.linspace()` but spaces points
evenly on the log-scale. Below we vary `lam` from $10^{-2}$ to $10^6$.
```{python}
fig, ax = subplots(figsize=(8,8))
ax.scatter(age, y, facecolor='gray', alpha=0.5)
for lam in np.logspace(-2, 6, 5):
gam = LinearGAM(s_gam(0, lam=lam)).fit(X_age, y)
ax.plot(age_grid,
gam.predict(age_grid),
label='{:.1e}'.format(lam),
linewidth=3)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='$\lambda$');
```
The `pygam` package can perform a search for an optimal smoothing parameter.
```{python}
gam_opt = gam.gridsearch(X_age, y)
ax.plot(age_grid,
gam_opt.predict(age_grid),
label='Grid search',
linewidth=4)
ax.legend()
fig
```
Alternatively, we can fix the degrees of freedom of the smoothing
spline using a function included in the `ISLP.pygam` package. Below we
find a value of $\lambda$ that gives us roughly four degrees of
freedom. We note here that these degrees of freedom include the
unpenalized intercept and linear term of the smoothing spline, hence there are at least two
degrees of freedom.
```{python}
age_term = gam.terms[0]
lam_4 = approx_lam(X_age, age_term, 4)
age_term.lam = lam_4
degrees_of_freedom(X_age, age_term)
```
Lets vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
as the desired degrees of freedom plus one to account for the fact that these smoothing
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
```{python}
fig, ax = subplots(figsize=(8,8))
ax.scatter(X_age,
y,
facecolor='gray',
alpha=0.3)
for df in [1,3,4,8,15]:
lam = approx_lam(X_age, age_term, df+1)
age_term.lam = lam
gam.fit(X_age, y)
ax.plot(age_grid,
gam.predict(age_grid),
label='{:d}'.format(df),
linewidth=4)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='Degrees of freedom');
```
### Additive Models with Several Terms
The strength of generalized additive models lies in their ability to fit multivariate regression models with more flexibility than linear models. We demonstrate two approaches: the first in a more manual fashion using natural splines and piecewise constant functions, and the second using the `pygam` package and smoothing splines.
We now fit a GAM by hand to predict
`wage` using natural spline functions of `year` and `age`,
treating `education` as a qualitative predictor, as in (7.16).
Since this is just a big linear regression model
using an appropriate choice of basis functions, we can simply do this
using the `sm.OLS()` function.
We will build the model matrix in a more manual fashion here, since we wish
to access the pieces separately when constructing partial dependence plots.
```{python}
ns_age = NaturalSpline(df=4).fit(age)
ns_year = NaturalSpline(df=5).fit(Wage['year'])
Xs = [ns_age.transform(age),
ns_year.transform(Wage['year']),
pd.get_dummies(Wage['education']).values]
X_bh = np.hstack(Xs)
gam_bh = sm.OLS(y, X_bh).fit()
```
Here the function `NaturalSpline()` is the workhorse supporting
the `ns()` helper function. We chose to use all columns of the
indicator matrix for the categorical variable `education`, making an intercept redundant.
Finally, we stacked the three component matrices horizontally to form the model matrix `X_bh`.
We now show how to construct partial dependence plots for each of the terms in our rudimentary GAM. We can do this by hand,
given grids for `age` and `year`.
We simply predict with new $X$ matrices, fixing all but one of the features at a time.
```{python}
age_grid = np.linspace(age.min(),
age.max(),
100)
X_age_bh = X_bh.copy()[:100]
X_age_bh[:] = X_bh[:].mean(0)[None,:]
X_age_bh[:,:4] = ns_age.transform(age_grid)
preds = gam_bh.get_prediction(X_age_bh)
bounds_age = preds.conf_int(alpha=0.05)
partial_age = preds.predicted_mean
center = partial_age.mean()
partial_age -= center
bounds_age -= center
fig, ax = subplots(figsize=(8,8))
ax.plot(age_grid, partial_age, 'b', linewidth=3)
ax.plot(age_grid, bounds_age[:,0], 'r--', linewidth=3)
ax.plot(age_grid, bounds_age[:,1], 'r--', linewidth=3)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of age on wage', fontsize=20);
```
Let's explain in some detail what we did above. The idea is to create a new prediction matrix, where all but the columns belonging to `age` are constant (and set to their training-data means). The four columns for `age` are filled in with the natural spline basis evaluated at the 100 values in `age_grid`.
* We made a grid of length 100 in `age`, and created a matrix `X_age_bh` with 100 rows and the same number of columns as `X_bh`.
* We replaced every row of this matrix with the column means of the original.
* We then replace just the first four columns representing `age` with the natural spline basis computed at the values in `age_grid`.
The remaining steps should by now be familiar.
We also look at the effect of `year` on `wage`; the process is the same.
```{python}
year_grid = np.linspace(2003, 2009, 100)
year_grid = np.linspace(Wage['year'].min(),
Wage['year'].max(),
100)
X_year_bh = X_bh.copy()[:100]
X_year_bh[:] = X_bh[:].mean(0)[None,:]
X_year_bh[:,4:9] = ns_year.transform(year_grid)
preds = gam_bh.get_prediction(X_year_bh)
bounds_year = preds.conf_int(alpha=0.05)
partial_year = preds.predicted_mean
center = partial_year.mean()
partial_year -= center
bounds_year -= center
fig, ax = subplots(figsize=(8,8))
ax.plot(year_grid, partial_year, 'b', linewidth=3)
ax.plot(year_grid, bounds_year[:,0], 'r--', linewidth=3)
ax.plot(year_grid, bounds_year[:,1], 'r--', linewidth=3)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20);
```
We now fit the model (7.16) using smoothing splines rather
than natural splines. All of the
terms in (7.16) are fit simultaneously, taking each other
into account to explain the response. The `pygam` package only works with matrices, so we must convert
the categorical series `education` to its array representation, which can be found
with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we
use only seven basis functions for it.
```{python}
gam_full = LinearGAM(s_gam(0) +
s_gam(1, n_splines=7) +
f_gam(2, lam=0))
Xgam = np.column_stack([age,
Wage['year'],
Wage['education'].cat.codes])
gam_full = gam_full.fit(Xgam, y)
```
The two `s_gam()` terms result in smoothing spline fits, and use a default value for $\lambda$ (`lam=0.6`), which is somewhat arbitrary. For the categorical term `education`, specified using a `f_gam()` term, we specify `lam=0` to avoid any shrinkage.
We produce the partial dependence plot in `age` to see the effect of these choices.
The values for the plot
are generated by the `pygam` package. We provide a `plot_gam()`
function for partial-dependence plots in `ISLP.pygam`, which makes this job easier than in our last example with natural splines.
```{python}
fig, ax = subplots(figsize=(8,8))
plot_gam(gam_full, 0, ax=ax)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of age on wage - default lam=0.6', fontsize=20);
```
We see that the function is somewhat wiggly. It is more natural to specify the `df` than a value for `lam`.
We refit a GAM using four degrees of freedom each for
`age` and `year`. Recall that the addition of one below takes into account the intercept
of the smoothing spline.
```{python}
age_term = gam_full.terms[0]
age_term.lam = approx_lam(Xgam, age_term, df=4+1)
year_term = gam_full.terms[1]
year_term.lam = approx_lam(Xgam, year_term, df=4+1)
gam_full = gam_full.fit(Xgam, y)
```
Note that updating `age_term.lam` above updates it in `gam_full.terms[0]` as well! Likewise for `year_term.lam`.
Repeating the plot for `age`, we see that it is much smoother.
We also produce the plot for `year`.
```{python}
fig, ax = subplots(figsize=(8,8))
plot_gam(gam_full,
1,
ax=ax)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20)
```
Finally we plot `education`, which is categorical. The partial dependence plot is different, and more suitable for the set of fitted constants for each level of this variable.
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_full, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of wage on education',
fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8);
```
### ANOVA Tests for Additive Models
In all of our models, the function of `year` looks rather linear. We can
perform a series of ANOVA tests in order to determine which of these
three models is best: a GAM that excludes `year` ($\mathcal{M}_1$),
a GAM that uses a linear function of `year` ($\mathcal{M}_2$), or a
GAM that uses a spline function of `year` ($\mathcal{M}_3$).
```{python}
gam_0 = LinearGAM(age_term + f_gam(2, lam=0))
gam_0.fit(Xgam, y)
gam_linear = LinearGAM(age_term +
l_gam(1, lam=0) +
f_gam(2, lam=0))
gam_linear.fit(Xgam, y)
```
Notice our use of `age_term` in the expressions above. We do this because
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
To directly assess the effect of `year` we run an ANOVA on the
three models fit above.
```{python}
anova_gam(gam_0, gam_linear, gam_full)
```
We find that there is compelling evidence that a GAM with a linear
function in `year` is better than a GAM that does not include
`year` at all ($p$-value= 0.002). However, there is no
evidence that a non-linear function of `year` is needed
($p$-value=0.435). In other words, based on the results of this
ANOVA, $\mathcal{M}_2$ is preferred.
We can repeat the same process for `age` as well. We see there is very clear evidence that
a non-linear term is required for `age`.
```{python}
gam_0 = LinearGAM(year_term +
f_gam(2, lam=0))
gam_linear = LinearGAM(l_gam(0, lam=0) +
year_term +
f_gam(2, lam=0))
gam_0.fit(Xgam, y)
gam_linear.fit(Xgam, y)
anova_gam(gam_0, gam_linear, gam_full)
```
There is a (verbose) `summary()` method for the GAM fit.
```{python}
gam_full.summary()
```
We can make predictions from `gam` objects, just like from
`lm` objects, using the `predict()` method for the class
`gam`. Here we make predictions on the training set.
```{python}
Yhat = gam_full.predict(Xgam)
```
In order to fit a logistic regression GAM, we use `LogisticGAM()`
from `pygam`.
```{python}
gam_logit = LogisticGAM(age_term +
l_gam(1, lam=0) +
f_gam(2, lam=0))
gam_logit.fit(Xgam, high_earn)
```
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of wage on education',
fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8);
```
The model seems to be very flat, with especially high error bars for the first category.
Let's look at the data a bit more closely.
```{python}
pd.crosstab(Wage['high_earn'], Wage['education'])
```
We see that there are no high earners in the first category of
education, meaning that the model will have a hard time fitting. We
will fit a logistic regression GAM excluding all observations falling into this
category. This provides more sensible results.
To do so,
we could subset the model matrix, though this will not remove the
column from `Xgam`. While we can deduce which column corresponds
to this feature, for reproducibilitys sake we reform the model matrix
on this smaller subset.
```{python}
only_hs = Wage['education'] == '1. < HS Grad'
Wage_ = Wage.loc[~only_hs]
Xgam_ = np.column_stack([Wage_['age'],
Wage_['year'],
Wage_['education'].cat.codes-1])
high_earn_ = Wage_['high_earn']
```
In the second-to-last line above, we subtract one from the codes of the category, due to a bug in `pygam`. It just relabels
the education values and hence has no effect on the fit.
We now fit the model.
```{python}
gam_logit_ = LogisticGAM(age_term +
year_term +
f_gam(2, lam=0))
gam_logit_.fit(Xgam_, high_earn_)
```
Lets look at the effect of `education`, `year` and `age` on high earner status now that weve
removed those observations.
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on education', fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories[1:],
fontsize=8);
```
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 1)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on year',
fontsize=20);
```
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 0)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
```
## Local Regression
We illustrate the use of local regression using the `lowess()`
function from `sm.nonparametric`. Some implementations of
GAMs allow terms to be local regression operators; this is not the case in `pygam`.
Here we fit local linear regression models using spans of 0.2
and 0.5; that is, each neighborhood consists of 20% or 50% of
the observations. As expected, using a span of 0.5 is smoother than 0.2.
```{python}
lowess = sm.nonparametric.lowess
fig, ax = subplots(figsize=(8,8))
ax.scatter(age, y, facecolor='gray', alpha=0.5)
for span in [0.2, 0.5]:
fitted = lowess(y,
age,
frac=span,
xvals=age_grid)
ax.plot(age_grid,
fitted,
label='{:.1f}'.format(span),
linewidth=4)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='span', fontsize=15);
```

View File

@@ -1,558 +0,0 @@
# Tree-Based Methods
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch08-baggboost-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch08-baggboost-lab.ipynb)
We import some of our usual libraries at this top
level.
```{python}
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import sklearn.model_selection as skm
from ISLP import load_data, confusion_table
from ISLP.models import ModelSpec as MS
```
We also collect the new imports
needed for this lab.
```{python}
from sklearn.tree import (DecisionTreeClassifier as DTC,
DecisionTreeRegressor as DTR,
plot_tree,
export_text)
from sklearn.metrics import (accuracy_score,
log_loss)
from sklearn.ensemble import \
(RandomForestRegressor as RF,
GradientBoostingRegressor as GBR)
from ISLP.bart import BART
```
## Fitting Classification Trees
We first use classification trees to analyze the `Carseats` data set.
In these data, `Sales` is a continuous variable, and so we begin
by recoding it as a binary variable. We use the `where()`
function to create a variable, called `High`, which takes on a
value of `Yes` if the `Sales` variable exceeds 8, and takes
on a value of `No` otherwise.
```{python}
Carseats = load_data('Carseats')
High = np.where(Carseats.Sales > 8,
"Yes",
"No")
```
We now use `DecisionTreeClassifier()` to fit a classification tree in
order to predict `High` using all variables but `Sales`.
To do so, we must form a model matrix as we did when fitting regression
models.
```{python}
model = MS(Carseats.columns.drop('Sales'), intercept=False)
D = model.fit_transform(Carseats)
feature_names = list(D.columns)
X = np.asarray(D)
```
We have converted `D` from a data frame to an array `X`, which is needed in some of the analysis below. We also need the `feature_names` for annotating our plots later.
There are several options needed to specify the classifier,
such as `max_depth` (how deep to grow the tree), `min_samples_split`
(minimum number of observations in a node to be eligible for splitting)
and `criterion` (whether to use Gini or cross-entropy as the split criterion).
We also set `random_state` for reproducibility; ties in the split criterion are broken at random.
```{python}
clf = DTC(criterion='entropy',
max_depth=3,
random_state=0)
clf.fit(X, High)
```
In our discussion of qualitative features in Section 3.3,
we noted that for a linear regression model such a feature could be
represented by including a matrix of dummy variables (one-hot-encoding) in the model
matrix, using the formula notation of `statsmodels`.
As mentioned in Section 8.1, there is a more
natural way to handle qualitative features when building a decision
tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
However,
the `sklearn` implementation of decision trees does not take
advantage of this approach; instead it simply treats the one-hot-encoded levels as separate variables.
```{python}
accuracy_score(High, clf.predict(X))
```
With only the default arguments, the training error rate is
21%.
For classification trees, we can
access the value of the deviance using `log_loss()`,
\begin{equation*}
\begin{split}
-2 \sum_m \sum_k n_{mk} \log \hat{p}_{mk},
\end{split}
\end{equation*}
where $n_{mk}$ is the number of observations in the $m$th terminal
node that belong to the $k$th class.
```{python}
resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
resid_dev
```
This is closely related to the *entropy*, defined in (8.7).
A small deviance indicates a
tree that provides a good fit to the (training) data.
One of the most attractive properties of trees is that they can
be graphically displayed. Here we use the `plot()` function
to display the tree structure.
```{python}
ax = subplots(figsize=(12,12))[1]
plot_tree(clf,
feature_names=feature_names,
ax=ax);
```
The most important indicator of `Sales` appears to be `ShelveLoc`.
We can see a text representation of the tree using
`export_text()`, which displays the split
criterion (e.g. `Price <= 92.5`) for each branch.
For leaf nodes it shows the overall prediction
(`Yes` or `No`).
We can also see the number of observations in that
leaf that take on values of `Yes` and `No` by specifying `show_weights=True`.
```{python}
print(export_text(clf,
feature_names=feature_names,
show_weights=True))
```
In order to properly evaluate the performance of a classification tree
on these data, we must estimate the test error rather than simply
computing the training error. We split the observations into a
training set and a test set, build the tree using the training set,
and evaluate its performance on the test data. This pattern is
similar to that in Chapter 6, with the linear models
replaced here by decision trees --- the code for validation
is almost identical. This approach leads to correct predictions
for 68.5% of the locations in the test data set.
```{python}
validation = skm.ShuffleSplit(n_splits=1,
test_size=200,
random_state=0)
results = skm.cross_validate(clf,
D,
High,
cv=validation)
results['test_score']
```
Next, we consider whether pruning the tree might lead to improved
classification performance. We first split the data into a training and
test set. We will use cross-validation to prune the tree on the training
set, and then evaluate the performance of the pruned tree on the test
set.
```{python}
(X_train,
X_test,
High_train,
High_test) = skm.train_test_split(X,
High,
test_size=0.5,
random_state=0)
```
We first refit the full tree on the training set; here we do not set a `max_depth` parameter, since we will learn that through cross-validation.
```{python}
clf = DTC(criterion='entropy', random_state=0)
clf.fit(X_train, High_train)
accuracy_score(High_test, clf.predict(X_test))
```
Next we use the `cost_complexity_pruning_path()` method of
`clf` to extract cost-complexity values.
```{python}
ccp_path = clf.cost_complexity_pruning_path(X_train, High_train)
kfold = skm.KFold(10,
random_state=1,
shuffle=True)
```
This yields a set of impurities and $\alpha$ values
from which we can extract an optimal one by cross-validation.
```{python}
grid = skm.GridSearchCV(clf,
{'ccp_alpha': ccp_path.ccp_alphas},
refit=True,
cv=kfold,
scoring='accuracy')
grid.fit(X_train, High_train)
grid.best_score_
```
Lets take a look at the pruned tree.
```{python}
ax = subplots(figsize=(12, 12))[1]
best_ = grid.best_estimator_
plot_tree(best_,
feature_names=feature_names,
ax=ax);
```
This is quite a bushy tree. We could count the leaves, or query
`best_` instead.
```{python}
best_.tree_.n_leaves
```
The tree with 30 terminal
nodes results in the lowest cross-validation error rate, with an accuracy of
68.5%. How well does this pruned tree perform on the test data set? Once
again, we apply the `predict()` function.
```{python}
print(accuracy_score(High_test,
best_.predict(X_test)))
confusion = confusion_table(best_.predict(X_test),
High_test)
confusion
```
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
## Fitting Regression Trees
Here we fit a regression tree to the `Boston` data set. The
steps are similar to those for classification trees.
```{python}
Boston = load_data("Boston")
model = MS(Boston.columns.drop('medv'), intercept=False)
D = model.fit_transform(Boston)
feature_names = list(D.columns)
X = np.asarray(D)
```
First, we split the data into training and test sets, and fit the tree
to the training data. Here we use 30% of the data for the test set.
```{python}
(X_train,
X_test,
y_train,
y_test) = skm.train_test_split(X,
Boston['medv'],
test_size=0.3,
random_state=0)
```
Having formed our training and test data sets, we fit the regression tree.
```{python}
reg = DTR(max_depth=3)
reg.fit(X_train, y_train)
ax = subplots(figsize=(12,12))[1]
plot_tree(reg,
feature_names=feature_names,
ax=ax);
```
The variable `lstat` measures the percentage of individuals with
lower socioeconomic status. The tree indicates that lower
values of `lstat` correspond to more expensive houses.
The tree predicts a median house price of $12,042 for small-sized homes (`rm < 6.8`), in
suburbs in which residents have low socioeconomic status (`lstat > 14.4`) and the crime-rate is moderate (`crim > 5.8`).
Now we use the cross-validation function to see whether pruning
the tree will improve performance.
```{python}
ccp_path = reg.cost_complexity_pruning_path(X_train, y_train)
kfold = skm.KFold(5,
shuffle=True,
random_state=10)
grid = skm.GridSearchCV(reg,
{'ccp_alpha': ccp_path.ccp_alphas},
refit=True,
cv=kfold,
scoring='neg_mean_squared_error')
G = grid.fit(X_train, y_train)
```
In keeping with the cross-validation results, we use the pruned tree
to make predictions on the test set.
```{python}
best_ = grid.best_estimator_
np.mean((y_test - best_.predict(X_test))**2)
```
In other words, the test set MSE associated with the regression tree
is 28.07. The square root of
the MSE is therefore around
5.30,
indicating that this model leads to test predictions that are within around
$5300
of the true median home value for the suburb.
Lets plot the best tree to see how interpretable it is.
```{python}
ax = subplots(figsize=(12,12))[1]
plot_tree(G.best_estimator_,
feature_names=feature_names,
ax=ax);
```
## Bagging and Random Forests
Here we apply bagging and random forests to the `Boston` data, using
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
that bagging is simply a special case of a random forest with
$m=p$. Therefore, the `RandomForestRegressor()` function can be used to
perform both bagging and random forests. We start with bagging.
```{python}
bag_boston = RF(max_features=X_train.shape[1], random_state=0)
bag_boston.fit(X_train, y_train)
```
The argument `max_features` indicates that all 12 predictors should
be considered for each split of the tree --- in other words, that
bagging should be done. How well does this bagged model perform on
the test set?
```{python}
ax = subplots(figsize=(8,8))[1]
y_hat_bag = bag_boston.predict(X_test)
ax.scatter(y_hat_bag, y_test)
np.mean((y_test - y_hat_bag)**2)
```
The test set MSE associated with the bagged regression tree is
14.63, about half that obtained using an optimally-pruned single
tree. We could change the number of trees grown from the default of
100 by
using the `n_estimators` argument:
```{python}
bag_boston = RF(max_features=X_train.shape[1],
n_estimators=500,
random_state=0).fit(X_train, y_train)
y_hat_bag = bag_boston.predict(X_test)
np.mean((y_test - y_hat_bag)**2)
```
There is not much change. Bagging and random forests cannot overfit by
increasing the number of trees, but can underfit if the number is too small.
Growing a random forest proceeds in exactly the same way, except that
we use a smaller value of the `max_features` argument. By default,
`RandomForestRegressor()` uses $p$ variables when building a random
forest of regression trees (i.e. it defaults to bagging), and `RandomForestClassifier()` uses
$\sqrt{p}$ variables when building a
random forest of classification trees. Here we use `max_features=6`.
```{python}
RF_boston = RF(max_features=6,
random_state=0).fit(X_train, y_train)
y_hat_RF = RF_boston.predict(X_test)
np.mean((y_test - y_hat_RF)**2)
```
The test set MSE is 20.04;
this indicates that random forests did somewhat worse than bagging
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
importance of each variable.
```{python}
feature_imp = pd.DataFrame(
{'importance':RF_boston.feature_importances_},
index=feature_names)
feature_imp.sort_values(by='importance', ascending=False)
```
This
is a relative measure of the total decrease in node impurity that results from
splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data).
The results indicate that across all of the trees considered in the
random forest, the wealth level of the community (`lstat`) and the
house size (`rm`) are by far the two most important variables.
## Boosting
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
to fit boosted regression trees to the `Boston` data
set. For classification we would use `GradientBoostingClassifier()`.
The argument `n_estimators=5000`
indicates that we want 5000 trees, and the option
`max_depth=3` limits the depth of each tree. The
argument `learning_rate` is the $\lambda$
mentioned earlier in the description of boosting.
```{python}
boost_boston = GBR(n_estimators=5000,
learning_rate=0.001,
max_depth=3,
random_state=0)
boost_boston.fit(X_train, y_train)
```
We can see how the training error decreases with the `train_score_` attribute.
To get an idea of how the test error decreases we can use the
`staged_predict()` method to get the predicted values along the path.
```{python}
test_error = np.zeros_like(boost_boston.train_score_)
for idx, y_ in enumerate(boost_boston.staged_predict(X_test)):
test_error[idx] = np.mean((y_test - y_)**2)
plot_idx = np.arange(boost_boston.train_score_.shape[0])
ax = subplots(figsize=(8,8))[1]
ax.plot(plot_idx,
boost_boston.train_score_,
'b',
label='Training')
ax.plot(plot_idx,
test_error,
'r',
label='Test')
ax.legend();
```
We now use the boosted model to predict `medv` on the test set:
```{python}
y_hat_boost = boost_boston.predict(X_test);
np.mean((y_test - y_hat_boost)**2)
```
The test MSE obtained is 14.48,
similar to the test MSE for bagging. If we want to, we can
perform boosting with a different value of the shrinkage parameter
$\lambda$ in (8.10). The default value is 0.001, but
this is easily modified. Here we take $\lambda=0.2$.
```{python}
boost_boston = GBR(n_estimators=5000,
learning_rate=0.2,
max_depth=3,
random_state=0)
boost_boston.fit(X_train,
y_train)
y_hat_boost = boost_boston.predict(X_test);
np.mean((y_test - y_hat_boost)**2)
```
In this case, using $\lambda=0.2$ leads to almost the same test MSE
as when using $\lambda=0.001$.
## Bayesian Additive Regression Trees
In this section we demonstrate a `Python` implementation of BART found in the
`ISLP.bart` package. We fit a model
to the `Boston` housing data set. This `BART()` estimator is
designed for quantitative outcome variables, though other implementations are available for
fitting logistic and probit models to categorical outcomes.
```{python}
bart_boston = BART(random_state=0, burnin=5, ndraw=15)
bart_boston.fit(X_train, y_train)
```
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
```{python}
yhat_test = bart_boston.predict(X_test.astype(np.float32))
np.mean((y_test - yhat_test)**2)
```
We can check how many times each variable appeared in the collection of trees.
This gives a summary similar to the variable importance plot for boosting and random forests.
```{python}
var_inclusion = pd.Series(bart_boston.variable_inclusion_.mean(0),
index=D.columns)
var_inclusion
```

View File

@@ -1,550 +0,0 @@
# Support Vector Machines
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch09-svm-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch09-svm-lab.ipynb)
In this lab, we use the `sklearn.svm` library to demonstrate the support
vector classifier and the support vector machine.
We import some of our usual libraries.
```{python}
import numpy as np
from matplotlib.pyplot import subplots, cm
import sklearn.model_selection as skm
from ISLP import load_data, confusion_table
```
We also collect the new imports
needed for this lab.
```{python}
from sklearn.svm import SVC
from ISLP.svm import plot as plot_svm
from sklearn.metrics import RocCurveDisplay
```
We will use the function `RocCurveDisplay.from_estimator()` to
produce several ROC plots, using a shorthand `roc_curve`.
```{python}
roc_curve = RocCurveDisplay.from_estimator # shorthand
```
## Support Vector Classifier
We now use the `SupportVectorClassifier()` function (abbreviated `SVC()`) from `sklearn` to fit the support vector
classifier for a given value of the parameter `C`. The
`C` argument allows us to specify the cost of a violation to
the margin. When the `C` argument is small, then the margins
will be wide and many support vectors will be on the margin or will
violate the margin. When the `C` argument is large, then the
margins will be narrow and there will be few support vectors on the
margin or violating the margin.
Here we demonstrate
the use of `SVC()` on a two-dimensional example, so that we can
plot the resulting decision boundary. We begin by generating the
observations, which belong to two classes, and checking whether the
classes are linearly separable.
```{python}
rng = np.random.default_rng(1)
X = rng.standard_normal((50, 2))
y = np.array([-1]*25+[1]*25)
X[y==1] += 1
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0],
X[:,1],
c=y,
cmap=cm.coolwarm);
```
They are not. We now fit the classifier.
```{python}
svm_linear = SVC(C=10, kernel='linear')
svm_linear.fit(X, y)
```
The support vector classifier with two features can
be visualized by plotting values of its *decision function*.
We have included a function for this in the `ISLP` package (inspired by a similar
example in the `sklearn` docs).
```{python}
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
y,
svm_linear,
ax=ax)
```
The decision
boundary between the two classes is linear (because we used the
argument `kernel='linear'`). The support vectors are marked with `+`
and the remaining observations are plotted as circles.
What if we instead used a smaller value of the cost parameter?
```{python}
svm_linear_small = SVC(C=0.1, kernel='linear')
svm_linear_small.fit(X, y)
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
y,
svm_linear_small,
ax=ax)
```
With a smaller value of the cost parameter, we
obtain a larger number of support vectors, because the margin is now
wider. For linear kernels, we can extract the
coefficients of the linear decision boundary as follows:
```{python}
svm_linear.coef_
```
Since the support vector machine is an estimator in `sklearn`, we
can use the usual machinery to tune it.
```{python}
kfold = skm.KFold(5,
random_state=0,
shuffle=True)
grid = skm.GridSearchCV(svm_linear,
{'C':[0.001,0.01,0.1,1,5,10,100]},
refit=True,
cv=kfold,
scoring='accuracy')
grid.fit(X, y)
grid.best_params_
```
We can easily access the cross-validation errors for each of these models
in `grid.cv_results_`. This prints out a lot of detail, so we
extract the accuracy results only.
```{python}
grid.cv_results_[('mean_test_score')]
```
We see that `C=1` results in the highest cross-validation
accuracy of 0.74, though
the accuracy is the same for several values of `C`.
The classifier `grid.best_estimator_` can be used to predict the class
label on a set of test observations. Lets generate a test data set.
```{python}
X_test = rng.standard_normal((20, 2))
y_test = np.array([-1]*10+[1]*10)
X_test[y_test==1] += 1
```
Now we predict the class labels of these test observations. Here we
use the best model selected by cross-validation in order to make the
predictions.
```{python}
best_ = grid.best_estimator_
y_test_hat = best_.predict(X_test)
confusion_table(y_test_hat, y_test)
```
Thus, with this value of `C`,
70% of the test
observations are correctly classified. What if we had instead used
`C=0.001`?
```{python}
svm_ = SVC(C=0.001,
kernel='linear').fit(X, y)
y_test_hat = svm_.predict(X_test)
confusion_table(y_test_hat, y_test)
```
In this case 60% of test observations are correctly classified.
We now consider a situation in which the two classes are linearly
separable. Then we can find an optimal separating hyperplane using the
`SVC()` estimator. We first
further separate the two classes in our simulated data so that they
are linearly separable:
```{python}
X[y==1] += 1.9;
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
```
Now the observations are just barely linearly separable.
```{python}
svm_ = SVC(C=1e5, kernel='linear').fit(X, y)
y_hat = svm_.predict(X)
confusion_table(y_hat, y)
```
We fit the
support vector classifier and plot the resulting hyperplane, using a
very large value of `C` so that no observations are
misclassified.
```{python}
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
y,
svm_,
ax=ax)
```
Indeed no training errors were made and only three support vectors were used.
In fact, the large value of `C` also means that these three support points are *on the margin*, and define it.
One may wonder how good the classifier could be on test data that depends on only three data points!
We now try a smaller
value of `C`.
```{python}
svm_ = SVC(C=0.1, kernel='linear').fit(X, y)
y_hat = svm_.predict(X)
confusion_table(y_hat, y)
```
Using `C=0.1`, we again do not misclassify any training observations, but we
also obtain a much wider margin and make use of twelve support
vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test
data than the model with `C=1e5` (and indeed, a simple experiment with a large test set would bear this out).
```{python}
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
y,
svm_,
ax=ax)
```
## Support Vector Machine
In order to fit an SVM using a non-linear kernel, we once again use
the `SVC()` estimator. However, now we use a different value
of the parameter `kernel`. To fit an SVM with a polynomial
kernel we use `kernel="poly"`, and to fit an SVM with a
radial kernel we use
`kernel="rbf"`. In the former case we also use the
`degree` argument to specify a degree for the polynomial kernel
(this is $d$ in (9.22)), and in the latter case we use
`gamma` to specify a value of $\gamma$ for the radial basis
kernel (9.24).
We first generate some data with a non-linear class boundary, as follows:
```{python}
X = rng.standard_normal((200, 2))
X[:100] += 2
X[100:150] -= 2
y = np.array([1]*150+[2]*50)
```
Plotting the data makes it clear that the class boundary is indeed non-linear.
```{python}
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0],
X[:,1],
c=y,
cmap=cm.coolwarm);
```
The data is randomly split into training and testing groups. We then
fit the training data using the `SVC()` estimator with a
radial kernel and $\gamma=1$:
```{python}
(X_train,
X_test,
y_train,
y_test) = skm.train_test_split(X,
y,
test_size=0.5,
random_state=0)
svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
svm_rbf.fit(X_train, y_train)
```
The plot shows that the resulting SVM has a decidedly non-linear
boundary.
```{python}
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
y_train,
svm_rbf,
ax=ax)
```
We can see from the figure that there are a fair number of training
errors in this SVM fit. If we increase the value of `C`, we
can reduce the number of training errors. However, this comes at the
price of a more irregular decision boundary that seems to be at risk
of overfitting the data.
```{python}
svm_rbf = SVC(kernel="rbf", gamma=1, C=1e5)
svm_rbf.fit(X_train, y_train)
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
y_train,
svm_rbf,
ax=ax)
```
We can perform cross-validation using `skm.GridSearchCV()` to select the
best choice of $\gamma$ and `C` for an SVM with a radial
kernel:
```{python}
kfold = skm.KFold(5,
random_state=0,
shuffle=True)
grid = skm.GridSearchCV(svm_rbf,
{'C':[0.1,1,10,100,1000],
'gamma':[0.5,1,2,3,4]},
refit=True,
cv=kfold,
scoring='accuracy');
grid.fit(X_train, y_train)
grid.best_params_
```
The best choice of parameters under five-fold CV is achieved at `C=1`
and `gamma=0.5`, though several other values also achieve the same
value.
```{python}
best_svm = grid.best_estimator_
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
y_train,
best_svm,
ax=ax)
y_hat_test = best_svm.predict(X_test)
confusion_table(y_hat_test, y_test)
```
With these parameters, 12% of test
observations are misclassified by this SVM.
## ROC Curves
SVMs and support vector classifiers output class labels for each
observation. However, it is also possible to obtain *fitted values*
for each observation, which are the numerical scores used to
obtain the class labels. For instance, in the case of a support vector
classifier, the fitted value for an observation $X= (X_1, X_2, \ldots,
X_p)^T$ takes the form $\hat{\beta}_0 + \hat{\beta}_1 X_1 +
\hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$. For an SVM with a
non-linear kernel, the equation that yields the fitted value is given
in (9.23). The sign of the fitted value
determines on which side of the decision boundary the observation
lies. Therefore, the relationship between the fitted value and the
class prediction for a given observation is simple: if the fitted
value exceeds zero then the observation is assigned to one class, and
if it is less than zero then it is assigned to the other.
By changing this threshold from zero to some positive value,
we skew the classifications in favor of one class versus the other.
By considering a range of these thresholds, positive and negative, we produce the ingredients for a ROC plot.
We can access these values by calling the `decision_function()`
method of a fitted SVM estimator.
The function `ROCCurveDisplay.from_estimator()` (which we have abbreviated to `roc_curve()`) will produce a plot of a ROC curve. It takes a fitted estimator as its first argument, followed
by a model matrix $X$ and labels $y$. The argument `name` is used in the legend,
while `color` is used for the color of the line. Results are plotted
on our axis object `ax`.
```{python}
fig, ax = subplots(figsize=(8,8))
roc_curve(best_svm,
X_train,
y_train,
name='Training',
color='r',
ax=ax);
```
In this example, the SVM appears to provide accurate predictions. By increasing
$\gamma$ we can produce a more flexible fit and generate further
improvements in accuracy.
```{python}
svm_flex = SVC(kernel="rbf",
gamma=50,
C=1)
svm_flex.fit(X_train, y_train)
fig, ax = subplots(figsize=(8,8))
roc_curve(svm_flex,
X_train,
y_train,
name='Training $\gamma=50$',
color='r',
ax=ax);
```
However, these ROC curves are all on the training data. We are really
more interested in the level of prediction accuracy on the test
data. When we compute the ROC curves on the test data, the model with
$\gamma=0.5$ appears to provide the most accurate results.
```{python}
roc_curve(svm_flex,
X_test,
y_test,
name='Test $\gamma=50$',
color='b',
ax=ax)
fig;
```
Lets look at our tuned SVM.
```{python}
fig, ax = subplots(figsize=(8,8))
for (X_, y_, c, name) in zip(
(X_train, X_test),
(y_train, y_test),
('r', 'b'),
('CV tuned on training',
'CV tuned on test')):
roc_curve(best_svm,
X_,
y_,
name=name,
ax=ax,
color=c)
```
## SVM with Multiple Classes
If the response is a factor containing more than two levels, then the
`SVC()` function will perform multi-class classification using
either the one-versus-one approach (when `decision_function_shape=='ovo'`)
or one-versus-rest {One-versus-rest is also known as one-versus-all.} (when `decision_function_shape=='ovr'`).
We explore that setting briefly here by
generating a third class of observations.
```{python}
rng = np.random.default_rng(123)
X = np.vstack([X, rng.standard_normal((50, 2))])
y = np.hstack([y, [0]*50])
X[y==0,1] += 2
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm);
```
We now fit an SVM to the data:
```{python}
svm_rbf_3 = SVC(kernel="rbf",
C=10,
gamma=1,
decision_function_shape='ovo');
svm_rbf_3.fit(X, y)
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
y,
svm_rbf_3,
scatter_cmap=cm.tab10,
ax=ax)
```
The `sklearn.svm` library can also be used to perform support vector
regression with a numerical response using the estimator `SupportVectorRegression()`.
## Application to Gene Expression Data
We now examine the `Khan` data set, which consists of a number of
tissue samples corresponding to four distinct types of small round
blue cell tumors. For each tissue sample, gene expression measurements
are available. The data set consists of training data, `xtrain`
and `ytrain`, and testing data, `xtest` and `ytest`.
We examine the dimension of the data:
```{python}
Khan = load_data('Khan')
Khan['xtrain'].shape, Khan['xtest'].shape
```
This data set consists of expression measurements for 2,308
genes. The training and test sets consist of 63 and 20
observations, respectively.
We will use a support vector approach to predict cancer subtype using
gene expression measurements. In this data set, there is a very
large number of features relative to the number of observations. This
suggests that we should use a linear kernel, because the additional
flexibility that will result from using a polynomial or radial kernel
is unnecessary.
```{python}
khan_linear = SVC(kernel='linear', C=10)
khan_linear.fit(Khan['xtrain'], Khan['ytrain'])
confusion_table(khan_linear.predict(Khan['xtrain']),
Khan['ytrain'])
```
We see that there are *no* training
errors. In fact, this is not surprising, because the large number of
variables relative to the number of observations implies that it is
easy to find hyperplanes that fully separate the classes. We are more
interested in the support vector classifiers performance on the
test observations.
```{python}
confusion_table(khan_linear.predict(Khan['xtest']),
Khan['ytest'])
```
We see that using `C=10` yields two test set errors on these data.

File diff suppressed because it is too large Load Diff

View File

@@ -1,550 +0,0 @@
# Survival Analysis
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch11-surv-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch11-surv-lab.ipynb)
In this lab, we perform survival analyses on three separate data
sets. In Section 11.8.1 we analyze the `BrainCancer`
data that was first described in Section 11.3. In Section 11.8.2, we examine the `Publication`
data from Section 11.5.4. Finally, Section 11.8.3 explores
a simulated call-center data set.
We begin by importing some of our libraries at this top
level. This makes the code more readable, as scanning the first few
lines of the notebook tell us what libraries are used in this
notebook.
```{python}
from matplotlib.pyplot import subplots
import numpy as np
import pandas as pd
from ISLP.models import ModelSpec as MS
from ISLP import load_data
```
We also collect the new imports
needed for this lab.
```{python}
from lifelines import \
(KaplanMeierFitter,
CoxPHFitter)
from lifelines.statistics import \
(logrank_test,
multivariate_logrank_test)
from ISLP.survival import sim_time
```
## Brain Cancer Data
We begin with the `BrainCancer` data set, contained in the `ISLP` package.
```{python}
BrainCancer = load_data('BrainCancer')
BrainCancer.columns
```
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
We first briefly examine the data.
```{python}
BrainCancer['sex'].value_counts()
```
```{python}
BrainCancer['diagnosis'].value_counts()
```
```{python}
BrainCancer['status'].value_counts()
```
Before beginning an analysis, it is important to know how the
`status` variable has been coded. Most software
uses the convention that a `status` of 1 indicates an
uncensored observation (often death), and a `status` of 0 indicates a censored
observation. But some scientists might use the opposite coding. For
the `BrainCancer` data set 35 patients died before the end of
the study, so we are using the conventional coding.
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure 11.2. The main
package we will use for survival analysis
is `lifelines`.
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
death). The first argument to `km.fit` is the event time, and the
second argument is the censoring variable, with a 1 indicating an observed
failure time. The `plot()` method produces a survival curve with pointwise confidence
intervals. By default, these are 90% confidence intervals, but this can be changed
by setting the `alpha` argument to one minus the desired
confidence level.
```{python}
fig, ax = subplots(figsize=(8,8))
km = KaplanMeierFitter()
km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
```
Next we create Kaplan-Meier survival curves that are stratified by
`sex`, in order to reproduce Figure 11.3.
We do this using the `groupby()` method of a dataframe.
This method returns a generator that can
be iterated over in the `for` loop. In this case,
the items in the `for` loop are 2-tuples representing
the groups: the first entry is the value
of the grouping column `sex` while the second value
is the dataframe consisting of all rows in the
dataframe matching that value of `sex`.
We will want to use this data below
in the log-rank test, hence we store this
information in the dictionary `by_sex`. Finally,
we have also used the notion of
*string interpolation* to automatically
label the different lines in the plot. String
interpolation is a powerful technique to format strings ---
`Python` has many ways to facilitate such operations.
```{python}
fig, ax = subplots(figsize=(8,8))
by_sex = {}
for sex, df in BrainCancer.groupby('sex'):
by_sex[sex] = df
km_sex = km.fit(df['time'], df['status'])
km_sex.plot(label='Sex=%s' % sex, ax=ax)
```
As discussed in Section 11.4, we can perform a
log-rank test to compare the survival of males to females. We use
the `logrank_test()` function from the `lifelines.statistics` module.
The first two arguments are the event times, with the second
denoting the corresponding (optional) censoring indicators.
```{python}
logrank_test(by_sex['Male']['time'],
by_sex['Female']['time'],
by_sex['Male']['status'],
by_sex['Female']['status'])
```
The resulting $p$-value is $0.23$, indicating no evidence of a
difference in survival between the two sexes.
Next, we use the `CoxPHFitter()` estimator
from `lifelines` to fit Cox proportional hazards models.
To begin, we consider a model that uses `sex` as the only predictor.
```{python}
coxph = CoxPHFitter # shorthand
sex_df = BrainCancer[['time', 'status', 'sex']]
model_df = MS(['time', 'status', 'sex'],
intercept=False).fit_transform(sex_df)
cox_fit = coxph().fit(model_df,
'time',
'status')
cox_fit.summary[['coef', 'se(coef)', 'p']]
```
The first argument to `fit` should be a data frame containing
at least the event time (the second argument `time` in this case),
as well as an optional censoring variable (the argument `status` in this case).
Note also that the Cox model does not include an intercept, which is why
we used the `intercept=False` argument to `ModelSpec` above.
The `summary()` method delivers many columns; we chose to abbreviate its output here.
It is possible to obtain the likelihood ratio test comparing this model to the one
with no features as follows:
```{python}
cox_fit.log_likelihood_ratio_test()
```
Regardless of which test we use, we see that there is no clear
evidence for a difference in survival between males and females. As
we learned in this chapter, the score test from the Cox model is
exactly equal to the log rank test statistic!
Now we fit a model that makes use of additional predictors. We first note
that one of our `diagnosis` values is missing, hence
we drop that observation before continuing.
```{python}
cleaned = BrainCancer.dropna()
all_MS = MS(cleaned.columns, intercept=False)
all_df = all_MS.fit_transform(cleaned)
fit_all = coxph().fit(all_df,
'time',
'status')
fit_all.summary[['coef', 'se(coef)', 'p']]
```
The `diagnosis` variable has been coded so that the baseline
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
with meningioma. In other words, after adjusting for the other
predictors, patients with HG glioma have much worse survival compared
to those with meningioma. In addition, larger values of the Karnofsky
index, `ki`, are associated with lower risk, i.e. longer survival.
Finally, we plot estimated survival curves for each diagnosis category,
adjusting for the other predictors. To make these plots, we set the
values of the other predictors equal to the mean for quantitative variables
and equal to the mode for categorical. To do this, we use the
`apply()` method across rows (i.e. `axis=0`) with a function
`representative` that checks if a column is categorical
or not.
```{python}
levels = cleaned['diagnosis'].unique()
def representative(series):
if hasattr(series.dtype, 'categories'):
return pd.Series.mode(series)
else:
return series.mean()
modal_data = cleaned.apply(representative, axis=0)
```
We make four
copies of the column means and assign the `diagnosis` column to be the four different
diagnoses.
```{python}
modal_df = pd.DataFrame(
[modal_data.iloc[0] for _ in range(len(levels))])
modal_df['diagnosis'] = levels
modal_df
```
We then construct the model matrix based on the model specification `all_MS` used to fit
the model, and name the rows according to the levels of `diagnosis`.
```{python}
modal_X = all_MS.transform(modal_df)
modal_X.index = levels
modal_X
```
We can use the `predict_survival_function()` method to obtain the estimated survival function.
```{python}
predicted_survival = fit_all.predict_survival_function(modal_X)
predicted_survival
```
This returns a data frame,
whose plot methods yields the different survival curves. To avoid clutter in
the plots, we do not display confidence intervals.
```{python}
fig, ax = subplots(figsize=(8, 8))
predicted_survival.plot(ax=ax);
```
## Publication Data
The `Publication` data presented in Section 11.5.4 can be
found in the `ISLP` package.
We first reproduce Figure 11.5 by plotting the Kaplan-Meier curves
stratified on the `posres` variable, which records whether the
study had a positive or negative result.
```{python}
fig, ax = subplots(figsize=(8,8))
Publication = load_data('Publication')
by_result = {}
for result, df in Publication.groupby('posres'):
by_result[result] = df
km_result = km.fit(df['time'], df['status'])
km_result.plot(label='Result=%d' % result, ax=ax)
```
As discussed previously, the $p$-values from fitting Coxs
proportional hazards model to the `posres` variable are quite
large, providing no evidence of a difference in time-to-publication
between studies with positive versus negative results.
```{python}
posres_df = MS(['posres',
'time',
'status'],
intercept=False).fit_transform(Publication)
posres_fit = coxph().fit(posres_df,
'time',
'status')
posres_fit.summary[['coef', 'se(coef)', 'p']]
```
However, the results change dramatically when we include other
predictors in the model. Here we exclude the funding mechanism
variable.
```{python}
model = MS(Publication.columns.drop('mech'),
intercept=False)
coxph().fit(model.fit_transform(Publication),
'time',
'status').summary[['coef', 'se(coef)', 'p']]
```
We see that there are a number of statistically significant variables,
including whether the trial focused on a clinical endpoint, the impact
of the study, and whether the study had positive or negative results.
## Call Center Data
In this section, we will simulate survival data using the relationship
between cumulative hazard and
the survival function explored in Exercise 8.
Our simulated data will represent the observed
wait times (in seconds) for 2,000 customers who have phoned a call
center. In this context, censoring occurs if a customer hangs up
before his or her call is answered.
There are three covariates: `Operators` (the number of call
center operators available at the time of the call, which can range
from $5$ to $15$), `Center` (either A, B, or C), and
`Time` of day (Morning, Afternoon, or Evening). We generate data
for these covariates so that all possibilities are equally likely: for
instance, morning, afternoon and evening calls are equally likely, and
any number of operators from $5$ to $15$ is equally likely.
```{python}
rng = np.random.default_rng(10)
N = 2000
Operators = rng.choice(np.arange(5, 16),
N,
replace=True)
Center = rng.choice(['A', 'B', 'C'],
N,
replace=True)
Time = rng.choice(['Morn.', 'After.', 'Even.'],
N,
replace=True)
D = pd.DataFrame({'Operators': Operators,
'Center': pd.Categorical(Center),
'Time': pd.Categorical(Time)})
```
We then build a model matrix (omitting the intercept)
```{python}
model = MS(['Operators',
'Center',
'Time'],
intercept=False)
X = model.fit_transform(D)
```
It is worthwhile to take a peek at the model matrix `X`, so
that we can be sure that we understand how the variables have been coded. By default,
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
of the variable is dropped.
```{python}
X[:5]
```
Next, we specify the coefficients and the hazard function.
```{python}
true_beta = np.array([0.04, -0.3, 0, 0.2, -0.2])
true_linpred = X.dot(true_beta)
hazard = lambda t: 1e-5 * t
```
Here, we have set the coefficient associated with `Operators` to
equal $0.04$; in other words, each additional operator leads to a
$e^{0.04}=1.041$-fold increase in the “risk” that the call will be
answered, given the `Center` and `Time` covariates. This
makes sense: the greater the number of operators at hand, the shorter
the wait time! The coefficient associated with `Center == B` is
$-0.3$, and `Center == A` is treated as the baseline. This means
that the risk of a call being answered at Center B is 0.74 times the
risk that it will be answered at Center A; in other words, the wait
times are a bit longer at Center B.
Recall from Section 2.3.7 the use of `lambda`
for creating short functions on the fly.
We use the function
`sim_time()` from the `ISLP.survival` package. This function
uses the relationship between the survival function
and cumulative hazard $S(t) = \exp(-H(t))$ and the specific
form of the cumulative hazard function in the Cox model
to simulate data based on values of the linear predictor
`true_linpred` and the cumulative hazard.
We need to provide the cumulative hazard function, which we do here.
```{python}
cum_hazard = lambda t: 1e-5 * t**2 / 2
```
We are now ready to generate data under the Cox proportional hazards
model. We truncate the maximum time to 1000 seconds to keep
simulated wait times reasonable. The function
`sim_time()` takes a linear predictor,
a cumulative hazard function and a
random number generator.
```{python}
W = np.array([sim_time(l, cum_hazard, rng)
for l in true_linpred])
D['Wait time'] = np.clip(W, 0, 1000)
```
We now simulate our censoring variable, for which we assume
90% of calls were answered (`Failed==1`) before the
customer hung up (`Failed==0`).
```{python}
D['Failed'] = rng.choice([1, 0],
N,
p=[0.9, 0.1])
D[:5]
```
```{python}
D['Failed'].mean()
```
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
```{python}
fig, ax = subplots(figsize=(8,8))
by_center = {}
for center, df in D.groupby('Center'):
by_center[center] = df
km_center = km.fit(df['Wait time'], df['Failed'])
km_center.plot(label='Center=%s' % center, ax=ax)
ax.set_title("Probability of Still Being on Hold")
```
Next, we stratify by `Time`.
```{python}
fig, ax = subplots(figsize=(8,8))
by_time = {}
for time, df in D.groupby('Time'):
by_time[time] = df
km_time = km.fit(df['Wait time'], df['Failed'])
km_time.plot(label='Time=%s' % time, ax=ax)
ax.set_title("Probability of Still Being on Hold")
```
It seems that calls at Call Center B take longer to be answered than
calls at Centers A and C. Similarly, it appears that wait times are
longest in the morning and shortest in the evening hours. We can use a
log-rank test to determine whether these differences are statistically
significant using the function `multivariate_logrank_test()`.
```{python}
multivariate_logrank_test(D['Wait time'],
D['Center'],
D['Failed'])
```
Next, we consider the effect of `Time`.
```{python}
multivariate_logrank_test(D['Wait time'],
D['Time'],
D['Failed'])
```
As in the case of a categorical variable with 2 levels, these
results are similar to the likelihood ratio test
from the Cox proportional hazards model. First, we
look at the results for `Center`.
```{python}
X = MS(['Wait time',
'Failed',
'Center'],
intercept=False).fit_transform(D)
F = coxph().fit(X, 'Wait time', 'Failed')
F.log_likelihood_ratio_test()
```
Next, we look at the results for `Time`.
```{python}
X = MS(['Wait time',
'Failed',
'Time'],
intercept=False).fit_transform(D)
F = coxph().fit(X, 'Wait time', 'Failed')
F.log_likelihood_ratio_test()
```
We find that differences between centers are highly significant, as
are differences between times of day.
Finally, we fit Cox's proportional hazards model to the data.
```{python}
X = MS(D.columns,
intercept=False).fit_transform(D)
fit_queuing = coxph().fit(
X,
'Wait time',
'Failed')
fit_queuing.summary[['coef', 'se(coef)', 'p']]
```
The $p$-values for Center B and evening time
are very small. It is also clear that the
hazard --- that is, the instantaneous risk that a call will be
answered --- increases with the number of operators. Since we
generated the data ourselves, we know that the true coefficients for
`Operators`, `Center = B`, `Center = C`,
`Time = Even.` and `Time = Morn.` are $0.04$, $-0.3$,
$0$, $0.2$, and $-0.2$, respectively. The coefficient estimates
from the fitted Cox model are fairly accurate.

View File

@@ -1,904 +0,0 @@
# Unsupervised Learning
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch12-unsup-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch12-unsup-lab.ipynb)
In this lab we demonstrate PCA and clustering on several datasets.
As in other labs, we import some of our libraries at this top
level. This makes the code more readable, as scanning the first few
lines of the notebook tells us what libraries are used in this
notebook.
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.datasets import get_rdataset
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from ISLP import load_data
```
We also collect the new imports
needed for this lab.
```{python}
from sklearn.cluster import \
(KMeans,
AgglomerativeClustering)
from scipy.cluster.hierarchy import \
(dendrogram,
cut_tree)
from ISLP.cluster import compute_linkage
```
## Principal Components Analysis
In this lab, we perform PCA on `USArrests`, a data set in the
`R` computing environment.
We retrieve the data using `get_rdataset()`, which can fetch data from
many standard `R` packages.
The rows of the data set contain the 50 states, in alphabetical order.
```{python}
USArrests = get_rdataset('USArrests').data
USArrests
```
The columns of the data set contain the four variables.
```{python}
USArrests.columns
```
We first briefly examine the data. We notice that the variables have vastly different means.
```{python}
USArrests.mean()
```
Dataframes have several useful methods for computing
column-wise summaries. We can also examine the
variance of the four variables using the `var()` method.
```{python}
USArrests.var()
```
Not surprisingly, the variables also have vastly different variances.
The `UrbanPop` variable measures the percentage of the population
in each state living in an urban area, which is not a comparable
number to the number of rapes in each state per 100,000 individuals.
PCA looks for derived variables that account for most of the variance in the data set.
If we do not scale the variables before performing PCA, then the principal components
would mostly be driven by the
`Assault` variable, since it has by far the largest
variance. So if the variables are measured in different units or vary widely in scale, it is recommended to standardize the variables to have standard deviation one before performing PCA.
Typically we set the means to zero as well.
This scaling can be done via the `StandardScaler()` transform imported above. We first `fit` the
scaler, which computes the necessary means and standard
deviations and then apply it to our data using the
`transform` method. As before, we combine these steps using the `fit_transform()` method.
```{python}
scaler = StandardScaler(with_std=True,
with_mean=True)
USArrests_scaled = scaler.fit_transform(USArrests)
```
Having scaled the data, we can then
perform principal components analysis using the `PCA()` transform
from the `sklearn.decomposition` package.
```{python}
pcaUS = PCA()
```
(By default, the `PCA()` transform centers the variables to have
mean zero though it does not scale them.) The transform `pcaUS`
can be used to find the PCA
`scores` returned by `fit()`. Once the `fit` method has been called, the `pcaUS` object also contains a number of useful quantities.
```{python}
pcaUS.fit(USArrests_scaled)
```
After fitting, the `mean_` attribute corresponds to the means
of the variables. In this case, since we centered and scaled the data with
`scaler()` the means will all be 0.
```{python}
pcaUS.mean_
```
The scores can be computed using the `transform()` method
of `pcaUS` after it has been fit.
```{python}
scores = pcaUS.transform(USArrests_scaled)
```
We will plot these scores a bit further down.
The `components_` attribute provides the principal component loadings:
each row of `pcaUS.components_` contains the corresponding
principal component loading vector.
```{python}
pcaUS.components_
```
The `biplot` is a common visualization method used with
PCA. It is not built in as a standard
part of `sklearn`, though there are python
packages that do produce such plots. Here we
make a simple biplot manually.
```{python}
i, j = 0, 1 # which components
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.scatter(scores[:,0], scores[:,1])
ax.set_xlabel('PC%d' % (i+1))
ax.set_ylabel('PC%d' % (j+1))
for k in range(pcaUS.components_.shape[1]):
ax.arrow(0, 0, pcaUS.components_[i,k], pcaUS.components_[j,k])
ax.text(pcaUS.components_[i,k],
pcaUS.components_[j,k],
USArrests.columns[k])
```
Notice that this figure is a reflection of Figure 12.1 through the $y$-axis. Recall that the
principal components are only unique up to a sign change, so we can
reproduce that figure by flipping the
signs of the second set of scores and loadings.
We also increase the length of the arrows to emphasize the loadings.
```{python}
scale_arrow = s_ = 2
scores[:,1] *= -1
pcaUS.components_[1] *= -1 # flip the y-axis
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.scatter(scores[:,0], scores[:,1])
ax.set_xlabel('PC%d' % (i+1))
ax.set_ylabel('PC%d' % (j+1))
for k in range(pcaUS.components_.shape[1]):
ax.arrow(0, 0, s_*pcaUS.components_[i,k], s_*pcaUS.components_[j,k])
ax.text(s_*pcaUS.components_[i,k],
s_*pcaUS.components_[j,k],
USArrests.columns[k])
```
The standard deviations of the principal component scores are as follows:
```{python}
scores.std(0, ddof=1)
```
The variance of each score can be extracted directly from the `pcaUS` object via
the `explained_variance_` attribute.
```{python}
pcaUS.explained_variance_
```
The proportion of variance explained by each principal
component (PVE) is stored as `explained_variance_ratio_`:
```{python}
pcaUS.explained_variance_ratio_
```
We see that the first principal component explains 62.0% of the
variance in the data, the next principal component explains 24.7%
of the variance, and so forth.
We can plot the PVE explained by each component, as well as the cumulative PVE. We first
plot the proportion of variance explained.
```{python}
# %%capture
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
ticks = np.arange(pcaUS.n_components_)+1
ax = axes[0]
ax.plot(ticks,
pcaUS.explained_variance_ratio_,
marker='o')
ax.set_xlabel('Principal Component');
ax.set_ylabel('Proportion of Variance Explained')
ax.set_ylim([0,1])
ax.set_xticks(ticks)
```
Notice the use of `%%capture`, which suppresses the displaying of the partially completed figure.
```{python}
ax = axes[1]
ax.plot(ticks,
pcaUS.explained_variance_ratio_.cumsum(),
marker='o')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Cumulative Proportion of Variance Explained')
ax.set_ylim([0, 1])
ax.set_xticks(ticks)
fig
```
The result is similar to that shown in Figure 12.3. Note
that the method `cumsum()` computes the cumulative sum of
the elements of a numeric vector. For instance:
```{python}
a = np.array([1,2,8,-3])
np.cumsum(a)
```
## Matrix Completion
We now re-create the analysis carried out on the `USArrests` data in
Section 12.3.
We saw in Section 12.2.2 that solving the optimization
problem~(12.6) on a centered data matrix $\bf X$ is
equivalent to computing the first $M$ principal
components of the data. We use our scaled
and centered `USArrests` data as $\bf X$ below. The *singular value decomposition*
(SVD) is a general algorithm for solving
(12.6).
```{python}
X = USArrests_scaled
U, D, V = np.linalg.svd(X, full_matrices=False)
U.shape, D.shape, V.shape
```
The `np.linalg.svd()` function returns three components, `U`, `D` and `V`. The matrix `V` is equivalent to the
loading matrix from principal components (up to an unimportant sign flip). Using the `full_matrices=False` option ensures that
for a tall matrix the shape of `U` is the same as the shape of `X`.
```{python}
V
```
```{python}
pcaUS.components_
```
The matrix `U` corresponds to a *standardized* version of the PCA score matrix (each column standardized to have sum-of-squares one). If we multiply each column of `U` by the corresponding element of `D`, we recover the PCA scores exactly (up to a meaningless sign flip).
```{python}
(U * D[None,:])[:3]
```
```{python}
scores[:3]
```
While it would be possible to carry out this lab using the `PCA()` estimator,
here we use the `np.linalg.svd()` function in order to illustrate its use.
We now omit 20 entries in the $50\times 4$ data matrix at random. We do so
by first selecting 20 rows (states) at random, and then selecting one
of the four entries in each row at random. This ensures that every row has
at least three observed values.
```{python}
n_omit = 20
np.random.seed(15)
r_idx = np.random.choice(np.arange(X.shape[0]),
n_omit,
replace=False)
c_idx = np.random.choice(np.arange(X.shape[1]),
n_omit,
replace=True)
Xna = X.copy()
Xna[r_idx, c_idx] = np.nan
```
Here the array `r_idx`
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
We now write some code to implement Algorithm 12.1.
We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function.
This will be needed in Step 2 of Algorithm 12.1.
```{python}
def low_rank(X, M=1):
U, D, V = np.linalg.svd(X)
L = U[:,:M] * D[None,:M]
return L.dot(V[:M])
```
To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm 12.1 --- by replacing
the missing values with the column means of the non-missing entries. These are stored in
`Xbar` below after running `np.nanmean()` over the row axis.
We make a copy so that when we assign values to `Xhat` below we do not also overwrite the
values in `Xna`.
```{python}
Xhat = Xna.copy()
Xbar = np.nanmean(Xhat, axis=0)
Xhat[r_idx, c_idx] = Xbar[c_idx]
```
Before we begin Step 2, we set ourselves up to measure the progress of our
iterations:
```{python}
thresh = 1e-7
rel_err = 1
count = 0
ismiss = np.isnan(Xna)
mssold = np.mean(Xhat[~ismiss]**2)
mss0 = np.mean(Xna[~ismiss]**2)
```
Here `ismiss` is a logical matrix with the same dimensions as `Xna`;
a given element is `True` if the corresponding matrix element is missing. The notation `~ismiss` negates this boolean vector. This is useful
because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`.
We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold` (which currently
agrees with `mss0`). We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then
iterate Step 2 of Algorithm 12.1 until the *relative error*, defined as
`(mssold - mss) / mss0`, falls below `thresh = 1e-7`.
{Algorithm 12.1 tells us to iterate Step 2 until 12.14 is no longer decreasing. Determining whether 12.14 is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm 12.1 to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.}
In Step 2(a) of Algorithm 12.1, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop:
```{python}
while rel_err > thresh:
count += 1
# Step 2(a)
Xapp = low_rank(Xhat, M=1)
# Step 2(b)
Xhat[ismiss] = Xapp[ismiss]
# Step 2(c)
mss = np.mean(((Xna - Xapp)[~ismiss])**2)
rel_err = (mssold - mss) / mss0
mssold = mss
print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}"
.format(count, mss, rel_err))
```
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
Finally, we compute the correlation between the 20 imputed values
and the actual values:
```{python}
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
```
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations.
## Clustering
### $K$-Means Clustering
The estimator `sklearn.cluster.KMeans()` performs $K$-means clustering in
`Python`. We begin with a simple simulated example in which there
truly are two clusters in the data: the first 25 observations have a
mean shift relative to the next 25 observations.
```{python}
np.random.seed(0);
X = np.random.standard_normal((50,2));
X[:25,0] += 3;
X[:25,1] -= 4;
```
We now perform $K$-means clustering with $K=2$.
```{python}
kmeans = KMeans(n_clusters=2,
random_state=2,
n_init=20).fit(X)
```
We specify `random_state` to make the results reproducible. The cluster assignments of the 50 observations are contained in `kmeans.labels_`.
```{python}
kmeans.labels_
```
The $K$-means clustering perfectly separated the observations into two
clusters even though we did not supply any group information to
`KMeans()`. We can plot the data, with each observation
colored according to its cluster assignment.
```{python}
fig, ax = plt.subplots(1, 1, figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
ax.set_title("K-Means Clustering Results with K=2");
```
Here the observations can be easily plotted because they are
two-dimensional. If there were more than two variables then we could
instead perform PCA and plot the first two principal component score
vectors to represent the clusters.
In this example, we knew that there really
were two clusters because we generated the data. However, for real
data, we do not know the true number of clusters, nor whether they exist in any precise way. We could
instead have performed $K$-means clustering on this example with
$K=3$.
```{python}
kmeans = KMeans(n_clusters=3,
random_state=3,
n_init=20).fit(X)
fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
ax.set_title("K-Means Clustering Results with K=3");
```
When $K=3$, $K$-means clustering splits up the two clusters.
We have used the `n_init` argument to run the $K$-means with 20
initial cluster assignments (the default is 10). If a
value of `n_init` greater than one is used, then $K$-means
clustering will be performed using multiple random assignments in
Step 1 of Algorithm 12.2, and the `KMeans()`
function will report only the best results. Here we compare using
`n_init=1` to `n_init=20`.
```{python}
kmeans1 = KMeans(n_clusters=3,
random_state=3,
n_init=1).fit(X)
kmeans20 = KMeans(n_clusters=3,
random_state=3,
n_init=20).fit(X);
kmeans1.inertia_, kmeans20.inertia_
```
Note that `kmeans.inertia_` is the total within-cluster sum
of squares, which we seek to minimize by performing $K$-means
clustering 12.17.
We *strongly* recommend always running $K$-means clustering with
a large value of `n_init`, such as 20 or 50, since otherwise an
undesirable local optimum may be obtained.
When performing $K$-means clustering, in addition to using multiple
initial cluster assignments, it is also important to set a random seed
using the `random_state` argument to `KMeans()`. This way, the initial
cluster assignments in Step 1 can be replicated, and the $K$-means
output will be fully reproducible.
### Hierarchical Clustering
The `AgglomerativeClustering()` class from
the `sklearn.clustering` package implements hierarchical clustering.
As its
name is long, we use the short hand `HClust` for *hierarchical clustering*. Note that this will not change the return type
when using this method, so instances will still be of class `AgglomerativeClustering`.
In the following example we use the data from the previous lab to plot the hierarchical clustering
dendrogram using complete, single, and average linkage clustering
with Euclidean distance as the dissimilarity measure. We begin by
clustering observations using complete linkage.
```{python}
HClust = AgglomerativeClustering
hc_comp = HClust(distance_threshold=0,
n_clusters=None,
linkage='complete')
hc_comp.fit(X)
```
This computes the entire dendrogram.
We could just as easily perform hierarchical clustering with average or single linkage instead:
```{python}
hc_avg = HClust(distance_threshold=0,
n_clusters=None,
linkage='average');
hc_avg.fit(X)
hc_sing = HClust(distance_threshold=0,
n_clusters=None,
linkage='single');
hc_sing.fit(X);
```
To use a precomputed distance matrix, we provide an additional
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
```{python}
D = np.zeros((X.shape[0], X.shape[0]));
for i in range(X.shape[0]):
x_ = np.multiply.outer(np.ones(X.shape[0]), X[i])
D[i] = np.sqrt(np.sum((X - x_)**2, 1));
hc_sing_pre = HClust(distance_threshold=0,
n_clusters=None,
metric='precomputed',
linkage='single')
hc_sing_pre.fit(D)
```
We use
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
`dendrogram()` expects a so-called *linkage-matrix representation*
of the clustering, which is not provided by `AgglomerativeClustering()`,
but can be computed. The function `compute_linkage()` in the
`ISLP.cluster` package is provided for this purpose.
We can now plot the dendrograms. The numbers at the bottom of the plot
identify each observation. The `dendrogram()` function has a default method to
color different branches of the tree that suggests a pre-defined cut of the tree at a particular depth.
We prefer to overwrite this default by setting this threshold to be infinite. Since we want this behavior for many dendrograms, we store these values in a dictionary `cargs` and pass this as keyword arguments using the notation `**cargs`.
```{python}
cargs = {'color_threshold':-np.inf,
'above_threshold_color':'black'}
linkage_comp = compute_linkage(hc_comp)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
dendrogram(linkage_comp,
ax=ax,
**cargs);
```
We may want to color branches of the tree above
and below a cut-threshold differently. This can be achieved
by changing the `color_threshold`. Lets cut the tree at a height of 4,
coloring links that merge above 4 in black.
```{python}
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
dendrogram(linkage_comp,
ax=ax,
color_threshold=4,
above_threshold_color='black');
```
To determine the cluster labels for each observation associated with a
given cut of the dendrogram, we can use the `cut_tree()`
function from `scipy.cluster.hierarchy`:
```{python}
cut_tree(linkage_comp, n_clusters=4).T
```
This can also be achieved by providing an argument `n_clusters`
to `HClust()`; however each cut would require recomputing
the clustering. Similarly, trees may be cut by distance threshold
with an argument of `distance_threshold` to `HClust()`
or `height` to `cut_tree()`.
```{python}
cut_tree(linkage_comp, height=5)
```
To scale the variables before performing hierarchical clustering of
the observations, we use `StandardScaler()` as in our PCA example:
```{python}
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)
hc_comp_scale = HClust(distance_threshold=0,
n_clusters=None,
linkage='complete').fit(X_scale)
linkage_comp_scale = compute_linkage(hc_comp_scale)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
dendrogram(linkage_comp_scale, ax=ax, **cargs)
ax.set_title("Hierarchical Clustering with Scaled Features");
```
Correlation-based distances between observations can be used for
clustering. The correlation between two observations measures the
similarity of their feature values. {Suppose each observation has
$p$ features, each a single numerical value. We measure the
similarity of two such observations by computing the
correlation of these $p$ pairs of numbers.}
With $n$ observations, the $n\times n$ correlation matrix can then be used as a similarity (or affinity) matrix, i.e. so that one minus the correlation matrix is the dissimilarity matrix used for clustering.
Note that using correlation only makes sense for
data with at least three features since the absolute correlation
between any two observations with measurements on two features is
always one. Hence, we will cluster a three-dimensional data set.
```{python}
X = np.random.standard_normal((30, 3))
corD = 1 - np.corrcoef(X)
hc_cor = HClust(linkage='complete',
distance_threshold=0,
n_clusters=None,
metric='precomputed')
hc_cor.fit(corD)
linkage_cor = compute_linkage(hc_cor)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
dendrogram(linkage_cor, ax=ax, **cargs)
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
```
## NCI60 Data Example
Unsupervised techniques are often used in the analysis of genomic
data. In particular, PCA and hierarchical clustering are popular
tools. We illustrate these techniques on the `NCI60` cancer cell line
microarray data, which consists of 6830 gene expression
measurements on 64 cancer cell lines.
```{python}
NCI60 = load_data('NCI60')
nci_labs = NCI60['labels']
nci_data = NCI60['data']
```
Each cell line is labeled with a cancer type. We do not make use of
the cancer types in performing PCA and clustering, as these are
unsupervised techniques. But after performing PCA and clustering, we
will check to see the extent to which these cancer types agree with
the results of these unsupervised techniques.
The data has 64 rows and 6830 columns.
```{python}
nci_data.shape
```
We begin by examining the cancer types for the cell lines.
```{python}
nci_labs.value_counts()
```
### PCA on the NCI60 Data
We first perform PCA on the data after scaling the variables (genes)
to have standard deviation one, although here one could reasonably argue
that it is better not to scale the genes as they are measured in the same units.
```{python}
scaler = StandardScaler()
nci_scaled = scaler.fit_transform(nci_data)
nci_pca = PCA()
nci_scores = nci_pca.fit_transform(nci_scaled)
```
We now plot the first few principal component score vectors, in order
to visualize the data. The observations (cell lines) corresponding to
a given cancer type will be plotted in the same color, so that we can
see to what extent the observations within a cancer type are similar
to each other.
```{python}
cancer_types = list(np.unique(nci_labs))
nci_groups = np.array([cancer_types.index(lab)
for lab in nci_labs.values])
fig, axes = plt.subplots(1, 2, figsize=(15,6))
ax = axes[0]
ax.scatter(nci_scores[:,0],
nci_scores[:,1],
c=nci_groups,
marker='o',
s=50)
ax.set_xlabel('PC1'); ax.set_ylabel('PC2')
ax = axes[1]
ax.scatter(nci_scores[:,0],
nci_scores[:,2],
c=nci_groups,
marker='o',
s=50)
ax.set_xlabel('PC1'); ax.set_ylabel('PC3');
```
On the whole, cell lines corresponding to a single cancer type do tend to
have similar values on the first few principal component score
vectors. This indicates that cell lines from the same cancer type tend
to have pretty similar gene expression levels.
We can also plot the percent variance
explained by the principal components as well as the cumulative percent variance explained.
This is similar to the plots we made earlier for the `USArrests` data.
```{python}
fig, axes = plt.subplots(1, 2, figsize=(15,6))
ax = axes[0]
ticks = np.arange(nci_pca.n_components_)+1
ax.plot(ticks,
nci_pca.explained_variance_ratio_,
marker='o')
ax.set_xlabel('Principal Component');
ax.set_ylabel('PVE')
ax = axes[1]
ax.plot(ticks,
nci_pca.explained_variance_ratio_.cumsum(),
marker='o');
ax.set_xlabel('Principal Component')
ax.set_ylabel('Cumulative PVE');
```
We see that together, the first seven principal components explain
around 40% of the variance in the data. This is not a huge amount
of the variance. However, looking at the scree plot, we see that while
each of the first seven principal components explain a substantial
amount of variance, there is a marked decrease in the variance
explained by further principal components. That is, there is an
*elbow* in the plot after approximately the seventh
principal component. This suggests that there may be little benefit
to examining more than seven or so principal components (though even
examining seven principal components may be difficult).
### Clustering the Observations of the NCI60 Data
We now perform hierarchical clustering of the cell lines in the `NCI60` data using
complete, single, and average linkage. Once again, the goal is to find out whether or not the observations cluster into distinct types of cancer. Euclidean
distance is used as the dissimilarity measure. We first write a short
function to produce
the three dendrograms.
```{python}
def plot_nci(linkage, ax, cut=-np.inf):
cargs = {'above_threshold_color':'black',
'color_threshold':cut}
hc = HClust(n_clusters=None,
distance_threshold=0,
linkage=linkage.lower()).fit(nci_scaled)
linkage_ = compute_linkage(hc)
dendrogram(linkage_,
ax=ax,
labels=np.asarray(nci_labs),
leaf_font_size=10,
**cargs)
ax.set_title('%s Linkage' % linkage)
return hc
```
Lets plot our results.
```{python}
fig, axes = plt.subplots(3, 1, figsize=(15,30))
ax = axes[0]; hc_comp = plot_nci('Complete', ax)
ax = axes[1]; hc_avg = plot_nci('Average', ax)
ax = axes[2]; hc_sing = plot_nci('Single', ax)
```
We see that the
choice of linkage certainly does affect the results
obtained. Typically, single linkage will tend to yield *trailing*
clusters: very large clusters onto which individual observations
attach one-by-one. On the other hand, complete and average linkage
tend to yield more balanced, attractive clusters. For this reason,
complete and average linkage are generally preferred to single
linkage. Clearly cell lines within a single cancer type do tend to
cluster together, although the clustering is not perfect. We will use
complete linkage hierarchical clustering for the analysis that
follows.
We can cut the dendrogram at the height that will yield a particular
number of clusters, say four:
```{python}
linkage_comp = compute_linkage(hc_comp)
comp_cut = cut_tree(linkage_comp, n_clusters=4).reshape(-1)
pd.crosstab(nci_labs['label'],
pd.Series(comp_cut.reshape(-1), name='Complete'))
```
There are some clear patterns. All the leukemia cell lines fall in
one cluster, while the breast cancer cell lines are spread out over
three different clusters.
We can plot a cut on the dendrogram that produces these four clusters:
```{python}
fig, ax = plt.subplots(figsize=(10,10))
plot_nci('Complete', ax, cut=140)
ax.axhline(140, c='r', linewidth=4);
```
The `axhline()` function draws a horizontal line on top of any
existing set of axes. The argument `140` plots a horizontal
line at height 140 on the dendrogram; this is a height that
results in four distinct clusters. It is easy to verify that the
resulting clusters are the same as the ones we obtained in
`comp_cut`.
We claimed earlier in Section 12.4.2 that
$K$-means clustering and hierarchical clustering with the dendrogram
cut to obtain the same number of clusters can yield very different
results. How do these `NCI60` hierarchical clustering results compare
to what we get if we perform $K$-means clustering with $K=4$?
```{python}
nci_kmeans = KMeans(n_clusters=4,
random_state=0,
n_init=20).fit(nci_scaled)
pd.crosstab(pd.Series(comp_cut, name='HClust'),
pd.Series(nci_kmeans.labels_, name='K-means'))
```
We see that the four clusters obtained using hierarchical clustering
and $K$-means clustering are somewhat different. First we note
that the labels in the two clusterings are arbitrary. That is, swapping
the identifier of the cluster does not
change the clustering. We see here Cluster 3 in
$K$-means clustering is identical to cluster 2 in hierarchical
clustering. However, the other clusters differ: for instance,
cluster 0 in $K$-means clustering contains a portion of the
observations assigned to cluster 0 by hierarchical clustering, as well
as all of the observations assigned to cluster 1 by hierarchical
clustering.
Rather than performing hierarchical clustering on the entire data
matrix, we can also perform hierarchical clustering on the first few
principal component score vectors, regarding these first few components
as a less noisy version of the data.
```{python}
hc_pca = HClust(n_clusters=None,
distance_threshold=0,
linkage='complete'
).fit(nci_scores[:,:5])
linkage_pca = compute_linkage(hc_pca)
fig, ax = plt.subplots(figsize=(8,8))
dendrogram(linkage_pca,
labels=np.asarray(nci_labs),
leaf_font_size=10,
ax=ax,
**cargs)
ax.set_title("Hier. Clust. on First Five Score Vectors")
pca_labels = pd.Series(cut_tree(linkage_pca,
n_clusters=4).reshape(-1),
name='Complete-PCA')
pd.crosstab(nci_labs['label'], pca_labels)
```

View File

@@ -1,574 +0,0 @@
# Multiple Testing
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch13-multiple-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch13-multiple-lab.ipynb)
We include our usual imports seen in earlier labs.
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from ISLP import load_data
```
We also collect the new imports
needed for this lab.
```{python}
from scipy.stats import \
(ttest_1samp,
ttest_rel,
ttest_ind,
t as t_dbn)
from statsmodels.stats.multicomp import \
pairwise_tukeyhsd
from statsmodels.stats.multitest import \
multipletests as mult_test
```
## Review of Hypothesis Tests
We begin by performing some one-sample $t$-tests.
First we create 100 variables, each consisting of 10 observations. The
first 50 variables have mean $0.5$ and variance $1$, while the others
have mean $0$ and variance $1$.
```{python}
rng = np.random.default_rng(12)
X = rng.standard_normal((10, 100))
true_mean = np.array([0.5]*50 + [0]*50)
X += true_mean[None,:]
```
To begin, we use `ttest_1samp()` from the
`scipy.stats` module to test $H_{0}: \mu_1=0$, the null
hypothesis that the first variable has mean zero.
```{python}
result = ttest_1samp(X[:,0], 0)
result.pvalue
```
The $p$-value comes out to 0.931, which is not low enough to
reject the null hypothesis at level $\alpha=0.05$. In this case,
$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made
a Type II error by failing to reject the null hypothesis when the null
hypothesis is false.
We now test $H_{0,j}: \mu_j=0$ for $j=1,\ldots,100$. We compute the
100 $p$-values, and then construct a vector recording whether the
$j$th $p$-value is less than or equal to 0.05, in which case we reject
$H_{0j}$, or greater than 0.05, in which case we do not reject
$H_{0j}$, for $j=1,\ldots,100$.
```{python}
p_values = np.empty(100)
for i in range(100):
p_values[i] = ttest_1samp(X[:,i], 0).pvalue
decision = pd.cut(p_values,
[0, 0.05, 1],
labels=['Reject H0',
'Do not reject H0'])
truth = pd.Categorical(true_mean == 0,
categories=[True, False],
ordered=True)
```
Since this is a simulated data set, we can create a $2 \times 2$ table
similar to Table 13.2.
```{python}
pd.crosstab(decision,
truth,
rownames=['Decision'],
colnames=['H0'])
```
Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false
null hypotheses, and we incorrectly reject 5 of the true null
hypotheses. Using the notation from Section 13.3, we have
$V=5$, $S=15$, $U=45$ and $W=35$.
We have set $\alpha=0.05$, which means that we expect to reject around
5% of the true null hypotheses. This is in line with the $2 \times 2$
table above, which indicates that we rejected $V=5$ of the $50$ true
null hypotheses.
In the simulation above, for the false null hypotheses, the ratio of
the mean to the standard deviation was only $0.5/1 = 0.5$. This
amounts to quite a weak signal, and it resulted in a high number of
Type II errors. Lets instead simulate data with a stronger signal,
so that the ratio of the mean to the standard deviation for the false
null hypotheses equals $1$. We make only 10 Type II errors.
```{python}
true_mean = np.array([1]*50 + [0]*50)
X = rng.standard_normal((10, 100))
X += true_mean[None,:]
for i in range(100):
p_values[i] = ttest_1samp(X[:,i], 0).pvalue
decision = pd.cut(p_values,
[0, 0.05, 1],
labels=['Reject H0',
'Do not reject H0'])
truth = pd.Categorical(true_mean == 0,
categories=[True, False],
ordered=True)
pd.crosstab(decision,
truth,
rownames=['Decision'],
colnames=['H0'])
```
## Family-Wise Error Rate
Recall from 13.5 that if the null hypothesis is true
for each of $m$ independent hypothesis tests, then the FWER is equal
to $1-(1-\alpha)^m$. We can use this expression to compute the FWER
for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.
We plot the FWER for these values of $\alpha$ in order to
reproduce Figure 13.2.
```{python}
m = np.linspace(1, 501)
fig, ax = plt.subplots()
[ax.plot(m,
1 - (1 - alpha)**m,
label=r'$\alpha=%s$' % str(alpha))
for alpha in [0.05, 0.01, 0.001]]
ax.set_xscale('log')
ax.set_xlabel('Number of Hypotheses')
ax.set_ylabel('Family-Wise Error Rate')
ax.legend()
ax.axhline(0.05, c='k', ls='--');
```
As discussed previously, even for moderate values of $m$ such as $50$,
the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value,
such as $0.001$. Of course, the problem with setting $\alpha$ to such
a low value is that we are likely to make a number of Type II errors:
in other words, our power is very low.
We now conduct a one-sample $t$-test for each of the first five
managers in the
`Fund` dataset, in order to test the null
hypothesis that the $j$th fund managers mean return equals zero,
$H_{0,j}: \mu_j=0$.
```{python}
Fund = load_data('Fund')
fund_mini = Fund.iloc[:,:5]
fund_mini_pvals = np.empty(5)
for i in range(5):
fund_mini_pvals[i] = ttest_1samp(fund_mini.iloc[:,i], 0).pvalue
fund_mini_pvals
```
The $p$-values are low for Managers One and Three, and high for the
other three managers. However, we cannot simply reject $H_{0,1}$ and
$H_{0,3}$, since this would fail to account for the multiple testing
that we have performed. Instead, we will conduct Bonferronis method
and Holms method to control the FWER.
To do this, we use the `multipletests()` function from the
`statsmodels` module (abbreviated to `mult_test()`). Given the $p$-values,
for methods like Holm and Bonferroni the function outputs
adjusted $p$-values, which
can be thought of as a new set of $p$-values that have been corrected
for multiple testing. If the adjusted $p$-value for a given hypothesis
is less than or equal to $\alpha$, then that hypothesis can be
rejected while maintaining a FWER of no more than $\alpha$. In other
words, for such methods, the adjusted $p$-values resulting from the `multipletests()`
function can simply be compared to the desired FWER in order to
determine whether or not to reject each hypothesis. We will later
see that we can use the same function to control FDR as well.
The `mult_test()` function takes $p$-values and a `method` argument, as well as an optional
`alpha` argument. It returns the decisions (`reject` below)
as well as the adjusted $p$-values (`bonf`).
```{python}
reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]
reject
```
The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than
or equal to 1.
```{python}
bonf, np.minimum(fund_mini_pvals * 5, 1)
```
Therefore, using Bonferronis method, we are able to reject the null hypothesis only for Manager
One while controlling FWER at $0.05$.
By contrast, using Holms method, the adjusted $p$-values indicate
that we can reject the null
hypotheses for Managers One and Three at a FWER of $0.05$.
```{python}
mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
```
As discussed previously, Manager One seems to perform particularly
well, whereas Manager Two has poor performance.
```{python}
fund_mini.mean()
```
Is there evidence of a meaningful difference in performance between
these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function
from `scipy.stats`:
```{python}
ttest_rel(fund_mini['Manager1'],
fund_mini['Manager2']).pvalue
```
The test results in a $p$-value of 0.038,
suggesting a statistically significant difference.
However, we decided to perform this test only after examining the data
and noting that Managers One and Two had the highest and lowest mean
performances. In a sense, this means that we have implicitly
performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than
just one, as discussed in Section 13.3.2. Hence, we use the
`pairwise_tukeyhsd()` function from
`statsmodels.stats.multicomp` to apply Tukeys method
in order to adjust for multiple testing. This function takes
as input a fitted *ANOVA* regression model, which is
essentially just a linear regression in which all of the predictors
are qualitative. In this case, the response consists of the monthly
excess returns achieved by each manager, and the predictor indicates
the manager to which each return corresponds.
```{python}
returns = np.hstack([fund_mini.iloc[:,i] for i in range(5)])
managers = np.hstack([[i+1]*50 for i in range(5)])
tukey = pairwise_tukeyhsd(returns, managers)
print(tukey.summary())
```
The `pairwise_tukeyhsd()` function provides confidence intervals
for the difference between each pair of managers (`lower` and
`upper`), as well as a $p$-value. All of these quantities have
been adjusted for multiple testing. Notice that the $p$-value for the
difference between Managers One and Two has increased from $0.038$ to
$0.186$, so there is no longer clear evidence of a difference between
the managers performances. We can plot the confidence intervals for
the pairwise comparisons using the `plot_simultaneous()` method
of `tukey`. Any pair of intervals that dont overlap indicates a significant difference at the nominal level of 0.05. In this case,
no differences are considered significant as reported in the table above.
```{python}
fig, ax = plt.subplots(figsize=(8,8))
tukey.plot_simultaneous(ax=ax);
```
## False Discovery Rate
Now we perform hypothesis tests for all 2,000 fund managers in the
`Fund` dataset. We perform a one-sample $t$-test
of $H_{0,j}: \mu_j=0$, which states that the
$j$th fund managers mean return is zero.
```{python}
fund_pvalues = np.empty(2000)
for i, manager in enumerate(Fund.columns):
fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue
```
There are far too many managers to consider trying to control the FWER.
Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives.
The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure.
```{python}
fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
fund_qvalues[:10]
```
The *q-values* output by the
Benjamini--Hochberg procedure can be interpreted as the smallest FDR
threshold at which we would reject a particular null hypothesis. For
instance, a $q$-value of $0.1$ indicates that we can reject the
corresponding null hypothesis at an FDR of 10% or greater, but that
we cannot reject the null hypothesis at an FDR below 10%.
If we control the FDR at 10%, then for how many of the fund managers can we reject $H_{0,j}: \mu_j=0$?
```{python}
(fund_qvalues <= 0.1).sum()
```
We find that 146 of the 2,000 fund managers have a $q$-value below
0.1; therefore, we are able to conclude that 146 of the fund managers
beat the market at an FDR of 10%. Only about 15 (10% of 146) of
these fund managers are likely to be false discoveries.
By contrast, if we had instead used Bonferronis method to control the
FWER at level $\alpha=0.1$, then we would have failed to reject any
null hypotheses!
```{python}
(fund_pvalues <= 0.1 / 2000).sum()
```
Figure 13.6 displays the ordered
$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
the `Fund` dataset, as well as the threshold for rejection by the
Benjamini--Hochberg procedure. Recall that the Benjamini--Hochberg
procedure identifies the largest $p$-value such that $p_{(j)}<qj/m$,
and rejects all hypotheses for which the $p$-value is less than or
equal to $p_{(j)}$. In the code below, we implement the
Benjamini--Hochberg procedure ourselves, in order to illustrate how it
works. We first order the $p$-values. We then identify all $p$-values
that satisfy $p_{(j)}<qj/m$ (`sorted_set_`). Finally, `selected_`
is a boolean array indicating which $p$-values
are less than or equal to the largest
$p$-value in `sorted_[sorted_set_]`. Therefore, `selected_` indexes the
$p$-values rejected by the Benjamini--Hochberg procedure.
```{python}
sorted_ = np.sort(fund_pvalues)
m = fund_pvalues.shape[0]
q = 0.1
sorted_set_ = np.where(sorted_ < q * np.linspace(1, m, m) / m)[0]
if sorted_set_.shape[0] > 0:
selected_ = fund_pvalues < sorted_[sorted_set_].max()
sorted_set_ = np.arange(sorted_set_.max())
else:
selected_ = []
sorted_set_ = []
```
We now reproduce the middle panel of Figure 13.6.
```{python}
fig, ax = plt.subplots()
ax.scatter(np.arange(0, sorted_.shape[0]) + 1,
sorted_, s=10)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_ylabel('P-Value')
ax.set_xlabel('Index')
ax.scatter(sorted_set_+1, sorted_[sorted_set_], c='r', s=20)
ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
```
## A Re-Sampling Approach
Here, we implement the re-sampling approach to hypothesis testing
using the `Khan` dataset, which we investigated in
Section 13.5. First, we merge the training and
testing data, which results in observations on 83 patients for
2,308 genes.
```{python}
Khan = load_data('Khan')
D = pd.concat([Khan['xtrain'], Khan['xtest']])
D['Y'] = pd.concat([Khan['ytrain'], Khan['ytest']])
D['Y'].value_counts()
```
There are four classes of cancer. For each gene, we compare the mean
expression in the second class (rhabdomyosarcoma) to the mean
expression in the fourth class (Burkitts lymphoma). Performing a
standard two-sample $t$-test
using `ttest_ind()` from `scipy.stats` on the $11$th
gene produces a test-statistic of -2.09 and an associated $p$-value
of 0.0412, suggesting modest evidence of a difference in mean
expression levels between the two cancer types.
```{python}
D2 = D[lambda df:df['Y'] == 2]
D4 = D[lambda df:df['Y'] == 4]
gene_11 = 'G0011'
observedT, pvalue = ttest_ind(D2[gene_11],
D4[gene_11],
equal_var=True)
observedT, pvalue
```
However, this $p$-value relies on the assumption that under the null
hypothesis of no difference between the two groups, the test statistic
follows a $t$-distribution with $29+25-2=52$ degrees of freedom.
Instead of using this theoretical null distribution, we can randomly
split the 54 patients into two groups of 29 and 25, and compute a new
test statistic. Under the null hypothesis of no difference between
the groups, this new test statistic should have the same distribution
as our original one. Repeating this process 10,000 times allows us to
approximate the null distribution of the test statistic. We compute
the fraction of the time that our observed test statistic exceeds the
test statistics obtained via re-sampling.
```{python}
B = 10000
Tnull = np.empty(B)
D_ = np.hstack([D2[gene_11], D4[gene_11]])
n_ = D2[gene_11].shape[0]
D_null = D_.copy()
for b in range(B):
rng.shuffle(D_null)
ttest_ = ttest_ind(D_null[:n_],
D_null[n_:],
equal_var=True)
Tnull[b] = ttest_.statistic
(np.abs(Tnull) < np.abs(observedT)).mean()
```
This fraction, 0.0398,
is our re-sampling-based $p$-value.
It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution.
We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure 13.7.
```{python}
fig, ax = plt.subplots(figsize=(8,8))
ax.hist(Tnull,
bins=100,
density=True,
facecolor='y',
label='Null')
xval = np.linspace(-4.2, 4.2, 1001)
ax.plot(xval,
t_dbn.pdf(xval, D_.shape[0]-2),
c='r')
ax.axvline(observedT,
c='b',
label='Observed')
ax.legend()
ax.set_xlabel("Null Distribution of Test Statistic");
```
The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red.
Finally, we implement the plug-in re-sampling FDR approach outlined in
Algorithm 13.4. Depending on the speed of your
computer, calculating the FDR for all 2,308 genes in the `Khan`
dataset may take a while. Hence, we will illustrate the approach on a
random subset of 100 genes. For each gene, we first compute the
observed test statistic, and then produce 10,000 re-sampled test
statistics. This may take a few minutes to run. If you are in a rush,
then you could set `B` equal to a smaller value (e.g. `B=500`).
```{python}
m, B = 100, 10000
idx = rng.choice(Khan['xtest'].columns, m, replace=False)
T_vals = np.empty(m)
Tnull_vals = np.empty((m, B))
for j in range(m):
col = idx[j]
T_vals[j] = ttest_ind(D2[col],
D4[col],
equal_var=True).statistic
D_ = np.hstack([D2[col], D4[col]])
D_null = D_.copy()
for b in range(B):
rng.shuffle(D_null)
ttest_ = ttest_ind(D_null[:n_],
D_null[n_:],
equal_var=True)
Tnull_vals[j,b] = ttest_.statistic
```
Next, we compute the number of rejected null hypotheses $R$, the
estimated number of false positives $\widehat{V}$, and the estimated
FDR, for a range of threshold values $c$ in
Algorithm 13.4. The threshold values are chosen
using the absolute values of the test statistics from the 100 genes.
```{python}
cutoffs = np.sort(np.abs(T_vals))
FDRs, Rs, Vs = np.empty((3, m))
for j in range(m):
R = np.sum(np.abs(T_vals) >= cutoffs[j])
V = np.sum(np.abs(Tnull_vals) >= cutoffs[j]) / B
Rs[j] = R
Vs[j] = V
FDRs[j] = V / R
```
Now, for any given FDR, we can find the genes that will be
rejected. For example, with FDR controlled at 0.1, we reject 15 of the
100 null hypotheses. On average, we would expect about one or two of
these genes (i.e. 10% of 15) to be false discoveries. At an FDR of
0.2, we can reject the null hypothesis for 28 genes, of which we
expect around six to be false discoveries.
The variable `idx` stores which
genes were included in our 100 randomly-selected genes. Lets look at
the genes whose estimated FDR is less than 0.1.
```{python}
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()])
```
At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected
proportion of false discoveries.
```{python}
sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
```
The next line generates Figure 13.11, which is similar
to Figure 13.9,
except that it is based on only a subset of the genes.
```{python}
fig, ax = plt.subplots()
ax.plot(Rs, FDRs, 'b', linewidth=3)
ax.set_xlabel("Number of Rejections")
ax.set_ylabel("False Discovery Rate");
```

View File

@@ -95,7 +95,7 @@ Open your terminal and run the following command to set up the environment for v
* `--python-version 3.12`: This will use Python 3.12 for the environment.
* `Ch02-statlearn-lab.ipynb`: This is an optional argument to run a specific notebook after the setup is complete. It is meant for testing to be sure given notebooks run but is not required. You can list more than one notebook.
### 2. Activate the environment
### 3. Activate the environment
Once the script is finished, you can activate the virtual environment to run other notebooks or work with the lab materials.
@@ -109,7 +109,7 @@ Once the script is finished, you can activate the virtual environment to run oth
ISLP_v2.2.2\.venv\Scripts\activate
```
### 3. Run other notebooks
### 4. Run other notebooks
After activating the environment, you can start Jupyter Lab to run other notebooks.
@@ -117,6 +117,14 @@ After activating the environment, you can start Jupyter Lab to run other noteboo
jupyter lab
```
# R Markdown files
Earlier versions of this repo maintained R Markdown versions of the lab next to the Jupyter notebooks. These have been removed from the repo but can be built from the
notebooks at any time by
```bash
make Rmd
```
## Contributors ✨