---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: ipynb,Rmd
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.19.1
---
# Cross-Validation and the Bootstrap
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch05-resample-lab.ipynb)
In this lab, we explore the resampling techniques covered in this
chapter. Some of the commands in this lab may take a while to run on
your computer.
We again begin by placing most of our imports at this top level.
```{python}
import numpy as np
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
summarize,
poly)
from sklearn.model_selection import train_test_split
```
There are several new imports needed for this lab.
```{python}
from functools import partial
from sklearn.model_selection import \
(cross_validate,
KFold,
ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm
```
## The Validation Set Approach
We explore the use of the validation set approach in order to estimate
the test error rates that result from fitting various linear models on
the `Auto` data set.
We use the function `train_test_split()` to split
the data into training and validation sets. As there are 392 observations,
we split into two equal sets of size 196 using the
argument `test_size=196`. It is generally a good idea to set a random seed
when performing operations like this that contain an
element of randomness, so that the results obtained can be reproduced
precisely at a later time. We set the random seed of the splitter
with the argument `random_state=0`.
```{python}
Auto = load_data('Auto')
Auto_train, Auto_valid = train_test_split(Auto,
test_size=196,
random_state=0)
```
Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`.
```{python}
hp_mm = MS(['horsepower'])
X_train = hp_mm.fit_transform(Auto_train)
y_train = Auto_train['mpg']
model = sm.OLS(y_train, X_train)
results = model.fit()
```
We now use the `predict()` method of `results` evaluated on the model matrix for this model
created using the validation data set. We also calculate the validation MSE of our model.
```{python}
X_valid = hp_mm.transform(Auto_valid)
y_valid = Auto_valid['mpg']
valid_pred = results.predict(X_valid)
np.mean((y_valid - valid_pred)**2)
```
Hence our estimate for the validation MSE of the linear regression
fit is $23.62$.
We can also estimate the validation error for
higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well
as training and test sets and returns the MSE on the test set.
```{python}
def evalMSE(terms,
response,
train,
test):
mm = MS(terms)
X_train = mm.fit_transform(train)
y_train = train[response]
X_test = mm.transform(test)
y_test = test[response]
results = sm.OLS(y_train, X_train).fit()
test_pred = results.predict(X_test)
return np.mean((y_test - test_pred)**2)
```
Let’s use this function to estimate the validation MSE
using linear, quadratic and cubic fits. We use the `enumerate()` function
here, which gives both the values and indices of objects as one iterates
over a for loop.
```{python}
MSE = np.zeros(3)
for idx, degree in enumerate(range(1, 4)):
MSE[idx] = evalMSE([poly('horsepower', degree)],
'mpg',
Auto_train,
Auto_valid)
MSE
```
These error rates are $23.62, 18.76$, and $18.80$, respectively. If we
choose a different training/validation split instead, then we
can expect somewhat different errors on the validation set.
```{python}
Auto_train, Auto_valid = train_test_split(Auto,
test_size=196,
random_state=3)
MSE = np.zeros(3)
for idx, degree in enumerate(range(1, 4)):
MSE[idx] = evalMSE([poly('horsepower', degree)],
'mpg',
Auto_train,
Auto_valid)
MSE
```
Using this split of the observations into a training set and a validation set,
we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$, and $16.97$, respectively.
These results are consistent with our previous findings: a model that
predicts `mpg` using a quadratic function of `horsepower`
performs better than a model that involves only a linear function of
`horsepower`, and there is no evidence of an improvement in using a cubic function of `horsepower`.
## Cross-Validation
In theory, the cross-validation estimate can be computed for any generalized
linear model. {}
In practice, however, the simplest way to cross-validate in
Python is to use `sklearn`, which has a different interface or API
than `statsmodels`, the code we have been using to fit GLMs.
This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ don’t naturally speak to each other, this
requires the use of a *wrapper*.
In the `ISLP` package,
we provide
a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with
models fit by `statsmodels`.
The class `sklearn_sm()`
has as its first argument
a model from `statsmodels`. It can take two additional
optional arguments: `model_str` which can be
used to specify a formula, and `model_args` which should
be a dictionary of additional arguments used when fitting
the model. For example, to fit a logistic regression model
we have to specify a `family` argument. This
is passed as `model_args={'family':sm.families.Binomial()}`.
Here is our wrapper in action:
```{python}
hp_model = sklearn_sm(sm.OLS,
MS(['horsepower']))
X, Y = Auto.drop(columns=['mpg']), Auto['mpg']
cv_results = cross_validate(hp_model,
X,
Y,
cv=Auto.shape[0])
cv_err = np.mean(cv_results['test_score'])
cv_err
```
The arguments to `cross_validate()` are as follows: an
object with the appropriate `fit()`, `predict()`,
and `score()` methods, an
array of features `X` and a response `Y`.
We also included an additional argument `cv` to `cross_validate()`; specifying an integer
$k$ results in $k$-fold cross-validation. We have provided a value
corresponding to the total number of observations, which results in
leave-one-out cross-validation (LOOCV). The `cross_validate()` function produces a dictionary with several components;
we simply want the cross-validated test score here (MSE), which is estimated to be 24.23.
We can repeat this procedure for increasingly complex polynomial fits.
To automate the process, we again
use a for loop which iteratively fits polynomial
regressions of degree 1 to 5, computes the
associated cross-validation error, and stores it in the $i$th element
of the vector `cv_error`. The variable `d` in the for loop
corresponds to the degree of the polynomial. We begin by initializing the
vector. This command may take a couple of seconds to run.
```{python}
cv_error = np.zeros(5)
H = np.array(Auto['horsepower'])
M = sklearn_sm(sm.OLS)
for i, d in enumerate(range(1,6)):
X = np.power.outer(H, np.arange(d+1))
M_CV = cross_validate(M,
X,
Y,
cv=Auto.shape[0])
cv_error[i] = np.mean(M_CV['test_score'])
cv_error
```
As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
quadratic fits, but then no clear improvement from using higher-degree polynomials.
Above we introduced the `outer()` method of the `np.power()`
function. The `outer()` method is applied to an operation
that has two arguments, such as `add()`, `min()`, or
`power()`.
It has two arrays as
arguments, and then forms a larger
array where the operation is applied to each pair of elements of the
two arrays.
```{python}
A = np.array([3, 5, 9])
B = np.array([2, 4])
np.add.outer(A, B)
```
In the CV example above, we used $k=n$, but of course we can also use $k