--- jupyter: jupytext: cell_metadata_filter: -all formats: Rmd main_language: python text_representation: extension: .Rmd format_name: rmarkdown format_version: '1.2' jupytext_version: 1.19.1 --- # Cross-Validation and the Bootstrap Open In Colab [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch05-resample-lab.ipynb) In this lab, we explore the resampling techniques covered in this chapter. Some of the commands in this lab may take a while to run on your computer. We again begin by placing most of our imports at this top level. ```{python} import numpy as np import statsmodels.api as sm from ISLP import load_data from ISLP.models import (ModelSpec as MS, summarize, poly) from sklearn.model_selection import train_test_split ``` There are several new imports needed for this lab. ```{python} from functools import partial from sklearn.model_selection import \ (cross_validate, KFold, ShuffleSplit) from sklearn.base import clone from ISLP.models import sklearn_sm ``` ## The Validation Set Approach We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the `Auto` data set. We use the function `train_test_split()` to split the data into training and validation sets. As there are 392 observations, we split into two equal sets of size 196 using the argument `test_size=196`. It is generally a good idea to set a random seed when performing operations like this that contain an element of randomness, so that the results obtained can be reproduced precisely at a later time. We set the random seed of the splitter with the argument `random_state=0`. ```{python} Auto = load_data('Auto') Auto_train, Auto_valid = train_test_split(Auto, test_size=196, random_state=0) ``` Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`. ```{python} hp_mm = MS(['horsepower']) X_train = hp_mm.fit_transform(Auto_train) y_train = Auto_train['mpg'] model = sm.OLS(y_train, X_train) results = model.fit() ``` We now use the `predict()` method of `results` evaluated on the model matrix for this model created using the validation data set. We also calculate the validation MSE of our model. ```{python} X_valid = hp_mm.transform(Auto_valid) y_valid = Auto_valid['mpg'] valid_pred = results.predict(X_valid) np.mean((y_valid - valid_pred)**2) ``` Hence our estimate for the validation MSE of the linear regression fit is $23.62$. We can also estimate the validation error for higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well as training and test sets and returns the MSE on the test set. ```{python} def evalMSE(terms, response, train, test): mm = MS(terms) X_train = mm.fit_transform(train) y_train = train[response] X_test = mm.transform(test) y_test = test[response] results = sm.OLS(y_train, X_train).fit() test_pred = results.predict(X_test) return np.mean((y_test - test_pred)**2) ``` Let’s use this function to estimate the validation MSE using linear, quadratic and cubic fits. We use the `enumerate()` function here, which gives both the values and indices of objects as one iterates over a for loop. ```{python} MSE = np.zeros(3) for idx, degree in enumerate(range(1, 4)): MSE[idx] = evalMSE([poly('horsepower', degree)], 'mpg', Auto_train, Auto_valid) MSE ``` These error rates are $23.62, 18.76$, and $18.80$, respectively. If we choose a different training/validation split instead, then we can expect somewhat different errors on the validation set. ```{python} Auto_train, Auto_valid = train_test_split(Auto, test_size=196, random_state=3) MSE = np.zeros(3) for idx, degree in enumerate(range(1, 4)): MSE[idx] = evalMSE([poly('horsepower', degree)], 'mpg', Auto_train, Auto_valid) MSE ``` Using this split of the observations into a training set and a validation set, we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$, and $16.97$, respectively. These results are consistent with our previous findings: a model that predicts `mpg` using a quadratic function of `horsepower` performs better than a model that involves only a linear function of `horsepower`, and there is no evidence of an improvement in using a cubic function of `horsepower`. ## Cross-Validation In theory, the cross-validation estimate can be computed for any generalized linear model. {} In practice, however, the simplest way to cross-validate in Python is to use `sklearn`, which has a different interface or API than `statsmodels`, the code we have been using to fit GLMs. This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ don’t naturally speak to each other, this requires the use of a *wrapper*. In the `ISLP` package, we provide a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with models fit by `statsmodels`. The class `sklearn_sm()` has as its first argument a model from `statsmodels`. It can take two additional optional arguments: `model_str` which can be used to specify a formula, and `model_args` which should be a dictionary of additional arguments used when fitting the model. For example, to fit a logistic regression model we have to specify a `family` argument. This is passed as `model_args={'family':sm.families.Binomial()}`. Here is our wrapper in action: ```{python} hp_model = sklearn_sm(sm.OLS, MS(['horsepower'])) X, Y = Auto.drop(columns=['mpg']), Auto['mpg'] cv_results = cross_validate(hp_model, X, Y, cv=Auto.shape[0]) cv_err = np.mean(cv_results['test_score']) cv_err ``` The arguments to `cross_validate()` are as follows: an object with the appropriate `fit()`, `predict()`, and `score()` methods, an array of features `X` and a response `Y`. We also included an additional argument `cv` to `cross_validate()`; specifying an integer $k$ results in $k$-fold cross-validation. We have provided a value corresponding to the total number of observations, which results in leave-one-out cross-validation (LOOCV). The `cross_validate()` function produces a dictionary with several components; we simply want the cross-validated test score here (MSE), which is estimated to be 24.23. We can repeat this procedure for increasingly complex polynomial fits. To automate the process, we again use a for loop which iteratively fits polynomial regressions of degree 1 to 5, computes the associated cross-validation error, and stores it in the $i$th element of the vector `cv_error`. The variable `d` in the for loop corresponds to the degree of the polynomial. We begin by initializing the vector. This command may take a couple of seconds to run. ```{python} cv_error = np.zeros(5) H = np.array(Auto['horsepower']) M = sklearn_sm(sm.OLS) for i, d in enumerate(range(1,6)): X = np.power.outer(H, np.arange(d+1)) M_CV = cross_validate(M, X, Y, cv=Auto.shape[0]) cv_error[i] = np.mean(M_CV['test_score']) cv_error ``` As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and quadratic fits, but then no clear improvement from using higher-degree polynomials. Above we introduced the `outer()` method of the `np.power()` function. The `outer()` method is applied to an operation that has two arguments, such as `add()`, `min()`, or `power()`. It has two arrays as arguments, and then forms a larger array where the operation is applied to each pair of elements of the two arrays. ```{python} A = np.array([3, 5, 9]) B = np.array([2, 4]) np.add.outer(A, B) ``` In the CV example above, we used $k=n$, but of course we can also use $k