pairing notebooks

This commit is contained in:
Jonathan Taylor
2023-08-20 19:41:01 -07:00
parent c82e9d5067
commit 058e89ef1c
22 changed files with 489 additions and 346 deletions

View File

@@ -1,3 +1,16 @@
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: ipynb,Rmd
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.14.7
---
# Chapter 6
@@ -32,7 +45,7 @@ from ISLP.models import \
(Stepwise,
sklearn_selected,
sklearn_selection_path)
!pip install l0bnb
# !pip install l0bnb
from l0bnb import fit_path
```
@@ -61,7 +74,7 @@ Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()
```
We see that `Salary` is missing for 59 players. The
`dropna()` method of data frames removes all of the rows that have missing
values in any variable (by default --- see `Hitters.dropna?`).
@@ -71,8 +84,8 @@ Hitters = Hitters.dropna();
Hitters.shape
```
We first choose the best model using forward selection based on $C_p$ (6.2). This score
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
it as a scorer. By default, `sklearn` tries to maximize a score, hence
@@ -106,7 +119,7 @@ neg_Cp = partial(nCp, sigma2)
```
We can now use `neg_Cp()` as a scorer for model selection.
Along with a score we need to specify the search strategy. This is done through the object
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
@@ -120,7 +133,7 @@ strategy = Stepwise.first_peak(design,
max_terms=len(design.terms))
```
We now fit a linear regression model with `Salary` as outcome using forward
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
a model from `statsmodels` along with a search strategy and selects a model with its
@@ -134,7 +147,7 @@ hitters_MSE.fit(Hitters, Y)
hitters_MSE.selected_state_
```
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
```{python}
@@ -145,7 +158,7 @@ hitters_Cp.fit(Hitters, Y)
hitters_Cp.selected_state_
```
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
@@ -167,7 +180,7 @@ strategy = Stepwise.fixed_steps(design,
full_path = sklearn_selection_path(OLS, strategy)
```
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
```{python}
@@ -176,8 +189,8 @@ Yhat_in = full_path.predict(Hitters)
Yhat_in.shape
```
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
in-sample MSE. As expected, the in-sample MSE improves each step we take,
indicating we must use either the validation or cross-validation
@@ -266,7 +279,7 @@ ax.legend()
mse_fig
```
To repeat the above using the validation set approach, we simply change our
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
@@ -296,7 +309,7 @@ ax.legend()
mse_fig
```
### Best Subset Selection
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
@@ -324,7 +337,7 @@ path = fit_path(X,
max_nonzeros=X.shape[1])
```
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
```{python}
@@ -392,7 +405,7 @@ soln_path.index.name = 'negative log(lambda)'
soln_path
```
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
To control the location of the legend we first set `legend` to `False` in the
plot method, adding it afterward with the `legend()` method of `ax`.
@@ -416,14 +429,14 @@ beta_hat = soln_path.loc[soln_path.index[39]]
lambdas[39], beta_hat
```
Lets compute the $\ell_2$ norm of the standardized coefficients.
```{python}
np.linalg.norm(beta_hat)
```
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
Note the much larger $\ell_2$ norm of the
coefficients associated with this smaller value of $\lambda$.
@@ -477,7 +490,7 @@ results = skm.cross_validate(ridge,
-results['test_score']
```
The test MSE is 1.342e+05. Note
that if we had instead simply fit a model with just an intercept, we
would have predicted each test observation using the mean of the
@@ -514,7 +527,7 @@ grid.best_params_['ridge__alpha']
grid.best_estimator_
```
Alternatively, we can use 5-fold cross-validation.
```{python}
@@ -540,7 +553,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20);
```
One can cross-validate different metrics to choose a parameter. The default
metric for `skl.ElasticNet()` is test $R^2$.
Lets compare $R^2$ to MSE for cross-validation here.
@@ -552,7 +565,7 @@ grid_r2 = skm.GridSearchCV(pipe,
grid_r2.fit(X, Y)
```
Finally, lets plot the results for cross-validated $R^2$.
```{python}
@@ -564,7 +577,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
```
### Fast Cross-Validation for Solution Paths
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
@@ -584,7 +597,7 @@ pipeCV = Pipeline(steps=[('scaler', scaler),
pipeCV.fit(X, Y)
```
Lets produce a plot again of the cross-validation error to see that
it is similar to using `skm.GridSearchCV`.
@@ -600,7 +613,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Cross-validated MSE', fontsize=20);
```
We see that the value of $\lambda$ that results in the
smallest cross-validation error is 1.19e-02, available
as the value `tuned_ridge.alpha_`. What is the test MSE
@@ -610,7 +623,7 @@ associated with this value of $\lambda$?
np.min(tuned_ridge.mse_path_.mean(1))
```
This represents a further improvement over the test MSE that we got
using $\lambda=4$. Finally, `tuned_ridge.coef_`
has the coefficients fit on the entire data set
@@ -666,7 +679,7 @@ results = skm.cross_validate(pipeCV,
```
### The Lasso
We saw that ridge regression with a wise choice of $\lambda$ can
@@ -721,7 +734,7 @@ regression (page 305) with $\lambda$ chosen by cross-validation.
np.min(tuned_lasso.mse_path_.mean(1))
```
Lets again produce a plot of the cross-validation error.
@@ -746,7 +759,7 @@ variables.
tuned_lasso.coef_
```
As in ridge regression, we could evaluate the test error
of cross-validated lasso by first splitting into
test and training sets and internally running
@@ -757,7 +770,7 @@ this as an exercise.
## PCR and PLS Regression
### Principal Components Regression
Principal components regression (PCR) can be performed using
`PCA()` from the `sklearn.decomposition`
@@ -778,7 +791,7 @@ pipe.fit(X, Y)
pipe.named_steps['linreg'].coef_
```
When performing PCA, the results vary depending
on whether the data has been *standardized* or not.
As in the earlier examples, this can be accomplished
@@ -792,7 +805,7 @@ pipe.fit(X, Y)
pipe.named_steps['linreg'].coef_
```
We can of course use CV to choose the number of components, by
using `skm.GridSearchCV`, in this
case fixing the parameters to vary the
@@ -807,7 +820,7 @@ grid = skm.GridSearchCV(pipe,
grid.fit(X, Y)
```
Lets plot the results as we have for other methods.
```{python}
@@ -822,7 +835,7 @@ ax.set_xticks(n_comp[::2])
ax.set_ylim([50000,250000]);
```
We see that the smallest cross-validation error occurs when
17
components are used. However, from the plot we also see that the
@@ -846,8 +859,8 @@ cv_null = skm.cross_validate(linreg,
-cv_null['test_score'].mean()
```
The `explained_variance_ratio_`
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
different numbers of components. This concept is discussed in greater
@@ -857,7 +870,7 @@ detail in Section 12.2.
pipe.named_steps['pca'].explained_variance_ratio_
```
Briefly, we can think of
this as the amount of information about the predictors
that is captured using $M$ principal components. For example, setting
@@ -880,7 +893,7 @@ pls = PLSRegression(n_components=2,
pls.fit(X, Y)
```
As was the case in PCR, we will want to
use CV to choose the number of components.
@@ -893,7 +906,7 @@ grid = skm.GridSearchCV(pls,
grid.fit(X, Y)
```
As for our other methods, we plot the MSE.
```{python}
@@ -908,7 +921,7 @@ ax.set_xticks(n_comp[::2])
ax.set_ylim([50000,250000]);
```
CV error is minimized at 12,
though there is little noticable difference between this point and a much lower number like 2 or 3 components.