pairing notebooks
This commit is contained in:
@@ -1,3 +1,16 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 6
|
||||
|
||||
@@ -32,7 +45,7 @@ from ISLP.models import \
|
||||
(Stepwise,
|
||||
sklearn_selected,
|
||||
sklearn_selection_path)
|
||||
!pip install l0bnb
|
||||
# !pip install l0bnb
|
||||
from l0bnb import fit_path
|
||||
|
||||
```
|
||||
@@ -61,7 +74,7 @@ Hitters = load_data('Hitters')
|
||||
np.isnan(Hitters['Salary']).sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that `Salary` is missing for 59 players. The
|
||||
`dropna()` method of data frames removes all of the rows that have missing
|
||||
values in any variable (by default --- see `Hitters.dropna?`).
|
||||
@@ -71,8 +84,8 @@ Hitters = Hitters.dropna();
|
||||
Hitters.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We first choose the best model using forward selection based on $C_p$ (6.2). This score
|
||||
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
|
||||
it as a scorer. By default, `sklearn` tries to maximize a score, hence
|
||||
@@ -106,7 +119,7 @@ neg_Cp = partial(nCp, sigma2)
|
||||
```
|
||||
We can now use `neg_Cp()` as a scorer for model selection.
|
||||
|
||||
|
||||
|
||||
|
||||
Along with a score we need to specify the search strategy. This is done through the object
|
||||
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
|
||||
@@ -120,7 +133,7 @@ strategy = Stepwise.first_peak(design,
|
||||
max_terms=len(design.terms))
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit a linear regression model with `Salary` as outcome using forward
|
||||
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
|
||||
a model from `statsmodels` along with a search strategy and selects a model with its
|
||||
@@ -134,7 +147,7 @@ hitters_MSE.fit(Hitters, Y)
|
||||
hitters_MSE.selected_state_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
|
||||
|
||||
```{python}
|
||||
@@ -145,7 +158,7 @@ hitters_Cp.fit(Hitters, Y)
|
||||
hitters_Cp.selected_state_
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
|
||||
|
||||
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
|
||||
@@ -167,7 +180,7 @@ strategy = Stepwise.fixed_steps(design,
|
||||
full_path = sklearn_selection_path(OLS, strategy)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
|
||||
|
||||
```{python}
|
||||
@@ -176,8 +189,8 @@ Yhat_in = full_path.predict(Hitters)
|
||||
Yhat_in.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
|
||||
in-sample MSE. As expected, the in-sample MSE improves each step we take,
|
||||
indicating we must use either the validation or cross-validation
|
||||
@@ -266,7 +279,7 @@ ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
To repeat the above using the validation set approach, we simply change our
|
||||
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
|
||||
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
|
||||
@@ -296,7 +309,7 @@ ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Best Subset Selection
|
||||
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
|
||||
@@ -324,7 +337,7 @@ path = fit_path(X,
|
||||
max_nonzeros=X.shape[1])
|
||||
|
||||
```
|
||||
|
||||
|
||||
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
|
||||
|
||||
```{python}
|
||||
@@ -392,7 +405,7 @@ soln_path.index.name = 'negative log(lambda)'
|
||||
soln_path
|
||||
|
||||
```
|
||||
|
||||
|
||||
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
|
||||
To control the location of the legend we first set `legend` to `False` in the
|
||||
plot method, adding it afterward with the `legend()` method of `ax`.
|
||||
@@ -416,14 +429,14 @@ beta_hat = soln_path.loc[soln_path.index[39]]
|
||||
lambdas[39], beta_hat
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s compute the $\ell_2$ norm of the standardized coefficients.
|
||||
|
||||
```{python}
|
||||
np.linalg.norm(beta_hat)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
|
||||
Note the much larger $\ell_2$ norm of the
|
||||
coefficients associated with this smaller value of $\lambda$.
|
||||
@@ -477,7 +490,7 @@ results = skm.cross_validate(ridge,
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
|
||||
|
||||
The test MSE is 1.342e+05. Note
|
||||
that if we had instead simply fit a model with just an intercept, we
|
||||
would have predicted each test observation using the mean of the
|
||||
@@ -514,7 +527,7 @@ grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Alternatively, we can use 5-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
@@ -540,7 +553,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
One can cross-validate different metrics to choose a parameter. The default
|
||||
metric for `skl.ElasticNet()` is test $R^2$.
|
||||
Let’s compare $R^2$ to MSE for cross-validation here.
|
||||
@@ -552,7 +565,7 @@ grid_r2 = skm.GridSearchCV(pipe,
|
||||
grid_r2.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Finally, let’s plot the results for cross-validated $R^2$.
|
||||
|
||||
```{python}
|
||||
@@ -564,7 +577,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Fast Cross-Validation for Solution Paths
|
||||
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
|
||||
@@ -584,7 +597,7 @@ pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
pipeCV.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s produce a plot again of the cross-validation error to see that
|
||||
it is similar to using `skm.GridSearchCV`.
|
||||
|
||||
@@ -600,7 +613,7 @@ ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that the value of $\lambda$ that results in the
|
||||
smallest cross-validation error is 1.19e-02, available
|
||||
as the value `tuned_ridge.alpha_`. What is the test MSE
|
||||
@@ -610,7 +623,7 @@ associated with this value of $\lambda$?
|
||||
np.min(tuned_ridge.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
|
||||
This represents a further improvement over the test MSE that we got
|
||||
using $\lambda=4$. Finally, `tuned_ridge.coef_`
|
||||
has the coefficients fit on the entire data set
|
||||
@@ -666,7 +679,7 @@ results = skm.cross_validate(pipeCV,
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
### The Lasso
|
||||
We saw that ridge regression with a wise choice of $\lambda$ can
|
||||
@@ -721,7 +734,7 @@ regression (page 305) with $\lambda$ chosen by cross-validation.
|
||||
np.min(tuned_lasso.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s again produce a plot of the cross-validation error.
|
||||
|
||||
|
||||
@@ -746,7 +759,7 @@ variables.
|
||||
tuned_lasso.coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
As in ridge regression, we could evaluate the test error
|
||||
of cross-validated lasso by first splitting into
|
||||
test and training sets and internally running
|
||||
@@ -757,7 +770,7 @@ this as an exercise.
|
||||
## PCR and PLS Regression
|
||||
|
||||
### Principal Components Regression
|
||||
|
||||
|
||||
|
||||
Principal components regression (PCR) can be performed using
|
||||
`PCA()` from the `sklearn.decomposition`
|
||||
@@ -778,7 +791,7 @@ pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
When performing PCA, the results vary depending
|
||||
on whether the data has been *standardized* or not.
|
||||
As in the earlier examples, this can be accomplished
|
||||
@@ -792,7 +805,7 @@ pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
|
||||
We can of course use CV to choose the number of components, by
|
||||
using `skm.GridSearchCV`, in this
|
||||
case fixing the parameters to vary the
|
||||
@@ -807,7 +820,7 @@ grid = skm.GridSearchCV(pipe,
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s plot the results as we have for other methods.
|
||||
|
||||
```{python}
|
||||
@@ -822,7 +835,7 @@ ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that the smallest cross-validation error occurs when
|
||||
17
|
||||
components are used. However, from the plot we also see that the
|
||||
@@ -846,8 +859,8 @@ cv_null = skm.cross_validate(linreg,
|
||||
-cv_null['test_score'].mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
The `explained_variance_ratio_`
|
||||
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
|
||||
different numbers of components. This concept is discussed in greater
|
||||
@@ -857,7 +870,7 @@ detail in Section 12.2.
|
||||
pipe.named_steps['pca'].explained_variance_ratio_
|
||||
|
||||
```
|
||||
|
||||
|
||||
Briefly, we can think of
|
||||
this as the amount of information about the predictors
|
||||
that is captured using $M$ principal components. For example, setting
|
||||
@@ -880,7 +893,7 @@ pls = PLSRegression(n_components=2,
|
||||
pls.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As was the case in PCR, we will want to
|
||||
use CV to choose the number of components.
|
||||
|
||||
@@ -893,7 +906,7 @@ grid = skm.GridSearchCV(pls,
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
As for our other methods, we plot the MSE.
|
||||
|
||||
```{python}
|
||||
@@ -908,7 +921,7 @@ ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
|
||||
CV error is minimized at 12,
|
||||
though there is little noticable difference between this point and a much lower number like 2 or 3 components.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user