trying with Rmd
This commit is contained in:
928
Rmarkdown/Ch6-varselect-lab.Rmd
Normal file
928
Rmarkdown/Ch6-varselect-lab.Rmd
Normal file
@@ -0,0 +1,928 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: notebooks///ipynb,Rmarkdown///Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 6
|
||||
|
||||
|
||||
# Lab: Linear Models and Regularization Methods
|
||||
In this lab we implement many of the techniques discussed in this chapter.
|
||||
We import some of our libraries at this top
|
||||
level.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
from statsmodels.api import OLS
|
||||
import sklearn.model_selection as skm
|
||||
import sklearn.linear_model as skl
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from ISLP import load_data
|
||||
from ISLP.models import ModelSpec as MS
|
||||
from functools import partial
|
||||
|
||||
```
|
||||
|
||||
We again collect the new imports
|
||||
needed for this lab.
|
||||
|
||||
```{python}
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.cross_decomposition import PLSRegression
|
||||
from ISLP.models import \
|
||||
(Stepwise,
|
||||
sklearn_selected,
|
||||
sklearn_selection_path)
|
||||
# !pip install l0bnb
|
||||
from l0bnb import fit_path
|
||||
|
||||
```
|
||||
We have installed the package `l0bnb` on the fly. Note the escaped `!pip install` --- this is run as a separate system command.
|
||||
## Subset Selection Methods
|
||||
Here we implement methods that reduce the number of parameters in a
|
||||
model by restricting the model to a subset of the input variables.
|
||||
|
||||
### Forward Selection
|
||||
|
||||
We will apply the forward-selection approach to the `Hitters`
|
||||
data. We wish to predict a baseball player’s `Salary` on the
|
||||
basis of various statistics associated with performance in the
|
||||
previous year.
|
||||
|
||||
First of all, we note that the `Salary` variable is missing for
|
||||
some of the players. The `np.isnan()` function can be used to
|
||||
identify the missing observations. It returns an array
|
||||
of the same shape as the input vector, with a `True` for any elements that
|
||||
are missing, and a `False` for non-missing elements. The
|
||||
`sum()` method can then be used to count all of the
|
||||
missing elements.
|
||||
|
||||
```{python}
|
||||
Hitters = load_data('Hitters')
|
||||
np.isnan(Hitters['Salary']).sum()
|
||||
|
||||
```
|
||||
|
||||
We see that `Salary` is missing for 59 players. The
|
||||
`dropna()` method of data frames removes all of the rows that have missing
|
||||
values in any variable (by default --- see `Hitters.dropna?`).
|
||||
|
||||
```{python}
|
||||
Hitters = Hitters.dropna();
|
||||
Hitters.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
We first choose the best model using forward selection based on $C_p$ (6.2). This score
|
||||
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
|
||||
it as a scorer. By default, `sklearn` tries to maximize a score, hence
|
||||
our scoring function computes the negative $C_p$ statistic.
|
||||
|
||||
```{python}
|
||||
def nCp(sigma2, estimator, X, Y):
|
||||
"Negative Cp statistic"
|
||||
n, p = X.shape
|
||||
Yhat = estimator.predict(X)
|
||||
RSS = np.sum((Y - Yhat)**2)
|
||||
return -(RSS + 2 * p * sigma2) / n
|
||||
|
||||
```
|
||||
We need to estimate the residual variance $\sigma^2$, which is the first argument in our scoring function above.
|
||||
We will fit the biggest model, using all the variables, and estimate $\sigma^2$ based on its MSE.
|
||||
|
||||
```{python}
|
||||
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
|
||||
Y = np.array(Hitters['Salary'])
|
||||
X = design.transform(Hitters)
|
||||
sigma2 = OLS(Y,X).fit().scale
|
||||
|
||||
```
|
||||
|
||||
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$.
|
||||
|
||||
```{python}
|
||||
neg_Cp = partial(nCp, sigma2)
|
||||
|
||||
```
|
||||
We can now use `neg_Cp()` as a scorer for model selection.
|
||||
|
||||
|
||||
|
||||
Along with a score we need to specify the search strategy. This is done through the object
|
||||
`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()`
|
||||
runs forward stepwise until any further additions to the model do not result
|
||||
in an improvement in the evaluation score. Similarly, the method `Stepwise.fixed_steps()`
|
||||
runs a fixed number of steps of stepwise search.
|
||||
|
||||
```{python}
|
||||
strategy = Stepwise.first_peak(design,
|
||||
direction='forward',
|
||||
max_terms=len(design.terms))
|
||||
|
||||
```
|
||||
|
||||
We now fit a linear regression model with `Salary` as outcome using forward
|
||||
selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes
|
||||
a model from `statsmodels` along with a search strategy and selects a model with its
|
||||
`fit` method. Without specifying a `scoring` argument, the score defaults to MSE, and so all 19 variables will be
|
||||
selected.
|
||||
|
||||
```{python}
|
||||
hitters_MSE = sklearn_selected(OLS,
|
||||
strategy)
|
||||
hitters_MSE.fit(Hitters, Y)
|
||||
hitters_MSE.selected_state_
|
||||
|
||||
```
|
||||
|
||||
Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected.
|
||||
|
||||
```{python}
|
||||
hitters_Cp = sklearn_selected(OLS,
|
||||
strategy,
|
||||
scoring=neg_Cp)
|
||||
hitters_Cp.fit(Hitters, Y)
|
||||
hitters_Cp.selected_state_
|
||||
|
||||
```
|
||||
|
||||
### Choosing Among Models Using the Validation Set Approach and Cross-Validation
|
||||
|
||||
As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a
|
||||
method that stores the full path of models found in forward selection, and allows predictions for each of these. This can be done with the `sklearn_selection_path()`
|
||||
estimator from `ISLP.models`. The function `cross_val_predict()` from `ISLP.models`
|
||||
computes the cross-validated predictions for each of the models
|
||||
along the path, which we can use to evaluate the cross-validated MSE
|
||||
along the path.
|
||||
|
||||
|
||||
Here we define a strategy that fits the full forward selection path.
|
||||
While there are various parameter choices for `sklearn_selection_path()`,
|
||||
we use the defaults here, which selects the model at each step based on the biggest reduction in RSS.
|
||||
|
||||
```{python}
|
||||
strategy = Stepwise.fixed_steps(design,
|
||||
len(design.terms),
|
||||
direction='forward')
|
||||
full_path = sklearn_selection_path(OLS, strategy)
|
||||
|
||||
```
|
||||
|
||||
We now fit the full forward-selection path on the `Hitters` data and compute the fitted values.
|
||||
|
||||
```{python}
|
||||
full_path.fit(Hitters, Y)
|
||||
Yhat_in = full_path.predict(Hitters)
|
||||
Yhat_in.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate
|
||||
in-sample MSE. As expected, the in-sample MSE improves each step we take,
|
||||
indicating we must use either the validation or cross-validation
|
||||
approach to select the number of steps. We fix the y-axis to range from
|
||||
50,000 to 250,000 to compare to the cross-validation and validation
|
||||
set MSE below, as well as other methods such as ridge regression, lasso and
|
||||
principal components regression.
|
||||
|
||||
```{python}
|
||||
mse_fig, ax = subplots(figsize=(8,8))
|
||||
insample_mse = ((Yhat_in - Y[:,None])**2).mean(0)
|
||||
n_steps = insample_mse.shape[0]
|
||||
ax.plot(np.arange(n_steps),
|
||||
insample_mse,
|
||||
'k', # color black
|
||||
label='In-sample')
|
||||
ax.set_ylabel('MSE',
|
||||
fontsize=20)
|
||||
ax.set_xlabel('# steps of forward stepwise',
|
||||
fontsize=20)
|
||||
ax.set_xticks(np.arange(n_steps)[::2])
|
||||
ax.legend()
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
Notice the expression `None` in `Y[:,None]` above.
|
||||
This adds an axis (dimension) to the one-dimensional array `Y`,
|
||||
which allows it to be recycled when subtracted from the two-dimensional `Yhat_in`.
|
||||
|
||||
|
||||
We are now ready to use cross-validation to estimate test error along
|
||||
the model path. We must use *only the training observations* to perform all aspects of model-fitting --- including
|
||||
variable selection. Therefore, the determination of which model of a
|
||||
given size is best must be made using \emph{only the training
|
||||
observations} in each training fold. This point is subtle but important. If the full data
|
||||
set is used to select the best subset at each step, then the validation
|
||||
set errors and cross-validation errors that we obtain will not be
|
||||
accurate estimates of the test error.
|
||||
|
||||
We now compute the cross-validated predicted values using 5-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
K = 5
|
||||
kfold = skm.KFold(K,
|
||||
random_state=0,
|
||||
shuffle=True)
|
||||
Yhat_cv = skm.cross_val_predict(full_path,
|
||||
Hitters,
|
||||
Y,
|
||||
cv=kfold)
|
||||
Yhat_cv.shape
|
||||
|
||||
```
|
||||
`skm.cross_val_predict()`
|
||||
The prediction matrix `Yhat_cv` is the same shape as `Yhat_in`; the difference is that the predictions in each row, corresponding to a particular sample index, were made from models fit on a training fold that did not include that row.
|
||||
|
||||
At each model along the path, we compute the MSE in each of the cross-validation folds.
|
||||
These we will average to get the mean MSE, and can also use the individual values to compute a crude estimate of the standard error of the mean. {The estimate is crude because the five error estimates are based on overlapping training sets, and hence are not independent.}
|
||||
Hence we must know the test indices for each cross-validation
|
||||
split. This can be found by using the `split()` method of `kfold`. Because
|
||||
we fixed the random state above, whenever we split any array with the same
|
||||
number of rows as $Y$ we recover the same training and test indices, though we simply
|
||||
ignore the training indices below.
|
||||
|
||||
```{python}
|
||||
cv_mse = []
|
||||
for train_idx, test_idx in kfold.split(Y):
|
||||
errors = (Yhat_cv[test_idx] - Y[test_idx,None])**2
|
||||
cv_mse.append(errors.mean(0)) # column means
|
||||
cv_mse = np.array(cv_mse).T
|
||||
cv_mse.shape
|
||||
|
||||
```
|
||||
|
||||
We now add the cross-validation error estimates to our MSE plot.
|
||||
We include the mean error across the five folds, and the estimate of the standard error of the mean.
|
||||
|
||||
```{python}
|
||||
ax.errorbar(np.arange(n_steps),
|
||||
cv_mse.mean(1),
|
||||
cv_mse.std(1) / np.sqrt(K),
|
||||
label='Cross-validated',
|
||||
c='r') # color red
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
To repeat the above using the validation set approach, we simply change our
|
||||
`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size
|
||||
of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()`
|
||||
|
||||
```{python}
|
||||
validation = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=0.2,
|
||||
random_state=0)
|
||||
for train_idx, test_idx in validation.split(Y):
|
||||
full_path.fit(Hitters.iloc[train_idx],
|
||||
Y[train_idx])
|
||||
Yhat_val = full_path.predict(Hitters.iloc[test_idx])
|
||||
errors = (Yhat_val - Y[test_idx,None])**2
|
||||
validation_mse = errors.mean(0)
|
||||
|
||||
```
|
||||
As for the in-sample MSE case, the validation set approach does not provide standard errors.
|
||||
|
||||
```{python}
|
||||
ax.plot(np.arange(n_steps),
|
||||
validation_mse,
|
||||
'b--', # color blue, broken line
|
||||
label='Validation')
|
||||
ax.set_xticks(np.arange(n_steps)[::2])
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.legend()
|
||||
mse_fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Best Subset Selection
|
||||
Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters`
|
||||
data, which for every subset size, searches for the best set of predictors.
|
||||
|
||||
We will use a package called `l0bnb` to perform
|
||||
best subset selection.
|
||||
Instead of constraining the subset to be a given size,
|
||||
this package produces a path of solutions using the subset size as a
|
||||
penalty rather than a constraint. Although the distinction is subtle, the difference comes when we cross-validate.
|
||||
|
||||
|
||||
```{python}
|
||||
D = design.fit_transform(Hitters)
|
||||
D = D.drop('intercept', axis=1)
|
||||
X = np.asarray(D)
|
||||
|
||||
```
|
||||
Here we excluded the first column corresponding to the intercept, as
|
||||
`l0bnb` will fit the intercept separately. We can find a path using the `fit_path()` function.
|
||||
|
||||
```{python}
|
||||
path = fit_path(X,
|
||||
Y,
|
||||
max_nonzeros=X.shape[1])
|
||||
|
||||
```
|
||||
|
||||
The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book.
|
||||
|
||||
```{python}
|
||||
path[3]
|
||||
|
||||
```
|
||||
In the example above, we see that at the fourth step in the path, we have two nonzero coefficients in `'B'`, corresponding to the value $0.114$ for the penalty parameter `lambda_0`.
|
||||
We could make predictions using this sequence of fits on a validation set as a function of `lambda_0`, or with more work using cross-validation.
|
||||
|
||||
## Ridge Regression and the Lasso
|
||||
We will use the `sklearn.linear_model` package (for which
|
||||
we use `skl` as shorthand below)
|
||||
to fit ridge and lasso regularized linear models on the `Hitters` data.
|
||||
We start with the model matrix `X` (without an intercept) that we computed in the previous section on best subset regression.
|
||||
|
||||
### Ridge Regression
|
||||
We will use the function `skl.ElasticNet()` to fit both ridge and the lasso.
|
||||
To fit a *path* of ridge regressions models, we use
|
||||
`skl.ElasticNet.path()`, which can fit both ridge and lasso, as well as a hybrid mixture; ridge regression
|
||||
corresponds to `l1_ratio=0`.
|
||||
It is good practice to standardize the columns of `X` in these applications, if the variables are measured in different units. Since `skl.ElasticNet()` does no normalization, we have to take care of that ourselves.
|
||||
Since we
|
||||
standardize first, in order to find coefficient
|
||||
estimates on the original scale, we must *unstandardize*
|
||||
the coefficient estimates. The parameter
|
||||
$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to
|
||||
be consistent with the rest of this chapter, we use `lambdas`
|
||||
rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [22] issue unwarranted convergence warning messages; we expect these to disappear as this package matures.}
|
||||
|
||||
```{python}
|
||||
Xs = X - X.mean(0)[None,:]
|
||||
X_scale = X.std(0)
|
||||
Xs = Xs / X_scale[None,:]
|
||||
lambdas = 10**np.linspace(8, -2, 100) / Y.std()
|
||||
soln_array = skl.ElasticNet.path(Xs,
|
||||
Y,
|
||||
l1_ratio=0.,
|
||||
alphas=lambdas)[1]
|
||||
soln_array.shape
|
||||
|
||||
```
|
||||
Here we extract the array of coefficients corresponding to the solutions along the regularization path.
|
||||
By default the `skl.ElasticNet.path` method fits a path along
|
||||
an automatically selected range of $\lambda$ values, except for the case when
|
||||
`l1_ratio=0`, which results in ridge regression (as is the case here). {The reason is rather technical; for all models except ridge, we can find the smallest value of $\lambda$ for which all coefficients are zero. For ridge this value is $\infty$.} So here
|
||||
we have chosen to implement the function over a grid of values ranging
|
||||
from $\lambda=10^{8}$ to $\lambda=10^{-2}$ scaled by the standard
|
||||
deviation of $y$, essentially covering the full range of scenarios
|
||||
from the null model containing only the intercept, to the least
|
||||
squares fit.
|
||||
|
||||
Associated with each value of $\lambda$ is a vector of ridge
|
||||
regression coefficients, that can be accessed by
|
||||
a column of `soln_array`. In this case, `soln_array` is a $19 \times 100$ matrix, with
|
||||
19 rows (one for each predictor) and 100
|
||||
columns (one for each value of $\lambda$).
|
||||
|
||||
We transpose this matrix and turn it into a data frame to facilitate viewing and plotting.
|
||||
|
||||
```{python}
|
||||
soln_path = pd.DataFrame(soln_array.T,
|
||||
columns=D.columns,
|
||||
index=-np.log(lambdas))
|
||||
soln_path.index.name = 'negative log(lambda)'
|
||||
soln_path
|
||||
|
||||
```
|
||||
|
||||
We plot the paths to get a sense of how the coefficients vary with $\lambda$.
|
||||
To control the location of the legend we first set `legend` to `False` in the
|
||||
plot method, adding it afterward with the `legend()` method of `ax`.
|
||||
|
||||
```{python}
|
||||
path_fig, ax = subplots(figsize=(8,8))
|
||||
soln_path.plot(ax=ax, legend=False)
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Standardized coefficients', fontsize=20)
|
||||
ax.legend(loc='upper left');
|
||||
|
||||
```
|
||||
(We have used `latex` formatting in the horizontal label, in order to format the Greek $\lambda$ appropriately.)
|
||||
We expect the coefficient estimates to be much smaller, in terms of
|
||||
$\ell_2$ norm, when a large value of $\lambda$ is used, as compared to
|
||||
when a small value of $\lambda$ is used. (Recall that the $\ell_2$ norm is the square root of the sum of squared coefficient values.) We display the coefficients at the $40$th step,
|
||||
where $\lambda$ is 25.535.
|
||||
|
||||
```{python}
|
||||
beta_hat = soln_path.loc[soln_path.index[39]]
|
||||
lambdas[39], beta_hat
|
||||
|
||||
```
|
||||
|
||||
Let’s compute the $\ell_2$ norm of the standardized coefficients.
|
||||
|
||||
```{python}
|
||||
np.linalg.norm(beta_hat)
|
||||
|
||||
```
|
||||
|
||||
In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01.
|
||||
Note the much larger $\ell_2$ norm of the
|
||||
coefficients associated with this smaller value of $\lambda$.
|
||||
|
||||
```{python}
|
||||
beta_hat = soln_path.loc[soln_path.index[59]]
|
||||
lambdas[59], np.linalg.norm(beta_hat)
|
||||
|
||||
```
|
||||
Above we normalized `X` upfront, and fit the ridge model using `Xs`.
|
||||
The `Pipeline()` object
|
||||
in `sklearn` provides a clear way to separate feature
|
||||
normalization from the fitting of the ridge model itself.
|
||||
|
||||
```{python}
|
||||
ridge = skl.ElasticNet(alpha=lambdas[59], l1_ratio=0)
|
||||
scaler = StandardScaler(with_mean=True, with_std=True)
|
||||
pipe = Pipeline(steps=[('scaler', scaler), ('ridge', ridge)])
|
||||
pipe.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
We show that it gives the same $\ell_2$ norm as in our previous fit on the standardized data.
|
||||
|
||||
```{python}
|
||||
np.linalg.norm(ridge.coef_)
|
||||
|
||||
```
|
||||
Notice that the operation `pipe.fit(X, Y)` above has changed the `ridge` object, and in particular has added attributes such as `coef_` that were not there before.
|
||||
### Estimating Test Error of Ridge Regression
|
||||
Choosing an *a priori* value of $\lambda$ for ridge regression is
|
||||
difficult if not impossible. We will want to use the validation method
|
||||
or cross-validation to select the tuning parameter. The reader may not
|
||||
be surprised that the `Pipeline()` approach can be used in
|
||||
`skm.cross_validate()` with either a validation method
|
||||
(i.e. `validation`) or $k$-fold cross-validation.
|
||||
|
||||
We fix the random state of the splitter
|
||||
so that the results obtained will be reproducible.
|
||||
|
||||
```{python}
|
||||
validation = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=0.5,
|
||||
random_state=0)
|
||||
ridge.alpha = 0.01
|
||||
results = skm.cross_validate(ridge,
|
||||
X,
|
||||
Y,
|
||||
scoring='neg_mean_squared_error',
|
||||
cv=validation)
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
|
||||
The test MSE is 1.342e+05. Note
|
||||
that if we had instead simply fit a model with just an intercept, we
|
||||
would have predicted each test observation using the mean of the
|
||||
training observations. We can get the same result by fitting a ridge regression model
|
||||
with a *very* large value of $\lambda$. Note that `1e10`
|
||||
means $10^{10}$.
|
||||
|
||||
```{python}
|
||||
ridge.alpha = 1e10
|
||||
results = skm.cross_validate(ridge,
|
||||
X,
|
||||
Y,
|
||||
scoring='neg_mean_squared_error',
|
||||
cv=validation)
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
Obviously choosing $\lambda=0.01$ is arbitrary, so we will use cross-validation or the validation-set
|
||||
approach to choose the tuning parameter $\lambda$.
|
||||
The object `GridSearchCV()` allows exhaustive
|
||||
grid search to choose such a parameter.
|
||||
|
||||
We first use the validation set method
|
||||
to choose $\lambda$.
|
||||
|
||||
```{python}
|
||||
param_grid = {'ridge__alpha': lambdas}
|
||||
grid = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=validation,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
|
||||
Alternatively, we can use 5-fold cross-validation.
|
||||
|
||||
```{python}
|
||||
grid = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
grid.best_params_['ridge__alpha']
|
||||
grid.best_estimator_
|
||||
|
||||
```
|
||||
Recall we set up the `kfold` object for 5-fold cross-validation on page 296. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
|
||||
to right.
|
||||
|
||||
```{python}
|
||||
ridge_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(lambdas),
|
||||
-grid.cv_results_['mean_test_score'],
|
||||
yerr=grid.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
One can cross-validate different metrics to choose a parameter. The default
|
||||
metric for `skl.ElasticNet()` is test $R^2$.
|
||||
Let’s compare $R^2$ to MSE for cross-validation here.
|
||||
|
||||
```{python}
|
||||
grid_r2 = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=kfold)
|
||||
grid_r2.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
Finally, let’s plot the results for cross-validated $R^2$.
|
||||
|
||||
```{python}
|
||||
r2_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(lambdas),
|
||||
grid_r2.cv_results_['mean_test_score'],
|
||||
yerr=grid_r2.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated $R^2$', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Fast Cross-Validation for Solution Paths
|
||||
The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit
|
||||
such paths, and to choose a suitable value of $\lambda$ using cross-validation. Even with
|
||||
identical splits the results will not agree *exactly* with our `grid`
|
||||
above because the standardization of each feature in `grid` is carried out on each fold,
|
||||
while in `pipeCV` below it is carried out only once.
|
||||
Nevertheless, the results are similar as the normalization
|
||||
is relatively stable across folds.
|
||||
|
||||
```{python}
|
||||
ridgeCV = skl.ElasticNetCV(alphas=lambdas,
|
||||
l1_ratio=0,
|
||||
cv=kfold)
|
||||
pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
('ridge', ridgeCV)])
|
||||
pipeCV.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
Let’s produce a plot again of the cross-validation error to see that
|
||||
it is similar to using `skm.GridSearchCV`.
|
||||
|
||||
```{python}
|
||||
tuned_ridge = pipeCV.named_steps['ridge']
|
||||
ridgeCV_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(lambdas),
|
||||
tuned_ridge.mse_path_.mean(1),
|
||||
yerr=tuned_ridge.mse_path_.std(1) / np.sqrt(K))
|
||||
ax.axvline(-np.log(tuned_ridge.alpha_), c='k', ls='--')
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
We see that the value of $\lambda$ that results in the
|
||||
smallest cross-validation error is 1.19e-02, available
|
||||
as the value `tuned_ridge.alpha_`. What is the test MSE
|
||||
associated with this value of $\lambda$?
|
||||
|
||||
```{python}
|
||||
np.min(tuned_ridge.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
This represents a further improvement over the test MSE that we got
|
||||
using $\lambda=4$. Finally, `tuned_ridge.coef_`
|
||||
has the coefficients fit on the entire data set
|
||||
at this value of $\lambda$.
|
||||
|
||||
```{python}
|
||||
tuned_ridge.coef_
|
||||
|
||||
```
|
||||
As expected, none of the coefficients are zero—ridge regression does
|
||||
not perform variable selection!
|
||||
|
||||
|
||||
### Evaluating Test Error of Cross-Validated Ridge
|
||||
Choosing $\lambda$ using cross-validation provides a single regression
|
||||
estimator, similar to fitting a linear regression model as we saw in
|
||||
Chapter 3. It is therefore reasonable to estimate what its test error
|
||||
is. We run into a problem here in that cross-validation will have
|
||||
*touched* all of its data in choosing $\lambda$, hence we have no
|
||||
further data to estimate test error. A compromise is to do an initial
|
||||
split of the data into two disjoint sets: a training set and a test set.
|
||||
We then fit a cross-validation
|
||||
tuned ridge regression on the training set, and evaluate its performance on the test set.
|
||||
We might call this cross-validation nested
|
||||
within the validation set approach. A priori there is no reason to use
|
||||
half of the data for each of the two sets in validation. Below, we use
|
||||
75% for training and 25% for test, with the estimator being ridge
|
||||
regression tuned using 5-fold cross-validation. This can be achieved
|
||||
in code as follows:
|
||||
|
||||
```{python}
|
||||
outer_valid = skm.ShuffleSplit(n_splits=1,
|
||||
test_size=0.25,
|
||||
random_state=1)
|
||||
inner_cv = skm.KFold(n_splits=5,
|
||||
shuffle=True,
|
||||
random_state=2)
|
||||
ridgeCV = skl.ElasticNetCV(alphas=lambdas,
|
||||
l1_ratio=0,
|
||||
cv=inner_cv)
|
||||
pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
('ridge', ridgeCV)]);
|
||||
|
||||
```
|
||||
|
||||
```{python}
|
||||
results = skm.cross_validate(pipeCV,
|
||||
X,
|
||||
Y,
|
||||
cv=outer_valid,
|
||||
scoring='neg_mean_squared_error')
|
||||
-results['test_score']
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### The Lasso
|
||||
We saw that ridge regression with a wise choice of $\lambda$ can
|
||||
outperform least squares as well as the null model on the
|
||||
`Hitters` data set. We now ask whether the lasso can yield
|
||||
either a more accurate or a more interpretable model than ridge
|
||||
regression. In order to fit a lasso model, we once again use the
|
||||
`ElasticNetCV()` function; however, this time we use the argument
|
||||
`l1_ratio=1`. Other than that change, we proceed just as we did in
|
||||
fitting a ridge model.
|
||||
|
||||
```{python}
|
||||
lassoCV = skl.ElasticNetCV(n_alphas=100,
|
||||
l1_ratio=1,
|
||||
cv=kfold)
|
||||
pipeCV = Pipeline(steps=[('scaler', scaler),
|
||||
('lasso', lassoCV)])
|
||||
pipeCV.fit(X, Y)
|
||||
tuned_lasso = pipeCV.named_steps['lasso']
|
||||
tuned_lasso.alpha_
|
||||
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
lambdas, soln_array = skl.Lasso.path(Xs,
|
||||
Y,
|
||||
l1_ratio=1,
|
||||
n_alphas=100)[:2]
|
||||
soln_path = pd.DataFrame(soln_array.T,
|
||||
columns=D.columns,
|
||||
index=-np.log(lambdas))
|
||||
|
||||
```
|
||||
We can see from the coefficient plot of the standardized coefficients that depending on the choice of
|
||||
tuning parameter, some of the coefficients will be exactly equal to
|
||||
zero.
|
||||
|
||||
```{python}
|
||||
path_fig, ax = subplots(figsize=(8,8))
|
||||
soln_path.plot(ax=ax, legend=False)
|
||||
ax.legend(loc='upper left')
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Standardized coefficiients', fontsize=20);
|
||||
|
||||
```
|
||||
The smallest cross-validated error is lower than the test set MSE of the null model
|
||||
and of least squares, and very similar to the test MSE of 115526.71 of ridge
|
||||
regression (page 303) with $\lambda$ chosen by cross-validation.
|
||||
|
||||
```{python}
|
||||
np.min(tuned_lasso.mse_path_.mean(1))
|
||||
|
||||
```
|
||||
|
||||
Let’s again produce a plot of the cross-validation error.
|
||||
|
||||
|
||||
```{python}
|
||||
lassoCV_fig, ax = subplots(figsize=(8,8))
|
||||
ax.errorbar(-np.log(tuned_lasso.alphas_),
|
||||
tuned_lasso.mse_path_.mean(1),
|
||||
yerr=tuned_lasso.mse_path_.std(1) / np.sqrt(K))
|
||||
ax.axvline(-np.log(tuned_lasso.alpha_), c='k', ls='--')
|
||||
ax.set_ylim([50000,250000])
|
||||
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20);
|
||||
```
|
||||
|
||||
However, the lasso has a substantial advantage over ridge regression
|
||||
in that the resulting coefficient estimates are sparse. Here we see
|
||||
that 6 of the 19 coefficient estimates are exactly zero. So the lasso
|
||||
model with $\lambda$ chosen by cross-validation contains only 13
|
||||
variables.
|
||||
|
||||
```{python}
|
||||
tuned_lasso.coef_
|
||||
|
||||
```
|
||||
|
||||
As in ridge regression, we could evaluate the test error
|
||||
of cross-validated lasso by first splitting into
|
||||
test and training sets and internally running
|
||||
cross-validation on the training set. We leave
|
||||
this as an exercise.
|
||||
|
||||
|
||||
## PCR and PLS Regression
|
||||
|
||||
### Principal Components Regression
|
||||
|
||||
|
||||
Principal components regression (PCR) can be performed using
|
||||
`PCA()` from the `sklearn.decomposition`
|
||||
module. We now apply PCR to the `Hitters` data, in order to
|
||||
predict `Salary`. Again, ensure that the missing values have
|
||||
been removed from the data, as described in Section 6.5.1.
|
||||
|
||||
We use `LinearRegression()` to fit the regression model
|
||||
here. Note that it fits an intercept by default, unlike
|
||||
the `OLS()` function seen earlier in Section 6.5.1.
|
||||
|
||||
```{python}
|
||||
pca = PCA(n_components=2)
|
||||
linreg = skl.LinearRegression()
|
||||
pipe = Pipeline([('pca', pca),
|
||||
('linreg', linreg)])
|
||||
pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
When performing PCA, the results vary depending
|
||||
on whether the data has been *standardized* or not.
|
||||
As in the earlier examples, this can be accomplished
|
||||
by including an additional step in the pipeline.
|
||||
|
||||
```{python}
|
||||
pipe = Pipeline([('scaler', scaler),
|
||||
('pca', pca),
|
||||
('linreg', linreg)])
|
||||
pipe.fit(X, Y)
|
||||
pipe.named_steps['linreg'].coef_
|
||||
|
||||
```
|
||||
|
||||
We can of course use CV to choose the number of components, by
|
||||
using `skm.GridSearchCV`, in this
|
||||
case fixing the parameters to vary the
|
||||
`n_components`.
|
||||
|
||||
```{python}
|
||||
param_grid = {'pca__n_components': range(1, 20)}
|
||||
grid = skm.GridSearchCV(pipe,
|
||||
param_grid,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
Let’s plot the results as we have for other methods.
|
||||
|
||||
```{python}
|
||||
pcr_fig, ax = subplots(figsize=(8,8))
|
||||
n_comp = param_grid['pca__n_components']
|
||||
ax.errorbar(n_comp,
|
||||
-grid.cv_results_['mean_test_score'],
|
||||
grid.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20)
|
||||
ax.set_xlabel('# principal components', fontsize=20)
|
||||
ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
We see that the smallest cross-validation error occurs when
|
||||
17
|
||||
components are used. However, from the plot we also see that the
|
||||
cross-validation error is roughly the same when only one component is
|
||||
included in the model. This suggests that a model that uses just a
|
||||
small number of components might suffice.
|
||||
|
||||
The CV score is provided for each possible number of components from
|
||||
1 to 19 inclusive. The `PCA()` method complains
|
||||
if we try to fit an intercept only with `n_components=0`
|
||||
so we also compute the MSE for just the null model with
|
||||
these splits.
|
||||
|
||||
```{python}
|
||||
Xn = np.zeros((X.shape[0], 1))
|
||||
cv_null = skm.cross_validate(linreg,
|
||||
Xn,
|
||||
Y,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
-cv_null['test_score'].mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `explained_variance_ratio_`
|
||||
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
|
||||
different numbers of components. This concept is discussed in greater
|
||||
detail in Section 12.2.
|
||||
|
||||
```{python}
|
||||
pipe.named_steps['pca'].explained_variance_ratio_
|
||||
|
||||
```
|
||||
|
||||
Briefly, we can think of
|
||||
this as the amount of information about the predictors
|
||||
that is captured using $M$ principal components. For example, setting
|
||||
$M=1$ only captures 38.31% of the variance, while $M=2$ captures an additional 21.84%, for a total of 60.15% of the variance.
|
||||
By $M=6$ it increases to
|
||||
88.63%. Beyond this the increments continue to diminish, until we use all $M=p=19$ components, which captures all 100% of the variance.
|
||||
|
||||
|
||||
|
||||
|
||||
### Partial Least Squares
|
||||
Partial least squares (PLS) is implemented in the
|
||||
`PLSRegression()` function.
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
pls = PLSRegression(n_components=2,
|
||||
scale=True)
|
||||
pls.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
As was the case in PCR, we will want to
|
||||
use CV to choose the number of components.
|
||||
|
||||
```{python}
|
||||
param_grid = {'n_components':range(1, 20)}
|
||||
grid = skm.GridSearchCV(pls,
|
||||
param_grid,
|
||||
cv=kfold,
|
||||
scoring='neg_mean_squared_error')
|
||||
grid.fit(X, Y)
|
||||
|
||||
```
|
||||
|
||||
As for our other methods, we plot the MSE.
|
||||
|
||||
```{python}
|
||||
pls_fig, ax = subplots(figsize=(8,8))
|
||||
n_comp = param_grid['n_components']
|
||||
ax.errorbar(n_comp,
|
||||
-grid.cv_results_['mean_test_score'],
|
||||
grid.cv_results_['std_test_score'] / np.sqrt(K))
|
||||
ax.set_ylabel('Cross-validated MSE', fontsize=20)
|
||||
ax.set_xlabel('# principal components', fontsize=20)
|
||||
ax.set_xticks(n_comp[::2])
|
||||
ax.set_ylim([50000,250000]);
|
||||
|
||||
```
|
||||
|
||||
CV error is minimized at 12,
|
||||
though there is little noticable difference between this point and a much lower number like 2 or 3 components.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user