standard_lasso

Jonathan corrected a spurious use of standard_lasso
This commit is contained in:
Trevor Hastie
2024-01-25 17:53:08 -08:00
parent a55798e4d3
commit e5bbb1a5bc
2 changed files with 527 additions and 176744 deletions

View File

@@ -1,15 +1,3 @@
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: Rmd,ipynb
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.14.7
---
# Chapter 10
@@ -41,7 +29,7 @@ from sklearn.model_selection import \
(train_test_split,
GridSearchCV)
```
### Torch-Specific Imports
There are a number of imports for `torch`. (These are not
@@ -56,7 +44,7 @@ from torch.optim import RMSprop
from torch.utils.data import TensorDataset
```
There are several other helper packages for `torch`. For instance,
the `torchmetrics` package has utilities to compute
various metrics to evaluate performance when fitting
@@ -70,7 +58,7 @@ from torchmetrics import (MeanAbsoluteError,
from torchinfo import summary
```
The package `pytorch_lightning` is a somewhat higher-level
interface to `torch` that simplifies the specification and
fitting of
@@ -82,7 +70,7 @@ from pytorch_lightning import Trainer
from pytorch_lightning.loggers import CSVLogger
```
In order to reproduce results we use `seed_everything()`. We will also instruct `torch` to use deterministic algorithms
where possible.
@@ -92,7 +80,7 @@ seed_everything(0, workers=True)
torch.use_deterministic_algorithms(True, warn_only=True)
```
We will use several datasets shipped with `torchvision` for our
examples: a pretrained network for image classification,
as well as some transforms used for preprocessing.
@@ -125,7 +113,7 @@ from ISLP.torch import (SimpleDataModule,
rec_num_workers)
```
In addition we have included some helper
functions to load the
`IMDb` database, as well as a lookup that maps integers
@@ -159,7 +147,7 @@ from glob import glob
import json
```
## Single Layer Network on Hitters Data
We start by fitting the models in Section 10.6 on the `Hitters` data.
@@ -213,7 +201,7 @@ hit_lm = LinearRegression().fit(X_train, Y_train)
Yhat_test = hit_lm.predict(X_test)
np.abs(Yhat_test - Y_test).mean()
```
Next we fit the lasso using `sklearn`. We are using
mean absolute error to select and evaluate a model, rather than mean squared error.
The specialized solver we used in Section 6.5.2 uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly.
@@ -236,7 +224,7 @@ $\lambda$ with an all-zero solution. This value equals the largest absolute inn
X_s = scaler.fit_transform(X_train)
n = X_s.shape[0]
lam_max = np.fabs(X_s.T.dot(Y_train - Y_train.mean())).max() / n
param_grid = {'alpha': np.exp(np.linspace(0, np.log(0.01), 100))
param_grid = {'lasso__alpha': np.exp(np.linspace(0, np.log(0.01), 100))
* lam_max}
```
Note that we had to transform the data first, since the scale of the variables impacts the choice of $\lambda$.
@@ -246,13 +234,13 @@ We now perform cross-validation using this sequence of $\lambda$ values.
cv = KFold(10,
shuffle=True,
random_state=1)
grid = GridSearchCV(lasso,
grid = GridSearchCV(standard_lasso,
param_grid,
cv=cv,
scoring='neg_mean_absolute_error')
grid.fit(X_train, Y_train);
```
We extract the lasso model with best cross-validated mean absolute error, and evaluate its
performance on `X_test` and `Y_test`, which were not used in
cross-validation.
@@ -291,7 +279,7 @@ class HittersModel(nn.Module):
return torch.flatten(self.sequential(x))
```
The `class` statement identifies the code chunk as a
declaration for a class `HittersModel`
that inherits from the base class `nn.Module`. This base
@@ -325,7 +313,7 @@ to disable dropout for when we want to evaluate the model on test data.
hit_model = HittersModel(X.shape[1])
```
The object `self.sequential` is a composition of four maps. The
first maps the 19 features of `Hitters` to 50 dimensions, introducing $50\times 19+50$ parameters
for the weights and *intercept* of the map (often called the *bias*). This layer
@@ -335,7 +323,7 @@ trainable parameters is therefore $50\times 19+50+50+1=1051$.
The package `torchinfo` provides a `summary()` function that neatly summarizes
this information. We specify the size of the input and see the size
of each tensor as it passes through layers of the network.
@@ -376,7 +364,7 @@ Y_test_t = torch.tensor(Y_test.astype(np.float32))
hit_test = TensorDataset(X_test_t, Y_test_t)
```
Finally, this dataset is passed to a `DataLoader()` which ultimately
passes data into our network. While this may seem
like a lot of overhead, this structure is helpful for more
@@ -397,7 +385,7 @@ workers might be reasonable (here the max was 16).
```{python}
max_num_workers = rec_num_workers()
```
The general training setup in `pytorch_lightning` involves
training, validation and test data. These are each
represented by different data loaders. During each epoch,
@@ -422,7 +410,7 @@ hit_dm = SimpleDataModule(hit_train,
validation=hit_test)
```
Next we must provide a `pytorch_lightning` module that controls
the steps performed during the training process. We provide methods for our
`SimpleModule()` that simply record the value
@@ -436,7 +424,7 @@ hit_module = SimpleModule.regression(hit_model,
metrics={'mae':MeanAbsoluteError()})
```
By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in
(10.23).
We have also asked for mean absolute error to be tracked as well
@@ -450,7 +438,7 @@ we will not cover those here in detail.
```{python}
hit_logger = CSVLogger('logs', name='hitters')
```
Finally we are ready to train our model and log the results. We
use the `Trainer()` object from `pytorch_lightning`
to do this work. The argument `datamodule=hit_dm` tells the trainer
@@ -487,8 +475,8 @@ data using the `test()` method of our trainer.
hit_trainer.test(hit_module, datamodule=hit_dm)
```
The results of the fit have been logged into a CSV file. We can find the
results specific to this run in the `experiment.metrics_file_path`
attribute of our logger. Note that each time the model is fit, the logger will output
@@ -544,7 +532,7 @@ ax.set_ylim([0, 400])
ax.set_xticks(np.linspace(0, 50, 11).astype(int));
```
We can predict directly from the final model, and
evaluate its performance on the test data.
Before fitting, we call the `eval()` method
@@ -562,7 +550,7 @@ preds = hit_module(X_test_t)
torch.abs(Y_test_t - preds).mean()
```
### Cleanup
In setting up our data module, we had initiated
@@ -583,7 +571,7 @@ del(Hitters,
hit_trainer, hit_module)
```
## Multilayer Network on the MNIST Digit Data
The `torchvision` package comes with a number of example datasets,
@@ -602,7 +590,7 @@ data will be downloaded the first time this function is executed, and stored in
mnist_train
```
There are 60,000 images in the training data and 10,000 in the test
data. The images are $28\times 28$, and stored as a matrix of pixels. We
need to transform each one into a vector.
@@ -628,7 +616,7 @@ mnist_dm = SimpleDataModule(mnist_train,
batch_size=256)
```
Lets take a look at the data that will get fed into our network. We loop through the first few
chunks of the test dataset, breaking after 2 batches:
@@ -640,8 +628,8 @@ for idx, (X_ ,Y_) in enumerate(mnist_dm.train_dataloader()):
break
```
We see that the $X$ for each batch consists of 256 images of size `1x28x28`.
Here the `1` indicates a single channel (greyscale). For RGB images such as `CIFAR100` below,
we will see that the `1` in the size will be replaced by `3` for the three RGB channels.
@@ -668,7 +656,7 @@ class MNISTModel(nn.Module):
def forward(self, x):
return self._forward(x)
```
We see that in the first layer, each `1x28x28` image is flattened, then mapped to
256 dimensions where we apply a ReLU activation with 40% dropout.
A second layer maps the first layers output down to
@@ -680,14 +668,14 @@ the 128 dimensions are mapped down to 10, the number of classes in the
mnist_model = MNISTModel()
```
We can check that the model produces output of expected size based
on our existing batch `X_` above.
```{python}
mnist_model(X_).size()
```
Lets take a look at the summary of the model. Instead of an `input_size` we can pass
a tensor of correct shape. In this case, we pass through the final
batched `X_` from above.
@@ -712,7 +700,7 @@ mnist_module = SimpleModule.classification(mnist_model,
mnist_logger = CSVLogger('logs', name='MNIST')
```
Now we are ready to go. The final step is to supply training data, and fit the model.
```{python}
@@ -735,7 +723,7 @@ alternative to actually supplying validation data, like we did for the `Hitters`
SGD uses batches
of 256 observations in computing the gradient, and doing the
arithmetic, we see that an epoch corresponds to 188 gradient steps.
`SimpleModule.classification()` includes
an accuracy metric by default. Other
@@ -762,7 +750,7 @@ Once again we evaluate the accuracy using the `test()` method of our trainer. Th
mnist_trainer.test(mnist_module,
datamodule=mnist_dm)
```
Table 10.1 also reports the error rates resulting from LDA (Chapter 4) and multiclass logistic
regression. For LDA we refer the reader to Section 4.7.3.
Although we could use the `sklearn` function `LogisticRegression()` to fit
@@ -839,7 +827,7 @@ cifar_train = TensorDataset(cifar_train_X,
cifar_test = TensorDataset(cifar_test_X,
torch.tensor(cifar_test.targets))
```
The `CIFAR100` dataset consists of 50,000 training images, each represented by a three-dimensional tensor:
each three-color image is represented as a set of three channels, each of which consists of
$32\times 32$ eight-bit pixels. We standardize as we did for the
@@ -865,10 +853,10 @@ for idx, (X_ ,Y_) in enumerate(cifar_dm.train_dataloader()):
break
```
Before we start, we look at some of the training images; similar code produced
Figure 10.5 on page 447. The example below also illustrates
Figure 10.5 on page 406. The example below also illustrates
that `TensorDataset` objects can be indexed with integers --- we are choosing
random images from the training data by indexing `cifar_train`. In order to display correctly,
we must reorder the dimensions by a call to `np.transpose()`.
@@ -917,7 +905,7 @@ class BuildingBlock(nn.Module):
return self.pool(self.activation(self.conv(x)))
```
Notice that we used the `padding = "same"` argument to
`nn.Conv2d()`, which ensures that the output channels have the
same dimension as the input channels. There are 32 channels in the first
@@ -954,7 +942,7 @@ class CIFARModel(nn.Module):
return self.output(val)
```
We build the model and look at the summary. (We had created examples of `X_` earlier.)
```{python}
@@ -1012,7 +1000,7 @@ cifar_trainer.fit(cifar_module,
datamodule=cifar_dm)
```
This model can take 10 minutes or more to run and achieves about 42% accuracy on the test
data. Although this is not terrible for 100-class data (a random
classifier gets 1% accuracy), searching the web we see results around
@@ -1041,7 +1029,7 @@ cifar_trainer.test(cifar_module,
datamodule=cifar_dm)
```
### Hardware Acceleration
As deep learning has become ubiquitous in machine learning, hardware
@@ -1102,8 +1090,8 @@ imgs = torch.stack([torch.div(crop(resize(read_image(f))), 255)
imgs = normalize(imgs)
imgs.size()
```
We now set up the trained network with the weights we read in code block~6. The model has 50 layers, with a fair bit of complexity.
```{python}
@@ -1130,7 +1118,7 @@ We now feed our six images through the fitted network.
img_preds = resnet_model(imgs)
```
Lets look at the predicted probabilities for each of the top 3 choices. First we compute
the probabilities by applying the softmax to the logits in `img_preds`. Note that
we have had to call the `detach()` method on the tensor `img_preds` in order to convert
@@ -1141,7 +1129,7 @@ img_probs = np.exp(np.asarray(img_preds.detach()))
img_probs /= img_probs.sum(1)[:,None]
```
In order to see the class labels, we must download the index file associated with `imagenet`. {This is avalable from the book website and [s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json](https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json).}
```{python}
@@ -1153,7 +1141,7 @@ class_labels = class_labels.set_index('idx')
class_labels = class_labels.sort_index()
```
Well now construct a data frame for each image file
with the labels with the three highest probabilities as
estimated by the model above.
@@ -1167,8 +1155,8 @@ for i, imgfile in enumerate(imgfiles):
print(img_df.reset_index().drop(columns=['idx']))
```
We see that the model
is quite confident about `Flamingo.jpg`, but a little less so for the
other images.
@@ -1184,7 +1172,7 @@ del(cifar_test,
cifar_optimizer,
cifar_trainer)
```
## IMDB Document Classification
We now implement models for sentiment classification (Section 10.4) on the `IMDB`
@@ -1282,7 +1270,7 @@ summary(imdb_model,
'num_params'])
```
Well again use
a smaller learning rate for these data,
hence we pass an `optimizer` to the
@@ -1301,7 +1289,7 @@ imdb_module = SimpleModule.binary_classification(
optimizer=imdb_optimizer)
```
Having loaded the datasets into a data module
and created a `SimpleModule`, the remaining steps
are familiar.
@@ -1315,14 +1303,14 @@ imdb_trainer = Trainer(deterministic=True,
imdb_trainer.fit(imdb_module,
datamodule=imdb_dm)
```
Evaluating the test error yields roughly 86% accuracy.
```{python}
test_results = imdb_trainer.test(imdb_module, datamodule=imdb_dm)
test_results
```
### Comparison to Lasso
We now fit a lasso logistic regression model
@@ -1375,7 +1363,7 @@ for l in lam_val:
intercepts.append(logit.intercept_)
```
The coefficient and intercepts have an extraneous dimension which can be removed
by the `np.squeeze()` function.
@@ -1388,7 +1376,7 @@ Well now make a plot to compare our neural network results with the
lasso.
```{python}
# %%capture
%%capture
fig, axes = subplots(1, 2, figsize=(16, 8), sharey=True)
for ((X_, Y_),
data_,
@@ -1444,7 +1432,7 @@ del(imdb_model,
imdb_train,
imdb_test)
```
## Recurrent Neural Networks
In this lab we fit the models illustrated in
@@ -1474,7 +1462,7 @@ imdb_seq_dm = SimpleDataModule(imdb_seq_train,
)
```
The first layer of the RNN is an embedding layer of size 32, which will be
learned during training. This layer one-hot encodes each document
as a matrix of dimension $500 \times 10,003$, and then maps these
@@ -1486,7 +1474,7 @@ matrix of size $10,003\times 32$; each of the 500 integers in the
document are then mapped to the appropriate 32 real numbers by
indexing the appropriate rows of this matrix.
The second layer is an LSTM with 32 units, and the output
layer is a single logit for the binary classification task.
In the last line of the `forward()` method below,
@@ -1517,7 +1505,7 @@ summary(lstm_model,
'num_params'])
```
The 10,003 is suppressed in the summary, but we see it in the
parameter count, since $10,003\times 32=320,096$.
@@ -1542,7 +1530,7 @@ track the test performance as the network is fit, and see that it attains 85% ac
```{python}
lstm_trainer.test(lstm_module, datamodule=imdb_seq_dm)
```
We once again show the learning progress, followed by cleanup.
```{python}
@@ -1568,7 +1556,7 @@ del(lstm_model,
imdb_seq_test)
```
### Time Series Prediction
We now show how to fit the models in Section 10.5.2
@@ -1585,7 +1573,7 @@ X = pd.DataFrame(StandardScaler(
index=NYSE.index)
```
Next we set up the lagged versions of the data, dropping
any rows with missing values using the `dropna()` method.
@@ -1599,7 +1587,7 @@ X.insert(len(X.columns), 'train', NYSE['train'])
X = X.dropna()
```
Finally, we extract the response, training indicator, and drop the current days `DJ_return` and
`log_volatility` to predict only from previous days data.
@@ -1609,8 +1597,8 @@ X = X.drop(columns=['train'] + cols)
X.columns
```
We first fit a simple linear model and compute the $R^2$ on the test data using
the `score()` method.
@@ -1619,7 +1607,7 @@ M = LinearRegression()
M.fit(X[train], Y[train])
M.score(X[~train], Y[~train])
```
We refit this model, including the factor variable `day_of_week`.
For a categorical series in `pandas`, we can form the indicators
using the `get_dummies()` method.
@@ -1692,7 +1680,7 @@ class NYSEModel(nn.Module):
return torch.flatten(val)
nyse_model = NYSEModel()
```
We fit the model in a similar fashion to previous networks. We
supply the `fit` function with test data as validation data, so that when
we monitor its progress and plot the history function we can see the
@@ -1711,7 +1699,7 @@ for mask in [train, ~train]:
nyse_train, nyse_test = datasets
```
Following our usual pattern, we inspect the summary.
```{python}
@@ -1742,7 +1730,7 @@ for idx, (x, y) in enumerate(nyse_dm.train_dataloader()):
break
```
We follow our previous example for setting up a trainer for a
regression problem, requesting the $R^2$ metric
to be be computed at each epoch.
@@ -1755,7 +1743,7 @@ nyse_module = SimpleModule.regression(nyse_model,
metrics={'r2':R2Score()})
```
Fitting the model should by now be familiar.
The results on the test data are very similar to the linear AR model.
@@ -1769,7 +1757,7 @@ nyse_trainer.test(nyse_module,
datamodule=nyse_dm)
```
We could also fit a model without the `nn.RNN()` layer by just
using a `nn.Flatten()` layer instead. This would be a nonlinear AR model. If in addition we excluded the
hidden layer, this would be equivalent to our earlier linear AR model.
@@ -1789,7 +1777,7 @@ for mask in [train, ~train]:
datasets.append(TensorDataset(X_day_t, Y_t))
day_train, day_test = datasets
```
Creating a data module follows a familiar pattern.
```{python}
@@ -1800,7 +1788,7 @@ day_dm = SimpleDataModule(day_train,
batch_size=64)
```
We build a `NonLinearARModel()` that takes as input the 20 features and a hidden layer with 32 units. The remaining steps are familiar.
```{python}
@@ -1826,7 +1814,7 @@ nl_module = SimpleModule.regression(nl_model,
metrics={'r2':R2Score()})
```
We continue with the usual training steps, fit the model,
and evaluate the test error. We see the test $R^2$ is a slight improvement over the linear AR model that also includes `day_of_week`.

File diff suppressed because one or more lines are too long