standard_lasso

Jonathan corrected a spurious use of standard_lasso
2024-01-25 17:53:08 -08:00
parent a55798e4d3
commit e5bbb1a5bc
2 changed files with 527 additions and 176744 deletions
--- a/Ch10-deeplearning-lab.Rmd
+++ b/Ch10-deeplearning-lab.Rmd
@@ -1,15 +1,3 @@
---
-jupyter:
-  jupytext:
-    cell_metadata_filter: -all
-    formats: Rmd,ipynb
-    main_language: python
-    text_representation:
-      extension: .Rmd
-      format_name: rmarkdown
-      format_version: '1.2'
-      jupytext_version: 1.14.7
---

 # Chapter 10

@@ -41,7 +29,7 @@ from sklearn.model_selection import \
     (train_test_split,
      GridSearchCV)
 ```
-
+ 

 ### Torch-Specific Imports
 There are a number of imports for `torch`. (These are not
@@ -56,7 +44,7 @@ from torch.optim import RMSprop
 from torch.utils.data import TensorDataset

 ```
-
+ 
 There are several other helper packages for `torch`. For instance,
 the `torchmetrics` package has utilities to compute
 various metrics to evaluate performance when fitting
@@ -70,7 +58,7 @@ from torchmetrics import (MeanAbsoluteError,
 from torchinfo import summary

 ```
-
+ 
 The package `pytorch_lightning` is a somewhat higher-level
 interface to `torch` that simplifies the specification and
 fitting of
@@ -82,7 +70,7 @@ from pytorch_lightning import Trainer
 from pytorch_lightning.loggers import CSVLogger

 ```
-
+ 
 In order to reproduce results we use `seed_everything()`. We will also instruct `torch` to use deterministic algorithms
 where possible.

@@ -92,7 +80,7 @@ seed_everything(0, workers=True)
 torch.use_deterministic_algorithms(True, warn_only=True)

 ```
-
+ 
 We will use several datasets shipped with `torchvision` for our
 examples: a pretrained network for image classification,
 as well as some transforms used for preprocessing.
@@ -125,7 +113,7 @@ from ISLP.torch import (SimpleDataModule,
                        rec_num_workers)

 ```
-
+ 
 In addition we have included some helper
 functions to load the
 `IMDb` database, as well as a lookup that maps integers
@@ -159,7 +147,7 @@ from glob import glob
 import json

 ```
-
+ 

 ## Single Layer Network on Hitters Data
 We start by fitting the models in Section 10.6 on the `Hitters` data.
@@ -213,7 +201,7 @@ hit_lm = LinearRegression().fit(X_train, Y_train)
 Yhat_test = hit_lm.predict(X_test)
 np.abs(Yhat_test - Y_test).mean()
 ```
-
+ 
 Next we fit the lasso using `sklearn`. We are using
 mean absolute error to select and evaluate a model, rather than mean squared error.
 The specialized solver we used in Section 6.5.2 uses only mean squared error. So here, with a bit more work,  we create a cross-validation grid and perform the cross-validation directly.  
@@ -236,7 +224,7 @@ $\lambda$ with an  all-zero solution. This value equals the largest absolute inn
 X_s = scaler.fit_transform(X_train)
 n = X_s.shape[0]
 lam_max = np.fabs(X_s.T.dot(Y_train - Y_train.mean())).max() / n
-param_grid = {'alpha': np.exp(np.linspace(0, np.log(0.01), 100))
+param_grid = {'lasso__alpha': np.exp(np.linspace(0, np.log(0.01), 100))
             * lam_max}
 ```
 Note that we had to transform the data first, since the scale of the variables impacts the choice of $\lambda$.
@@ -246,13 +234,13 @@ We now perform cross-validation using this sequence of $\lambda$ values.
 cv = KFold(10,
           shuffle=True,
           random_state=1)
-grid = GridSearchCV(lasso,
+grid = GridSearchCV(standard_lasso,
                    param_grid,
                    cv=cv,
                    scoring='neg_mean_absolute_error')
 grid.fit(X_train, Y_train);
 ```
-
+ 
 We extract the lasso model with best cross-validated mean absolute error, and evaluate its
 performance on `X_test` and `Y_test`, which were not used in
 cross-validation.
@@ -291,7 +279,7 @@ class HittersModel(nn.Module):
        return torch.flatten(self.sequential(x))

 ```
-
+ 
 The `class` statement identifies the code chunk as a
 declaration for a class `HittersModel`
 that inherits from the  base class `nn.Module`. This base
@@ -325,7 +313,7 @@ to disable dropout for when we want to evaluate the model on test data.
 hit_model = HittersModel(X.shape[1])

 ```
-
+ 
 The object `self.sequential` is a composition of four maps. The
 first maps the 19 features of `Hitters` to 50 dimensions, introducing $50\times 19+50$ parameters
 for the weights and *intercept*  of the map (often called the *bias*). This layer
@@ -335,7 +323,7 @@ trainable parameters is therefore $50\times 19+50+50+1=1051$.

    

-
+ 
 The package `torchinfo` provides a `summary()` function that neatly summarizes
 this information. We specify the size of the input and see the size
 of each tensor as it passes through layers of the network. 
@@ -376,7 +364,7 @@ Y_test_t = torch.tensor(Y_test.astype(np.float32))
 hit_test = TensorDataset(X_test_t, Y_test_t)

 ```
-
+ 
 Finally, this dataset is passed to a `DataLoader()` which ultimately
 passes data into our network. While this may seem
 like a lot of overhead, this structure is helpful for more
@@ -397,7 +385,7 @@ workers might be reasonable (here the max was 16).
 ```{python}
 max_num_workers = rec_num_workers()
 ```
-
+ 
 The general training setup in `pytorch_lightning` involves
 training, validation and test data. These are each
 represented by different data loaders. During each epoch,
@@ -422,7 +410,7 @@ hit_dm = SimpleDataModule(hit_train,
                          validation=hit_test)

 ```
-
+ 
 Next we must provide a `pytorch_lightning` module that controls
 the steps performed during the training process. We provide methods for our
 `SimpleModule()` that simply record the value
@@ -436,7 +424,7 @@ hit_module = SimpleModule.regression(hit_model,
                           metrics={'mae':MeanAbsoluteError()})

 ```
-
+ 
 By using the `SimpleModule.regression()` method,  we indicate that we will use squared-error loss as in
 (10.23).
 We have also asked for mean absolute error to be tracked as well
@@ -450,7 +438,7 @@ we will not cover those here in detail.
 ```{python}
 hit_logger = CSVLogger('logs', name='hitters')
 ```
-
+ 
 Finally we are ready to train our model and log the results. We
 use the `Trainer()` object from `pytorch_lightning`
 to do this work. The argument `datamodule=hit_dm` tells the trainer
@@ -487,8 +475,8 @@ data using the `test()` method of our trainer.
 hit_trainer.test(hit_module, datamodule=hit_dm)

 ```
-
-
+    
+ 
 The results of the fit have been logged into a CSV file. We can find the
 results specific to this run in the `experiment.metrics_file_path`
 attribute of our logger. Note that each time the model is fit, the logger will output
@@ -544,7 +532,7 @@ ax.set_ylim([0, 400])
 ax.set_xticks(np.linspace(0, 50, 11).astype(int));
 ```

-
+ 
 We can predict directly from the final model, and
 evaluate its performance on the test data.
 Before fitting, we call the `eval()` method
@@ -562,7 +550,7 @@ preds = hit_module(X_test_t)
 torch.abs(Y_test_t - preds).mean()
 ```
    
-
+ 

 ### Cleanup
 In setting up our data module, we had initiated
@@ -583,7 +571,7 @@ del(Hitters,
    hit_trainer, hit_module)

 ```
-
+ 

 ## Multilayer Network on the MNIST Digit Data
 The `torchvision` package comes with a number of example datasets,
@@ -602,7 +590,7 @@ data will be downloaded the first time this function is executed, and stored in
 mnist_train

 ```
-
+    
 There are 60,000 images in the training data and 10,000 in the test
 data. The images are $28\times 28$, and stored as a matrix of pixels. We
 need to transform each one into a vector.
@@ -628,7 +616,7 @@ mnist_dm = SimpleDataModule(mnist_train,
                            batch_size=256)

 ```
-
+ 
 Let’s take a look at the data that will get fed into our network. We loop through the first few
 chunks of the test dataset, breaking after 2 batches:

@@ -640,8 +628,8 @@ for idx, (X_ ,Y_) in enumerate(mnist_dm.train_dataloader()):
        break

 ```
-
-
+    
+ 
 We see that the $X$ for each batch consists of 256 images of size `1x28x28`.
 Here the `1` indicates a single channel (greyscale). For RGB images such as `CIFAR100` below,
 we will see that the `1` in the size will be replaced by `3` for the three RGB channels.
@@ -668,7 +656,7 @@ class MNISTModel(nn.Module):
    def forward(self, x):
        return self._forward(x)
 ```
-
+ 
 We see that in the first layer, each `1x28x28` image is flattened, then mapped to
 256 dimensions where we apply a ReLU activation with 40% dropout.
 A second layer maps the first layer’s output down to
@@ -680,14 +668,14 @@ the 128 dimensions are mapped down to 10, the number of classes in the
 mnist_model = MNISTModel()

 ```
-
+ 
 We can check that the model produces output of expected size based
 on our existing batch `X_` above.

 ```{python}
 mnist_model(X_).size()
 ```
-
+ 
 Let’s take a look at the summary of the model. Instead of an `input_size` we can pass
 a tensor of correct shape. In this case, we pass through the final
 batched `X_` from above.
@@ -712,7 +700,7 @@ mnist_module = SimpleModule.classification(mnist_model,
 mnist_logger = CSVLogger('logs', name='MNIST')

 ```
-
+ 
 Now we are ready to go. The final step is to supply training data, and fit the model.

 ```{python}
@@ -735,7 +723,7 @@ alternative to actually supplying validation data, like we did for the `Hitters`
 SGD  uses batches
 of 256 observations in computing the gradient, and doing the
 arithmetic, we see that an epoch corresponds to 188 gradient steps.
-
+ 

 `SimpleModule.classification()` includes
 an accuracy metric by default. Other
@@ -762,7 +750,7 @@ Once again we evaluate the accuracy using the `test()` method of our trainer. Th
 mnist_trainer.test(mnist_module,
                   datamodule=mnist_dm)
 ```
-
+    
 Table 10.1 also reports the error rates resulting from LDA (Chapter 4) and multiclass logistic
 regression. For LDA we refer the reader to Section 4.7.3.
 Although we could use the `sklearn` function `LogisticRegression()` to fit  
@@ -839,7 +827,7 @@ cifar_train = TensorDataset(cifar_train_X,
 cifar_test = TensorDataset(cifar_test_X,
                            torch.tensor(cifar_test.targets))
 ```
-
+ 
 The `CIFAR100` dataset consists of 50,000 training images, each represented by a three-dimensional tensor:
 each three-color image is represented as a set of three channels, each of which consists of
 $32\times 32$ eight-bit pixels. We standardize as we did for the
@@ -865,10 +853,10 @@ for idx, (X_ ,Y_) in enumerate(cifar_dm.train_dataloader()):
        break

 ```
-
-
+    
+ 
 Before we start, we look at some of the training images; similar code produced
-Figure 10.5 on page  447. The example below also illustrates
+Figure 10.5 on page  406. The example below also illustrates
 that `TensorDataset` objects can be indexed with integers --- we are choosing
 random images from the training data by indexing `cifar_train`. In order to display correctly,
 we must reorder the dimensions by a call to `np.transpose()`.
@@ -917,7 +905,7 @@ class BuildingBlock(nn.Module):
        return self.pool(self.activation(self.conv(x)))

 ```
-
+ 
 Notice that we used the `padding = "same"` argument to
 `nn.Conv2d()`, which ensures that the output channels have the
 same dimension as the input channels. There are 32 channels in the first
@@ -954,7 +942,7 @@ class CIFARModel(nn.Module):
        return self.output(val)

 ```
-
+ 
 We  build the model and look at the summary. (We had created examples of `X_` earlier.)

 ```{python}
@@ -1012,7 +1000,7 @@ cifar_trainer.fit(cifar_module,
                  datamodule=cifar_dm)

 ```
-
+ 
 This model can take 10 minutes or more to run and achieves about 42% accuracy on the test
 data. Although this is not terrible for 100-class data (a random
 classifier gets 1% accuracy), searching the web we see results around
@@ -1041,7 +1029,7 @@ cifar_trainer.test(cifar_module,
                   datamodule=cifar_dm)

 ```
-
+ 

 ### Hardware Acceleration
 As deep learning has become ubiquitous in machine learning, hardware
@@ -1102,8 +1090,8 @@ imgs = torch.stack([torch.div(crop(resize(read_image(f))), 255)
 imgs = normalize(imgs)
 imgs.size()
 ```
-
-
+    
+ 
 We now set up the trained network with the weights we read in code block~6. The model has 50 layers, with a fair bit of complexity.

 ```{python}
@@ -1130,7 +1118,7 @@ We now feed our six images through the fitted network.
 img_preds = resnet_model(imgs)

 ```
-
+ 
 Let’s look at the predicted probabilities for each of the top 3 choices. First we compute
 the probabilities by applying the softmax to the logits in `img_preds`. Note that
 we have had to call the `detach()` method on the tensor `img_preds` in order to convert
@@ -1141,7 +1129,7 @@ img_probs = np.exp(np.asarray(img_preds.detach()))
 img_probs /= img_probs.sum(1)[:,None]

 ```
-
+ 
 In order to see the class labels, we must download the index file associated with `imagenet`. {This is avalable from the book website and  [s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json](https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json).}

 ```{python}
@@ -1153,7 +1141,7 @@ class_labels = class_labels.set_index('idx')
 class_labels = class_labels.sort_index()

 ```
-
+ 
 We’ll now construct a data frame for each image file
 with the labels with the three highest probabilities as
 estimated by the model above.
@@ -1167,8 +1155,8 @@ for i, imgfile in enumerate(imgfiles):
    print(img_df.reset_index().drop(columns=['idx']))

 ```
-
-
+    
+ 
 We see that the model
 is quite confident about `Flamingo.jpg`, but a little less so for the
 other images.
@@ -1184,7 +1172,7 @@ del(cifar_test,
    cifar_optimizer,
    cifar_trainer)
 ```
-
+ 

 ## IMDB Document Classification
 We now implement models for sentiment classification (Section 10.4)  on the `IMDB`
@@ -1282,7 +1270,7 @@ summary(imdb_model,
                   'num_params'])

 ```
-
+ 
 We’ll again use
 a smaller learning rate for these data,
 hence we pass an `optimizer` to the
@@ -1301,7 +1289,7 @@ imdb_module = SimpleModule.binary_classification(
                         optimizer=imdb_optimizer)

 ```
-
+ 
 Having loaded the datasets into a data module
 and created a `SimpleModule`, the remaining steps
 are familiar.
@@ -1315,14 +1303,14 @@ imdb_trainer = Trainer(deterministic=True,
 imdb_trainer.fit(imdb_module,
                 datamodule=imdb_dm)
 ```
-
+ 
 Evaluating the test error yields roughly 86% accuracy.

 ```{python}
 test_results = imdb_trainer.test(imdb_module, datamodule=imdb_dm)
 test_results
 ```
-
+ 

 ### Comparison to Lasso
 We now fit a lasso logistic regression model
@@ -1375,7 +1363,7 @@ for l in lam_val:
    intercepts.append(logit.intercept_)

 ```
-
+ 
 The coefficient and intercepts have an extraneous dimension which can be removed
 by the `np.squeeze()`  function.

@@ -1388,7 +1376,7 @@ We’ll now make a plot to compare our neural network results with the
 lasso.

 ```{python}
-# %%capture
+%%capture
 fig, axes = subplots(1, 2, figsize=(16, 8), sharey=True)
 for ((X_, Y_),
     data_,
@@ -1444,7 +1432,7 @@ del(imdb_model,
    imdb_train,
    imdb_test)
 ```
-
+ 

 ## Recurrent Neural Networks
 In this lab we fit the models illustrated in
@@ -1474,7 +1462,7 @@ imdb_seq_dm = SimpleDataModule(imdb_seq_train,
                               )

 ```
-
+ 
 The first layer of the RNN is an embedding layer of size 32, which will be
 learned during  training. This layer one-hot encodes  each document
 as a matrix of dimension $500 \times 10,003$, and then maps these
@@ -1486,7 +1474,7 @@ matrix of size $10,003\times 32$; each of the 500 integers in the
 document are then mapped to the appropriate 32 real numbers by
 indexing the appropriate rows of this matrix.

-
+ 
 The second  layer is an LSTM with 32 units, and the output
 layer is a single logit for the binary classification task.
 In the last line of the `forward()` method below,
@@ -1517,7 +1505,7 @@ summary(lstm_model,
                   'num_params'])

 ```
-
+    
 The 10,003 is suppressed in the summary, but we see it in the
 parameter count, since $10,003\times 32=320,096$.

@@ -1542,7 +1530,7 @@ track the test performance as the network is fit, and see that it attains 85% ac
 ```{python}
 lstm_trainer.test(lstm_module, datamodule=imdb_seq_dm)
 ```
-
+    
 We once again show the learning progress, followed by cleanup.

 ```{python}
@@ -1568,7 +1556,7 @@ del(lstm_model,
    imdb_seq_test)

 ```
-
+ 

 ### Time Series Prediction
 We now show how to fit the models in Section 10.5.2
@@ -1585,7 +1573,7 @@ X = pd.DataFrame(StandardScaler(
                 index=NYSE.index)

 ```
-
+ 
 Next we set up the lagged versions of the data, dropping
 any rows with missing values using the `dropna()`  method.

@@ -1599,7 +1587,7 @@ X.insert(len(X.columns), 'train', NYSE['train'])
 X = X.dropna()

 ```
-
+ 
 Finally, we extract the response, training indicator, and drop the current day’s `DJ_return` and
 `log_volatility` to predict only from previous day’s data.

@@ -1609,8 +1597,8 @@ X = X.drop(columns=['train'] + cols)
 X.columns

 ```
-
-
+    
+ 
 We first fit a simple linear model and compute the $R^2$ on the test data using
 the `score()`  method.

@@ -1619,7 +1607,7 @@ M = LinearRegression()
 M.fit(X[train], Y[train])
 M.score(X[~train], Y[~train])
 ```
-
+ 
 We refit this model, including the factor variable `day_of_week`.
 For a categorical series in `pandas`, we can form the indicators
 using the `get_dummies()`  method.
@@ -1692,7 +1680,7 @@ class NYSEModel(nn.Module):
        return torch.flatten(val)
 nyse_model = NYSEModel()
 ```
-
+ 
 We  fit the model in a similar fashion to previous networks. We
 supply the `fit` function with test data as validation data, so that when
 we monitor its progress and plot the history function we can see the
@@ -1711,7 +1699,7 @@ for mask in [train, ~train]:
 nyse_train, nyse_test = datasets

 ```
-
+ 
 Following our usual pattern, we inspect the summary.

 ```{python}
@@ -1742,7 +1730,7 @@ for idx, (x, y) in enumerate(nyse_dm.train_dataloader()):
        break

 ```
-
+ 
 We follow our previous example for setting up a trainer for a
 regression problem, requesting the $R^2$ metric
 to be be computed at each epoch.
@@ -1755,7 +1743,7 @@ nyse_module = SimpleModule.regression(nyse_model,
                                      metrics={'r2':R2Score()})

 ```
-
+ 
 Fitting the model should by now be familiar.
 The results on the test data are very similar to the linear AR model. 

@@ -1769,7 +1757,7 @@ nyse_trainer.test(nyse_module,
                  datamodule=nyse_dm)
 ```

-
+ 
 We could also fit a model without the `nn.RNN()` layer by just
 using a `nn.Flatten()` layer instead. This would be a nonlinear AR model. If in addition we excluded the 
 hidden layer, this would be equivalent to our earlier linear AR model.  
@@ -1789,7 +1777,7 @@ for mask in [train, ~train]:
    datasets.append(TensorDataset(X_day_t, Y_t))
 day_train, day_test = datasets
 ```
-
+ 
 Creating a data module follows a familiar pattern.

 ```{python}
@@ -1800,7 +1788,7 @@ day_dm = SimpleDataModule(day_train,
                          batch_size=64)

 ```
-
+ 
 We build a `NonLinearARModel()` that takes as input the 20 features and a hidden layer  with 32 units. The remaining steps are familiar.

 ```{python}
@@ -1826,7 +1814,7 @@ nl_module = SimpleModule.regression(nl_model,
                                        metrics={'r2':R2Score()})

 ```
-
+ 
 We continue with the usual training steps, fit the model,
 and evaluate the test error. We see the test $R^2$ is a slight improvement over the linear AR model that also includes `day_of_week`.

--- a/Ch10-deeplearning-lab.ipynb
+++ b/Ch10-deeplearning-lab.ipynb