trying with Rmd

2023-08-06 16:39:15 -07:00
parent 6742bd9598
commit dc38c6c526
24 changed files with 10640 additions and 24 deletions
--- a/Rmarkdown/Ch5-resample-lab.Rmd
+++ b/Rmarkdown/Ch5-resample-lab.Rmd
@@ -0,0 +1,564 @@
+---
+jupyter:
+  jupytext:
+    cell_metadata_filter: -all
+    formats: notebooks///ipynb,Rmarkdown///Rmd
+    main_language: python
+    text_representation:
+      extension: .Rmd
+      format_name: rmarkdown
+      format_version: '1.2'
+      jupytext_version: 1.14.7
+---
+
+
+# Chapter 5
+
+
+
+
+# Lab: Cross-Validation and the Bootstrap
+In this lab, we explore the resampling techniques covered in this
+chapter. Some of the commands in this lab may take a while to run on
+your computer.
+
+We again begin by placing most of our imports at this top level.
+
+```{python}
+import numpy as np
+import statsmodels.api as sm
+from ISLP import load_data
+from ISLP.models import (ModelSpec as MS,
+                         summarize,
+                         poly)
+from sklearn.model_selection import train_test_split
+
+```
+
+
+There are several new imports needed for this lab.
+
+```{python}
+from functools import partial
+from sklearn.model_selection import \
+     (cross_validate,
+      KFold,
+      ShuffleSplit)
+from sklearn.base import clone
+from ISLP.models import sklearn_sm
+
+```
+
+
+## The Validation Set Approach
+We explore the use of the validation set approach in order to estimate
+the test error rates that result from fitting various linear models on
+the  `Auto`  data set.
+
+We use the function `train_test_split()` to split
+the data into training and validation sets. As there are 392 observations,
+we split into two equal sets of size 196 using the
+argument `test_size=196`. It is generally a good idea to set a random seed
+when performing operations like this that contain an
+element of randomness, so that the results obtained can be reproduced
+precisely at a later time. We set the random seed of the splitter
+with the argument `random_state=0`. 
+
+```{python}
+Auto = load_data('Auto')
+Auto_train, Auto_valid = train_test_split(Auto,
+                                         test_size=196,
+                                         random_state=0)
+
+```
+
+Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`.
+
+```{python}
+hp_mm = MS(['horsepower'])
+X_train = hp_mm.fit_transform(Auto_train)
+y_train = Auto_train['mpg']
+model = sm.OLS(y_train, X_train)
+results = model.fit()
+
+```
+
+We now use the `predict()` method of `results` evaluated on the model matrix for this model
+created using the validation data set. We also calculate the validation MSE of our model.
+
+```{python}
+X_valid = hp_mm.transform(Auto_valid)
+y_valid = Auto_valid['mpg']
+valid_pred = results.predict(X_valid)
+np.mean((y_valid - valid_pred)**2)
+
+```
+
+Hence our estimate for the validation MSE of  the linear regression
+fit is $23.62$.
+
+We can also estimate the validation error for
+higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well
+as a training and test set and returns the MSE on the test set.
+
+```{python}
+def evalMSE(terms,
+            response,
+            train,
+            test):
+
+   mm = MS(terms)
+   X_train = mm.fit_transform(train)
+   y_train = train[response]
+
+   X_test = mm.transform(test)
+   y_test = test[response]
+
+   results = sm.OLS(y_train, X_train).fit()
+   test_pred = results.predict(X_test)
+
+   return np.mean((y_test - test_pred)**2)
+
+```
+
+Let’s use this function to estimate the validation MSE
+using linear, quadratic and cubic fits. We use the `enumerate()`  function
+here, which gives both the values and indices of objects as one iterates
+over a for loop.
+
+```{python}
+MSE = np.zeros(3)
+for idx, degree in enumerate(range(1, 4)):
+    MSE[idx] = evalMSE([poly('horsepower', degree)],
+                       'mpg',
+                       Auto_train,
+                       Auto_valid)
+MSE
+
+```
+
+These error rates are $23.62, 18.76$, and $18.80$, respectively. If we
+choose a different training/validation split instead, then we
+can expect somewhat different errors on the validation set.
+
+```{python}
+Auto_train, Auto_valid = train_test_split(Auto,
+                                          test_size=196,
+                                          random_state=3)
+MSE = np.zeros(3)
+for idx, degree in enumerate(range(1, 4)):
+    MSE[idx] = evalMSE([poly('horsepower', degree)],
+                       'mpg',
+                       Auto_train,
+                       Auto_valid)
+MSE
+```
+
+Using this split of the observations into a training set and a validation set,
+we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$, and $16.97$, respectively.
+
+These results are consistent with our previous findings: a model that
+predicts `mpg` using a quadratic function of `horsepower`
+performs better than a model that involves only a linear function of
+`horsepower`, and there is no evidence of an improvement in using a cubic function of `horsepower`.
+
+
+## Cross-Validation
+In theory, the cross-validation estimate can be computed for any generalized
+linear model.  {}
+In practice, however, the simplest way to cross-validate in
+Python is to use `sklearn`, which has a different interface or API
+than `statsmodels`, the code we have been using to fit GLMs.
+
+This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ don’t naturally speak to each other, this
+requires the use of a *wrapper*.
+In the `ISLP` package,
+we provide 
+a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with
+models fit by `statsmodels`.
+
+The class `sklearn_sm()` 
+has  as its first argument
+a model from `statsmodels`. It can take two additional
+optional arguments: `model_str` which can be
+used to specify a formula, and `model_args` which should
+be a dictionary of additional arguments used when fitting
+the model. For example, to fit a logistic regression model
+we have to specify a `family` argument. This
+is passed as `model_args={'family':sm.families.Binomial()}`.
+
+Here is our wrapper in action:
+
+```{python}
+hp_model = sklearn_sm(sm.OLS,
+                      MS(['horsepower']))
+X, Y = Auto.drop(columns=['mpg']), Auto['mpg']
+cv_results = cross_validate(hp_model,
+                            X,
+                            Y,
+                            cv=Auto.shape[0])
+cv_err = np.mean(cv_results['test_score'])
+cv_err
+
+```
+The arguments to `cross_validate()` are as follows: an
+object with the appropriate `fit()`, `predict()`,
+and `score()` methods,  an
+array of features `X` and a response `Y`. 
+We also included an additional argument `cv` to `cross_validate()`; specifying an integer
+$K$ results in $K$-fold cross-validation. We have provided a value 
+corresponding to the total number of observations, which results in
+leave-one-out cross-validation (LOOCV). The `cross_validate()`  function produces a dictionary with several components;
+we simply want the cross-validated test score here (MSE), which is estimated to be 24.23.
+
+
+We can repeat this procedure for increasingly complex polynomial fits.
+To automate the process, we again
+use a for loop which iteratively fits polynomial
+regressions of degree 1 to 5, computes the
+associated cross-validation error, and stores it in the $i$th element
+of the vector `cv_error`. The variable `d` in the for loop
+corresponds to the degree of the polynomial. We begin by initializing the
+vector. This command may take a couple of seconds to run.
+
+```{python}
+cv_error = np.zeros(5)
+H = np.array(Auto['horsepower'])
+M = sklearn_sm(sm.OLS)
+for i, d in enumerate(range(1,6)):
+    X = np.power.outer(H, np.arange(d+1))
+    M_CV = cross_validate(M,
+                          X,
+                          Y,
+                          cv=Auto.shape[0])
+    cv_error[i] = np.mean(M_CV['test_score'])
+cv_error
+
+```
+As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
+quadratic fits, but then no clear improvement from using higher-degree polynomials.
+
+Above we introduced the `outer()`  method of the `np.power()`
+function.  The `outer()` method is applied to an operation
+that has two arguments, such as `add()`, `min()`, or
+`power()`.
+It has two arrays as
+arguments, and then forms a larger
+array where the operation is applied to each pair of elements of the
+two arrays. 
+
+```{python}
+A = np.array([3, 5, 9])
+B = np.array([2, 4])
+np.add.outer(A, B)
+
+```
+
+In the CV example above, we used $K=n$, but of course we can also use $K<n$. The code is very similar
+to the above (and is significantly faster). Here we use `KFold()` to partition the data into $K=10$ random groups. We use `random_state` to set a random seed and initialize a vector `cv_error` in which we will store the CV errors corresponding to the
+polynomial fits of degrees one to five.
+
+```{python}
+cv_error = np.zeros(5)
+cv = KFold(n_splits=10,
+           shuffle=True,
+           random_state=0) # use same splits for each degree
+for i, d in enumerate(range(1,6)):
+    X = np.power.outer(H, np.arange(d+1))
+    M_CV = cross_validate(M,
+                          X,
+                          Y,
+                          cv=cv)
+    cv_error[i] = np.mean(M_CV['test_score'])
+cv_error
+
+```
+Notice that the computation time is much shorter than that of LOOCV.
+(In principle, the computation time for LOOCV for a least squares
+linear model should be faster than for $K$-fold CV, due to the
+availability of the formula (5.2)  for LOOCV;
+however, the generic `cross_validate()`  function does not make
+use of this formula.)  We still see little evidence that using cubic
+or higher-degree polynomial terms leads to a lower test error than simply
+using a quadratic fit.
+
+
+The `cross_validate()` function is flexible and can take
+different splitting mechanisms as an argument. For instance, one can use the `ShuffleSplit()` funtion to implement
+the validation set approach just as easily as K-fold cross-validation.
+
+```{python}
+validation = ShuffleSplit(n_splits=1,
+                          test_size=196,
+                          random_state=0)
+results = cross_validate(hp_model,
+                         Auto.drop(['mpg'], axis=1),
+                         Auto['mpg'],
+                         cv=validation);
+results['test_score']
+
+```
+
+
+One can estimate the variability in the test error by running the following:
+
+```{python}
+validation = ShuffleSplit(n_splits=10,
+                          test_size=196,
+                          random_state=0)
+results = cross_validate(hp_model,
+                         Auto.drop(['mpg'], axis=1),
+                         Auto['mpg'],
+                         cv=validation)
+results['test_score'].mean(), results['test_score'].std()
+
+```
+
+Note that this standard deviation is not a valid estimate of the
+sampling variability of the mean test score or the individual scores, since the randomly-selected training
+samples overlap and hence introduce correlations. But it does give an
+idea of the Monte Carlo variation
+incurred by picking different random folds.
+
+## The Bootstrap
+We illustrate the use of the bootstrap in the simple example
+ {of Section 5.2,}  as well as on an example involving
+estimating the accuracy of the linear regression model on the  `Auto`
+data set.
+### Estimating the Accuracy of a Statistic of Interest
+One of the great advantages of the bootstrap approach is that it can
+be applied in almost all situations. No complicated mathematical
+calculations are required. While there are several implementations
+of the bootstrap in Python, its use for estimating
+standard error is simple enough that we write our own function
+below for the case when our data is stored
+in a dataframe.
+
+To illustrate the bootstrap, we
+start with a simple example.
+The  `Portfolio`  data set in the `ISLP` package is described
+in Section 5.2. The goal is to estimate the
+sampling variance of the parameter $\alpha$ given in formula (5.7).  We will
+create a function
+`alpha_func()`, which takes as input a dataframe `D` assumed
+to have columns `X` and `Y`, as well as a
+vector `idx` indicating which observations should be used to
+estimate 
+$\alpha$. The function then outputs the estimate for $\alpha$ based on
+the selected observations.
+
+```{python}
+Portfolio = load_data('Portfolio')
+def alpha_func(D, idx):
+   cov_ = np.cov(D[['X','Y']].loc[idx], rowvar=False)
+   return ((cov_[1,1] - cov_[0,1]) /
+           (cov_[0,0]+cov_[1,1]-2*cov_[0,1]))
+
+```
+This function returns an estimate for $\alpha$
+based on applying the minimum
+    variance formula (5.7) to the observations indexed by
+the argument `idx`.  For instance, the following command
+estimates $\alpha$ using all 100 observations.
+
+```{python}
+alpha_func(Portfolio, range(100))
+```
+
+Next we randomly select
+100 observations from `range(100)`, with replacement. This is equivalent
+to constructing a new bootstrap data set and recomputing $\hat{\alpha}$
+based on the new data set.
+
+```{python}
+rng = np.random.default_rng(0)
+alpha_func(Portfolio,
+           rng.choice(100,
+                      100,
+                      replace=True))
+```
+
+
+This process can be generalized to create a simple function `boot_SE()` for
+computing the bootstrap standard error for arbitrary
+functions that take only a data frame as an argument.
+
+```{python}
+def boot_SE(func,
+            D,
+            n=None,
+            B=1000,
+            seed=0):
+    rng = np.random.default_rng(seed)
+    first_, second_ = 0, 0
+    n = n or D.shape[0]
+    for _ in range(B):
+        idx = rng.choice(D.index,
+                         n,
+                         replace=True)
+        value = func(D, idx)
+        first_ += value
+        second_ += value**2
+    return np.sqrt(second_ / B - (first_ / B)**2)
+```
+Notice the use of `_` as a loop variable in `for _ in range(B)`. This is often used if the value of the counter is
+unimportant and simply makes sure  the loop is executed `B` times.
+
+Let’s use our function to evaluate the accuracy of our
+estimate of $\alpha$ using $B=1{,}000$ bootstrap replications. 
+
+```{python}
+alpha_SE = boot_SE(alpha_func,
+                   Portfolio,
+                   B=1000,
+                   seed=0)
+alpha_SE
+
+```
+
+The final output shows that the bootstrap estimate for ${\rm SE}(\hat{\alpha})$ is $0.0912$.
+
+### Estimating the Accuracy of a Linear Regression Model
+The bootstrap approach can be used to assess the variability of the
+coefficient estimates and predictions from a statistical learning
+method. Here we use the bootstrap approach in order to assess the
+variability of the estimates for $\beta_0$ and $\beta_1$, the
+intercept and slope terms for the linear regression model that uses
+`horsepower` to predict `mpg` in the  `Auto`  data set. We
+will compare the estimates obtained using the bootstrap to those
+obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
+${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
+
+To use our `boot_SE()` function, we must write a function (its
+first argument)
+that takes a data frame `D` and indices `idx`
+as its only arguments. But here we want to bootstrap a specific
+regression model, specified by a model formula and data. We show how
+to do this in a few simple steps.
+
+We start by writing a generic
+function `boot_OLS()` for bootstrapping a regression model that takes a formula to
+define the corresponding regression. We use the `clone()` function to
+make a copy of the formula that can be refit to the new dataframe. This means
+that any derived features such as those defined by `poly()`
+(which we will see shortly),
+will be re-fit on the resampled data frame.
+
+```{python}
+def boot_OLS(model_matrix, response, D, idx):
+    D_ = D.loc[idx]
+    Y_ = D_[response]
+    X_ = clone(model_matrix).fit_transform(D_)
+    return sm.OLS(Y_, X_).fit().params
+```
+This is not quite what is needed as the first argument to
+`boot_SE()`. The first two arguments which specify the model will not change in the
+bootstrap process, and we would like to *freeze* them.   The
+function `partial()` from the `functools` module  does precisely this: it takes a function
+as an argument, and freezes some of its arguments, starting from the
+left. We use it to freeze the first two model-formula arguments of `boot_OLS()`.
+
+```{python}
+hp_func = partial(boot_OLS, MS(['horsepower']), 'mpg')
+
+```
+Typing `hp_func?` will show that it has two arguments `D`
+and `idx` --- it is a version of `boot_OLS()` with the first
+two arguments frozen --- and hence is ideal as the first argument for `boot_SE()`.
+
+The `hp_func()` function can now be used in order to create
+bootstrap estimates for the intercept and slope terms by randomly
+sampling from among the observations with replacement. We first
+demonstrate its utility on 10 bootstrap samples.
+
+```{python}
+rng = np.random.default_rng(0)
+np.array([hp_func(Auto,
+          rng.choice(392,
+                     392,
+                     replace=True)) for _ in range(10)])
+
+```
+Next, we use the `boot_SE()` {}  function to compute the standard
+errors of 1,000 bootstrap estimates for the intercept and slope terms.
+
+```{python}
+hp_se = boot_SE(hp_func,
+                Auto,
+                B=1000,
+                seed=10)
+hp_se
+
+```
+
+
+This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
+0.85, and that the bootstrap
+estimate for ${\rm SE}(\hat{\beta}_1)$ is
+0.0074.  As discussed in
+Section 3.1.2, standard formulas can be used to compute
+the standard errors for the regression coefficients in a linear
+model. These can be obtained using the `summarize()`  function
+from `ISLP.sm`.
+
+```{python}
+hp_model.fit(Auto, Auto['mpg'])
+model_se = summarize(hp_model.results_)['std err']
+model_se
+
+```
+
+
+The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
+obtained using the formulas  from Section 3.1.2  are
+0.717 for the
+intercept and
+0.006 for the
+slope. Interestingly, these are somewhat different from the estimates
+obtained using the bootstrap.  Does this indicate a problem with the
+bootstrap? In fact, it suggests the opposite.  Recall that the
+standard formulas given in
+ {Equation 3.8 on page 80}
+rely on certain assumptions. For example,
+they depend on the unknown parameter $\sigma^2$, the noise
+variance. We then estimate $\sigma^2$ using the RSS. Now although the
+formula for the standard errors do not rely on the linear model being
+correct, the estimate for $\sigma^2$ does.  We see
+ {in Figure 3.8 on page 106}  that there is
+a non-linear relationship in the data, and so the residuals from a
+linear fit will be inflated, and so will $\hat{\sigma}^2$.  Secondly,
+the standard formulas assume (somewhat unrealistically) that the $x_i$
+are fixed, and all the variability comes from the variation in the
+errors $\epsilon_i$.  The bootstrap approach does not rely on any of
+these assumptions, and so it is likely giving a more accurate estimate
+of the standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ than
+the results from `sm.OLS`.
+
+Below we compute the bootstrap standard error estimates and the
+standard linear regression estimates that result from fitting the
+quadratic model to the data. Since this model provides a good fit to
+the data (Figure 3.8), there is now a better
+correspondence between the bootstrap estimates and the standard
+estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
+${\rm SE}(\hat{\beta}_2)$.
+
+```{python}
+quad_model = MS([poly('horsepower', 2, raw=True)])
+quad_func = partial(boot_OLS,
+                    quad_model,
+                    'mpg')
+boot_SE(quad_func, Auto, B=1000)
+
+```
+
+We  compare the results to the standard errors computed using `sm.OLS()`.
+
+```{python}
+M = sm.OLS(Auto['mpg'],
+           quad_model.fit_transform(Auto))
+summarize(M.fit())['std err']
+
+```
+
+
+