v2.2 versions of labs except Ch10

This commit is contained in:
Jonathan Taylor
2024-06-04 18:07:35 -07:00
parent e5bbb1a5bc
commit 29526fb7bc
25 changed files with 19373 additions and 10042 deletions

View File

@@ -1,23 +1,15 @@
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: Rmd,ipynb
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.14.7
---
# Chapter 5
# Cross-Validation and the Bootstrap
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch05-resample-lab.ipynb)
# Lab: Cross-Validation and the Bootstrap
In this lab, we explore the resampling techniques covered in this
chapter. Some of the commands in this lab may take a while to run on
your computer.
@@ -235,7 +227,7 @@ for i, d in enumerate(range(1,6)):
cv_error
```
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
quadratic fits, but then no clear improvement from using higher-degree polynomials.
Above we introduced the `outer()` method of the `np.power()`
@@ -276,7 +268,7 @@ cv_error
Notice that the computation time is much shorter than that of LOOCV.
(In principle, the computation time for LOOCV for a least squares
linear model should be faster than for $K$-fold CV, due to the
availability of the formula (5.2) for LOOCV;
availability of the formula~(\ref{Ch5:eq:LOOCVform}) for LOOCV;
however, the generic `cross_validate()` function does not make
use of this formula.) We still see little evidence that using cubic
or higher-degree polynomial terms leads to a lower test error than simply
@@ -322,7 +314,7 @@ incurred by picking different random folds.
## The Bootstrap
We illustrate the use of the bootstrap in the simple example
{of Section 5.2,} as well as on an example involving
{of Section~\ref{Ch5:sec:bootstrap},} as well as on an example involving
estimating the accuracy of the linear regression model on the `Auto`
data set.
### Estimating the Accuracy of a Statistic of Interest
@@ -337,8 +329,8 @@ in a dataframe.
To illustrate the bootstrap, we
start with a simple example.
The `Portfolio` data set in the `ISLP` package is described
in Section 5.2. The goal is to estimate the
sampling variance of the parameter $\alpha$ given in formula (5.7). We will
in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}). We will
create a function
`alpha_func()`, which takes as input a dataframe `D` assumed
to have columns `X` and `Y`, as well as a
@@ -357,7 +349,7 @@ def alpha_func(D, idx):
```
This function returns an estimate for $\alpha$
based on applying the minimum
variance formula (5.7) to the observations indexed by
variance formula (\ref{Ch5:min.var}) to the observations indexed by
the argument `idx`. For instance, the following command
estimates $\alpha$ using all 100 observations.
@@ -427,7 +419,7 @@ intercept and slope terms for the linear regression model that uses
`horsepower` to predict `mpg` in the `Auto` data set. We
will compare the estimates obtained using the bootstrap to those
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.
To use our `boot_SE()` function, we must write a function (its
first argument)
@@ -474,7 +466,7 @@ demonstrate its utility on 10 bootstrap samples.
```{python}
rng = np.random.default_rng(0)
np.array([hp_func(Auto,
rng.choice(392,
rng.choice(Auto.index,
392,
replace=True)) for _ in range(10)])
@@ -496,7 +488,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
0.85, and that the bootstrap
estimate for ${\rm SE}(\hat{\beta}_1)$ is
0.0074. As discussed in
Section 3.1.2, standard formulas can be used to compute
Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
the standard errors for the regression coefficients in a linear
model. These can be obtained using the `summarize()` function
from `ISLP.sm`.
@@ -510,7 +502,7 @@ model_se
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
obtained using the formulas from Section 3.1.2 are
obtained using the formulas from Section~\ref{Ch3:secoefsec} are
0.717 for the
intercept and
0.006 for the
@@ -518,13 +510,13 @@ slope. Interestingly, these are somewhat different from the estimates
obtained using the bootstrap. Does this indicate a problem with the
bootstrap? In fact, it suggests the opposite. Recall that the
standard formulas given in
{Equation 3.8 on page 82}
{Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
rely on certain assumptions. For example,
they depend on the unknown parameter $\sigma^2$, the noise
variance. We then estimate $\sigma^2$ using the RSS. Now although the
formula for the standard errors do not rely on the linear model being
correct, the estimate for $\sigma^2$ does. We see
{in Figure 3.8 on page 108} that there is
{in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}} that there is
a non-linear relationship in the data, and so the residuals from a
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
the standard formulas assume (somewhat unrealistically) that the $x_i$
@@ -537,7 +529,7 @@ the results from `sm.OLS`.
Below we compute the bootstrap standard error estimates and the
standard linear regression estimates that result from fitting the
quadratic model to the data. Since this model provides a good fit to
the data (Figure 3.8), there is now a better
the data (Figure~\ref{Ch3:polyplot}), there is now a better
correspondence between the bootstrap estimates and the standard
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
${\rm SE}(\hat{\beta}_2)$.