v2.2 versions of labs except Ch10

2024-06-04 18:07:35 -07:00
parent e5bbb1a5bc
commit 29526fb7bc
25 changed files with 19373 additions and 10042 deletions
--- a/Ch05-resample-lab.Rmd
+++ b/Ch05-resample-lab.Rmd
@@ -1,23 +1,15 @@
---
-jupyter:
-  jupytext:
-    cell_metadata_filter: -all
-    formats: Rmd,ipynb
-    main_language: python
-    text_representation:
-      extension: .Rmd
-      format_name: rmarkdown
-      format_version: '1.2'
-      jupytext_version: 1.14.7
---
-
-
-# Chapter 5



+# Cross-Validation and the Bootstrap
+
+<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb">
+<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch05-resample-lab.ipynb)
+

-# Lab: Cross-Validation and the Bootstrap
 In this lab, we explore the resampling techniques covered in this
 chapter. Some of the commands in this lab may take a while to run on
 your computer.
@@ -235,7 +227,7 @@ for i, d in enumerate(range(1,6)):
 cv_error

 ```
-As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
+As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
 quadratic fits, but then no clear improvement from using higher-degree polynomials.

 Above we introduced the `outer()`  method of the `np.power()`
@@ -276,7 +268,7 @@ cv_error
 Notice that the computation time is much shorter than that of LOOCV.
 (In principle, the computation time for LOOCV for a least squares
 linear model should be faster than for $K$-fold CV, due to the
-availability of the formula (5.2)  for LOOCV;
+availability of the formula~(\ref{Ch5:eq:LOOCVform})  for LOOCV;
 however, the generic `cross_validate()`  function does not make
 use of this formula.)  We still see little evidence that using cubic
 or higher-degree polynomial terms leads to a lower test error than simply
@@ -322,7 +314,7 @@ incurred by picking different random folds.

 ## The Bootstrap
 We illustrate the use of the bootstrap in the simple example
- {of Section 5.2,}  as well as on an example involving
+ {of Section~\ref{Ch5:sec:bootstrap},}  as well as on an example involving
 estimating the accuracy of the linear regression model on the  `Auto`
 data set.
 ### Estimating the Accuracy of a Statistic of Interest
@@ -337,8 +329,8 @@ in a dataframe.
 To illustrate the bootstrap, we
 start with a simple example.
 The  `Portfolio`  data set in the `ISLP` package is described
-in Section 5.2. The goal is to estimate the
-sampling variance of the parameter $\alpha$ given in formula (5.7).  We will
+in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
+sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}).  We will
 create a function
 `alpha_func()`, which takes as input a dataframe `D` assumed
 to have columns `X` and `Y`, as well as a
@@ -357,7 +349,7 @@ def alpha_func(D, idx):
 ```
 This function returns an estimate for $\alpha$
 based on applying the minimum
-    variance formula (5.7) to the observations indexed by
+    variance formula (\ref{Ch5:min.var}) to the observations indexed by
 the argument `idx`.  For instance, the following command
 estimates $\alpha$ using all 100 observations.

@@ -427,7 +419,7 @@ intercept and slope terms for the linear regression model that uses
 `horsepower` to predict `mpg` in the  `Auto`  data set. We
 will compare the estimates obtained using the bootstrap to those
 obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
-${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
+${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.

 To use our `boot_SE()` function, we must write a function (its
 first argument)
@@ -474,7 +466,7 @@ demonstrate its utility on 10 bootstrap samples.
 ```{python}
 rng = np.random.default_rng(0)
 np.array([hp_func(Auto,
-          rng.choice(392,
+          rng.choice(Auto.index,
                     392,
                     replace=True)) for _ in range(10)])

@@ -496,7 +488,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
 0.85, and that the bootstrap
 estimate for ${\rm SE}(\hat{\beta}_1)$ is
 0.0074.  As discussed in
-Section 3.1.2, standard formulas can be used to compute
+Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
 the standard errors for the regression coefficients in a linear
 model. These can be obtained using the `summarize()`  function
 from `ISLP.sm`.
@@ -510,7 +502,7 @@ model_se


 The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
-obtained using the formulas  from Section 3.1.2  are
+obtained using the formulas  from Section~\ref{Ch3:secoefsec}  are
 0.717 for the
 intercept and
 0.006 for the
@@ -518,13 +510,13 @@ slope. Interestingly, these are somewhat different from the estimates
 obtained using the bootstrap.  Does this indicate a problem with the
 bootstrap? In fact, it suggests the opposite.  Recall that the
 standard formulas given in
- {Equation 3.8 on page 82}
+ {Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
 rely on certain assumptions. For example,
 they depend on the unknown parameter $\sigma^2$, the noise
 variance. We then estimate $\sigma^2$ using the RSS. Now although the
 formula for the standard errors do not rely on the linear model being
 correct, the estimate for $\sigma^2$ does.  We see
- {in Figure 3.8 on page 108}  that there is
+ {in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}}  that there is
 a non-linear relationship in the data, and so the residuals from a
 linear fit will be inflated, and so will $\hat{\sigma}^2$.  Secondly,
 the standard formulas assume (somewhat unrealistically) that the $x_i$
@@ -537,7 +529,7 @@ the results from `sm.OLS`.
 Below we compute the bootstrap standard error estimates and the
 standard linear regression estimates that result from fitting the
 quadratic model to the data. Since this model provides a good fit to
-the data (Figure 3.8), there is now a better
+the data (Figure~\ref{Ch3:polyplot}), there is now a better
 correspondence between the bootstrap estimates and the standard
 estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
 ${\rm SE}(\hat{\beta}_2)$.