v2.2 versions of labs except Ch10
This commit is contained in:
@@ -1,23 +1,15 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: Rmd,ipynb
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 5
|
||||
|
||||
|
||||
|
||||
# Cross-Validation and the Bootstrap
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch05-resample-lab.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
[](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch05-resample-lab.ipynb)
|
||||
|
||||
|
||||
# Lab: Cross-Validation and the Bootstrap
|
||||
In this lab, we explore the resampling techniques covered in this
|
||||
chapter. Some of the commands in this lab may take a while to run on
|
||||
your computer.
|
||||
@@ -235,7 +227,7 @@ for i, d in enumerate(range(1,6)):
|
||||
cv_error
|
||||
|
||||
```
|
||||
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
|
||||
As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
|
||||
quadratic fits, but then no clear improvement from using higher-degree polynomials.
|
||||
|
||||
Above we introduced the `outer()` method of the `np.power()`
|
||||
@@ -276,7 +268,7 @@ cv_error
|
||||
Notice that the computation time is much shorter than that of LOOCV.
|
||||
(In principle, the computation time for LOOCV for a least squares
|
||||
linear model should be faster than for $K$-fold CV, due to the
|
||||
availability of the formula (5.2) for LOOCV;
|
||||
availability of the formula~(\ref{Ch5:eq:LOOCVform}) for LOOCV;
|
||||
however, the generic `cross_validate()` function does not make
|
||||
use of this formula.) We still see little evidence that using cubic
|
||||
or higher-degree polynomial terms leads to a lower test error than simply
|
||||
@@ -322,7 +314,7 @@ incurred by picking different random folds.
|
||||
|
||||
## The Bootstrap
|
||||
We illustrate the use of the bootstrap in the simple example
|
||||
{of Section 5.2,} as well as on an example involving
|
||||
{of Section~\ref{Ch5:sec:bootstrap},} as well as on an example involving
|
||||
estimating the accuracy of the linear regression model on the `Auto`
|
||||
data set.
|
||||
### Estimating the Accuracy of a Statistic of Interest
|
||||
@@ -337,8 +329,8 @@ in a dataframe.
|
||||
To illustrate the bootstrap, we
|
||||
start with a simple example.
|
||||
The `Portfolio` data set in the `ISLP` package is described
|
||||
in Section 5.2. The goal is to estimate the
|
||||
sampling variance of the parameter $\alpha$ given in formula (5.7). We will
|
||||
in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
|
||||
sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}). We will
|
||||
create a function
|
||||
`alpha_func()`, which takes as input a dataframe `D` assumed
|
||||
to have columns `X` and `Y`, as well as a
|
||||
@@ -357,7 +349,7 @@ def alpha_func(D, idx):
|
||||
```
|
||||
This function returns an estimate for $\alpha$
|
||||
based on applying the minimum
|
||||
variance formula (5.7) to the observations indexed by
|
||||
variance formula (\ref{Ch5:min.var}) to the observations indexed by
|
||||
the argument `idx`. For instance, the following command
|
||||
estimates $\alpha$ using all 100 observations.
|
||||
|
||||
@@ -427,7 +419,7 @@ intercept and slope terms for the linear regression model that uses
|
||||
`horsepower` to predict `mpg` in the `Auto` data set. We
|
||||
will compare the estimates obtained using the bootstrap to those
|
||||
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
|
||||
${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
|
||||
${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.
|
||||
|
||||
To use our `boot_SE()` function, we must write a function (its
|
||||
first argument)
|
||||
@@ -474,7 +466,7 @@ demonstrate its utility on 10 bootstrap samples.
|
||||
```{python}
|
||||
rng = np.random.default_rng(0)
|
||||
np.array([hp_func(Auto,
|
||||
rng.choice(392,
|
||||
rng.choice(Auto.index,
|
||||
392,
|
||||
replace=True)) for _ in range(10)])
|
||||
|
||||
@@ -496,7 +488,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
|
||||
0.85, and that the bootstrap
|
||||
estimate for ${\rm SE}(\hat{\beta}_1)$ is
|
||||
0.0074. As discussed in
|
||||
Section 3.1.2, standard formulas can be used to compute
|
||||
Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
|
||||
the standard errors for the regression coefficients in a linear
|
||||
model. These can be obtained using the `summarize()` function
|
||||
from `ISLP.sm`.
|
||||
@@ -510,7 +502,7 @@ model_se
|
||||
|
||||
|
||||
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
|
||||
obtained using the formulas from Section 3.1.2 are
|
||||
obtained using the formulas from Section~\ref{Ch3:secoefsec} are
|
||||
0.717 for the
|
||||
intercept and
|
||||
0.006 for the
|
||||
@@ -518,13 +510,13 @@ slope. Interestingly, these are somewhat different from the estimates
|
||||
obtained using the bootstrap. Does this indicate a problem with the
|
||||
bootstrap? In fact, it suggests the opposite. Recall that the
|
||||
standard formulas given in
|
||||
{Equation 3.8 on page 82}
|
||||
{Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
|
||||
rely on certain assumptions. For example,
|
||||
they depend on the unknown parameter $\sigma^2$, the noise
|
||||
variance. We then estimate $\sigma^2$ using the RSS. Now although the
|
||||
formula for the standard errors do not rely on the linear model being
|
||||
correct, the estimate for $\sigma^2$ does. We see
|
||||
{in Figure 3.8 on page 108} that there is
|
||||
{in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}} that there is
|
||||
a non-linear relationship in the data, and so the residuals from a
|
||||
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
|
||||
the standard formulas assume (somewhat unrealistically) that the $x_i$
|
||||
@@ -537,7 +529,7 @@ the results from `sm.OLS`.
|
||||
Below we compute the bootstrap standard error estimates and the
|
||||
standard linear regression estimates that result from fitting the
|
||||
quadratic model to the data. Since this model provides a good fit to
|
||||
the data (Figure 3.8), there is now a better
|
||||
the data (Figure~\ref{Ch3:polyplot}), there is now a better
|
||||
correspondence between the bootstrap estimates and the standard
|
||||
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
|
||||
${\rm SE}(\hat{\beta}_2)$.
|
||||
|
||||
Reference in New Issue
Block a user