Fix refs again (#76)

* Ch2->Ch02

* fixed latex refs again, somehow crept back in

* fixed the page refs, formats synced

* unsynced

* executed notebook besides 10

* warnings for lasso

* allow saving of output in notebooks

* Ch10 executed
This commit is contained in:
Jonathan Taylor
2026-02-04 17:40:52 -08:00
committed by GitHub
parent 3d9af7c4b0
commit 6bf6160a3d
25 changed files with 21872 additions and 3191 deletions

View File

@@ -36,8 +36,8 @@ Please see the `Python` resources page on the book website [statlearning.com](ht
You will need to install the `ISLP` package, which provides access to the datasets and custom-built functions that we provide.
Inside a macOS or Linux terminal type `pip install ISLP`; this also installs most other packages needed in the labs. The `Python` resources page has a link to the `ISLP` documentation website.
To run this lab, download the file `Ch2-statlearn-lab.ipynb` from the `Python` resources page.
Now run the following code at the command line: `jupyter lab Ch2-statlearn-lab.ipynb`.
To run this lab, download the file `Ch02-statlearn-lab.ipynb` from the `Python` resources page.
Now run the following code at the command line: `jupyter lab Ch02-statlearn-lab.ipynb`.
If you're using Windows, you can use the `start menu` to access `anaconda`, and follow the links. For example, to install `ISLP` and run this lab, you can run the same code above in an `anaconda` shell.
@@ -72,7 +72,7 @@ print('fit a model with', 11, 'variables')
The following command will provide information about the `print()` function.
```{python}
print?
# print?
```
@@ -222,7 +222,7 @@ documentation associated with the function `fun`, if it exists.
We can try this for `np.array()`.
```{python}
np.array?
# np.array?
```
This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`.

File diff suppressed because one or more lines are too long

View File

@@ -332,7 +332,7 @@ As mentioned above, there is an existing function to add a line to a plot --- `a
Next we examine some diagnostic plots, several of which were discussed
in Section~\ref{Ch3:problems.sec}.
in Section 3.3.3.
We can find the fitted values and residuals
of the fit as attributes of the `results` object.
Various influence measures describing the regression model
@@ -429,7 +429,7 @@ We can access the individual components of `results` by name
and
`np.sqrt(results.scale)` gives us the RSE.
Variance inflation factors (section~\ref{Ch3:problems.sec}) are sometimes useful
Variance inflation factors (section 3.3.3) are sometimes useful
to assess the effect of collinearity in the model matrix of a regression model.
We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.

File diff suppressed because one or more lines are too long

View File

@@ -394,7 +394,7 @@ lda.fit(X_train, L_train)
```
Here we have used the list comprehensions introduced
in Section~\ref{Ch3-linreg-lab:multivariate-goodness-of-fit}. Looking at our first line above, we see that the right-hand side is a list
in Section 3.6.4. Looking at our first line above, we see that the right-hand side is a list
of length two. This is because the code `for M in [X_train, X_test]` iterates over a list
of length two. While here we loop over a list,
the list comprehension method works when looping over any iterable object.
@@ -443,7 +443,7 @@ lda.scalings_
```
These values provide the linear combination of `Lag1` and `Lag2` that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (\ref{Ch4:bayes.multi}).
These values provide the linear combination of `Lag1` and `Lag2` that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (4.24).
If $-0.64\times `Lag1` - 0.51 \times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.
```{python}
@@ -452,7 +452,7 @@ lda_pred = lda.predict(X_test)
```
As we observed in our comparison of classification methods
(Section~\ref{Ch4:comparison.sec}), the LDA and logistic
(Section 4.5), the LDA and logistic
regression predictions are almost identical.
```{python}
@@ -511,7 +511,7 @@ The LDA classifier above is the first classifier from the
`sklearn` library. We will use several other objects
from this library. The objects
follow a common structure that simplifies tasks such as cross-validation,
which we will see in Chapter~\ref{Ch5:resample}. Specifically,
which we will see in Chapter 5. Specifically,
the methods first create a generic classifier without
referring to any data. This classifier is then fit
to data with the `fit()` method and predictions are
@@ -797,7 +797,7 @@ feature_std.std()
```
Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page~\pageref{Ch4-varformula}.
Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 183.
In this case it does not matter, as long as the variables are all on the same scale.
Using the function `train_test_split()` we now split the observations into a test set,
@@ -864,7 +864,7 @@ This is double the rate that one would obtain from random guessing.
The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*.
We do not know *a priori* what value to use. It is therefore of interest
to see how the classifier performs on test data as we vary these
parameters. This can be achieved with a `for` loop, described in Section~\ref{Ch2-statlearn-lab:for-loops}.
parameters. This can be achieved with a `for` loop, described in Section 2.3.8.
Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase
insurance as we vary the number of neighbors from 1 to 5:
@@ -891,7 +891,7 @@ As a comparison, we can also fit a logistic regression model to the
data. This can also be done
with `sklearn`, though by default it fits
something like the *ridge regression* version
of logistic regression, which we introduce in Chapter~\ref{Ch6:varselect}. This can
of logistic regression, which we introduce in Chapter 6. This can
be modified by appropriately setting the argument `C` below. Its default
value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized)
logistic regression estimator discussed above.
@@ -935,7 +935,7 @@ confusion_table(logit_labels, y_test)
```
## Linear and Poisson Regression on the Bikeshare Data
Here we fit linear and Poisson regression models to the `Bikeshare` data, as described in Section~\ref{Ch4:sec:pois}.
Here we fit linear and Poisson regression models to the `Bikeshare` data, as described in Section 4.6.
The response `bikers` measures the number of bike rentals per hour
in Washington, DC in the period 2010--2012.
@@ -976,7 +976,7 @@ variables constant, there are on average about 7 more riders in
February than in January. Similarly there are about 16.5 more riders
in March than in January.
The results seen in Section~\ref{sec:bikeshare.linear}
The results seen in Section 4.6.1
used a slightly different coding of the variables `hr` and `mnth`, as follows:
```{python}
@@ -1030,7 +1030,7 @@ np.allclose(M_lm.fittedvalues, M2_lm.fittedvalues)
```
To reproduce the left-hand side of Figure~\ref{Ch4:bikeshare}
To reproduce the left-hand side of Figure 4.13
we must first obtain the coefficient estimates associated with
`mnth`. The coefficients for January through November can be obtained
directly from the `M2_lm` object. The coefficient for December
@@ -1070,7 +1070,7 @@ ax_month.set_ylabel('Coefficient', fontsize=20);
```
Reproducing the right-hand plot in Figure~\ref{Ch4:bikeshare} follows a similar process.
Reproducing the right-hand plot in Figure 4.13 follows a similar process.
```{python}
coef_hr = S2[S2.index.str.contains('hr')]['coef']
@@ -1105,7 +1105,7 @@ M_pois = sm.GLM(Y, X2, family=sm.families.Poisson()).fit()
```
We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce Figure~\ref{Ch4:bikeshare.pois}. We first complete these coefficients as before.
We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce Figure 4.15. We first complete these coefficients as before.
```{python}
S_pois = summarize(M_pois)

File diff suppressed because one or more lines are too long

View File

@@ -226,7 +226,7 @@ for i, d in enumerate(range(1,6)):
cv_error
```
As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
quadratic fits, but then no clear improvement from using higher-degree polynomials.
Above we introduced the `outer()` method of the `np.power()`
@@ -267,7 +267,7 @@ cv_error
Notice that the computation time is much shorter than that of LOOCV.
(In principle, the computation time for LOOCV for a least squares
linear model should be faster than for $k$-fold CV, due to the
availability of the formula~(\ref{Ch5:eq:LOOCVform}) for LOOCV;
availability of the formula~(5.2) for LOOCV;
however, the generic `cross_validate()` function does not make
use of this formula.) We still see little evidence that using cubic
or higher-degree polynomial terms leads to a lower test error than simply
@@ -314,7 +314,7 @@ incurred by picking different random folds.
## The Bootstrap
We illustrate the use of the bootstrap in the simple example
{of Section~\ref{Ch5:sec:bootstrap},} as well as on an example involving
{of Section 5.2,} as well as on an example involving
estimating the accuracy of the linear regression model on the `Auto`
data set.
### Estimating the Accuracy of a Statistic of Interest
@@ -329,8 +329,8 @@ in a dataframe.
To illustrate the bootstrap, we
start with a simple example.
The `Portfolio` data set in the `ISLP` package is described
in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}). We will
in Section 5.2. The goal is to estimate the
sampling variance of the parameter $\alpha$ given in formula~(5.7). We will
create a function
`alpha_func()`, which takes as input a dataframe `D` assumed
to have columns `X` and `Y`, as well as a
@@ -349,7 +349,7 @@ def alpha_func(D, idx):
```
This function returns an estimate for $\alpha$
based on applying the minimum
variance formula (\ref{Ch5:min.var}) to the observations indexed by
variance formula (5.7) to the observations indexed by
the argument `idx`. For instance, the following command
estimates $\alpha$ using all 100 observations.
@@ -419,7 +419,7 @@ intercept and slope terms for the linear regression model that uses
`horsepower` to predict `mpg` in the `Auto` data set. We
will compare the estimates obtained using the bootstrap to those
obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.
${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
To use our `boot_SE()` function, we must write a function (its
first argument)
@@ -488,7 +488,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
0.85, and that the bootstrap
estimate for ${\rm SE}(\hat{\beta}_1)$ is
0.0074. As discussed in
Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
Section 3.1.2, standard formulas can be used to compute
the standard errors for the regression coefficients in a linear
model. These can be obtained using the `summarize()` function
from `ISLP.sm`.
@@ -502,7 +502,7 @@ model_se
The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
obtained using the formulas from Section~\ref{Ch3:secoefsec} are
obtained using the formulas from Section 3.1.2 are
0.717 for the
intercept and
0.006 for the
@@ -510,13 +510,13 @@ slope. Interestingly, these are somewhat different from the estimates
obtained using the bootstrap. Does this indicate a problem with the
bootstrap? In fact, it suggests the opposite. Recall that the
standard formulas given in
{Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
Equation 3.8 on page 75
rely on certain assumptions. For example,
they depend on the unknown parameter $\sigma^2$, the noise
variance. We then estimate $\sigma^2$ using the RSS. Now although the
formulas for the standard errors do not rely on the linear model being
correct, the estimate for $\sigma^2$ does. We see
{in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}} that there is
in Figure 3.8 on page 99 that there is
a non-linear relationship in the data, and so the residuals from a
linear fit will be inflated, and so will $\hat{\sigma}^2$. Secondly,
the standard formulas assume (somewhat unrealistically) that the $x_i$
@@ -529,7 +529,7 @@ the results from `sm.OLS`.
Below we compute the bootstrap standard error estimates and the
standard linear regression estimates that result from fitting the
quadratic model to the data. Since this model provides a good fit to
the data (Figure~\ref{Ch3:polyplot}), there is now a better
the data (Figure 3.8), there is now a better
correspondence between the bootstrap estimates and the standard
estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
${\rm SE}(\hat{\beta}_2)$.

View File

@@ -5,6 +5,7 @@
"id": "6dde3cef",
"metadata": {},
"source": [
"\n",
"# Cross-Validation and the Bootstrap\n",
"\n",
"<a target=\"_blank\" href=\"https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch05-resample-lab.ipynb\">\n",
@@ -32,10 +33,10 @@
"id": "f1deb5cc",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:38.213402Z",
"iopub.status.busy": "2026-02-02T23:39:38.212455Z",
"iopub.status.idle": "2026-02-02T23:39:39.121524Z",
"shell.execute_reply": "2026-02-02T23:39:39.121125Z"
"iopub.execute_input": "2026-02-05T00:47:05.730449Z",
"iopub.status.busy": "2026-02-05T00:47:05.730329Z",
"iopub.status.idle": "2026-02-05T00:47:06.513542Z",
"shell.execute_reply": "2026-02-05T00:47:06.513033Z"
},
"lines_to_next_cell": 2
},
@@ -64,10 +65,10 @@
"id": "268c41b3",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.123447Z",
"iopub.status.busy": "2026-02-02T23:39:39.123256Z",
"iopub.status.idle": "2026-02-02T23:39:39.125378Z",
"shell.execute_reply": "2026-02-02T23:39:39.125152Z"
"iopub.execute_input": "2026-02-05T00:47:06.515215Z",
"iopub.status.busy": "2026-02-05T00:47:06.515043Z",
"iopub.status.idle": "2026-02-05T00:47:06.517293Z",
"shell.execute_reply": "2026-02-05T00:47:06.516837Z"
},
"lines_to_next_cell": 2
},
@@ -108,10 +109,10 @@
"id": "22f44ae0",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.126552Z",
"iopub.status.busy": "2026-02-02T23:39:39.126448Z",
"iopub.status.idle": "2026-02-02T23:39:39.136455Z",
"shell.execute_reply": "2026-02-02T23:39:39.135968Z"
"iopub.execute_input": "2026-02-05T00:47:06.518601Z",
"iopub.status.busy": "2026-02-05T00:47:06.518485Z",
"iopub.status.idle": "2026-02-05T00:47:06.524299Z",
"shell.execute_reply": "2026-02-05T00:47:06.523724Z"
}
},
"outputs": [],
@@ -136,10 +137,10 @@
"id": "0c32e917",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.138496Z",
"iopub.status.busy": "2026-02-02T23:39:39.138360Z",
"iopub.status.idle": "2026-02-02T23:39:39.144210Z",
"shell.execute_reply": "2026-02-02T23:39:39.143911Z"
"iopub.execute_input": "2026-02-05T00:47:06.525648Z",
"iopub.status.busy": "2026-02-05T00:47:06.525549Z",
"iopub.status.idle": "2026-02-05T00:47:06.531033Z",
"shell.execute_reply": "2026-02-05T00:47:06.530633Z"
}
},
"outputs": [],
@@ -166,17 +167,17 @@
"id": "86ce4f85",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.145874Z",
"iopub.status.busy": "2026-02-02T23:39:39.145769Z",
"iopub.status.idle": "2026-02-02T23:39:39.150170Z",
"shell.execute_reply": "2026-02-02T23:39:39.149908Z"
"iopub.execute_input": "2026-02-05T00:47:06.532517Z",
"iopub.status.busy": "2026-02-05T00:47:06.532422Z",
"iopub.status.idle": "2026-02-05T00:47:06.537369Z",
"shell.execute_reply": "2026-02-05T00:47:06.536814Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"23.61661706966988"
"np.float64(23.616617069669893)"
]
},
"execution_count": 5,
@@ -210,10 +211,10 @@
"id": "50a66a97",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.151449Z",
"iopub.status.busy": "2026-02-02T23:39:39.151374Z",
"iopub.status.idle": "2026-02-02T23:39:39.153368Z",
"shell.execute_reply": "2026-02-02T23:39:39.153165Z"
"iopub.execute_input": "2026-02-05T00:47:06.538690Z",
"iopub.status.busy": "2026-02-05T00:47:06.538577Z",
"iopub.status.idle": "2026-02-05T00:47:06.540965Z",
"shell.execute_reply": "2026-02-05T00:47:06.540491Z"
}
},
"outputs": [],
@@ -253,10 +254,10 @@
"id": "d49b6999",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.154485Z",
"iopub.status.busy": "2026-02-02T23:39:39.154414Z",
"iopub.status.idle": "2026-02-02T23:39:39.167468Z",
"shell.execute_reply": "2026-02-02T23:39:39.167243Z"
"iopub.execute_input": "2026-02-05T00:47:06.542270Z",
"iopub.status.busy": "2026-02-05T00:47:06.542165Z",
"iopub.status.idle": "2026-02-05T00:47:06.560011Z",
"shell.execute_reply": "2026-02-05T00:47:06.559510Z"
}
},
"outputs": [
@@ -297,10 +298,10 @@
"id": "dac8bd54",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.168726Z",
"iopub.status.busy": "2026-02-02T23:39:39.168651Z",
"iopub.status.idle": "2026-02-02T23:39:39.182426Z",
"shell.execute_reply": "2026-02-02T23:39:39.182175Z"
"iopub.execute_input": "2026-02-05T00:47:06.561313Z",
"iopub.status.busy": "2026-02-05T00:47:06.561212Z",
"iopub.status.idle": "2026-02-05T00:47:06.579372Z",
"shell.execute_reply": "2026-02-05T00:47:06.578863Z"
}
},
"outputs": [
@@ -380,10 +381,10 @@
"id": "601ae443",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:39.184522Z",
"iopub.status.busy": "2026-02-02T23:39:39.184425Z",
"iopub.status.idle": "2026-02-02T23:39:40.007057Z",
"shell.execute_reply": "2026-02-02T23:39:40.006826Z"
"iopub.execute_input": "2026-02-05T00:47:06.580908Z",
"iopub.status.busy": "2026-02-05T00:47:06.580793Z",
"iopub.status.idle": "2026-02-05T00:47:07.783382Z",
"shell.execute_reply": "2026-02-05T00:47:07.782912Z"
},
"lines_to_next_cell": 0
},
@@ -391,7 +392,7 @@
{
"data": {
"text/plain": [
"24.23151351792924"
"np.float64(24.231513517929226)"
]
},
"execution_count": 9,
@@ -448,10 +449,10 @@
"id": "11226c85",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.008355Z",
"iopub.status.busy": "2026-02-02T23:39:40.008281Z",
"iopub.status.idle": "2026-02-02T23:39:40.469400Z",
"shell.execute_reply": "2026-02-02T23:39:40.469130Z"
"iopub.execute_input": "2026-02-05T00:47:07.785034Z",
"iopub.status.busy": "2026-02-05T00:47:07.784918Z",
"iopub.status.idle": "2026-02-05T00:47:08.636881Z",
"shell.execute_reply": "2026-02-05T00:47:08.636274Z"
},
"lines_to_next_cell": 0
},
@@ -459,7 +460,7 @@
{
"data": {
"text/plain": [
"array([24.23151352, 19.24821312, 19.33498406, 19.42443033, 19.03323827])"
"array([24.23151352, 19.24821312, 19.33498406, 19.4244303 , 19.03321527])"
]
},
"execution_count": 10,
@@ -486,7 +487,7 @@
"id": "a3a920ae",
"metadata": {},
"source": [
"As in Figure~\\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and\n",
"As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and\n",
"quadratic fits, but then no clear improvement from using higher-degree polynomials.\n",
"\n",
"Above we introduced the `outer()` method of the `np.power()`\n",
@@ -505,10 +506,10 @@
"id": "64b64d97",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.470650Z",
"iopub.status.busy": "2026-02-02T23:39:40.470570Z",
"iopub.status.idle": "2026-02-02T23:39:40.472721Z",
"shell.execute_reply": "2026-02-02T23:39:40.472520Z"
"iopub.execute_input": "2026-02-05T00:47:08.638904Z",
"iopub.status.busy": "2026-02-05T00:47:08.638760Z",
"iopub.status.idle": "2026-02-05T00:47:08.641760Z",
"shell.execute_reply": "2026-02-05T00:47:08.641094Z"
}
},
"outputs": [
@@ -547,10 +548,10 @@
"id": "ca0f972f",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.473794Z",
"iopub.status.busy": "2026-02-02T23:39:40.473725Z",
"iopub.status.idle": "2026-02-02T23:39:40.491806Z",
"shell.execute_reply": "2026-02-02T23:39:40.491547Z"
"iopub.execute_input": "2026-02-05T00:47:08.643305Z",
"iopub.status.busy": "2026-02-05T00:47:08.643162Z",
"iopub.status.idle": "2026-02-05T00:47:08.675141Z",
"shell.execute_reply": "2026-02-05T00:47:08.674740Z"
},
"lines_to_next_cell": 0
},
@@ -558,7 +559,7 @@
{
"data": {
"text/plain": [
"array([24.20766449, 19.18533142, 19.27626666, 19.47848403, 19.13720581])"
"array([24.20766449, 19.18533142, 19.27626666, 19.47848402, 19.13718373])"
]
},
"execution_count": 12,
@@ -589,7 +590,7 @@
"Notice that the computation time is much shorter than that of LOOCV.\n",
"(In principle, the computation time for LOOCV for a least squares\n",
"linear model should be faster than for $k$-fold CV, due to the\n",
"availability of the formula~(\\ref{Ch5:eq:LOOCVform}) for LOOCV;\n",
"availability of the formula~(5.2) for LOOCV;\n",
"however, the generic `cross_validate()` function does not make\n",
"use of this formula.) We still see little evidence that using cubic\n",
"or higher-degree polynomial terms leads to a lower test error than simply\n",
@@ -613,10 +614,10 @@
"id": "080cdb29",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.493215Z",
"iopub.status.busy": "2026-02-02T23:39:40.493138Z",
"iopub.status.idle": "2026-02-02T23:39:40.498978Z",
"shell.execute_reply": "2026-02-02T23:39:40.498768Z"
"iopub.execute_input": "2026-02-05T00:47:08.677151Z",
"iopub.status.busy": "2026-02-05T00:47:08.676983Z",
"iopub.status.idle": "2026-02-05T00:47:08.685777Z",
"shell.execute_reply": "2026-02-05T00:47:08.685174Z"
},
"lines_to_next_cell": 2
},
@@ -657,17 +658,17 @@
"id": "7c46de2b",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.500169Z",
"iopub.status.busy": "2026-02-02T23:39:40.500100Z",
"iopub.status.idle": "2026-02-02T23:39:40.525494Z",
"shell.execute_reply": "2026-02-02T23:39:40.525258Z"
"iopub.execute_input": "2026-02-05T00:47:08.687387Z",
"iopub.status.busy": "2026-02-05T00:47:08.687268Z",
"iopub.status.idle": "2026-02-05T00:47:08.726376Z",
"shell.execute_reply": "2026-02-05T00:47:08.725862Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(23.802232661034168, 1.421845094109185)"
"(np.float64(23.802232661034168), np.float64(1.4218450941091916))"
]
},
"execution_count": 14,
@@ -699,7 +700,7 @@
"\n",
"## The Bootstrap\n",
"We illustrate the use of the bootstrap in the simple example\n",
" {of Section~\\ref{Ch5:sec:bootstrap},} as well as on an example involving\n",
" {of Section 5.2,} as well as on an example involving\n",
"estimating the accuracy of the linear regression model on the `Auto`\n",
"data set.\n",
"### Estimating the Accuracy of a Statistic of Interest\n",
@@ -714,8 +715,8 @@
"To illustrate the bootstrap, we\n",
"start with a simple example.\n",
"The `Portfolio` data set in the `ISLP` package is described\n",
"in Section~\\ref{Ch5:sec:bootstrap}. The goal is to estimate the\n",
"sampling variance of the parameter $\\alpha$ given in formula~(\\ref{Ch5:min.var}). We will\n",
"in Section 5.2. The goal is to estimate the\n",
"sampling variance of the parameter $\\alpha$ given in formula~(5.7). We will\n",
"create a function\n",
"`alpha_func()`, which takes as input a dataframe `D` assumed\n",
"to have columns `X` and `Y`, as well as a\n",
@@ -731,10 +732,10 @@
"id": "a4b6d9b3",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.526599Z",
"iopub.status.busy": "2026-02-02T23:39:40.526530Z",
"iopub.status.idle": "2026-02-02T23:39:40.529182Z",
"shell.execute_reply": "2026-02-02T23:39:40.528979Z"
"iopub.execute_input": "2026-02-05T00:47:08.727991Z",
"iopub.status.busy": "2026-02-05T00:47:08.727860Z",
"iopub.status.idle": "2026-02-05T00:47:08.731089Z",
"shell.execute_reply": "2026-02-05T00:47:08.730582Z"
},
"lines_to_next_cell": 0
},
@@ -754,7 +755,7 @@
"source": [
"This function returns an estimate for $\\alpha$\n",
"based on applying the minimum\n",
" variance formula (\\ref{Ch5:min.var}) to the observations indexed by\n",
" variance formula (5.7) to the observations indexed by\n",
"the argument `idx`. For instance, the following command\n",
"estimates $\\alpha$ using all 100 observations."
]
@@ -765,17 +766,17 @@
"id": "81498a11",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.530362Z",
"iopub.status.busy": "2026-02-02T23:39:40.530277Z",
"iopub.status.idle": "2026-02-02T23:39:40.532680Z",
"shell.execute_reply": "2026-02-02T23:39:40.532462Z"
"iopub.execute_input": "2026-02-05T00:47:08.732467Z",
"iopub.status.busy": "2026-02-05T00:47:08.732368Z",
"iopub.status.idle": "2026-02-05T00:47:08.735439Z",
"shell.execute_reply": "2026-02-05T00:47:08.734958Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.57583207459283"
"np.float64(0.57583207459283)"
]
},
"execution_count": 16,
@@ -804,10 +805,10 @@
"id": "64fe1cb6",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.533876Z",
"iopub.status.busy": "2026-02-02T23:39:40.533810Z",
"iopub.status.idle": "2026-02-02T23:39:40.536977Z",
"shell.execute_reply": "2026-02-02T23:39:40.536767Z"
"iopub.execute_input": "2026-02-05T00:47:08.736816Z",
"iopub.status.busy": "2026-02-05T00:47:08.736715Z",
"iopub.status.idle": "2026-02-05T00:47:08.739777Z",
"shell.execute_reply": "2026-02-05T00:47:08.739384Z"
},
"lines_to_next_cell": 2
},
@@ -815,7 +816,7 @@
{
"data": {
"text/plain": [
"0.6074452469619004"
"np.float64(0.6074452469619002)"
]
},
"execution_count": 17,
@@ -847,10 +848,10 @@
"id": "dd16bbae",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.538089Z",
"iopub.status.busy": "2026-02-02T23:39:40.538010Z",
"iopub.status.idle": "2026-02-02T23:39:40.540142Z",
"shell.execute_reply": "2026-02-02T23:39:40.539916Z"
"iopub.execute_input": "2026-02-05T00:47:08.741328Z",
"iopub.status.busy": "2026-02-05T00:47:08.741207Z",
"iopub.status.idle": "2026-02-05T00:47:08.743628Z",
"shell.execute_reply": "2026-02-05T00:47:08.743222Z"
},
"lines_to_next_cell": 0
},
@@ -892,17 +893,17 @@
"id": "b42b4585",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.541333Z",
"iopub.status.busy": "2026-02-02T23:39:40.541248Z",
"iopub.status.idle": "2026-02-02T23:39:40.761713Z",
"shell.execute_reply": "2026-02-02T23:39:40.761472Z"
"iopub.execute_input": "2026-02-05T00:47:08.744970Z",
"iopub.status.busy": "2026-02-05T00:47:08.744866Z",
"iopub.status.idle": "2026-02-05T00:47:09.034568Z",
"shell.execute_reply": "2026-02-05T00:47:09.034067Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.09118176521277699"
"np.float64(0.09118176521277699)"
]
},
"execution_count": 19,
@@ -934,7 +935,7 @@
"`horsepower` to predict `mpg` in the `Auto` data set. We\n",
"will compare the estimates obtained using the bootstrap to those\n",
"obtained using the formulas for ${\\rm SE}(\\hat{\\beta}_0)$ and\n",
"${\\rm SE}(\\hat{\\beta}_1)$ described in Section~\\ref{Ch3:secoefsec}.\n",
"${\\rm SE}(\\hat{\\beta}_1)$ described in Section 3.1.2.\n",
"\n",
"To use our `boot_SE()` function, we must write a function (its\n",
"first argument)\n",
@@ -958,10 +959,10 @@
"id": "6bc11784",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.762911Z",
"iopub.status.busy": "2026-02-02T23:39:40.762837Z",
"iopub.status.idle": "2026-02-02T23:39:40.764597Z",
"shell.execute_reply": "2026-02-02T23:39:40.764401Z"
"iopub.execute_input": "2026-02-05T00:47:09.036186Z",
"iopub.status.busy": "2026-02-05T00:47:09.036064Z",
"iopub.status.idle": "2026-02-05T00:47:09.038120Z",
"shell.execute_reply": "2026-02-05T00:47:09.037695Z"
},
"lines_to_next_cell": 0
},
@@ -993,10 +994,10 @@
"id": "740cd50c",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.765800Z",
"iopub.status.busy": "2026-02-02T23:39:40.765734Z",
"iopub.status.idle": "2026-02-02T23:39:40.767191Z",
"shell.execute_reply": "2026-02-02T23:39:40.766986Z"
"iopub.execute_input": "2026-02-05T00:47:09.039473Z",
"iopub.status.busy": "2026-02-05T00:47:09.039363Z",
"iopub.status.idle": "2026-02-05T00:47:09.041198Z",
"shell.execute_reply": "2026-02-05T00:47:09.040802Z"
},
"lines_to_next_cell": 0
},
@@ -1026,10 +1027,10 @@
"id": "ffb3ec50",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.768266Z",
"iopub.status.busy": "2026-02-02T23:39:40.768204Z",
"iopub.status.idle": "2026-02-02T23:39:40.789931Z",
"shell.execute_reply": "2026-02-02T23:39:40.789705Z"
"iopub.execute_input": "2026-02-05T00:47:09.042630Z",
"iopub.status.busy": "2026-02-05T00:47:09.042520Z",
"iopub.status.idle": "2026-02-05T00:47:09.074556Z",
"shell.execute_reply": "2026-02-05T00:47:09.074079Z"
},
"lines_to_next_cell": 0
},
@@ -1077,10 +1078,10 @@
"id": "7d561f70",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:40.791196Z",
"iopub.status.busy": "2026-02-02T23:39:40.791129Z",
"iopub.status.idle": "2026-02-02T23:39:42.690473Z",
"shell.execute_reply": "2026-02-02T23:39:42.690189Z"
"iopub.execute_input": "2026-02-05T00:47:09.076044Z",
"iopub.status.busy": "2026-02-05T00:47:09.075921Z",
"iopub.status.idle": "2026-02-05T00:47:11.975218Z",
"shell.execute_reply": "2026-02-05T00:47:11.974653Z"
},
"lines_to_next_cell": 2
},
@@ -1115,7 +1116,7 @@
"0.85, and that the bootstrap\n",
"estimate for ${\\rm SE}(\\hat{\\beta}_1)$ is\n",
"0.0074. As discussed in\n",
"Section~\\ref{Ch3:secoefsec}, standard formulas can be used to compute\n",
"Section 3.1.2, standard formulas can be used to compute\n",
"the standard errors for the regression coefficients in a linear\n",
"model. These can be obtained using the `summarize()` function\n",
"from `ISLP.sm`."
@@ -1127,10 +1128,10 @@
"id": "3888aa0a",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:42.692058Z",
"iopub.status.busy": "2026-02-02T23:39:42.691967Z",
"iopub.status.idle": "2026-02-02T23:39:42.737592Z",
"shell.execute_reply": "2026-02-02T23:39:42.737346Z"
"iopub.execute_input": "2026-02-05T00:47:11.976651Z",
"iopub.status.busy": "2026-02-05T00:47:11.976530Z",
"iopub.status.idle": "2026-02-05T00:47:11.995335Z",
"shell.execute_reply": "2026-02-05T00:47:11.994822Z"
},
"lines_to_next_cell": 2
},
@@ -1160,7 +1161,7 @@
"metadata": {},
"source": [
"The standard error estimates for $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$\n",
"obtained using the formulas from Section~\\ref{Ch3:secoefsec} are\n",
"obtained using the formulas from Section 3.1.2 are\n",
"0.717 for the\n",
"intercept and\n",
"0.006 for the\n",
@@ -1168,13 +1169,13 @@
"obtained using the bootstrap. Does this indicate a problem with the\n",
"bootstrap? In fact, it suggests the opposite. Recall that the\n",
"standard formulas given in\n",
" {Equation~\\ref{Ch3:se.eqn} on page~\\pageref{Ch3:se.eqn}}\n",
"Equation 3.8 on page 75\n",
"rely on certain assumptions. For example,\n",
"they depend on the unknown parameter $\\sigma^2$, the noise\n",
"variance. We then estimate $\\sigma^2$ using the RSS. Now although the\n",
"formulas for the standard errors do not rely on the linear model being\n",
"correct, the estimate for $\\sigma^2$ does. We see\n",
" {in Figure~\\ref{Ch3:polyplot} on page~\\pageref{Ch3:polyplot}} that there is\n",
"in Figure 3.8 on page 99 that there is\n",
"a non-linear relationship in the data, and so the residuals from a\n",
"linear fit will be inflated, and so will $\\hat{\\sigma}^2$. Secondly,\n",
"the standard formulas assume (somewhat unrealistically) that the $x_i$\n",
@@ -1187,7 +1188,7 @@
"Below we compute the bootstrap standard error estimates and the\n",
"standard linear regression estimates that result from fitting the\n",
"quadratic model to the data. Since this model provides a good fit to\n",
"the data (Figure~\\ref{Ch3:polyplot}), there is now a better\n",
"the data (Figure 3.8), there is now a better\n",
"correspondence between the bootstrap estimates and the standard\n",
"estimates of ${\\rm SE}(\\hat{\\beta}_0)$, ${\\rm SE}(\\hat{\\beta}_1)$ and\n",
"${\\rm SE}(\\hat{\\beta}_2)$."
@@ -1199,10 +1200,10 @@
"id": "acc3e32c",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:42.739037Z",
"iopub.status.busy": "2026-02-02T23:39:42.738882Z",
"iopub.status.idle": "2026-02-02T23:39:45.405121Z",
"shell.execute_reply": "2026-02-02T23:39:45.404784Z"
"iopub.execute_input": "2026-02-05T00:47:11.996827Z",
"iopub.status.busy": "2026-02-05T00:47:11.996698Z",
"iopub.status.idle": "2026-02-05T00:47:16.184970Z",
"shell.execute_reply": "2026-02-05T00:47:16.184301Z"
}
},
"outputs": [
@@ -1242,10 +1243,10 @@
"id": "dca5340c",
"metadata": {
"execution": {
"iopub.execute_input": "2026-02-02T23:39:45.406403Z",
"iopub.status.busy": "2026-02-02T23:39:45.406328Z",
"iopub.status.idle": "2026-02-02T23:39:45.416112Z",
"shell.execute_reply": "2026-02-02T23:39:45.415874Z"
"iopub.execute_input": "2026-02-05T00:47:16.186711Z",
"iopub.status.busy": "2026-02-05T00:47:16.186585Z",
"iopub.status.idle": "2026-02-05T00:47:16.197450Z",
"shell.execute_reply": "2026-02-05T00:47:16.197112Z"
},
"lines_to_next_cell": 0
},
@@ -1272,7 +1273,7 @@
},
{
"cell_type": "markdown",
"id": "e98297be",
"id": "d714b6a3",
"metadata": {},
"source": [
"\n",
@@ -1283,8 +1284,7 @@
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "ipynb",
"main_language": "python"
"formats": "ipynb"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
@@ -1301,7 +1301,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.12.10"
}
},
"nbformat": 4,

View File

@@ -86,7 +86,7 @@ Hitters.shape
```
We first choose the best model using forward selection based on $C_p$ (\ref{Ch6:eq:cp}). This score
We first choose the best model using forward selection based on $C_p$ (6.2). This score
is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
it as a scorer. By default, `sklearn` tries to maximize a score, hence
our scoring function computes the negative $C_p$ statistic.
@@ -111,7 +111,7 @@ sigma2 = OLS(Y,X).fit().scale
```
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section~\ref{Ch5-resample-lab:the-bootstrap} to freeze the first argument with our estimate of $\sigma^2$.
The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$.
```{python}
neg_Cp = partial(nCp, sigma2)
@@ -363,7 +363,7 @@ Since we
standardize first, in order to find coefficient
estimates on the original scale, we must *unstandardize*
the coefficient estimates. The parameter
$\lambda$ in (\ref{Ch6:ridge}) and (\ref{Ch6:LASSO}) is called `alphas` in `sklearn`. In order to
$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to
be consistent with the rest of this chapter, we use `lambdas`
rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [23] issue unwarranted convergence warning messages; we suppressed these when we filtered the warnings above.}
@@ -540,7 +540,7 @@ grid.best_params_['ridge__alpha']
grid.best_estimator_
```
Recall we set up the `kfold` object for 5-fold cross-validation on page~\pageref{line:choos-among-models}. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
Recall we set up the `kfold` object for 5-fold cross-validation on page 271. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left
to right.
```{python}
@@ -640,7 +640,7 @@ not perform variable selection!
### Evaluating Test Error of Cross-Validated Ridge
Choosing $\lambda$ using cross-validation provides a single regression
estimator, similar to fitting a linear regression model as we saw in
Chapter~\ref{Ch3:linreg}. It is therefore reasonable to estimate what its test error
Chapter 3. It is therefore reasonable to estimate what its test error
is. We run into a problem here in that cross-validation will have
*touched* all of its data in choosing $\lambda$, hence we have no
further data to estimate test error. A compromise is to do an initial
@@ -728,7 +728,7 @@ ax.set_ylabel('Standardized coefficiients', fontsize=20);
```
The smallest cross-validated error is lower than the test set MSE of the null model
and of least squares, and very similar to the test MSE of 115526.71 of ridge
regression (page~\pageref{page:MSECVRidge}) with $\lambda$ chosen by cross-validation.
regression (page 278) with $\lambda$ chosen by cross-validation.
```{python}
np.min(tuned_lasso.mse_path_.mean(1))
@@ -776,11 +776,11 @@ Principal components regression (PCR) can be performed using
`PCA()` from the `sklearn.decomposition`
module. We now apply PCR to the `Hitters` data, in order to
predict `Salary`. Again, ensure that the missing values have
been removed from the data, as described in Section~\ref{Ch6-varselect-lab:lab-1-subset-selection-methods}.
been removed from the data, as described in Section 6.5.1.
We use `LinearRegression()` to fit the regression model
here. Note that it fits an intercept by default, unlike
the `OLS()` function seen earlier in Section~\ref{Ch6-varselect-lab:lab-1-subset-selection-methods}.
the `OLS()` function seen earlier in Section 6.5.1.
```{python}
pca = PCA(n_components=2)
@@ -864,7 +864,7 @@ cv_null = skm.cross_validate(linreg,
The `explained_variance_ratio_`
attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
different numbers of components. This concept is discussed in greater
detail in Section~\ref{Ch10:sec:pca}.
detail in Section 12.2.
```{python}
pipe.named_steps['pca'].explained_variance_ratio_

File diff suppressed because one or more lines are too long

View File

@@ -47,7 +47,7 @@ from ISLP.pygam import (approx_lam,
```
## Polynomial Regression and Step Functions
We start by demonstrating how Figure~\ref{Ch7:fig:poly} can be reproduced.
We start by demonstrating how Figure 7.1 can be reproduced.
Let's begin by loading the data.
```{python}
@@ -59,7 +59,7 @@ age = Wage['age']
Throughout most of this lab, our response is `Wage['wage']`, which
we have stored as `y` above.
As in Section~\ref{Ch3-linreg-lab:non-linear-transformations-of-the-predictors}, we will use the `poly()` function to create a model matrix
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
that will fit a $4$th degree polynomial in `age`.
```{python}
@@ -73,11 +73,11 @@ summarize(M)
This polynomial is constructed using the function `poly()`,
which creates
a special *transformer* `Poly()` (using `sklearn` terminology
for feature transformations such as `PCA()` seen in Section \ref{Ch6-varselect-lab:principal-components-regression}) which
for feature transformations such as `PCA()` seen in Section 6.5.3) which
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
the
discussion of transformations on
page~\pageref{Ch3-linreg-lab:using-transformations-fit-and-transform}.
page 118.
In the code above, the first line executes the `fit()` method
using the dataframe
@@ -140,7 +140,7 @@ def plot_wage_fit(age_df,
We include an argument `alpha` to `ax.scatter()`
to add some transparency to the points. This provides a visual indication
of density. Notice the use of the `zip()` function in the
`for` loop above (see Section~\ref{Ch2-statlearn-lab:for-loops}).
`for` loop above (see Section 2.3.8).
We have three lines to plot, each with different colors and line
types. Here `zip()` conveniently bundles these together as
iterators in the loop. {In `Python`{} speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
@@ -243,7 +243,7 @@ anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
As an alternative to using hypothesis tests and ANOVA, we could choose
the polynomial degree using cross-validation, as discussed in Chapter~\ref{Ch5:resample}.
the polynomial degree using cross-validation, as discussed in Chapter 5.
Next we consider the task of predicting whether an individual earns
more than $250,000 per year. We proceed much as before, except
@@ -302,7 +302,7 @@ value do not cover each other up. This type of plot is often called a
*rug plot*.
In order to fit a step function, as discussed in
Section~\ref{Ch7:sec:scolstep-function}, we first use the `pd.qcut()`
Section 7.2, we first use the `pd.qcut()`
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
columns of the model matrix for this categorical variable. Note that this function will
include *all* columns for a given categorical, rather than the usual approach which drops one
@@ -334,7 +334,7 @@ evaluation functions are in the `scipy.interpolate` package;
we have simply wrapped them as transforms
similar to `Poly()` and `PCA()`.
In Section~\ref{Ch7:sec:scolr-splin}, we saw
In Section 7.4, we saw
that regression splines can be fit by constructing an appropriate
matrix of basis functions. The `BSpline()` function generates the
entire matrix of basis functions for splines with the specified set of
@@ -349,7 +349,7 @@ bs_age.shape
```
This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots.
We can form this same matrix using the `bs()` object,
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section~\ref{Ch7-nonlin-lab:polynomial-regression-and-step-functions}.
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1.
We now fit a cubic spline model to the `Wage` data.
@@ -458,7 +458,7 @@ of a model matrix with a particular smoothing operation:
`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
The argument `0` passed to `s` below indicates that this smoother will
apply to the first column of a feature matrix. Below, we pass it a
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section~\ref{Ch7:sec5.2}.
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2.
```{python}
X_age = np.asarray(age).reshape((-1,1))
@@ -548,7 +548,7 @@ The strength of generalized additive models lies in their ability to fit multiva
We now fit a GAM by hand to predict
`wage` using natural spline functions of `year` and `age`,
treating `education` as a qualitative predictor, as in (\ref{Ch7:nsmod}).
treating `education` as a qualitative predictor, as in (7.16).
Since this is just a big linear regression model
using an appropriate choice of basis functions, we can simply do this
using the `sm.OLS()` function.
@@ -631,9 +631,9 @@ ax.set_title('Partial dependence of year on wage', fontsize=20);
```
We now fit the model (\ref{Ch7:nsmod}) using smoothing splines rather
We now fit the model (7.16) using smoothing splines rather
than natural splines. All of the
terms in (\ref{Ch7:nsmod}) are fit simultaneously, taking each other
terms in (7.16) are fit simultaneously, taking each other
into account to explain the response. The `pygam` package only works with matrices, so we must convert
the categorical series `education` to its array representation, which can be found
with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we

File diff suppressed because one or more lines are too long

View File

@@ -86,11 +86,11 @@ clf.fit(X, High)
```
In our discussion of qualitative features in Section~\ref{ch3:sec3},
In our discussion of qualitative features in Section 3.3,
we noted that for a linear regression model such a feature could be
represented by including a matrix of dummy variables (one-hot-encoding) in the model
matrix, using the formula notation of `statsmodels`.
As mentioned in Section~\ref{Ch8:decison.tree.sec}, there is a more
As mentioned in Section 8.1, there is a more
natural way to handle qualitative features when building a decision
tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
However,
@@ -121,7 +121,7 @@ resid_dev
```
This is closely related to the *entropy*, defined in (\ref{Ch8:eq:cross-entropy}).
This is closely related to the *entropy*, defined in (8.7).
A small deviance indicates a
tree that provides a good fit to the (training) data.
@@ -158,7 +158,7 @@ on these data, we must estimate the test error rather than simply
computing the training error. We split the observations into a
training set and a test set, build the tree using the training set,
and evaluate its performance on the test data. This pattern is
similar to that in Chapter~\ref{Ch6:varselect}, with the linear models
similar to that in Chapter 6, with the linear models
replaced here by decision trees --- the code for validation
is almost identical. This approach leads to correct predictions
for 68.5% of the locations in the test data set.
@@ -431,7 +431,7 @@ feature_imp.sort_values(by='importance', ascending=False)
```
This
is a relative measure of the total decrease in node impurity that results from
splits over that variable, averaged over all trees (this was plotted in Figure~\ref{Ch8:fig:varimp} for a model fit to the `Heart` data).
splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data).
The results indicate that across all of the trees considered in the
random forest, the wealth level of the community (`lstat`) and the
@@ -495,7 +495,7 @@ np.mean((y_test - y_hat_boost)**2)
The test MSE obtained is 14.48,
similar to the test MSE for bagging. If we want to, we can
perform boosting with a different value of the shrinkage parameter
$\lambda$ in (\ref{Ch8:alphaboost}). The default value is 0.001, but
$\lambda$ in (8.10). The default value is 0.001, but
this is easily modified. Here we take $\lambda=0.2$.
```{python}

File diff suppressed because one or more lines are too long

View File

@@ -255,9 +255,9 @@ kernel we use `kernel="poly"`, and to fit an SVM with a
radial kernel we use
`kernel="rbf"`. In the former case we also use the
`degree` argument to specify a degree for the polynomial kernel
(this is $d$ in (\ref{Ch9:eq:polyd})), and in the latter case we use
(this is $d$ in (9.22)), and in the latter case we use
`gamma` to specify a value of $\gamma$ for the radial basis
kernel (\ref{Ch9:eq:radial}).
kernel (9.24).
We first generate some data with a non-linear class boundary, as follows:
@@ -377,7 +377,7 @@ classifier, the fitted value for an observation $X= (X_1, X_2, \ldots,
X_p)^T$ takes the form $\hat{\beta}_0 + \hat{\beta}_1 X_1 +
\hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$. For an SVM with a
non-linear kernel, the equation that yields the fitted value is given
in (\ref{Ch9:eq:svmip}). The sign of the fitted value
in (9.23). The sign of the fitted value
determines on which side of the decision boundary the observation
lies. Therefore, the relationship between the fitted value and the
class prediction for a given observation is simple: if the fitted

File diff suppressed because one or more lines are too long

View File

@@ -1,5 +1,4 @@
# Deep Learning
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2.1/Ch10-deeplearning-lab.ipynb">
@@ -21,7 +20,8 @@ Much of our code is adapted from there, as well as the `pytorch_lightning` docum
We start with several standard imports that we have seen before.
```{python}
import numpy as np, pandas as pd
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
from sklearn.linear_model import \
(LinearRegression,
@@ -57,7 +57,7 @@ the `torchmetrics` package has utilities to compute
various metrics to evaluate performance when fitting
a model. The `torchinfo` package provides a useful
summary of the layers of a model. We use the `read_image()`
function when loading test images in Section~\ref{Ch13-deeplearning-lab:using-pretrained-cnn-models}.
function when loading test images in Section 10.9.4.
If you have not already installed the packages `torchvision`
and `torchinfo` you can install them by running
@@ -153,17 +153,19 @@ in our example applying the `ResNet50` model
to some of our own images.
The `json` module will be used to load
a JSON file for looking up classes to identify the labels of the
pictures in the `ResNet50` example.
pictures in the `ResNet50` example. We'll also import `warnings` to filter
out warnings when fitting the LASSO to the IMDB data.
```{python}
from glob import glob
import json
import warnings
```
## Single Layer Network on Hitters Data
We start by fitting the models in Section~\ref{Ch13:sec:when-use-deep} on the `Hitters` data.
We start by fitting the models in Section 10.6 on the `Hitters` data.
```{python}
Hitters = load_data('Hitters').dropna()
@@ -217,7 +219,7 @@ np.abs(Yhat_test - Y_test).mean()
Next we fit the lasso using `sklearn`. We are using
mean absolute error to select and evaluate a model, rather than mean squared error.
The specialized solver we used in Section~\ref{Ch6-varselect-lab:lab-2-ridge-regression-and-the-lasso} uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly.
The specialized solver we used in Section 6.5.2 uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly.
We encode a pipeline with two steps: we first normalize the features using a `StandardScaler()` transform,
and then fit the lasso without further normalization.
@@ -439,7 +441,7 @@ hit_module = SimpleModule.regression(hit_model,
```
By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in
(\ref{Ch13:eq:4}).
(10.23).
We have also asked for mean absolute error to be tracked as well
in the metrics that are logged.
@@ -476,7 +478,7 @@ hit_trainer = Trainer(deterministic=True,
hit_trainer.fit(hit_module, datamodule=hit_dm)
```
At each step of SGD, the algorithm randomly selects 32 training observations for
the computation of the gradient. Recall from Section~\ref{Ch13:sec:fitt-neur-netw}
the computation of the gradient. Recall from Section 10.7
that an epoch amounts to the number of SGD steps required to process $n$
observations. Since the training set has
$n=175$, and we specified a `batch_size` of 32 in the construction of `hit_dm`, an epoch is $175/32=5.5$ SGD steps.
@@ -765,8 +767,8 @@ mnist_trainer.test(mnist_module,
datamodule=mnist_dm)
```
Table~\ref{Ch13:tab:mnist} also reports the error rates resulting from LDA (Chapter~\ref{Ch4:classification}) and multiclass logistic
regression. For LDA we refer the reader to Section~\ref{Ch4-classification-lab:linear-discriminant-analysis}.
Table 10.1 also reports the error rates resulting from LDA (Chapter 4) and multiclass logistic
regression. For LDA we refer the reader to Section 4.7.3.
Although we could use the `sklearn` function `LogisticRegression()` to fit
multiclass logistic regression, we are set up here to fit such a model
with `torch`.
@@ -871,7 +873,7 @@ for idx, (X_ ,Y_) in enumerate(cifar_dm.train_dataloader()):
Before we start, we look at some of the training images; similar code produced
Figure~\ref{Ch13:fig:cifar100} on page \pageref{Ch13:fig:cifar100}. The example below also illustrates
Figure 10.5 on page 406. The example below also illustrates
that `TensorDataset` objects can be indexed with integers --- we are choosing
random images from the training data by indexing `cifar_train`. In order to display correctly,
we must reorder the dimensions by a call to `np.transpose()`.
@@ -894,7 +896,7 @@ for i in range(5):
Here the `imshow()` method recognizes from the shape of its argument that it is a 3-dimensional array, with the last dimension indexing the three RGB color channels.
We specify a moderately-sized CNN for
demonstration purposes, similar in structure to Figure~\ref{Ch13:fig:DeepCNN}.
demonstration purposes, similar in structure to Figure 10.8.
We use several layers, each consisting of convolution, ReLU, and max-pooling steps.
We first define a module that defines one of these layers. As in our
previous examples, we overwrite the `__init__()` and `forward()` methods
@@ -1034,7 +1036,7 @@ summary_plot(cifar_results,
ax,
col='accuracy',
ylabel='Accuracy')
ax.set_xticks(np.linspace(0, 10, 6).astype(int))
ax.set_xticks(np.linspace(0, 30, 7).astype(int))
ax.set_ylabel('Accuracy')
ax.set_ylim([0, 1]);
```
@@ -1083,7 +1085,7 @@ clauses; if it works, we get the speedup, if it fails, nothing happens.
## Using Pretrained CNN Models
We now show how to use a CNN pretrained on the `imagenet` database to classify natural
images, and demonstrate how we produced Figure~\ref{Ch13:fig:homeimages}.
images, and demonstrate how we produced Figure 10.10.
We copied six JPEG images from a digital photo album into the
directory `book_images`. These images are available
from the data section of <www.statlearning.com>, the ISLP book website. Download `book_images.zip`; when
@@ -1192,7 +1194,7 @@ del(cifar_test,
## IMDB Document Classification
We now implement models for sentiment classification (Section~\ref{Ch13:sec:docum-class}) on the `IMDB`
We now implement models for sentiment classification (Section 10.4) on the `IMDB`
dataset. As mentioned above code block~8, we are using
a preprocessed version of the `IMDB` dataset found in the
`keras` package. As `keras` uses `tensorflow`, a different
@@ -1346,7 +1348,7 @@ matrix that is recognized by `sklearn.`
```
Similar to what we did in
Section~\ref{Ch13-deeplearning-lab:single-layer-network-on-hitters-data},
Section 10.9.1,
we construct a series of 50 values for the lasso reguralization parameter $\lambda$.
```{python}
@@ -1369,16 +1371,20 @@ logit = LogisticRegression(penalty='l1',
```
The path of 50 values takes approximately 40 seconds to run.
As in Chapter 6, we will filter out warnings, this time using a context manager.
```{python}
coefs = []
intercepts = []
with warnings.catch_warnings():
warnings.simplefilter("ignore")
coefs = []
intercepts = []
for l in lam_val:
logit.C = 1/l
logit.fit(X_train, Y_train)
coefs.append(logit.coef_.copy())
intercepts.append(logit.intercept_)
for l in lam_val:
logit.C = 1/l
logit.fit(X_train, Y_train)
coefs.append(logit.coef_.copy())
intercepts.append(logit.intercept_)
```
@@ -1454,16 +1460,16 @@ del(imdb_model,
## Recurrent Neural Networks
In this lab we fit the models illustrated in
Section~\ref{Ch13:sec:recurr-neur-netw}.
Section 10.5.
### Sequential Models for Document Classification
Here we fit a simple LSTM RNN for sentiment prediction to
the `IMDb` movie-review data, as discussed in Section~\ref{Ch13:sec:sequ-models-docum}.
the `IMDb` movie-review data, as discussed in Section 10.5.1.
For an RNN we use the sequence of words in a document, taking their
order into account. We loaded the preprocessed
data at the beginning of
Section~\ref{Ch13-deeplearning-lab:imdb-document-classification}.
Section 10.9.5.
A script that details the preprocessing can be found in the
`ISLP` library. Notably, since more than 90% of the documents
had fewer than 500 words, we set the document length to 500. For
@@ -1578,7 +1584,7 @@ del(lstm_model,
### Time Series Prediction
We now show how to fit the models in Section~\ref{Ch13:sec:time-seri-pred}
We now show how to fit the models in Section 10.5.2
for time series prediction.
We first load and standardize the data.

File diff suppressed because one or more lines are too long

View File

@@ -10,9 +10,9 @@
In this lab, we perform survival analyses on three separate data
sets. In Section~\ref{brain.cancer.sec} we analyze the `BrainCancer`
data that was first described in Section~\ref{sec:KM}. In Section~\ref{time.to.pub.sec}, we examine the `Publication`
data from Section~\ref{sec:pub}. Finally, Section~\ref{call.center.sec} explores
sets. In Section 11.8.1 we analyze the `BrainCancer`
data that was first described in Section 11.3. In Section 11.8.2, we examine the `Publication`
data from Section 11.5.4. Finally, Section 11.8.3 explores
a simulated call-center data set.
We begin by importing some of our libraries at this top
@@ -82,7 +82,7 @@ observation. But some scientists might use the opposite coding. For
the `BrainCancer` data set 35 patients died before the end of
the study, so we are using the conventional coding.
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure~\ref{fig:survbrain}. The main
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure 11.2. The main
package we will use for survival analysis
is `lifelines`.
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
@@ -102,7 +102,7 @@ km_brain.plot(label='Kaplan Meier estimate', ax=ax)
```
Next we create Kaplan-Meier survival curves that are stratified by
`sex`, in order to reproduce Figure~\ref{fig:survbrain2}.
`sex`, in order to reproduce Figure 11.3.
We do this using the `groupby()` method of a dataframe.
This method returns a generator that can
be iterated over in the `for` loop. In this case,
@@ -130,7 +130,7 @@ for sex, df in BrainCancer.groupby('sex'):
```
As discussed in Section~\ref{sec:logrank}, we can perform a
As discussed in Section 11.4, we can perform a
log-rank test to compare the survival of males to females. We use
the `logrank_test()` function from the `lifelines.statistics` module.
The first two arguments are the event times, with the second
@@ -266,9 +266,9 @@ predicted_survival.plot(ax=ax);
## Publication Data
The `Publication` data presented in Section~\ref{sec:pub} can be
The `Publication` data presented in Section 11.5.4 can be
found in the `ISLP` package.
We first reproduce Figure~\ref{fig:lauersurv} by plotting the Kaplan-Meier curves
We first reproduce Figure 11.5 by plotting the Kaplan-Meier curves
stratified on the `posres` variable, which records whether the
study had a positive or negative result.
@@ -323,7 +323,7 @@ of the study, and whether the study had positive or negative results.
In this section, we will simulate survival data using the relationship
between cumulative hazard and
the survival function explored in Exercise \ref{ex:all3}.
the survival function explored in Exercise 8.
Our simulated data will represent the observed
wait times (in seconds) for 2,000 customers who have phoned a call
center. In this context, censoring occurs if a customer hangs up
@@ -394,7 +394,7 @@ that the risk of a call being answered at Center B is 0.74 times the
risk that it will be answered at Center A; in other words, the wait
times are a bit longer at Center B.
Recall from Section~\ref{Ch2-statlearn-lab:loading-data} the use of `lambda`
Recall from Section 2.3.7 the use of `lambda`
for creating short functions on the fly.
We use the function
`sim_time()` from the `ISLP.survival` package. This function

File diff suppressed because one or more lines are too long

View File

@@ -164,7 +164,7 @@ for k in range(pcaUS.components_.shape[1]):
USArrests.columns[k])
```
Notice that this figure is a reflection of Figure~\ref{Ch10:fig:USArrests:obs} through the $y$-axis. Recall that the
Notice that this figure is a reflection of Figure 12.1 through the $y$-axis. Recall that the
principal components are only unique up to a sign change, so we can
reproduce that figure by flipping the
signs of the second set of scores and loadings.
@@ -241,7 +241,7 @@ ax.set_xticks(ticks)
fig
```
The result is similar to that shown in Figure~\ref{Ch10:fig:USArrests:scree}. Note
The result is similar to that shown in Figure 12.3. Note
that the method `cumsum()` computes the cumulative sum of
the elements of a numeric vector. For instance:
@@ -253,15 +253,15 @@ np.cumsum(a)
## Matrix Completion
We now re-create the analysis carried out on the `USArrests` data in
Section~\ref{Ch10:sec:princ-comp-with}.
Section 12.3.
We saw in Section~\ref{ch10:sec2.2} that solving the optimization
problem~(\ref{Ch10:eq:mc2}) on a centered data matrix $\bf X$ is
We saw in Section 12.2.2 that solving the optimization
problem~(12.6) on a centered data matrix $\bf X$ is
equivalent to computing the first $M$ principal
components of the data. We use our scaled
and centered `USArrests` data as $\bf X$ below. The *singular value decomposition*
(SVD) is a general algorithm for solving
(\ref{Ch10:eq:mc2}).
(12.6).
```{python}
X = USArrests_scaled
@@ -319,9 +319,9 @@ Here the array `r_idx`
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
We now write some code to implement Algorithm~\ref{Ch10:alg:hardimpute}.
We now write some code to implement Algorithm 12.1.
We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function.
This will be needed in Step 2 of Algorithm~\ref{Ch10:alg:hardimpute}.
This will be needed in Step 2 of Algorithm 12.1.
```{python}
def low_rank(X, M=1):
@@ -330,7 +330,7 @@ def low_rank(X, M=1):
return L.dot(V[:M])
```
To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm~\ref{Ch10:alg:hardimpute} --- by replacing
To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm 12.1 --- by replacing
the missing values with the column means of the non-missing entries. These are stored in
`Xbar` below after running `np.nanmean()` over the row axis.
We make a copy so that when we assign values to `Xhat` below we do not also overwrite the
@@ -360,11 +360,11 @@ a given element is `True` if the corresponding matrix element is missing. The no
because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`.
We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold` (which currently
agrees with `mss0`). We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then
iterate Step 2 of Algorithm~\ref{Ch10:alg:hardimpute} until the *relative error*, defined as
iterate Step 2 of Algorithm 12.1 until the *relative error*, defined as
`(mssold - mss) / mss0`, falls below `thresh = 1e-7`.
{Algorithm~\ref{Ch10:alg:hardimpute} tells us to iterate Step 2 until \eqref{Ch10:eq:mc6} is no longer decreasing. Determining whether \eqref{Ch10:eq:mc6} is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm~\ref{Ch10:alg:hardimpute} to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.}
{Algorithm 12.1 tells us to iterate Step 2 until 12.14 is no longer decreasing. Determining whether 12.14 is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm 12.1 to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.}
In Step 2(a) of Algorithm~\ref{Ch10:alg:hardimpute}, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop:
In Step 2(a) of Algorithm 12.1, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop:
```{python}
while rel_err > thresh:
@@ -393,7 +393,7 @@ np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
```
In this lab, we implemented Algorithm~\ref{Ch10:alg:hardimpute} ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations.
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations.
## Clustering
@@ -464,7 +464,7 @@ We have used the `n_init` argument to run the $K$-means with 20
initial cluster assignments (the default is 10). If a
value of `n_init` greater than one is used, then $K$-means
clustering will be performed using multiple random assignments in
Step 1 of Algorithm~\ref{Ch10:alg:km}, and the `KMeans()`
Step 1 of Algorithm 12.2, and the `KMeans()`
function will report only the best results. Here we compare using
`n_init=1` to `n_init=20`.
@@ -480,7 +480,7 @@ kmeans1.inertia_, kmeans20.inertia_
```
Note that `kmeans.inertia_` is the total within-cluster sum
of squares, which we seek to minimize by performing $K$-means
clustering \eqref{Ch10:eq:kmeans}.
clustering 12.17.
We *strongly* recommend always running $K$-means clustering with
a large value of `n_init`, such as 20 or 50, since otherwise an
@@ -846,7 +846,7 @@ results in four distinct clusters. It is easy to verify that the
resulting clusters are the same as the ones we obtained in
`comp_cut`.
We claimed earlier in Section~\ref{Ch10:subsec:hc} that
We claimed earlier in Section 12.4.2 that
$K$-means clustering and hierarchical clustering with the dendrogram
cut to obtain the same number of clusters can yield very different
results. How do these `NCI60` hierarchical clustering results compare

File diff suppressed because one or more lines are too long

View File

@@ -91,7 +91,7 @@ truth = pd.Categorical(true_mean == 0,
```
Since this is a simulated data set, we can create a $2 \times 2$ table
similar to Table~\ref{Ch12:tab-hypotheses}.
similar to Table 13.2.
```{python}
pd.crosstab(decision,
@@ -102,7 +102,7 @@ pd.crosstab(decision,
```
Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false
null hypotheses, and we incorrectly reject 5 of the true null
hypotheses. Using the notation from Section~\ref{sec:fwer}, we have
hypotheses. Using the notation from Section 13.3, we have
$V=5$, $S=15$, $U=45$ and $W=35$.
We have set $\alpha=0.05$, which means that we expect to reject around
5% of the true null hypotheses. This is in line with the $2 \times 2$
@@ -140,12 +140,12 @@ pd.crosstab(decision,
## Family-Wise Error Rate
Recall from \eqref{eq:FWER.indep} that if the null hypothesis is true
Recall from 13.5 that if the null hypothesis is true
for each of $m$ independent hypothesis tests, then the FWER is equal
to $1-(1-\alpha)^m$. We can use this expression to compute the FWER
for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.
We plot the FWER for these values of $\alpha$ in order to
reproduce Figure~\ref{Ch12:fwer}.
reproduce Figure 13.2.
```{python}
m = np.linspace(1, 501)
@@ -263,7 +263,7 @@ However, we decided to perform this test only after examining the data
and noting that Managers One and Two had the highest and lowest mean
performances. In a sense, this means that we have implicitly
performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than
just one, as discussed in Section~\ref{tukey.sec}. Hence, we use the
just one, as discussed in Section 13.3.2. Hence, we use the
`pairwise_tukeyhsd()` function from
`statsmodels.stats.multicomp` to apply Tukeys method
in order to adjust for multiple testing. This function takes
@@ -350,7 +350,7 @@ null hypotheses!
```
Figure~\ref{Ch12:fig:BonferroniBenjamini} displays the ordered
Figure 13.6 displays the ordered
$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
the `Fund` dataset, as well as the threshold for rejection by the
Benjamini--Hochberg procedure. Recall that the Benjamini--Hochberg
@@ -379,7 +379,7 @@ else:
```
We now reproduce the middle panel of Figure~\ref{Ch12:fig:BonferroniBenjamini}.
We now reproduce the middle panel of Figure 13.6.
```{python}
fig, ax = plt.subplots()
@@ -398,7 +398,7 @@ ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
## A Re-Sampling Approach
Here, we implement the re-sampling approach to hypothesis testing
using the `Khan` dataset, which we investigated in
Section~\ref{sec:permutations}. First, we merge the training and
Section 13.5. First, we merge the training and
testing data, which results in observations on 83 patients for
2,308 genes.
@@ -464,7 +464,7 @@ for b in range(B):
This fraction, 0.0398,
is our re-sampling-based $p$-value.
It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution.
We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure~\ref{Ch12:fig-permp-1}.
We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure 13.7.
```{python}
fig, ax = plt.subplots(figsize=(8,8))
@@ -487,7 +487,7 @@ ax.set_xlabel("Null Distribution of Test Statistic");
The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red.
Finally, we implement the plug-in re-sampling FDR approach outlined in
Algorithm~\ref{Ch12:alg-plugin-fdr}. Depending on the speed of your
Algorithm 13.4. Depending on the speed of your
computer, calculating the FDR for all 2,308 genes in the `Khan`
dataset may take a while. Hence, we will illustrate the approach on a
random subset of 100 genes. For each gene, we first compute the
@@ -520,7 +520,7 @@ for j in range(m):
Next, we compute the number of rejected null hypotheses $R$, the
estimated number of false positives $\widehat{V}$, and the estimated
FDR, for a range of threshold values $c$ in
Algorithm~\ref{Ch12:alg-plugin-fdr}. The threshold values are chosen
Algorithm 13.4. The threshold values are chosen
using the absolute values of the test statistics from the 100 genes.
```{python}
@@ -559,8 +559,8 @@ sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
```
The next line generates Figure~\ref{fig:labfdr}, which is similar
to Figure~\ref{Ch12:fig-plugin-fdr},
The next line generates Figure 13.11, which is similar
to Figure 13.9,
except that it is based on only a subset of the genes.
```{python}

File diff suppressed because one or more lines are too long

View File

@@ -23,7 +23,8 @@ def setup_env(outdir,
uv_executable,
timeout,
kernel,
allow_errors):
allow_errors,
overwrite):
"""
Sets up a student environment for ISLP_labs.
@@ -45,6 +46,8 @@ def setup_env(outdir,
Kernel to use for running notebooks.
allow_errors : bool
Allow errors when running notebooks.
overwrite: bool
Save notebook outputs in place?
"""
repo_url = 'https://github.com/intro-stat-learning/ISLP_labs.git'
@@ -110,7 +113,8 @@ def setup_env(outdir,
if kernel:
nbconvert_command.extend(['--kernel', kernel])
nbconvert_command.append('--allow-errors')
if overwrite:
nbconvert_command.append('--inplace')
run_command(nbconvert_command, cwd=str(outdir))
else:
run_command([str(uv_bin / 'pip'), 'install', 'pytest', 'nbmake'], cwd=str(outdir))
@@ -128,6 +132,9 @@ def setup_env(outdir,
str(nbfile)]
if kernel:
pytest_command.append(f'--nbmake-kernel={kernel}')
if overwrite:
pytest_command.append('--overwrite')
# nbmake does not have --allow-errors in the same way as nbconvert
# If errors are not allowed, nbmake will fail on first error naturally
@@ -152,7 +159,8 @@ def main():
parser.add_argument('--uv-executable', default='uv', help='The `uv` executable to use (default: "uv")')
parser.add_argument('--timeout', type=int, default=3600, help='Timeout for running notebooks (default: 3600)')
parser.add_argument('--kernel', default=None, help='Kernel to use for running notebooks')
parser.add_argument('--allow-errors', action='store_true', help='Allow errors when running notebooks. For older commits, Ch02-statlearn-lab.ipynb raises exceptions. Since v2.2.1 these are handled by appropriate tags in the .ipynb cells.')
parser.add_argument('--allow-errors', action='store_true', default=False, help='Allow errors when running notebooks. For older commits, Ch02-statlearn-lab.ipynb raises exceptions. Since v2.2.1 these are handled by appropriate tags in the .ipynb cells.')
parser.add_argument('--overwrite', action='store_true', default=False, help='If True, store output of cells in notebook(s).')
parser.add_argument('nbfiles', nargs='*', help='Optional list of notebooks to run.')
@@ -165,7 +173,8 @@ def main():
args.uv_executable,
args.timeout,
args.kernel,
args.allow_errors)
args.allow_errors,
args.overwrite)
if __name__ == '__main__':
main()