pairing notebooks

This commit is contained in:
Jonathan Taylor
2023-08-20 19:41:01 -07:00
parent c82e9d5067
commit 058e89ef1c
22 changed files with 489 additions and 346 deletions

View File

@@ -1,3 +1,16 @@
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: ipynb,Rmd
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.14.7
---
# Chapter 7
@@ -17,7 +30,7 @@ from ISLP.models import (summarize,
ModelSpec as MS)
from statsmodels.stats.anova import anova_lm
```
We again collect the new imports
needed for this lab. Many of these are developed specifically for the
`ISLP` package.
@@ -38,7 +51,7 @@ from ISLP.pygam import (approx_lam,
anova as anova_gam)
```
## Polynomial Regression and Step Functions
We start by demonstrating how Figure 7.1 can be reproduced.
Let's begin by loading the data.
@@ -49,7 +62,7 @@ y = Wage['wage']
age = Wage['age']
```
Throughout most of this lab, our response is `Wage['wage']`, which
we have stored as `y` above.
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
@@ -61,8 +74,8 @@ M = sm.OLS(y, poly_age.transform(Wage)).fit()
summarize(M)
```
This polynomial is constructed using the function `poly()`,
which creates
a special *transformer* `Poly()` (using `sklearn` terminology
@@ -83,7 +96,7 @@ on the second line, as well as in the plotting function developed below.
We now create a grid of values for `age` at which we want
predictions.
@@ -151,7 +164,7 @@ plot_wage_fit(age_df,
With polynomial regression we must decide on the degree of
the polynomial to use. Sometimes we just wing it, and decide to use
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
@@ -182,7 +195,7 @@ anova_lm(*[sm.OLS(y, X_).fit()
for X_ in Xs])
```
Notice the `*` in the `anova_lm()` line above. This
function takes a variable number of non-keyword arguments, in this case fitted models.
When these models are provided as a list (as is done here), it must be
@@ -207,8 +220,8 @@ that `poly()` creates orthogonal polynomials.
summarize(M)
```
Notice that the p-values are the same, and in fact the square of
the t-statistics are equal to the F-statistics from the
`anova_lm()` function; for example:
@@ -217,8 +230,8 @@ the t-statistics are equal to the F-statistics from the
(-11.983)**2
```
However, the ANOVA method works whether or not we used orthogonal
polynomials, provided the models are nested. For example, we can use
`anova_lm()` to compare the following three
@@ -233,8 +246,8 @@ XEs = [model.fit_transform(Wage)
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
```
As an alternative to using hypothesis tests and ANOVA, we could choose
the polynomial degree using cross-validation, as discussed in Chapter 5.
@@ -254,8 +267,8 @@ B = glm.fit()
summarize(B)
```
Once again, we make predictions using the `get_prediction()` method.
```{python}
@@ -264,7 +277,7 @@ preds = B.get_prediction(newX)
bands = preds.conf_int(alpha=0.05)
```
We now plot the estimated relationship.
```{python}
@@ -306,8 +319,8 @@ cut_age = pd.qcut(age, 4)
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
```
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
quantiles directly instead of the argument `4`. For cuts not based
on quantiles we would use the `pd.cut()` function.
@@ -364,7 +377,7 @@ M = sm.OLS(y, Xbs).fit()
summarize(M)
```
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
@@ -422,7 +435,7 @@ deciding bin membership.
In order to fit a natural spline, we use the `NaturalSpline()`
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
degrees of freedom (excluding the intercept) and plot the results.
@@ -440,7 +453,7 @@ plot_wage_fit(age_df,
'Natural spline, df=5');
```
## Smoothing Splines and GAMs
A smoothing spline is a special case of a GAM with squared-error loss
and a single feature. To fit GAMs in `Python` we will use the
@@ -459,7 +472,7 @@ gam = LinearGAM(s_gam(0, lam=0.6))
gam.fit(X_age, y)
```
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
size of that dimension based on the remaining entries of the shape tuple.
@@ -482,7 +495,7 @@ ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='$\lambda$');
```
The `pygam` package can perform a search for an optimal smoothing parameter.
```{python}
@@ -495,7 +508,7 @@ ax.legend()
fig
```
Alternatively, we can fix the degrees of freedom of the smoothing
spline using a function included in the `ISLP.pygam` package. Below we
find a value of $\lambda$ that gives us roughly four degrees of
@@ -510,8 +523,8 @@ age_term.lam = lam_4
degrees_of_freedom(X_age, age_term)
```
Lets vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
as the desired degrees of freedom plus one to account for the fact that these smoothing
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
@@ -623,7 +636,7 @@ ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20);
```
We now fit the model (7.16) using smoothing splines rather
than natural splines. All of the
terms in (7.16) are fit simultaneously, taking each other
@@ -715,7 +728,7 @@ gam_linear = LinearGAM(age_term +
gam_linear.fit(Xgam, y)
```
Notice our use of `age_term` in the expressions above. We do this because
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
@@ -762,7 +775,7 @@ We can make predictions from `gam` objects, just like from
Yhat = gam_full.predict(Xgam)
```
In order to fit a logistic regression GAM, we use `LogisticGAM()`
from `pygam`.
@@ -773,7 +786,7 @@ gam_logit = LogisticGAM(age_term +
gam_logit.fit(Xgam, high_earn)
```
```{python}
fig, ax = subplots(figsize=(8, 8))
@@ -825,8 +838,8 @@ gam_logit_ = LogisticGAM(age_term +
gam_logit_.fit(Xgam_, high_earn_)
```
Lets look at the effect of `education`, `year` and `age` on high earner status now that weve
removed those observations.
@@ -859,7 +872,7 @@ ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
```
## Local Regression
We illustrate the use of local regression using the `lowess()`