pairing notebooks
This commit is contained in:
@@ -1,3 +1,16 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 7
|
||||
|
||||
@@ -17,7 +30,7 @@ from ISLP.models import (summarize,
|
||||
ModelSpec as MS)
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
```
|
||||
|
||||
|
||||
We again collect the new imports
|
||||
needed for this lab. Many of these are developed specifically for the
|
||||
`ISLP` package.
|
||||
@@ -38,7 +51,7 @@ from ISLP.pygam import (approx_lam,
|
||||
anova as anova_gam)
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Polynomial Regression and Step Functions
|
||||
We start by demonstrating how Figure 7.1 can be reproduced.
|
||||
Let's begin by loading the data.
|
||||
@@ -49,7 +62,7 @@ y = Wage['wage']
|
||||
age = Wage['age']
|
||||
|
||||
```
|
||||
|
||||
|
||||
Throughout most of this lab, our response is `Wage['wage']`, which
|
||||
we have stored as `y` above.
|
||||
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
|
||||
@@ -61,8 +74,8 @@ M = sm.OLS(y, poly_age.transform(Wage)).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
This polynomial is constructed using the function `poly()`,
|
||||
which creates
|
||||
a special *transformer* `Poly()` (using `sklearn` terminology
|
||||
@@ -83,7 +96,7 @@ on the second line, as well as in the plotting function developed below.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We now create a grid of values for `age` at which we want
|
||||
predictions.
|
||||
|
||||
@@ -151,7 +164,7 @@ plot_wage_fit(age_df,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
With polynomial regression we must decide on the degree of
|
||||
the polynomial to use. Sometimes we just wing it, and decide to use
|
||||
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
|
||||
@@ -182,7 +195,7 @@ anova_lm(*[sm.OLS(y, X_).fit()
|
||||
for X_ in Xs])
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice the `*` in the `anova_lm()` line above. This
|
||||
function takes a variable number of non-keyword arguments, in this case fitted models.
|
||||
When these models are provided as a list (as is done here), it must be
|
||||
@@ -207,8 +220,8 @@ that `poly()` creates orthogonal polynomials.
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Notice that the p-values are the same, and in fact the square of
|
||||
the t-statistics are equal to the F-statistics from the
|
||||
`anova_lm()` function; for example:
|
||||
@@ -217,8 +230,8 @@ the t-statistics are equal to the F-statistics from the
|
||||
(-11.983)**2
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
However, the ANOVA method works whether or not we used orthogonal
|
||||
polynomials, provided the models are nested. For example, we can use
|
||||
`anova_lm()` to compare the following three
|
||||
@@ -233,8 +246,8 @@ XEs = [model.fit_transform(Wage)
|
||||
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
As an alternative to using hypothesis tests and ANOVA, we could choose
|
||||
the polynomial degree using cross-validation, as discussed in Chapter 5.
|
||||
|
||||
@@ -254,8 +267,8 @@ B = glm.fit()
|
||||
summarize(B)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Once again, we make predictions using the `get_prediction()` method.
|
||||
|
||||
```{python}
|
||||
@@ -264,7 +277,7 @@ preds = B.get_prediction(newX)
|
||||
bands = preds.conf_int(alpha=0.05)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now plot the estimated relationship.
|
||||
|
||||
```{python}
|
||||
@@ -306,8 +319,8 @@ cut_age = pd.qcut(age, 4)
|
||||
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
|
||||
quantiles directly instead of the argument `4`. For cuts not based
|
||||
on quantiles we would use the `pd.cut()` function.
|
||||
@@ -364,7 +377,7 @@ M = sm.OLS(y, Xbs).fit()
|
||||
summarize(M)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
|
||||
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
|
||||
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
|
||||
@@ -422,7 +435,7 @@ deciding bin membership.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
In order to fit a natural spline, we use the `NaturalSpline()`
|
||||
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
|
||||
degrees of freedom (excluding the intercept) and plot the results.
|
||||
@@ -440,7 +453,7 @@ plot_wage_fit(age_df,
|
||||
'Natural spline, df=5');
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Smoothing Splines and GAMs
|
||||
A smoothing spline is a special case of a GAM with squared-error loss
|
||||
and a single feature. To fit GAMs in `Python` we will use the
|
||||
@@ -459,7 +472,7 @@ gam = LinearGAM(s_gam(0, lam=0.6))
|
||||
gam.fit(X_age, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
|
||||
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
|
||||
size of that dimension based on the remaining entries of the shape tuple.
|
||||
@@ -482,7 +495,7 @@ ax.set_ylabel('Wage', fontsize=20);
|
||||
ax.legend(title='$\lambda$');
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `pygam` package can perform a search for an optimal smoothing parameter.
|
||||
|
||||
```{python}
|
||||
@@ -495,7 +508,7 @@ ax.legend()
|
||||
fig
|
||||
|
||||
```
|
||||
|
||||
|
||||
Alternatively, we can fix the degrees of freedom of the smoothing
|
||||
spline using a function included in the `ISLP.pygam` package. Below we
|
||||
find a value of $\lambda$ that gives us roughly four degrees of
|
||||
@@ -510,8 +523,8 @@ age_term.lam = lam_4
|
||||
degrees_of_freedom(X_age, age_term)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
|
||||
as the desired degrees of freedom plus one to account for the fact that these smoothing
|
||||
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
|
||||
@@ -623,7 +636,7 @@ ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of year on wage', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now fit the model (7.16) using smoothing splines rather
|
||||
than natural splines. All of the
|
||||
terms in (7.16) are fit simultaneously, taking each other
|
||||
@@ -715,7 +728,7 @@ gam_linear = LinearGAM(age_term +
|
||||
gam_linear.fit(Xgam, y)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Notice our use of `age_term` in the expressions above. We do this because
|
||||
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
|
||||
|
||||
@@ -762,7 +775,7 @@ We can make predictions from `gam` objects, just like from
|
||||
Yhat = gam_full.predict(Xgam)
|
||||
|
||||
```
|
||||
|
||||
|
||||
In order to fit a logistic regression GAM, we use `LogisticGAM()`
|
||||
from `pygam`.
|
||||
|
||||
@@ -773,7 +786,7 @@ gam_logit = LogisticGAM(age_term +
|
||||
gam_logit.fit(Xgam, high_earn)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
```{python}
|
||||
fig, ax = subplots(figsize=(8, 8))
|
||||
@@ -825,8 +838,8 @@ gam_logit_ = LogisticGAM(age_term +
|
||||
gam_logit_.fit(Xgam_, high_earn_)
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve
|
||||
removed those observations.
|
||||
|
||||
@@ -859,7 +872,7 @@ ax.set_ylabel('Effect on wage')
|
||||
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Local Regression
|
||||
We illustrate the use of local regression using the `lowess()`
|
||||
|
||||
Reference in New Issue
Block a user