568 lines
16 KiB
Plaintext
568 lines
16 KiB
Plaintext
---
|
||
jupyter:
|
||
jupytext:
|
||
cell_metadata_filter: -all
|
||
formats: ipynb,Rmd
|
||
main_language: python
|
||
text_representation:
|
||
extension: .Rmd
|
||
format_name: rmarkdown
|
||
format_version: '1.2'
|
||
jupytext_version: 1.14.7
|
||
---
|
||
|
||
|
||
# Chapter 8
|
||
|
||
|
||
|
||
|
||
# Lab: Tree-Based Methods
|
||
We import some of our usual libraries at this top
|
||
level.
|
||
|
||
```{python}
|
||
import numpy as np
|
||
import pandas as pd
|
||
from matplotlib.pyplot import subplots
|
||
import sklearn.model_selection as skm
|
||
from ISLP import load_data, confusion_table
|
||
from ISLP.models import ModelSpec as MS
|
||
|
||
```
|
||
We also collect the new imports
|
||
needed for this lab.
|
||
|
||
```{python}
|
||
from sklearn.tree import (DecisionTreeClassifier as DTC,
|
||
DecisionTreeRegressor as DTR,
|
||
plot_tree,
|
||
export_text)
|
||
from sklearn.metrics import (accuracy_score,
|
||
log_loss)
|
||
from sklearn.ensemble import \
|
||
(RandomForestRegressor as RF,
|
||
GradientBoostingRegressor as GBR)
|
||
from ISLP.bart import BART
|
||
|
||
```
|
||
|
||
|
||
## Fitting Classification Trees
|
||
|
||
|
||
We first use classification trees to analyze the `Carseats` data set.
|
||
In these data, `Sales` is a continuous variable, and so we begin
|
||
by recoding it as a binary variable. We use the `where()`
|
||
function to create a variable, called `High`, which takes on a
|
||
value of `Yes` if the `Sales` variable exceeds 8, and takes
|
||
on a value of `No` otherwise.
|
||
|
||
```{python}
|
||
Carseats = load_data('Carseats')
|
||
High = np.where(Carseats.Sales > 8,
|
||
"Yes",
|
||
"No")
|
||
|
||
```
|
||
|
||
We now use `DecisionTreeClassifier()` to fit a classification tree in
|
||
order to predict `High` using all variables but `Sales`.
|
||
To do so, we must form a model matrix as we did when fitting regression
|
||
models.
|
||
|
||
```{python}
|
||
model = MS(Carseats.columns.drop('Sales'), intercept=False)
|
||
D = model.fit_transform(Carseats)
|
||
feature_names = list(D.columns)
|
||
X = np.asarray(D)
|
||
|
||
```
|
||
We have converted `D` from a data frame to an array `X`, which is needed in some of the analysis below. We also need the `feature_names` for annotating our plots later.
|
||
|
||
There are several options needed to specify the classifier,
|
||
such as `max_depth` (how deep to grow the tree), `min_samples_split`
|
||
(minimum number of observations in a node to be eligible for splitting)
|
||
and `criterion` (whether to use Gini or cross-entropy as the split criterion).
|
||
We also set `random_state` for reproducibility; ties in the split criterion are broken at random.
|
||
|
||
```{python}
|
||
clf = DTC(criterion='entropy',
|
||
max_depth=3,
|
||
random_state=0)
|
||
clf.fit(X, High)
|
||
|
||
```
|
||
|
||
|
||
In our discussion of qualitative features in Section 3.3,
|
||
we noted that for a linear regression model such a feature could be
|
||
represented by including a matrix of dummy variables (one-hot-encoding) in the model
|
||
matrix, using the formula notation of `statsmodels`.
|
||
As mentioned in Section 8.1, there is a more
|
||
natural way to handle qualitative features when building a decision
|
||
tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
|
||
However,
|
||
the `sklearn` implementation of decision trees does not take
|
||
advantage of this approach; instead it simply treats the one-hot-encoded levels as separate variables.
|
||
|
||
```{python}
|
||
accuracy_score(High, clf.predict(X))
|
||
|
||
```
|
||
|
||
|
||
With only the default arguments, the training error rate is
|
||
21%.
|
||
For classification trees, we can
|
||
access the value of the deviance using `log_loss()`,
|
||
\begin{equation*}
|
||
\begin{split}
|
||
-2 \sum_m \sum_k n_{mk} \log \hat{p}_{mk},
|
||
\end{split}
|
||
\end{equation*}
|
||
where $n_{mk}$ is the number of observations in the $m$th terminal
|
||
node that belong to the $k$th class.
|
||
|
||
```{python}
|
||
resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
|
||
resid_dev
|
||
|
||
```
|
||
|
||
This is closely related to the *entropy*, defined in (8.7).
|
||
A small deviance indicates a
|
||
tree that provides a good fit to the (training) data.
|
||
|
||
One of the most attractive properties of trees is that they can
|
||
be graphically displayed. Here we use the `plot()` function
|
||
to display the tree structure.
|
||
|
||
```{python}
|
||
ax = subplots(figsize=(12,12))[1]
|
||
plot_tree(clf,
|
||
feature_names=feature_names,
|
||
ax=ax);
|
||
|
||
```
|
||
The most important indicator of `Sales` appears to be `ShelveLoc`.
|
||
|
||
We can see a text representation of the tree using
|
||
`export_text()`, which displays the split
|
||
criterion (e.g. `Price <= 92.5`) for each branch.
|
||
For leaf nodes it shows the overall prediction
|
||
(`Yes` or `No`).
|
||
We can also see the number of observations in that
|
||
leaf that take on values of `Yes` and `No` by specifying `show_weights=True`.
|
||
|
||
```{python}
|
||
print(export_text(clf,
|
||
feature_names=feature_names,
|
||
show_weights=True))
|
||
|
||
```
|
||
|
||
In order to properly evaluate the performance of a classification tree
|
||
on these data, we must estimate the test error rather than simply
|
||
computing the training error. We split the observations into a
|
||
training set and a test set, build the tree using the training set,
|
||
and evaluate its performance on the test data. This pattern is
|
||
similar to that in Chapter 6, with the linear models
|
||
replaced here by decision trees --- the code for validation
|
||
is almost identical. This approach leads to correct predictions
|
||
for 68.5% of the locations in the test data set.
|
||
|
||
```{python}
|
||
validation = skm.ShuffleSplit(n_splits=1,
|
||
test_size=200,
|
||
random_state=0)
|
||
results = skm.cross_validate(clf,
|
||
D,
|
||
High,
|
||
cv=validation)
|
||
results['test_score']
|
||
|
||
```
|
||
|
||
|
||
|
||
Next, we consider whether pruning the tree might lead to improved
|
||
classification performance. We first split the data into a training and
|
||
test set. We will use cross-validation to prune the tree on the training
|
||
set, and then evaluate the performance of the pruned tree on the test
|
||
set.
|
||
|
||
```{python}
|
||
(X_train,
|
||
X_test,
|
||
High_train,
|
||
High_test) = skm.train_test_split(X,
|
||
High,
|
||
test_size=0.5,
|
||
random_state=0)
|
||
|
||
```
|
||
We first refit the full tree on the training set; here we do not set a `max_depth` parameter, since we will learn that through cross-validation.
|
||
|
||
|
||
```{python}
|
||
clf = DTC(criterion='entropy', random_state=0)
|
||
clf.fit(X_train, High_train)
|
||
accuracy_score(High_test, clf.predict(X_test))
|
||
|
||
```
|
||
Next we use the `cost_complexity_pruning_path()` method of
|
||
`clf` to extract cost-complexity values.
|
||
|
||
```{python}
|
||
ccp_path = clf.cost_complexity_pruning_path(X_train, High_train)
|
||
kfold = skm.KFold(10,
|
||
random_state=1,
|
||
shuffle=True)
|
||
|
||
```
|
||
This yields a set of impurities and $\alpha$ values
|
||
from which we can extract an optimal one by cross-validation.
|
||
|
||
```{python}
|
||
grid = skm.GridSearchCV(clf,
|
||
{'ccp_alpha': ccp_path.ccp_alphas},
|
||
refit=True,
|
||
cv=kfold,
|
||
scoring='accuracy')
|
||
grid.fit(X_train, High_train)
|
||
grid.best_score_
|
||
|
||
```
|
||
Let’s take a look at the pruned true.
|
||
|
||
```{python}
|
||
ax = subplots(figsize=(12, 12))[1]
|
||
best_ = grid.best_estimator_
|
||
plot_tree(best_,
|
||
feature_names=feature_names,
|
||
ax=ax);
|
||
|
||
```
|
||
This is quite a bushy tree. We could count the leaves, or query
|
||
`best_` instead.
|
||
|
||
```{python}
|
||
best_.tree_.n_leaves
|
||
|
||
```
|
||
The tree with 30 terminal
|
||
nodes results in the lowest cross-validation error rate, with an accuracy of
|
||
68.5%. How well does this pruned tree perform on the test data set? Once
|
||
again, we apply the `predict()` function.
|
||
|
||
```{python}
|
||
print(accuracy_score(High_test,
|
||
best_.predict(X_test)))
|
||
confusion = confusion_table(best_.predict(X_test),
|
||
High_test)
|
||
confusion
|
||
|
||
```
|
||
|
||
|
||
Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
|
||
|
||
|
||
|
||
|
||
## Fitting Regression Trees
|
||
Here we fit a regression tree to the `Boston` data set. The
|
||
steps are similar to those for classification trees.
|
||
|
||
```{python}
|
||
Boston = load_data("Boston")
|
||
model = MS(Boston.columns.drop('medv'), intercept=False)
|
||
D = model.fit_transform(Boston)
|
||
feature_names = list(D.columns)
|
||
X = np.asarray(D)
|
||
|
||
```
|
||
|
||
First, we split the data into training and test sets, and fit the tree
|
||
to the training data. Here we use 30% of the data for the test set.
|
||
|
||
|
||
```{python}
|
||
(X_train,
|
||
X_test,
|
||
y_train,
|
||
y_test) = skm.train_test_split(X,
|
||
Boston['medv'],
|
||
test_size=0.3,
|
||
random_state=0)
|
||
|
||
```
|
||
|
||
Having formed our training and test data sets, we fit the regression tree.
|
||
|
||
```{python}
|
||
reg = DTR(max_depth=3)
|
||
reg.fit(X_train, y_train)
|
||
ax = subplots(figsize=(12,12))[1]
|
||
plot_tree(reg,
|
||
feature_names=feature_names,
|
||
ax=ax);
|
||
|
||
```
|
||
|
||
The variable `lstat` measures the percentage of individuals with
|
||
lower socioeconomic status. The tree indicates that lower
|
||
values of `lstat` correspond to more expensive houses.
|
||
The tree predicts a median house price of $12,042 for small-sized homes (`rm < 6.8`), in
|
||
suburbs in which residents have low socioeconomic status (`lstat > 14.4`) and the crime-rate is moderate (`crim > 5.8`).
|
||
|
||
|
||
Now we use the cross-validation function to see whether pruning
|
||
the tree will improve performance.
|
||
|
||
```{python}
|
||
ccp_path = reg.cost_complexity_pruning_path(X_train, y_train)
|
||
kfold = skm.KFold(5,
|
||
shuffle=True,
|
||
random_state=10)
|
||
grid = skm.GridSearchCV(reg,
|
||
{'ccp_alpha': ccp_path.ccp_alphas},
|
||
refit=True,
|
||
cv=kfold,
|
||
scoring='neg_mean_squared_error')
|
||
G = grid.fit(X_train, y_train)
|
||
|
||
```
|
||
|
||
In keeping with the cross-validation results, we use the pruned tree
|
||
to make predictions on the test set.
|
||
|
||
```{python}
|
||
best_ = grid.best_estimator_
|
||
np.mean((y_test - best_.predict(X_test))**2)
|
||
|
||
```
|
||
|
||
|
||
In other words, the test set MSE associated with the regression tree
|
||
is 28.07. The square root of
|
||
the MSE is therefore around
|
||
5.30,
|
||
indicating that this model leads to test predictions that are within around
|
||
$5300
|
||
of the true median home value for the suburb.
|
||
|
||
Let’s plot the best tree to see how interpretable it is.
|
||
|
||
```{python}
|
||
ax = subplots(figsize=(12,12))[1]
|
||
plot_tree(G.best_estimator_,
|
||
feature_names=feature_names,
|
||
ax=ax);
|
||
|
||
```
|
||
|
||
|
||
|
||
|
||
## Bagging and Random Forests
|
||
|
||
|
||
Here we apply bagging and random forests to the `Boston` data, using
|
||
the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
|
||
that bagging is simply a special case of a random forest with
|
||
$m=p$. Therefore, the `RandomForestRegressor()` function can be used to
|
||
perform both bagging and random forests. We start with bagging.
|
||
|
||
```{python}
|
||
bag_boston = RF(max_features=X_train.shape[1], random_state=0)
|
||
bag_boston.fit(X_train, y_train)
|
||
|
||
```
|
||
|
||
|
||
The argument `max_features` indicates that all 12 predictors should
|
||
be considered for each split of the tree --- in other words, that
|
||
bagging should be done. How well does this bagged model perform on
|
||
the test set?
|
||
|
||
```{python}
|
||
ax = subplots(figsize=(8,8))[1]
|
||
y_hat_bag = bag_boston.predict(X_test)
|
||
ax.scatter(y_hat_bag, y_test)
|
||
np.mean((y_test - y_hat_bag)**2)
|
||
|
||
```
|
||
|
||
The test set MSE associated with the bagged regression tree is
|
||
14.63, about half that obtained using an optimally-pruned single
|
||
tree. We could change the number of trees grown from the default of
|
||
100 by
|
||
using the `n_estimators` argument:
|
||
|
||
```{python}
|
||
bag_boston = RF(max_features=X_train.shape[1],
|
||
n_estimators=500,
|
||
random_state=0).fit(X_train, y_train)
|
||
y_hat_bag = bag_boston.predict(X_test)
|
||
np.mean((y_test - y_hat_bag)**2)
|
||
```
|
||
There is not much change. Bagging and random forests cannot overfit by
|
||
increasing the number of trees, but can underfit if the number is too small.
|
||
|
||
Growing a random forest proceeds in exactly the same way, except that
|
||
we use a smaller value of the `max_features` argument. By default,
|
||
`RandomForestRegressor()` uses $p$ variables when building a random
|
||
forest of regression trees (i.e. it defaults to bagging), and `RandomForestClassifier()` uses
|
||
$\sqrt{p}$ variables when building a
|
||
random forest of classification trees. Here we use `max_features=6`.
|
||
|
||
```{python}
|
||
RF_boston = RF(max_features=6,
|
||
random_state=0).fit(X_train, y_train)
|
||
y_hat_RF = RF_boston.predict(X_test)
|
||
np.mean((y_test - y_hat_RF)**2)
|
||
|
||
```
|
||
|
||
|
||
The test set MSE is 20.04;
|
||
this indicates that random forests did somewhat worse than bagging
|
||
in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
|
||
importance of each variable.
|
||
|
||
```{python}
|
||
feature_imp = pd.DataFrame(
|
||
{'importance':RF_boston.feature_importances_},
|
||
index=feature_names)
|
||
feature_imp.sort_values(by='importance', ascending=False)
|
||
```
|
||
This
|
||
is a relative measure of the total decrease in node impurity that results from
|
||
splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data).
|
||
|
||
The results indicate that across all of the trees considered in the
|
||
random forest, the wealth level of the community (`lstat`) and the
|
||
house size (`rm`) are by far the two most important variables.
|
||
|
||
|
||
|
||
|
||
## Boosting
|
||
|
||
|
||
Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
|
||
to fit boosted regression trees to the `Boston` data
|
||
set. For classification we would use `GradientBoostingClassifier()`.
|
||
The argument `n_estimators=5000`
|
||
indicates that we want 5000 trees, and the option
|
||
`max_depth=3` limits the depth of each tree. The
|
||
argument `learning_rate` is the $\lambda$
|
||
mentioned earlier in the description of boosting.
|
||
|
||
```{python}
|
||
boost_boston = GBR(n_estimators=5000,
|
||
learning_rate=0.001,
|
||
max_depth=3,
|
||
random_state=0)
|
||
boost_boston.fit(X_train, y_train)
|
||
|
||
```
|
||
|
||
We can see how the training error decreases with the `train_score_` attribute.
|
||
To get an idea of how the test error decreases we can use the
|
||
`staged_predict()` method to get the predicted values along the path.
|
||
|
||
```{python}
|
||
test_error = np.zeros_like(boost_boston.train_score_)
|
||
for idx, y_ in enumerate(boost_boston.staged_predict(X_test)):
|
||
test_error[idx] = np.mean((y_test - y_)**2)
|
||
|
||
plot_idx = np.arange(boost_boston.train_score_.shape[0])
|
||
ax = subplots(figsize=(8,8))[1]
|
||
ax.plot(plot_idx,
|
||
boost_boston.train_score_,
|
||
'b',
|
||
label='Training')
|
||
ax.plot(plot_idx,
|
||
test_error,
|
||
'r',
|
||
label='Test')
|
||
ax.legend();
|
||
|
||
```
|
||
|
||
We now use the boosted model to predict `medv` on the test set:
|
||
|
||
```{python}
|
||
y_hat_boost = boost_boston.predict(X_test);
|
||
np.mean((y_test - y_hat_boost)**2)
|
||
|
||
```
|
||
|
||
The test MSE obtained is 14.48,
|
||
similar to the test MSE for bagging. If we want to, we can
|
||
perform boosting with a different value of the shrinkage parameter
|
||
$\lambda$ in (8.10). The default value is 0.001, but
|
||
this is easily modified. Here we take $\lambda=0.2$.
|
||
|
||
```{python}
|
||
boost_boston = GBR(n_estimators=5000,
|
||
learning_rate=0.2,
|
||
max_depth=3,
|
||
random_state=0)
|
||
boost_boston.fit(X_train,
|
||
y_train)
|
||
y_hat_boost = boost_boston.predict(X_test);
|
||
np.mean((y_test - y_hat_boost)**2)
|
||
|
||
```
|
||
|
||
|
||
In this case, using $\lambda=0.2$ leads to a almost the same test MSE
|
||
as when using $\lambda=0.001$.
|
||
|
||
|
||
|
||
|
||
## Bayesian Additive Regression Trees
|
||
|
||
|
||
In this section we demonstrate a `Python` implementation of BART found in the
|
||
`ISLP.bart` package. We fit a model
|
||
to the `Boston` housing data set. This `BART()` estimator is
|
||
designed for quantitative outcome variables, though other implementations are available for
|
||
fitting logistic and probit models to categorical outcomes.
|
||
|
||
```{python}
|
||
bart_boston = BART(random_state=0, burnin=5, ndraw=15)
|
||
bart_boston.fit(X_train, y_train)
|
||
|
||
```
|
||
|
||
|
||
On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest.
|
||
|
||
```{python}
|
||
yhat_test = bart_boston.predict(X_test.astype(np.float32))
|
||
np.mean((y_test - yhat_test)**2)
|
||
|
||
```
|
||
|
||
|
||
We can check how many times each variable appeared in the collection of trees.
|
||
This gives a summary similar to the variable importance plot for boosting and random forests.
|
||
|
||
```{python}
|
||
var_inclusion = pd.Series(bart_boston.variable_inclusion_.mean(0),
|
||
index=D.columns)
|
||
var_inclusion
|
||
|
||
```
|
||
|
||
|
||
|
||
|
||
|