trying with Rmd

2023-08-06 16:39:15 -07:00
parent 6742bd9598
commit dc38c6c526
24 changed files with 10640 additions and 24 deletions
--- a/Rmarkdown/Ch8-baggboost-lab.Rmd
+++ b/Rmarkdown/Ch8-baggboost-lab.Rmd
@@ -0,0 +1,567 @@
+---
+jupyter:
+  jupytext:
+    cell_metadata_filter: -all
+    formats: notebooks///ipynb,Rmarkdown///Rmd
+    main_language: python
+    text_representation:
+      extension: .Rmd
+      format_name: rmarkdown
+      format_version: '1.2'
+      jupytext_version: 1.14.7
+---
+
+
+# Chapter 8
+
+
+
+
+# Lab: Tree-Based Methods
+We import some of our usual libraries at this top
+level.
+
+```{python}
+import numpy as np
+import pandas as pd
+from matplotlib.pyplot import subplots
+import sklearn.model_selection as skm
+from ISLP import load_data, confusion_table
+from ISLP.models import ModelSpec as MS
+
+```
+We also  collect the new imports
+needed for this lab.
+
+```{python}
+from sklearn.tree import (DecisionTreeClassifier as DTC,
+                          DecisionTreeRegressor as DTR,
+                          plot_tree,
+                          export_text)
+from sklearn.metrics import (accuracy_score,
+                             log_loss)
+from sklearn.ensemble import \
+     (RandomForestRegressor as RF,
+      GradientBoostingRegressor as GBR)
+from ISLP.bart import BART
+
+```
+
+
+## Fitting Classification Trees
+
+
+We first use classification trees to analyze the  `Carseats`  data set.
+In these data, `Sales` is a continuous variable, and so we begin
+by recoding it as a binary variable. We use the `where()` 
+function to create a variable, called `High`, which takes on a
+value of `Yes` if the `Sales` variable exceeds 8, and takes
+on a value of `No` otherwise.
+
+```{python}
+Carseats = load_data('Carseats')
+High = np.where(Carseats.Sales > 8,
+                "Yes",
+                "No")
+
+```
+
+We now use `DecisionTreeClassifier()`  to fit a classification tree in
+order to predict `High` using all variables but `Sales`.
+To do so, we must form a model matrix as we did when fitting regression
+models.  
+
+```{python}
+model = MS(Carseats.columns.drop('Sales'), intercept=False)
+D = model.fit_transform(Carseats)
+feature_names = list(D.columns)
+X = np.asarray(D)
+
+```
+We have converted `D` from a data frame to an array `X`, which is needed in some of the analysis below. We also need the `feature_names` for annotating our plots later.
+
+There are several options needed to specify the  classifier,
+such as `max_depth` (how deep to grow the tree), `min_samples_split`
+(minimum number of observations in a node to be eligible for splitting)
+and `criterion` (whether to use Gini or cross-entropy as the split criterion).
+We also set `random_state` for reproducibility; ties in the split criterion are broken at random.
+
+```{python}
+clf = DTC(criterion='entropy',
+          max_depth=3,
+          random_state=0)        
+clf.fit(X, High)
+
+```
+
+
+In our discussion of qualitative features in Section 3.3,
+we noted that for a linear regression model such a feature could be
+represented by including a matrix of dummy variables (one-hot-encoding) in the model
+matrix, using the formula notation of `statsmodels`.
+As mentioned in Section 8.1, there is a more
+natural way to handle qualitative features when building a decision
+tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups.
+However, 
+the `sklearn` implementation of decision trees does not take
+advantage of this approach; instead it simply treats the one-hot-encoded levels as separate variables.
+
+```{python}
+accuracy_score(High, clf.predict(X))
+
+```
+
+
+With only the default arguments, the training error rate is
+21%.
+For classification trees, we can
+access the value of the deviance using `log_loss()`,
+\begin{equation*}
+\begin{split}
+-2 \sum_m \sum_k n_{mk} \log \hat{p}_{mk},
+\end{split}
+\end{equation*}
+where $n_{mk}$ is the number of observations in the $m$th terminal
+node that belong to the $k$th class.
+
+```{python}
+resid_dev = np.sum(log_loss(High, clf.predict_proba(X)))
+resid_dev
+
+```
+
+This is closely related to the *entropy*, defined in (8.7).
+A small deviance indicates a
+tree that provides a good fit to the (training) data.
+ 
+One of the most attractive properties of trees is that they can
+be graphically displayed. Here we use the `plot()`  function
+to display the tree structure.
+
+```{python}
+ax = subplots(figsize=(12,12))[1]
+plot_tree(clf,
+          feature_names=feature_names,
+          ax=ax);
+
+```
+The most important indicator of `Sales` appears to be `ShelveLoc`.
+
+We can see a text representation of the tree using
+`export_text()`, which displays the split
+criterion (e.g. `Price <= 92.5`) for each branch.
+For leaf nodes it shows the overall prediction  
+(`Yes` or `No`). 
+ We can also see the number of observations in that
+leaf that take on values of `Yes` and `No` by specifying  `show_weights=True`.
+
+```{python}
+print(export_text(clf,
+                  feature_names=feature_names,
+                  show_weights=True))
+
+```
+
+In order to properly evaluate the performance of a classification tree
+on these data, we must estimate the test error rather than simply
+computing the training error. We split the observations into a
+training set and a test set, build the tree using the training set,
+and evaluate its performance on the test data. This pattern is
+similar to that in Chapter 6, with the linear models
+replaced here by decision trees --- the code for validation
+is almost identical. This approach leads to correct predictions
+for 68.5% of the locations in the test data set.
+
+```{python}
+validation = skm.ShuffleSplit(n_splits=1,
+                              test_size=200,
+                              random_state=0)
+results = skm.cross_validate(clf,
+                             D,
+                             High,
+                             cv=validation)
+results['test_score']
+
+```
+    
+
+
+Next, we consider whether pruning the tree might lead to improved
+classification performance. We first split the data into a training and
+test set. We will use cross-validation to prune the tree on the training
+set, and then evaluate the performance of the pruned tree on the test
+set.
+
+```{python}
+(X_train,
+ X_test,
+ High_train,
+ High_test) = skm.train_test_split(X,
+                                   High,
+                                   test_size=0.5,
+                                   random_state=0)
+                                   
+```
+We first refit the full tree on the training set; here we do not set a `max_depth` parameter, since we will learn that through cross-validation.
+
+
+```{python}
+clf = DTC(criterion='entropy', random_state=0)
+clf.fit(X_train, High_train)
+accuracy_score(High_test, clf.predict(X_test))
+
+```
+Next we use the `cost_complexity_pruning_path()` method of
+`clf` to extract cost-complexity values. 
+
+```{python}
+ccp_path = clf.cost_complexity_pruning_path(X_train, High_train)
+kfold = skm.KFold(10,
+                  random_state=1,
+                  shuffle=True)
+
+```
+This yields a set of impurities and $\alpha$ values
+from which we can extract an optimal one by cross-validation.
+
+```{python}
+grid = skm.GridSearchCV(clf,
+                        {'ccp_alpha': ccp_path.ccp_alphas},
+                        refit=True,
+                        cv=kfold,
+                        scoring='accuracy')
+grid.fit(X_train, High_train)
+grid.best_score_
+
+```
+Let’s take a look at the pruned true.
+
+```{python}
+ax = subplots(figsize=(12, 12))[1]
+best_ = grid.best_estimator_
+plot_tree(best_,
+          feature_names=feature_names,
+          ax=ax);
+
+```
+This is quite a bushy tree. We could count the leaves, or query
+`best_` instead.
+
+```{python}
+best_.tree_.n_leaves
+
+```
+The tree with 30 terminal
+nodes results in the lowest cross-validation error rate, with an accuracy of
+68.5%. How well does this pruned tree perform on the test data set? Once
+again, we apply the `predict()`  function.
+
+```{python}
+print(accuracy_score(High_test,
+                     best_.predict(X_test)))
+confusion = confusion_table(best_.predict(X_test),
+                            High_test)
+confusion
+
+```
+
+
+Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance.
+
+  
+
+
+## Fitting Regression Trees
+Here we fit a regression tree to the  `Boston`  data set. The
+steps are similar to those for classification trees.
+
+```{python}
+Boston = load_data("Boston")
+model = MS(Boston.columns.drop('medv'), intercept=False)
+D = model.fit_transform(Boston)
+feature_names = list(D.columns)
+X = np.asarray(D)
+
+```
+
+First, we split the data into training and test sets, and fit the tree
+to the training data. Here we use 30% of the data for the test set.
+
+
+```{python}
+(X_train,
+ X_test,
+ y_train,
+ y_test) = skm.train_test_split(X,
+                                Boston['medv'],
+                                test_size=0.3,
+                                random_state=0)
+
+```
+
+Having formed  our training  and test data sets, we fit the regression tree.
+
+```{python}
+reg = DTR(max_depth=3)
+reg.fit(X_train, y_train)
+ax = subplots(figsize=(12,12))[1]
+plot_tree(reg,
+          feature_names=feature_names,
+          ax=ax);
+
+```
+
+The variable `lstat` measures the percentage of individuals with
+lower socioeconomic status. The tree indicates that lower
+values of `lstat` correspond to more expensive houses.
+The tree predicts a median house price of $12,042 for small-sized homes (`rm < 6.8`), in
+suburbs in which residents have low socioeconomic status (`lstat  > 14.4`) and the crime-rate is moderate (`crim > 5.8`).
+
+
+Now we use the cross-validation function to see whether pruning
+the tree will improve performance.
+
+```{python}
+ccp_path = reg.cost_complexity_pruning_path(X_train, y_train)
+kfold = skm.KFold(5,
+                  shuffle=True,
+                  random_state=10)
+grid = skm.GridSearchCV(reg,
+                        {'ccp_alpha': ccp_path.ccp_alphas},
+                        refit=True,
+                        cv=kfold,
+                        scoring='neg_mean_squared_error')
+G = grid.fit(X_train, y_train)
+
+```
+
+In keeping with the cross-validation results, we use the pruned tree
+to make predictions on the test set.
+
+```{python}
+best_ = grid.best_estimator_
+np.mean((y_test - best_.predict(X_test))**2)
+
+```
+
+
+In other words, the test set MSE associated with the regression tree
+is 28.07.  The square root of
+the MSE is therefore around
+5.30,
+indicating that this model leads to test predictions that are within around
+$5300
+of the true median home value for the suburb.
+
+Let’s plot the best tree to see how interpretable it is.
+
+```{python}
+ax = subplots(figsize=(12,12))[1]
+plot_tree(G.best_estimator_,
+          feature_names=feature_names,
+          ax=ax);
+
+```
+ 
+ 
+
+
+## Bagging and Random Forests
+
+
+Here we apply bagging and random forests to the `Boston` data, using
+the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall
+that bagging is simply a special case of a random forest with
+$m=p$. Therefore, the `RandomForestRegressor()`  function can be used to
+perform both bagging and random forests. We start with bagging.
+
+```{python}
+bag_boston = RF(max_features=X_train.shape[1], random_state=0)
+bag_boston.fit(X_train, y_train)
+
+```
+
+
+The argument `max_features` indicates that all 12 predictors should
+be considered for each split of the tree --- in other words, that
+bagging should be done.  How well does this bagged model perform on
+the test set?
+
+```{python}
+ax = subplots(figsize=(8,8))[1]
+y_hat_bag = bag_boston.predict(X_test)
+ax.scatter(y_hat_bag, y_test)
+np.mean((y_test - y_hat_bag)**2)
+
+```
+
+The test set MSE associated with the bagged regression tree is
+14.63, about half that obtained using an optimally-pruned single
+tree.  We could change the number of trees grown from the default of
+100 by
+using the `n_estimators` argument:
+
+```{python}
+bag_boston = RF(max_features=X_train.shape[1],
+                n_estimators=500,
+                random_state=0).fit(X_train, y_train)
+y_hat_bag = bag_boston.predict(X_test)
+np.mean((y_test - y_hat_bag)**2)
+```
+There is not much change. Bagging and random forests cannot overfit by
+increasing the number of trees, but can underfit if the number is too small.
+
+Growing a random forest proceeds in exactly the same way, except that
+we use a smaller value of the `max_features` argument. By default,
+`RandomForestRegressor()`  uses $p$ variables when building a random
+forest of regression trees (i.e. it defaults to bagging), and `RandomForestClassifier()` uses
+$\sqrt{p}$ variables when building a
+random forest of classification trees. Here we use `max_features=6`.
+
+```{python}
+RF_boston = RF(max_features=6,
+               random_state=0).fit(X_train, y_train)
+y_hat_RF = RF_boston.predict(X_test)
+np.mean((y_test - y_hat_RF)**2)
+
+```
+
+
+The test set MSE is 20.04;
+this indicates that random forests did somewhat worse than bagging
+in this case. Extracting the `feature_importances_` values from the fitted model, we can view the
+importance of each variable.
+
+```{python}
+feature_imp = pd.DataFrame(
+    {'importance':RF_boston.feature_importances_},
+    index=feature_names)
+feature_imp.sort_values(by='importance', ascending=False)
+```
+ This
+is a relative measure of the total decrease in node impurity that results from
+splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data). 
+
+The results indicate that across all of the trees considered in the
+random forest, the wealth level of the community (`lstat`) and the
+house size (`rm`) are by far the two most important variables.
+
+  
+
+
+## Boosting
+
+
+Here we use `GradientBoostingRegressor()` from `sklearn.ensemble`
+to fit boosted regression trees to the `Boston` data
+set. For classification we would  use `GradientBoostingClassifier()`.
+The argument `n_estimators=5000`
+indicates that we want 5000 trees, and the option
+`max_depth=3` limits the depth of each tree. The
+argument `learning_rate` is the $\lambda$
+mentioned earlier in the description of boosting.
+
+```{python}
+boost_boston = GBR(n_estimators=5000,
+                   learning_rate=0.001,
+                   max_depth=3,
+                   random_state=0)
+boost_boston.fit(X_train, y_train)
+
+```
+
+We can see how the training error decreases with the `train_score_` attribute.
+To get an idea of how the test error decreases we can use the
+`staged_predict()` method to get the predicted values along the path.
+
+```{python}
+test_error = np.zeros_like(boost_boston.train_score_)
+for idx, y_ in enumerate(boost_boston.staged_predict(X_test)):
+   test_error[idx] = np.mean((y_test - y_)**2)
+
+plot_idx = np.arange(boost_boston.train_score_.shape[0])
+ax = subplots(figsize=(8,8))[1]
+ax.plot(plot_idx,
+        boost_boston.train_score_,
+        'b',
+        label='Training')
+ax.plot(plot_idx,
+        test_error,
+        'r',
+        label='Test')
+ax.legend();
+
+```
+
+We now use the boosted model to predict `medv` on the test set:
+
+```{python}
+y_hat_boost = boost_boston.predict(X_test);
+np.mean((y_test - y_hat_boost)**2)
+
+```
+
+ The test MSE obtained is 14.48,
+similar to the test MSE for bagging. If we want to, we can
+perform boosting with a different value of the shrinkage parameter
+$\lambda$ in  (8.10). The default value is 0.001, but
+this is easily modified.  Here we take $\lambda=0.2$.
+
+```{python}
+boost_boston = GBR(n_estimators=5000,
+                   learning_rate=0.2,
+                   max_depth=3,
+                   random_state=0)
+boost_boston.fit(X_train,
+                 y_train)
+y_hat_boost = boost_boston.predict(X_test);
+np.mean((y_test - y_hat_boost)**2)
+
+```
+
+
+In this case, using $\lambda=0.2$ leads to a almost the same test MSE
+as when using $\lambda=0.001$.
+
+ 
+
+
+## Bayesian Additive Regression Trees
+
+
+In this section we demonstrate a  `Python` implementation of BART found in the
+`ISLP.bart` package. We fit a  model
+to the `Boston` housing data set. This `BART()` estimator is
+designed for quantitative outcome variables, though other implementations are available for
+fitting logistic and probit models to categorical outcomes.
+
+```{python}
+bart_boston = BART(random_state=0, burnin=5, ndraw=15)
+bart_boston.fit(X_train, y_train)
+
+```
+
+
+On this data set, with this split into test and training, we see that the test error of BART is similar to that of  random forest.
+
+```{python}
+yhat_test = bart_boston.predict(X_test.astype(np.float32))
+np.mean((y_test - yhat_test)**2)
+
+```
+
+
+We can check how many times each variable appeared in the collection of trees.
+This gives a summary similar to the variable importance plot for boosting and random forests.
+
+```{python}
+var_inclusion = pd.Series(bart_boston.variable_inclusion_.mean(0),
+                               index=D.columns)
+var_inclusion
+
+```
+    
+  
+ 
+
+