renamed to correct order on github
This commit is contained in:
619
Ch03-linreg-lab.Rmd
Normal file
619
Ch03-linreg-lab.Rmd
Normal file
@@ -0,0 +1,619 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 3
|
||||
|
||||
|
||||
|
||||
|
||||
# Lab: Linear Regression
|
||||
|
||||
## Importing packages
|
||||
We import our standard libraries at this top
|
||||
level.
|
||||
|
||||
```{python}
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from matplotlib.pyplot import subplots
|
||||
|
||||
```
|
||||
|
||||
|
||||
### New imports
|
||||
Throughout this lab we will introduce new functions and libraries. However,
|
||||
we will import them here to emphasize these are the new
|
||||
code objects in this lab. Keeping imports near the top
|
||||
of a notebook makes the code more readable, since scanning the first few
|
||||
lines tells us what libraries are used.
|
||||
|
||||
```{python}
|
||||
import statsmodels.api as sm
|
||||
|
||||
```
|
||||
We will provide relevant details about the
|
||||
functions below as they are needed.
|
||||
|
||||
Besides importing whole modules, it is also possible
|
||||
to import only a few items from a given module. This
|
||||
will help keep the *namespace* clean.
|
||||
We will use a few specific objects from the `statsmodels` package
|
||||
which we import here.
|
||||
|
||||
```{python}
|
||||
from statsmodels.stats.outliers_influence \
|
||||
import variance_inflation_factor as VIF
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
|
||||
```
|
||||
|
||||
As one of the import statements above is quite a long line, we inserted a line break `\` to
|
||||
ease readability.
|
||||
|
||||
We will also use some functions written for the labs in this book in the `ISLP`
|
||||
package.
|
||||
|
||||
```{python}
|
||||
from ISLP import load_data
|
||||
from ISLP.models import (ModelSpec as MS,
|
||||
summarize,
|
||||
poly)
|
||||
|
||||
```
|
||||
|
||||
### Inspecting Objects and Namespaces
|
||||
The
|
||||
function `dir()`
|
||||
provides a list of
|
||||
objects in a namespace.
|
||||
|
||||
```{python}
|
||||
dir()
|
||||
|
||||
```
|
||||
This shows you everything that `Python` can find at the top level.
|
||||
There are certain objects like `__builtins__` that contain references to built-in
|
||||
functions like `print()`.
|
||||
|
||||
Every python object has its own notion of
|
||||
namespace, also accessible with `dir()`. This will include
|
||||
both the attributes of the object
|
||||
as well as any methods associated with it. For instance, we see `'sum'` in the listing for an
|
||||
array.
|
||||
|
||||
```{python}
|
||||
A = np.array([3,5,11])
|
||||
dir(A)
|
||||
|
||||
```
|
||||
This indicates that the object `A.sum` exists. In this case it is a method
|
||||
that can be used to compute the sum of the array `A` as can be seen by typing `A.sum?`.
|
||||
|
||||
```{python}
|
||||
A.sum()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Simple Linear Regression
|
||||
In this section we will construct model
|
||||
matrices (also called design matrices) using the `ModelSpec()` transform from `ISLP.models`.
|
||||
|
||||
We will use the `Boston` housing data set, which is contained in the `ISLP` package. The `Boston` dataset records `medv` (median house value) for $506$ neighborhoods
|
||||
around Boston. We will build a regression model to predict `medv` using $13$
|
||||
predictors such as `rmvar` (average number of rooms per house),
|
||||
`age` (proportion of owner-occupied units built prior to 1940), and `lstat` (percent of
|
||||
households with low socioeconomic status). We will use `statsmodels` for this
|
||||
task, a `Python` package that implements several commonly used
|
||||
regression methods.
|
||||
|
||||
We have included a simple loading function `load_data()` in the
|
||||
`ISLP` package:
|
||||
|
||||
```{python}
|
||||
Boston = load_data("Boston")
|
||||
Boston.columns
|
||||
|
||||
```
|
||||
|
||||
Type `Boston?` to find out more about these data.
|
||||
|
||||
We start by using the `sm.OLS()` function to fit a
|
||||
simple linear regression model. Our response will be
|
||||
`medv` and `lstat` will be the single predictor.
|
||||
For this model, we can create the model matrix by hand.
|
||||
|
||||
|
||||
```{python}
|
||||
X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]),
|
||||
'lstat': Boston['lstat']})
|
||||
X[:4]
|
||||
|
||||
```
|
||||
|
||||
We extract the response, and fit the model.
|
||||
|
||||
```{python}
|
||||
y = Boston['medv']
|
||||
model = sm.OLS(y, X)
|
||||
results = model.fit()
|
||||
|
||||
```
|
||||
Note that `sm.OLS()` does
|
||||
not fit the model; it specifies the model, and then `model.fit()` does the actual fitting.
|
||||
|
||||
Our `ISLP` function `summarize()` produces a simple table of the parameter estimates,
|
||||
their standard errors, t-statistics and p-values.
|
||||
The function takes a single argument, such as the object `results`
|
||||
returned here by the `fit`
|
||||
method, and returns such a summary.
|
||||
|
||||
```{python}
|
||||
summarize(results)
|
||||
|
||||
```
|
||||
|
||||
|
||||
Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`.
|
||||
### Using Transformations: Fit and Transform
|
||||
Our model above has a single predictor, and constructing `X` was straightforward.
|
||||
In practice we often fit models with more than one predictor, typically selected from an array or data frame.
|
||||
We may wish to introduce transformations to the variables before fitting the model, specify interactions between variables, and expand some particular variables into sets of variables (e.g. polynomials).
|
||||
The `sklearn` package has a particular notion
|
||||
for this type of task: a *transform*. A transform is an object
|
||||
that is created with some parameters as arguments. The
|
||||
object has two main methods: `fit()` and `transform()`.
|
||||
|
||||
We provide a general approach for specifying models and constructing
|
||||
the model matrix through the transform `ModelSpec()` in the `ISLP` library.
|
||||
`ModelSpec()`
|
||||
(renamed `MS()` in the preamble) creates a
|
||||
transform object, and then a pair of methods
|
||||
`transform()` and `fit()` are used to construct a
|
||||
corresponding model matrix.
|
||||
|
||||
We first describe this process for our simple regression model using a single predictor `lstat` in
|
||||
the `Boston` data frame, but will use it repeatedly in more
|
||||
complex tasks in this and other labs in this book.
|
||||
In our case the transform is created by the expression
|
||||
`design = MS(['lstat'])`.
|
||||
|
||||
The `fit()` method takes the original array and may do some
|
||||
initial computations on it, as specified in the transform object.
|
||||
For example, it may compute means and standard deviations for centering and scaling.
|
||||
The `transform()`
|
||||
method applies the fitted transformation to the array of data, and produces the model matrix.
|
||||
|
||||
|
||||
```{python}
|
||||
design = MS(['lstat'])
|
||||
design = design.fit(Boston)
|
||||
X = design.transform(Boston)
|
||||
X[:4]
|
||||
```
|
||||
In this simple case, the `fit()` method does very little; it simply checks that the variable `'lstat'` specified in `design` exists in `Boston`. Then `transform()` constructs the model matrix with two columns: an `intercept` and the variable `lstat`.
|
||||
|
||||
These two operations can be combined with the
|
||||
`fit_transform()` method.
|
||||
|
||||
```{python}
|
||||
design = MS(['lstat'])
|
||||
X = design.fit_transform(Boston)
|
||||
X[:4]
|
||||
```
|
||||
Note that, as in the previous code chunk when the two steps were done separately, the `design` object is changed as a result of the `fit()` operation. The power of this pipeline will become clearer when we fit more complex models that involve interactions and transformations.
|
||||
|
||||
|
||||
Let's return to our fitted regression model.
|
||||
The object
|
||||
`results` has several methods that can be used for inference.
|
||||
We already presented a function `summarize()` for showing the essentials of the fit.
|
||||
For a full and somewhat exhaustive summary of the fit, we can use the `summary()`
|
||||
method.
|
||||
|
||||
```{python}
|
||||
results.summary()
|
||||
|
||||
```
|
||||
|
||||
The fitted coefficients can also be retrieved as the
|
||||
`params` attribute of `results`.
|
||||
|
||||
```{python}
|
||||
results.params
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and
|
||||
prediction intervals for the prediction of `medv` for given values of `lstat`.
|
||||
|
||||
We first create a new data frame, in this case containing only the variable `lstat`, with the values for this variable at which we wish to make predictions.
|
||||
We then use the `transform()` method of `design` to create the corresponding model matrix.
|
||||
|
||||
```{python}
|
||||
new_df = pd.DataFrame({'lstat':[5, 10, 15]})
|
||||
newX = design.transform(new_df)
|
||||
newX
|
||||
|
||||
```
|
||||
|
||||
Next we compute the predictions at `newX`, and view them by extracting the `predicted_mean` attribute.
|
||||
|
||||
```{python}
|
||||
new_predictions = results.get_prediction(newX);
|
||||
new_predictions.predicted_mean
|
||||
|
||||
```
|
||||
We can produce confidence intervals for the predicted values.
|
||||
|
||||
```{python}
|
||||
new_predictions.conf_int(alpha=0.05)
|
||||
|
||||
```
|
||||
Prediction intervals are computing by setting `obs=True`:
|
||||
|
||||
```{python}
|
||||
new_predictions.conf_int(obs=True, alpha=0.05)
|
||||
|
||||
```
|
||||
For instance, the 95% confidence interval associated with an
|
||||
`lstat` value of 10 is (24.47, 25.63), and the 95% prediction
|
||||
interval is (12.82, 37.28). As expected, the confidence and
|
||||
prediction intervals are centered around the same point (a predicted
|
||||
value of 25.05 for `medv` when `lstat` equals
|
||||
10), but the latter are substantially wider.
|
||||
|
||||
Next we will plot `medv` and `lstat`
|
||||
using `DataFrame.plot.scatter()`,
|
||||
and wish to
|
||||
add the regression line to the resulting plot.
|
||||
|
||||
|
||||
### Defining Functions
|
||||
While there is a function
|
||||
within the `ISLP` package that adds a line to an existing plot, we take this opportunity
|
||||
to define our first function to do so.
|
||||
|
||||
```{python}
|
||||
def abline(ax, b, m):
|
||||
"Add a line with slope m and intercept b to ax"
|
||||
xlim = ax.get_xlim()
|
||||
ylim = [m * xlim[0] + b, m * xlim[1] + b]
|
||||
ax.plot(xlim, ylim)
|
||||
|
||||
```
|
||||
A few things are illustrated above. First we see the syntax for defining a function:
|
||||
`def funcname(...)`. The function has arguments `ax, b, m`
|
||||
where `ax` is an axis object for an exisiting plot, `b` is the intercept and
|
||||
`m` is the slope of the desired line. Other plotting options can be passed on to
|
||||
`ax.plot` by including additional optional arguments as follows:
|
||||
|
||||
```{python}
|
||||
def abline(ax, b, m, *args, **kwargs):
|
||||
"Add a line with slope m and intercept b to ax"
|
||||
xlim = ax.get_xlim()
|
||||
ylim = [m * xlim[0] + b, m * xlim[1] + b]
|
||||
ax.plot(xlim, ylim, *args, **kwargs)
|
||||
|
||||
```
|
||||
The addition of `*args` allows any number of
|
||||
non-named arguments to `abline`, while `*kwargs` allows any
|
||||
number of named arguments (such as `linewidth=3`) to `abline`.
|
||||
In our function, we pass
|
||||
these arguments verbatim to `ax.plot` above. Readers
|
||||
interested in learning more about
|
||||
functions are referred to the section on
|
||||
defining functions in [docs.python.org/tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions).
|
||||
|
||||
Let’s use our new function to add this regression line to a plot of
|
||||
`medv` vs. `lstat`.
|
||||
|
||||
```{python}
|
||||
ax = Boston.plot.scatter('lstat', 'medv')
|
||||
abline(ax,
|
||||
results.params[0],
|
||||
results.params[1],
|
||||
'r--',
|
||||
linewidth=3)
|
||||
|
||||
```
|
||||
Thus, the final call to `ax.plot()` is `ax.plot(xlim, ylim, 'r--', linewidth=3)`.
|
||||
We have used the argument `'r--'` to produce a red dashed line, and added
|
||||
an argument to make it of width 3.
|
||||
There is some evidence for non-linearity in the relationship between `lstat` and `medv`. We will explore this issue later in this lab.
|
||||
|
||||
As mentioned above, there is an existing function to add a line to a plot --- `ax.axline()` --- but knowing how to write such functions empowers us to create more expressive displays.
|
||||
|
||||
|
||||
|
||||
|
||||
Next we examine some diagnostic plots, several of which were discussed
|
||||
in Section 3.3.3.
|
||||
We can find the fitted values and residuals
|
||||
of the fit as attributes of the `results` object.
|
||||
Various influence measures describing the regression model
|
||||
are computed with the `get_influence()` method.
|
||||
As we will not use the `fig` component returned
|
||||
as the first value from `subplots()`, we simply
|
||||
capture the second returned value in `ax` below.
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.scatter(results.fittedvalues, results.resid)
|
||||
ax.set_xlabel('Fitted value')
|
||||
ax.set_ylabel('Residual')
|
||||
ax.axhline(0, c='k', ls='--');
|
||||
|
||||
```
|
||||
We add a horizontal line at 0 for reference using the
|
||||
`ax.axhline()` method, indicating
|
||||
it should be black (`c='k'`) and have a dashed linestyle (`ls='--'`).
|
||||
|
||||
On the basis of the residual plot, there is some evidence of non-linearity.
|
||||
Leverage statistics can be computed for any number of predictors using the
|
||||
`hat_matrix_diag` attribute of the value returned by the
|
||||
`get_influence()` method.
|
||||
|
||||
```{python}
|
||||
infl = results.get_influence()
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
|
||||
ax.set_xlabel('Index')
|
||||
ax.set_ylabel('Leverage')
|
||||
np.argmax(infl.hat_matrix_diag)
|
||||
|
||||
```
|
||||
The `np.argmax()` function identifies the index of the largest element of an array, optionally computed over an axis of the array.
|
||||
In this case, we maximized over the entire array
|
||||
to determine which observation has the largest leverage statistic.
|
||||
|
||||
|
||||
## Multiple Linear Regression
|
||||
In order to fit a multiple linear regression model using least squares, we again use
|
||||
the `ModelSpec()` transform to construct the required
|
||||
model matrix and response. The arguments
|
||||
to `ModelSpec()` can be quite general, but in this case
|
||||
a list of column names suffice. We consider a fit here with
|
||||
the two variables `lstat` and `age`.
|
||||
|
||||
```{python}
|
||||
X = MS(['lstat', 'age']).fit_transform(Boston)
|
||||
model1 = sm.OLS(y, X)
|
||||
results1 = model1.fit()
|
||||
summarize(results1)
|
||||
```
|
||||
Notice how we have compacted the first line into a succinct expression describing the construction of `X`.
|
||||
|
||||
The `Boston` data set contains 12 variables, and so it would be cumbersome
|
||||
to have to type all of these in order to perform a regression using all of the predictors.
|
||||
Instead, we can use the following short-hand:
|
||||
|
||||
```{python}
|
||||
terms = Boston.columns.drop('medv')
|
||||
terms
|
||||
|
||||
```
|
||||
|
||||
We can now fit the model with all the variables in `terms` using
|
||||
the same model matrix builder.
|
||||
|
||||
```{python}
|
||||
X = MS(terms).fit_transform(Boston)
|
||||
model = sm.OLS(y, X)
|
||||
results = model.fit()
|
||||
summarize(results)
|
||||
|
||||
```
|
||||
|
||||
What if we would like to perform a regression using all of the variables but one? For
|
||||
example, in the above regression output, `age` has a high $p$-value.
|
||||
So we may wish to run a regression excluding this predictor.
|
||||
The following syntax results in a regression using all predictors except `age`.
|
||||
|
||||
```{python}
|
||||
minus_age = Boston.columns.drop(['medv', 'age'])
|
||||
Xma = MS(minus_age).fit_transform(Boston)
|
||||
model1 = sm.OLS(y, Xma)
|
||||
summarize(model1.fit())
|
||||
|
||||
```
|
||||
|
||||
## Multivariate Goodness of Fit
|
||||
We can access the individual components of `results` by name
|
||||
(`dir(results)` shows us what is available). Hence
|
||||
`results.rsquared` gives us the $R^2$,
|
||||
and
|
||||
`np.sqrt(results.scale)` gives us the RSE.
|
||||
|
||||
Variance inflation factors (section 3.3.3) are sometimes useful
|
||||
to assess the effect of collinearity in the model matrix of a regression model.
|
||||
We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
|
||||
|
||||
### List Comprehension
|
||||
Often we encounter a sequence of objects which we would like to transform
|
||||
for some other task. Below, we compute the VIF for each
|
||||
feature in our `X` matrix and produce a data frame
|
||||
whose index agrees with the columns of `X`.
|
||||
The notion of list comprehension can often make such
|
||||
a task easier.
|
||||
|
||||
List comprehensions are simple and powerful ways to form
|
||||
lists of `Python` objects. The language also supports
|
||||
dictionary and *generator* comprehension, though these are
|
||||
beyond our scope here. Let's look at an example. We compute the VIF for each of the variables
|
||||
in the model matrix `X`, using the function `variance_inflation_factor()`.
|
||||
|
||||
|
||||
```{python}
|
||||
vals = [VIF(X, i)
|
||||
for i in range(1, X.shape[1])]
|
||||
vif = pd.DataFrame({'vif':vals},
|
||||
index=X.columns[1:])
|
||||
vif
|
||||
|
||||
```
|
||||
The function `VIF()` takes two arguments: a dataframe or array,
|
||||
and a variable column index. In the code above we call `VIF()` on the fly for all columns in `X`.
|
||||
We have excluded column 0 above (the intercept), which is not of interest. In this case the VIFs are not that exciting.
|
||||
|
||||
The object `vals` above could have been constructed with the following for loop:
|
||||
|
||||
```{python}
|
||||
vals = []
|
||||
for i in range(1, X.values.shape[1]):
|
||||
vals.append(VIF(X.values, i))
|
||||
|
||||
```
|
||||
List comprehension allows us to perform such repetitive operations in a more straightforward way.
|
||||
## Interaction Terms
|
||||
It is easy to include interaction terms in a linear model using `ModelSpec()`.
|
||||
Including a tuple `("lstat","age")` tells the model
|
||||
matrix builder to include an interaction term between
|
||||
`lstat` and `age`.
|
||||
|
||||
```{python}
|
||||
X = MS(['lstat',
|
||||
'age',
|
||||
('lstat', 'age')]).fit_transform(Boston)
|
||||
model2 = sm.OLS(y, X)
|
||||
summarize(model2.fit())
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Non-linear Transformations of the Predictors
|
||||
The model matrix builder can include terms beyond
|
||||
just column names and interactions. For instance,
|
||||
the `poly()` function supplied in `ISLP` specifies that
|
||||
columns representing polynomial functions
|
||||
of its first argument are added to the model matrix.
|
||||
|
||||
```{python}
|
||||
X = MS([poly('lstat', degree=2), 'age']).fit_transform(Boston)
|
||||
model3 = sm.OLS(y, X)
|
||||
results3 = model3.fit()
|
||||
summarize(results3)
|
||||
|
||||
```
|
||||
The effectively zero *p*-value associated with the quadratic term
|
||||
(i.e. the third row above) suggests that it leads to an improved model.
|
||||
|
||||
By default, `poly()` creates a basis matrix for inclusion in the
|
||||
model matrix whose
|
||||
columns are *orthogonal polynomials*, which are designed for stable
|
||||
least squares computations. {Actually, `poly()` is a wrapper for the workhorse and standalone function `Poly()` that does the work in building the model matrix.}
|
||||
Alternatively, had we included an argument
|
||||
`raw=True` in the above call to `poly()`, the basis matrix would consist simply of
|
||||
`lstat` and `lstat**2`. Since either of these bases
|
||||
represent quadratic polynomials, the fitted values would not
|
||||
change in this case, just the polynomial coefficients. Also by default, the columns
|
||||
created by `poly()` do not include an intercept column as
|
||||
that is automatically added by `MS()`.
|
||||
|
||||
We use the `anova_lm()` function to further quantify the extent to which the quadratic fit is
|
||||
superior to the linear fit.
|
||||
|
||||
```{python}
|
||||
anova_lm(results1, results3)
|
||||
|
||||
```
|
||||
Here `results1` represents the linear submodel containing
|
||||
predictors `lstat` and `age`,
|
||||
while `results3` corresponds to the larger model above with a quadratic
|
||||
term in `lstat`.
|
||||
The `anova_lm()` function performs a hypothesis test
|
||||
comparing the two models. The null hypothesis is that the quadratic
|
||||
term in the bigger model is not needed, and the alternative hypothesis is that the
|
||||
bigger model is superior. Here the *F*-statistic is 177.28 and
|
||||
the associated *p*-value is zero.
|
||||
In this case the *F*-statistic is the square of the
|
||||
*t*-statistic for the quadratic term in the linear model summary
|
||||
for `results3` --- a consequence of the fact that these nested
|
||||
models differ by one degree of freedom.
|
||||
This provides very clear evidence that the quadratic polynomial in
|
||||
`lstat` improves the linear model.
|
||||
This is not surprising, since earlier we saw evidence for non-linearity in the relationship between `medv`
|
||||
and `lstat`.
|
||||
|
||||
The function `anova_lm()` can take more than two nested models
|
||||
as input, in which case it compares every successive pair of models.
|
||||
That also explains why their are `NaN`s in the first row above, since
|
||||
there is no previous model with which to compare the first.
|
||||
|
||||
|
||||
```{python}
|
||||
ax = subplots(figsize=(8,8))[1]
|
||||
ax.scatter(results3.fittedvalues, results3.resid)
|
||||
ax.set_xlabel('Fitted value')
|
||||
ax.set_ylabel('Residual')
|
||||
ax.axhline(0, c='k', ls='--')
|
||||
|
||||
```
|
||||
We see that when the quadratic term is included in the model,
|
||||
there is little discernible pattern in the residuals.
|
||||
In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument
|
||||
to `poly()`.
|
||||
|
||||
|
||||
|
||||
## Qualitative Predictors
|
||||
Here we use the `Carseats` data, which is included in the
|
||||
`ISLP` package. We will attempt to predict `Sales`
|
||||
(child car seat sales) in 400 locations based on a number of
|
||||
predictors.
|
||||
|
||||
```{python}
|
||||
Carseats = load_data('Carseats')
|
||||
Carseats.columns
|
||||
|
||||
```
|
||||
The `Carseats`
|
||||
data includes qualitative predictors such as
|
||||
`ShelveLoc`, an indicator of the quality of the shelving
|
||||
location --- that is,
|
||||
the space within a store in which the car seat is displayed. The predictor
|
||||
`ShelveLoc` takes on three possible values, `Bad`, `Medium`, and `Good`.
|
||||
Given a qualitative variable such as `ShelveLoc`, `ModelSpec()` generates dummy
|
||||
variables automatically.
|
||||
These variables are often referred to as a *one-hot encoding* of the categorical
|
||||
feature. Their columns sum to one, so to avoid collinearity with an intercept, the first column is dropped. Below we see
|
||||
the column `ShelveLoc[Bad]` has been dropped, since `Bad` is the first level of `ShelveLoc`.
|
||||
Below we fit a multiple regression model that includes some interaction terms.
|
||||
|
||||
```{python}
|
||||
allvars = list(Carseats.columns.drop('Sales'))
|
||||
y = Carseats['Sales']
|
||||
final = allvars + [('Income', 'Advertising'),
|
||||
('Price', 'Age')]
|
||||
X = MS(final).fit_transform(Carseats)
|
||||
model = sm.OLS(y, X)
|
||||
summarize(model.fit())
|
||||
|
||||
```
|
||||
In the first line above, we made `allvars` a list, so that we
|
||||
could add the interaction terms two lines down.
|
||||
Our model-matrix builder has created a `ShelveLoc[Good]`
|
||||
dummy variable that takes on a value of 1 if the
|
||||
shelving location is good, and 0 otherwise. It has also created a `ShelveLoc[Medium]`
|
||||
dummy variable that equals 1 if the shelving location is medium, and 0 otherwise.
|
||||
A bad shelving location corresponds to a zero for each of the two dummy variables.
|
||||
The fact that the coefficient for `ShelveLoc[Good]` in the regression output is
|
||||
positive indicates that a good shelving location is associated with high sales (relative to a bad location).
|
||||
And `ShelveLoc[Medium]` has a smaller positive coefficient,
|
||||
indicating that a medium shelving location leads to higher sales than a bad
|
||||
shelving location, but lower sales than a good shelving location.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user