Correct note on Multiple Regression.

This commit is contained in:
Marco Oesting
2023-10-10 07:53:07 +02:00
parent 72e4f8da83
commit c74be9a6d7

View File

@@ -1,319 +1,320 @@
--- ---
editor: editor:
markdown: markdown:
wrap: 72 wrap: 72
--- ---
# Multiple Regression Basics # Multiple Regression Basics
## Motivation ## Motivation
### Introductory Example: tree dataset from R ### Introductory Example: tree dataset from R
``` julia ``` julia
using Statistics using Statistics
using Plots using Plots
using RDatasets using RDatasets
trees = dataset("datasets", "trees") trees = dataset("datasets", "trees")
scatter(trees.Volume, trees.Girth, scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume") legend=false, xlabel="Girth", ylabel="Volume")
``` ```
*Aim:* Find relationship between the *response variable* `volume` and *Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth? of a tree given its girth?
``` julia ``` julia
scatter(trees.Girth, trees.Volume, scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume") legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x) plot!(x -> -37 + 5*x)
``` ```
First Guess: There is a linear relation! First Guess: There is a linear relation!
## Simple Linear Regression ## Simple Linear Regression
Main assumption: up to some error term, each measurement of the response Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate covariate
$\leadsto$ **(Simple) Linear Model:** $\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$. normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the straight line that fits best, i.e., find the *optimal* *Task:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$. estimators for $\beta_0$ and $\beta_1$.
*Typical choice*: Least squares estimator (= maximum likelihood *Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors) estimator for normal errors)
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones. vector of covariates and $\mathbf{1}$ is a vector of ones.
Written in matrix style: Written in matrix style:
$$ $$
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$ $$
Note: There is a closed-form expression for Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem. rather use Julia to solve the problem.
``` julia ``` julia
lm(@formula(Volume ~ Girth), trees) lm(@formula(Volume ~ Girth), trees)
``` ```
*Interpretation of the Julia output:* *Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and - column `estimate` : least square estimates for $\hat \beta_0$ and
$\hat \beta_1$ $\hat \beta_1$
- column `Std. Error` : estimated standard deviation - column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
- column `t value` : value of the $t$-statistics - column `t value` : value of the $t$-statistics
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution. follow a $t$-distribution.
- column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for - column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for
$i=0,1$ $i=0,1$
::: callout-tip ::: callout-tip
The command `rand(n)` generates a sample of `n` "random" (i.e., The command `rand(n)` generates a sample of `n` "random" (i.e.,
uniformly distributed) random numbers. If you want to sample from uniformly distributed) random numbers. If you want to sample from
another distribution, use the `Distributions` package, define an object another distribution, use the `Distributions` package, define an object
being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a
normal distribution with mean 0.0 and standard deviation 2.0, and sample normal distribution with mean 0.0 and standard deviation 2.0, and sample
`n` times from this distribution by `rand(d, n)`. `n` times from this distribution by `rand(d, n)`.
::: :::
::: {.callout-caution collapse="false"} ::: {.callout-caution collapse="false"}
## Task 1 ## Task 1
1. Generate $n=20$ covariates $\mathbf{x}$ randomly. 1. Generate $n=20$ covariates $\mathbf{x}$ randomly.
2. Given these covariates and the true parameters $\beta_0=-3$, 2. Given these covariates and the true parameters $\beta_0=-3$,
$\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model $\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model
(with normally distributed errors) and estimate the coefficients (with normally distributed errors) and estimate the coefficients
$\beta_0$ and $\beta_1$. $\beta_0$ and $\beta_1$.
3. Play with different choices of the parameters above (including the 3. Play with different choices of the parameters above (including the
sample size $n$) to see the effects on the parameter estimates and sample size $n$) to see the effects on the parameter estimates and
the $p$-values. the $p$-values.
:::
::: {callout-note}
The more covariates you add the more variance can be explained by the linear model - $R^2$ increases. In order to balance goodness-of-tit of a model and its complexity, information criteria such as `aic` are considered. ## Multiple Regression Model
:::
*Idea*: Generalize the simple linear regression model to multiple
## Multiple Regression Model covariates, w.g., predict `volume` using `girth` and \`height\`\`.
*Idea*: Generalize the simple linear regression model to multiple $\leadsto$ **Linear Model:**
covariates, w.g., predict `volume` using `girth` and \`height\`\`. $$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
$\leadsto$ **Linear Model:** - $y_i$: $i$-th measurement of the response,
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- $x_{i1}$: $i$ th value of first covariate,
- $y_i$: $i$-th measurement of the response,
- ...
- $x_{i1}$: $i$ th value of first covariate,
- $x_{ip}$: $i$-th value of $p$-th covariate,
- ...
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
- $x_{ip}$: $i$-th value of $p$-th covariate, distributed errors with unknown variance $\sigma^2$.
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally *Task:* Find the *optimal* estimators for
distributed errors with unknown variance $\sigma^2$. $\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
*Task:* Find the *optimal* estimators for *Our choice again:* Least squares estimator (= maximum likelihood
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. estimator for normal errors)
*Our choice again:* Least squares estimator (= maximum likelihood $$
estimator for normal errors) \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$
$$
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
$$ vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the Written in matrix style:
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
$$
Written in matrix style: \mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$
$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 Defining the *design matrix*
$$
$$ \mathbf{X} = \left( \begin{array}{cccc}
Defining the *design matrix* 1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\
$$ \mathbf{X} = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p}
1 & x_{11} & \ldots & x_{1p} \\ \end{array}\right) \qquad
\vdots & \vdots & \ddots & \vdots \\ (\text{size } n \times (p+1)), $$
1 & x_{11} & \ldots & x_{1p}
\end{array}\right) \qquad we get the short form
(\text{size } n \times (p+1)), $$
$$
we get the short form \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \[use Julia code (existing package) to perform linear regression for
$$ `volume ~ girth + height`\]
\[use Julia code (existing package) to perform linear regression for The interpretation of the Julia output is similar to the simple linear
`volume ~ girth + height`\] regression model, but we provide explicit formulas now:
The interpretation of the Julia output is similar to the simple linear - parameter estimates:
regression model, but we provide explicit formulas now:
$$
- parameter estimates: (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
$$
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} - estimated standard errors:
$$
$$
- estimated standard errors: \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
$$
$$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} - $t$-statistics:
$$
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
- $t$-statistics:
- $p$-values:
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
$$
- $p$-values: p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$
$$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} ::: {.callout-caution collapse="false"}
$$
## Task 2
::: {.callout-caution collapse="false"}
1. Implement functions that estimate the $\beta$-parameters,
## Task 2 the corresponding standard errors and the $t$-statistics.
2. Test your functions with the `tree' data set and try to reproduce the
1. Implement functions that estimate the $\beta$-parameters, output above.
the corresponding standard errors and the $t$-statistics. :::
2. Test your functions with the `tree' data set and try to reproduce the
output above. Which model is the best? For linear models, one often uses the $R^2$ characteristic.
::: Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model.
Which model is the best? For linear models, one often uses the $R^2$ characteristic. ``` julia
Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model. r2(linmod1)
r2(linmod2)
``` julia
r2(linmod1) linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod2)
r2(linmod3)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) ```
r2(linmod3) ::: {callout-note}
``` The more covariates you add the more variance can be explained by the linear model - $R^2$ increases. In order to balance goodness-of-tit of a model and its complexity, information criteria such as `aic` are considered.
:::
## Generalized Linear Models
## Generalized Linear Models
Classical linear model
Classical linear model
$$
\mathbf{y} = \mathbf{X} \beta + \varepsilon $$
$$ \mathbf{y} = \mathbf{X} \beta + \varepsilon
$$
implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include: However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include:
- for counting data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) - for counting data: $$
$$ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
$$
Here, the components are considered to be independent and the
exponential function is applied componentwise. Here, the components are considered to be independent and the
exponential function is applied componentwise.
- for binary data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} - for binary data: $$
$$ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$
Again, the components are considered to be independent and all the
operations are applied componentwise. Again, the components are considered to be independent and all the
operations are applied componentwise.
All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that
$$
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). $$
$$ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$
For the models above, these are:
For the models above, these are:
+----------------+------------------+--------------------------+
| Type of Data | Distribution | Link Function | +----------------+------------------+--------------------------+
| | Family | | | Type of Data | Distribution | Link Function |
+================+==================+==========================+ | | Family | |
| continuous | Normal | identity: | +================+==================+==========================+
| | | | | continuous | Normal | identity: |
| | | $$ | | | | |
| | | g(x)=x | | | | $$ |
| | | $$ | | | | g(x)=x |
+----------------+------------------+--------------------------+ | | | $$ |
| count | Poisson | log: | +----------------+------------------+--------------------------+
| | | | | count | Poisson | log: |
| | | $$ | | | | |
| | | g(x) = \log(x) | | | | $$ |
| | | $$ | | | | g(x) = \log(x) |
+----------------+------------------+--------------------------+ | | | $$ |
| binary | Bernoulli | logit: | +----------------+------------------+--------------------------+
| | | | | binary | Bernoulli | logit: |
| | | $$ | | | | |
| | | g(x) = \log\left( | | | | $$ |
| | | \frac{x}{1-x} | | | | g(x) = \log\left( |
| | | \right) | | | | \frac{x}{1-x} |
| | | $$ | | | | \right) |
+----------------+------------------+--------------------------+ | | | $$ |
+----------------+------------------+--------------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e., In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), $$
$$ \hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$
which is equivalent to the maximization of the log-likelihood, i.e.,
which is equivalent to the maximization of the log-likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), $$
$$ \hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$
In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above. In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
``` julia
using CSV ``` julia
using HTTP using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body)) http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink()) model = glm(@formula(participation ~ age^2),
``` SwissLabor, Binomial(), ProbitLink())
```
::: {.callout-caution collapse="false"}
::: {.callout-caution collapse="false"}
## Task 3:
## Task 3:
1. Reproduce the results of our data analysis of the `tree` data set
using a generalized linear model with normal distribution family. 1. Reproduce the results of our data analysis of the `tree` data set
2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model. using a generalized linear model with normal distribution family.
::: 2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model.
:::