Updated Text & Code on Regression.

This commit is contained in:
Marco Oesting
2023-10-08 20:56:51 +02:00
parent cc5c76f770
commit f2d84806ea
2 changed files with 344 additions and 253 deletions

View File

@@ -0,0 +1,52 @@
############################################################################
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
############################################################################
using Statistics
using Plots
using RDatasets
using GLM
##
trees = dataset("datasets", "trees")
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
##
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
##
linmod1 = lm(@formula(Volume ~ Girth), trees)
##
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
##
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
##
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
##
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())

View File

@@ -1,253 +1,292 @@
--- ---
editor: editor:
markdown: markdown:
wrap: 72 wrap: 72
--- ---
# Multiple Regression Basics # Multiple Regression Basics
## Motivation ## Motivation
### Introductory Example: tree dataset from R ### Introductory Example: tree dataset from R
\[figure of raw data\] ```{julia}
using Statistics
*Aim:* Find relationship between the *response variable* `volume` and using Plots
the *explanatory variable/covariate* `girth`? Can we predict the volume using RDatasets
of a tree given its girth?
trees = dataset("datasets", "trees")
\[figure including a straight line\]
scatter(trees.Volume, trees.Girth,
First Guess: There is a linear relation! legend=false, xlabel="Girth", ylabel="Volume")
```
## Simple Linear Regression
*Aim:* Find relationship between the *response variable* `volume` and
Main assumption: up to some error term, each measurement of the response the *explanatory variable/covariate* `girth`? Can we predict the volume
variable $y_i$ depends linearly on the corresponding value $x_i$ of the of a tree given its girth?
covariate
```{julia}
$\leadsto$ **(Simple) Linear Model:** scatter(trees.Girth, trees.Volume,
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ legend=false, xlabel="Girth", ylabel="Volume")
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent plot!(x -> -37 + 5*x)
normally distributed errors with unknown variance $\sigma^2$. ```
*Task:* Find the straight line that fits best, i.e., find the *optimal* First Guess: There is a linear relation!
estimators for $\beta_0$ and $\beta_1$.
## Simple Linear Regression
*Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors) Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ covariate
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the $\leadsto$ **(Simple) Linear Model:**
vector of covariates and $\mathbf{1}$ is a vector of ones. $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
Written in matrix style: normally distributed errors with unknown variance $\sigma^2$.
$$ *Task:* Find the straight line that fits best, i.e., find the *optimal*
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 estimators for $\beta_0$ and $\beta_1$.
$$
*Typical choice*: Least squares estimator (= maximum likelihood
Note: There is a closed-form expression for estimator for normal errors)
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem. $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
\[use Julia code (existing package) to perform linear regression for where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
`volume ~ girth`\] vector of covariates and $\mathbf{1}$ is a vector of ones.
*Interpretation of the Julia output:* Written in matrix style:
- column `estimate` : least square estimates for $\hat \beta_0$ and $$
$\hat \beta_1$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$
- column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
- column `t value` : value of the $t$-statistics rather use Julia to solve the problem.
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ \[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\]
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution. ```{julia}
lm(@formula(Volume ~ Girth), trees)
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for ```
$i=0,1$
*Interpretation of the Julia output:*
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ - column `estimate` : least square estimates for $\hat \beta_0$ and
(you can choose them)), simulate responses from a linear model and $\hat \beta_1$
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
choices of the parameters to see the effects on the parameter estimates - column `Std. Error` : estimated standard deviation
and the $p$-values. $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
## Multiple Regression Model - column `t value` : value of the $t$-statistics
*Idea*: Generalize the simple linear regression model to multiple $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
covariates, w.g., predict `volume` using `girth` and \`height\`\`.
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
$\leadsto$ **Linear Model:** follow a $t$-distribution.
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
- $y_i$: $i$-th measurement of the response, $i=0,1$
- $x_{i1}$: $i$ th value of first covariate, **Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
- ... (you can choose them)), simulate responses from a linear model and
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
- $x_{ip}$: $i$-th value of $p$-th covariate, choices of the parameters to see the effects on the parameter estimates
and the $p$-values.
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
distributed errors with unknown variance $\sigma^2$. ## Multiple Regression Model
*Task:* Find the *optimal* estimators for *Idea*: Generalize the simple linear regression model to multiple
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. covariates, w.g., predict `volume` using `girth` and \`height\`\`.
*Our choice again:* Least squares estimator (= maximum likelihood $\leadsto$ **Linear Model:**
estimator for normal errors) $$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
$$ - $y_i$: $i$-th measurement of the response,
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$ - $x_{i1}$: $i$ th value of first covariate,
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the - ...
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
- $x_{ip}$: $i$-th value of $p$-th covariate,
Written in matrix style:
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
$$ distributed errors with unknown variance $\sigma^2$.
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$ *Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
Defining the *design matrix*
*Our choice again:* Least squares estimator (= maximum likelihood
$$ \mathbf{X} = \left( \begin{array}{cccc} estimator for normal errors)
1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\ $$
1 & x_{11} & \ldots & x_{1p} \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
\end{array}\right) \qquad $$
(\text{size } n \times (p+1)), $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
we get the short form vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
$$ Written in matrix style:
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$ $$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
\[use Julia code (existing package) to perform linear regression for $$
`volume ~ girth + height`\]
Defining the *design matrix*
The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now: $$ \mathbf{X} = \left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1p} \\
- parameter estimates: \vdots & \vdots & \ddots & \vdots \\
1 & x_{11} & \ldots & x_{1p}
$$ \end{array}\right) \qquad
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} (\text{size } n \times (p+1)), $$
$$
we get the short form
- estimated standard errors:
$$
$$ \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} $$
$$
\[use Julia code (existing package) to perform linear regression for
- $t$-statistics: `volume ~ girth + height`\]
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now:
- $p$-values:
- parameter estimates:
$$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} $$
$$ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
**Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your - estimated standard errors:
functions with the \`\`\`tree''' data set and try to reproduce the
output above. $$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
## Generalized Linear Models $$
Classical linear model - $t$-statistics:
$$ $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
\mathbf{y} = \mathbf{X} \beta + \varepsilon
$$ - $p$-values:
implies that $$
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$
In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. **Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your
However, the assumption of a normal distribution is unrealistic for functions with the \`\`\`tree''' data set and try to reproduce the
non-continuous data. Popular alternatives include: output above.
- for counting data: $$ ```{julia}
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) r2(linmod1)
$$ r2(linmod2)
Here, the components are considered to be independent and the linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
exponential function is applied componentwise.
r2(linmod3)
- for binary data: $$ ```
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$ ## Generalized Linear Models
Again, the components are considered to be independent and all the Classical linear model
operations are applied componentwise.
$$
All these models are defined by the choice of a family of distributions \mathbf{y} = \mathbf{X} \beta + \varepsilon
and a function $g$ (the so-called *link function*) such that $$
$$ implies that
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). $$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
$$
In particular, the conditional expectation satisfies
For the models above, these are: $\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
+----------------------+---------------------+----------------------+ However, the assumption of a normal distribution is unrealistic for
| Type of Data | Distribution Family | Link Function | non-continuous data. Popular alternatives include:
+======================+=====================+======================+
| continuous | Normal | identity: | - for counting data: $$
| | | | \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
| | | $$ | $$
| | | g(x)=x |
| | | $$ | Here, the components are considered to be independent and the
+----------------------+---------------------+----------------------+ exponential function is applied componentwise.
| count | Poisson | log: |
| | | | - for binary data: $$
| | | $$ | \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
| | | g(x) = \log(x) | $$
| | | $$ |
+----------------------+---------------------+----------------------+ Again, the components are considered to be independent and all the
| binary | Bernoulli | logit: | operations are applied componentwise.
| | | |
| | | $$ | All these models are defined by the choice of a family of distributions
| | | g(x) = \log\left | and a function $g$ (the so-called *link function*) such that
| | | ( |
| | | \frac{x}{1-x}\right) | $$
| | | $$ | \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
+----------------------+---------------------+----------------------+ $$
In general, the parameter vector $\beta$ is estimated via maximizing the For the models above, these are:
likelihood, i.e.,
+--------------+---------------------+--------------------+
$$ | Type of Data | Distribution Family | Link Function |
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +==============+=====================+====================+
$$ | continuous | Normal | identity: |
| | | |
which is equivalent to the maximization of the log-likelihood, i.e., | | | $$ |
| | | g(x)=x |
$$ | | | $$ |
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +--------------+---------------------+--------------------+
$$ | count | Poisson | log: |
| | | |
In the Gaussian case, the maximum likelihood estimator is identical to | | | $$ |
the least squares estimator considered above. | | | g(x) = \log(x) |
| | | $$ |
\[\[ Example in Julia: maybe `SwissLabor` \]\] +--------------+---------------------+--------------------+
| binary | Bernoulli | logit: |
**Task 3:** Reproduce the results of our data analysis of the `tree` | | | |
data set using a generalized linear model with normal distribution | | | $$ |
family. | | | g(x) = \log\left |
| | | ( |
| | | \ |
| | | f |
| | | rac{x}{1-x}\right) |
| | | $$ |
+--------------+---------------------+--------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$
which is equivalent to the maximization of the log-likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$
In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
```{julia}
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
```
**Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution
family.