Update Multiple Regression.

This commit is contained in:
Marco Oesting 2023-10-09 22:11:52 +02:00
parent bb14fac836
commit 12e0a50946

View File

@ -1,324 +1,316 @@
--- ---
editor: editor:
markdown: markdown:
wrap: 72 wrap: 72
--- ---
# Multiple Regression Basics # Multiple Regression Basics
## Motivation ## Motivation
### Introductory Example: tree dataset from R ### Introductory Example: tree dataset from R
``` julia ``` julia
using Statistics using Statistics
using Plots using Plots
using RDatasets using RDatasets
trees = dataset("datasets", "trees") trees = dataset("datasets", "trees")
scatter(trees.Volume, trees.Girth, scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume") legend=false, xlabel="Girth", ylabel="Volume")
``` ```
*Aim:* Find relationship between the *response variable* `volume` and *Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth? of a tree given its girth?
``` julia ``` julia
scatter(trees.Girth, trees.Volume, scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume") legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x) plot!(x -> -37 + 5*x)
``` ```
First Guess: There is a linear relation! First Guess: There is a linear relation!
## Simple Linear Regression ## Simple Linear Regression
Main assumption: up to some error term, each measurement of the response Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate covariate
$\leadsto$ **(Simple) Linear Model:** $\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$. normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the straight line that fits best, i.e., find the *optimal* *Task:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$. estimators for $\beta_0$ and $\beta_1$.
*Typical choice*: Least squares estimator (= maximum likelihood *Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors) estimator for normal errors)
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones. vector of covariates and $\mathbf{1}$ is a vector of ones.
Written in matrix style: Written in matrix style:
$$ $$
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$ $$
Note: There is a closed-form expression for Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem. rather use Julia to solve the problem.
``` julia ``` julia
lm(@formula(Volume ~ Girth), trees) lm(@formula(Volume ~ Girth), trees)
``` ```
*Interpretation of the Julia output:* *Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and - column `estimate` : least square estimates for $\hat \beta_0$ and
$\hat \beta_1$ $\hat \beta_1$
- column `Std. Error` : estimated standard deviation - column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
- column `t value` : value of the $t$-statistics - column `t value` : value of the $t$-statistics
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution. follow a $t$-distribution.
- column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for - column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for
$i=0,1$ $i=0,1$
::: callout-tip ::: callout-tip
The command `rand(n)` generates a sample of `n` "random" (i.e., The command `rand(n)` generates a sample of `n` "random" (i.e.,
uniformly distributed) random numbers. If you want to sample from uniformly distributed) random numbers. If you want to sample from
another distribution, use the `Distributions` package, define an object another distribution, use the `Distributions` package, define an object
being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a
normal distribution with mean 0.0 and standard deviation 2.0, and sample normal distribution with mean 0.0 and standard deviation 2.0, and sample
`n` times from this distribution by `rand(d, n)`. `n` times from this distribution by `rand(d, n)`.
::: :::
::: {.callout-caution collapse="false"} ::: {.callout-caution collapse="false"}
## Task 1 ## Task 1
1. Generate $n=20$ covariates $\mathbf{x}$ randomly. 1. Generate $n=20$ covariates $\mathbf{x}$ randomly.
2. Given these covariates and the true parameters $\beta_0=-3$, 2. Given these covariates and the true parameters $\beta_0=-3$,
$\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model $\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model
(with normally distributed errors) and estimate the coefficients (with normally distributed errors) and estimate the coefficients
$\beta_0$ and $\beta_1$. $\beta_0$ and $\beta_1$.
3. Play with different choices of the parameters above (including the 3. Play with different choices of the parameters above (including the
sample size $n$) to see the effects on the parameter estimates and sample size $n$) to see the effects on the parameter estimates and
the $p$-values. the $p$-values.
::: :::
## Multiple Regression Model ## Multiple Regression Model
*Idea*: Generalize the simple linear regression model to multiple *Idea*: Generalize the simple linear regression model to multiple
covariates, w.g., predict `volume` using `girth` and \`height\`\`. covariates, w.g., predict `volume` using `girth` and \`height\`\`.
$\leadsto$ **Linear Model:** $\leadsto$ **Linear Model:**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where $$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- $y_i$: $i$-th measurement of the response, - $y_i$: $i$-th measurement of the response,
- $x_{i1}$: $i$ th value of first covariate, - $x_{i1}$: $i$ th value of first covariate,
- ... - ...
- $x_{ip}$: $i$-th value of $p$-th covariate, - $x_{ip}$: $i$-th value of $p$-th covariate,
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
distributed errors with unknown variance $\sigma^2$. distributed errors with unknown variance $\sigma^2$.
*Task:* Find the *optimal* estimators for *Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. $\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
*Our choice again:* Least squares estimator (= maximum likelihood *Our choice again:* Least squares estimator (= maximum likelihood
estimator for normal errors) estimator for normal errors)
$$ $$
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$ $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
Written in matrix style: Written in matrix style:
$$ $$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 \mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$ $$
Defining the *design matrix* Defining the *design matrix*
$$ \mathbf{X} = \left( \begin{array}{cccc} $$ \mathbf{X} = \left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1p} \\ 1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\
1 & x_{11} & \ldots & x_{1p} 1 & x_{11} & \ldots & x_{1p}
\end{array}\right) \qquad \end{array}\right) \qquad
(\text{size } n \times (p+1)), $$ (\text{size } n \times (p+1)), $$
we get the short form we get the short form
$$ $$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$ $$
\[use Julia code (existing package) to perform linear regression for \[use Julia code (existing package) to perform linear regression for
`volume ~ girth + height`\] `volume ~ girth + height`\]
The interpretation of the Julia output is similar to the simple linear The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now: regression model, but we provide explicit formulas now:
- parameter estimates: - parameter estimates:
$$ $$
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$ $$
- estimated standard errors: - estimated standard errors:
$$ $$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
$$ $$
- $t$-statistics: - $t$-statistics:
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
- $p$-values: - $p$-values:
$$ $$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$ $$
::: {.callout-caution collapse="false"} ::: {.callout-caution collapse="false"}
## Task 2 ## Task 2
1. Implement functions that estimate the $\beta$-parameters, 1. Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. the corresponding standard errors and the $t$-statistics.
2. Test your functions with the `tree' data set and try to reproduce the 2. Test your functions with the `tree' data set and try to reproduce the
output above. output above.
::: :::
Which model is the best? For linear models, one often uses the $R^2$ characteristic. Which model is the best? For linear models, one often uses the $R^2$ characteristic.
Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model. Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model.
``` julia ``` julia
r2(linmod1) r2(linmod1)
r2(linmod2) r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3) r2(linmod3)
``` ```
## Generalized Linear Models ## Generalized Linear Models
Classical linear model Classical linear model
$$ $$
\mathbf{y} = \mathbf{X} \beta + \varepsilon \mathbf{y} = \mathbf{X} \beta + \varepsilon
$$ $$
implies that implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ $$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
In particular, the conditional expectation satisfies In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. $\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
However, the assumption of a normal distribution is unrealistic for However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include: non-continuous data. Popular alternatives include:
- for counting data: $$ - for counting data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
$$ $$
Here, the components are considered to be independent and the Here, the components are considered to be independent and the
exponential function is applied componentwise. exponential function is applied componentwise.
- for binary data: $$ - for binary data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$ $$
Again, the components are considered to be independent and all the Again, the components are considered to be independent and all the
operations are applied componentwise. operations are applied componentwise.
All these models are defined by the choice of a family of distributions All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that and a function $g$ (the so-called *link function*) such that
$$ $$
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$ $$
For the models above, these are: For the models above, these are:
+----------------+----------------+----------------+ +----------------+------------------+--------------------------+
| Type of Data | Distribution | Link Function | | Type of Data | Distribution | Link Function |
| | Family | | | | Family | |
+================+================+================+ +================+==================+==========================+
| continuous | Normal | identity: | | continuous | Normal | identity: |
| | | | | | | |
| | | $$ | | | | $$ |
| | | g(x)=x | | | | g(x)=x |
| | | $$ | | | | $$ |
+----------------+----------------+----------------+ +----------------+------------------+--------------------------+
| count | Poisson | log: | | count | Poisson | log: |
| | | | | | | |
| | | \$\$ | | | | $$ |
| | | | | | | g(x) = \log(x) |
| | | g(x) = \log(x) | | | | $$ |
| | | \$\$ | +----------------+------------------+--------------------------+
+----------------+----------------+----------------+ | binary | Bernoulli | logit: |
| binary | Bernoulli | logit: | | | | |
| | | | | | | $$ |
| | | $$ | | | | g(x) = \log\left( |
| | | g | | | | \frac{x}{1-x} |
| | | ( | | | | \right) |
| | | x) = \log\left | | | | $$ |
| | | ( | +----------------+------------------+--------------------------+
| | | \ |
| | | f | In general, the parameter vector $\beta$ is estimated via maximizing the
| | | ra | likelihood, i.e.,
| | | c |
| | | { | $$
| | | x}{1-x}\right) | \hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
| | | $$ | $$
+----------------+----------------+----------------+
which is equivalent to the maximization of the log-likelihood, i.e.,
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e., $$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$ $$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$ In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
which is equivalent to the maximization of the log-likelihood, i.e.,
``` julia
$$ using CSV
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), using HTTP
$$
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
In the Gaussian case, the maximum likelihood estimator is identical to SwissLabor = DataFrame(CSV.File(http_response.body))
the least squares estimator considered above.
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
``` julia
using CSV model = glm(@formula(participation ~ age^2),
using HTTP SwissLabor, Binomial(), ProbitLink())
```
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body)) ::: {.callout-caution collapse="false"}
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") ## Task 3:
model = glm(@formula(participation ~ age^2), 1. Reproduce the results of our data analysis of the `tree` data set
SwissLabor, Binomial(), ProbitLink()) using a generalized linear model with normal distribution family.
``` 2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model.
:::
::: {.callout-caution collapse="false"}
## Task 3:
1. Reproduce the results of our data analysis of the `tree` data set
using a generalized linear model with normal distribution family.
2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model.
:::