Update MultipleRegressionBasics.md

This commit is contained in:
Oesting 2023-09-13 14:56:26 +02:00 committed by GitHub
parent 21097e480c
commit fd35257877
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,12 +1,12 @@
#Multiple Regression Basics
# Multiple Regression Basics
## Motivation
### Introductory Example: tree dataset from R
[figure raw data]
[figure of raw data]
*Aim:* Find relationship between the *response variable* `volume` and the *explanatory variable* `girth`?
*Aim:* Find relationship between the *response variable* `volume` and the *explanatory variable/covariate* `girth`?
Can we predict the volume of a tree given its girth?
[figure including a straight line]
@ -15,3 +15,86 @@ First Guess: There is a linear relation!
## Simple Linear Regression
Main assumption: up to some error term, each measurement of teh response variable $y_i$ depends linearly on the corresponding value $x_i$ of the covariate
=> **(Simple) Linear Model**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the straight line that fits best, i.e., find the *optimal* estimators for $\beta_0$ and $\beta_1$.
*Typical choice*: Least squares estimator (= maximum likelihood estimator for normal errors)
```math
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2
```
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the vector of covariates and $\mathbf{1}$ is a vector of ones.
Written in matrix style:
```math
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \|^2
```
Note: There is a closed-form expression for $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but rather use Julia to solve the problem.
[use Julia code (existing package) to perform linear regression for ``volume ~ girth``]
*Interpretation of the Julia output:*
+ column ``estimate`` : least square estimates for $\hat \beta_0$ and $\hat \beta_1$
+ column ``Std. Error`` : estimated standard deviation $s_\beta$ of the estimators
+ column ``t value`` : value of the $t$-statistics
```math
t_i = \hat \beta_i \over s_{\beta_i}, \quad i=0,1,
```
Under the hypothesis $\beta_i=0$, $t_i$ would follow a $t$-distribution.
+ column ``Pr(>|t|)'': $p$-values for the hyptheses $\beta_i=0$ for $i=0,1$
**Exercise**: Generate a random set of covariates $\mathbf{x}$. Given these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ (you can choose them)), simulate responses from a linear model
```math
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2
```
and estimate the coefficients $\beta_0$ and $\beta_1$. Play with different choices of the parameters to see the effects on the parameter estimates and the $p$-values.
## Multiple Regression Model
*Idea*: Generalize the simple linear regression model to multiple covariates, w.g., predict ``volume`` using ``girth`` and `height``.
=> **Linear Model**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$
where
+ $y_i$: $i$ th measurement of the response,
+ $x_{i1}$: $i$ th value of first covariate,
...
+ $x_{ip}$: $i$ th value of $p$-th covariate,
+ $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the *optimal* estimators for $\beta_0, \beta_1, \ldots, \beta_p$.
*Our choice again:* Least squares estimator (= maximum likelihood estimator for normal errors)
```math
(\hat \beta_0, \hat \beta_1, \ldots, \beta_p) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \mathbf{x}_p \beta_p\|^2
```
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$_j is the vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
Written in matrix style:
```math
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ x_p\end{array} \right) \|^2
```
Defining the *design matrix* $\mathbf{X}$ (size $n \times (p+1)$), we get the short form
```math
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
```
[use Julia code (existing package) to perform linear regression for ``volume ~ girth + height``]
Interpretation of the Julia output is similar to the simple linear regression model, but we provide explicit formulas now:
+ parameter estimates
+ estimated standard errors
+ $t$-statistics
+ $p$-values
**Exercise**: Implement functions that estimate the $\beta$-parameters, the corresponding standard errors and the $t$-statistics.
Test your functions with the ```tree''' data set and try to reproduce the output above.
## Potential Add-on: Multiple Regression Models with Categorical Covariates