Update MultipleRegressionBasics.md

2023-09-13 14:56:26 +02:00 · 2023-09-13 14:56:26 +02:00 · fd35257877
commit fd35257877
parent 21097e480c
1 changed files with 86 additions and 3 deletions
--- a/material/3_wed/MultipleRegressionBasics.md
+++ b/material/3_wed/MultipleRegressionBasics.md
@ -1,12 +1,12 @@
-#Multiple Regression Basics
+# Multiple Regression Basics

 ## Motivation

 ### Introductory Example: tree dataset from R

-[figure raw data]
+[figure of raw data]

-*Aim:* Find relationship between the *response variable* `volume`  and the *explanatory variable* `girth`?
+*Aim:* Find relationship between the *response variable* `volume`  and the *explanatory variable/covariate* `girth`?
 Can we predict the volume of a tree given its girth?

 [figure including a straight line]
@ -15,3 +15,86 @@ First Guess: There is a linear relation!


 ## Simple Linear Regression
+
+Main assumption: up to some error term, each measurement of teh response variable $y_i$ depends linearly on the corresponding value $x_i$ of the covariate
+
+=> **(Simple) Linear Model**
+$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
+where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent normally distributed errors with unknown variance $\sigma^2$.
+
+*Task:* Find the straight line that fits best, i.e., find the *optimal* estimators for $\beta_0$ and $\beta_1$.
+
+*Typical choice*: Least squares estimator (= maximum likelihood estimator for normal errors)
+```math
+(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2
+```
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the vector of covariates and $\mathbf{1}$ is a vector of ones.
+
+Written in matrix style:
+```math
+(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \|^2
+```
+Note: There is a closed-form expression for $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but rather use Julia to solve the problem.
+
+[use Julia code (existing package) to perform linear regression for ``volume ~ girth``]
+
+*Interpretation of the Julia output:* 
+ column ``estimate`` : least square estimates for $\hat \beta_0$ and $\hat \beta_1$
+ column ``Std. Error`` : estimated standard deviation $s_\beta$ of the estimators
+ column ``t value`` : value of the $t$-statistics
+  ```math
+   t_i = \hat \beta_i \over s_{\beta_i}, \quad i=0,1,
+   ```
+  Under the hypothesis $\beta_i=0$, $t_i$ would follow a $t$-distribution.
+ column ``Pr(>|t|)'': $p$-values for the hyptheses $\beta_i=0$ for $i=0,1$ 
+
+**Exercise**: Generate a random set of covariates $\mathbf{x}$. Given these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ (you can choose them)), simulate responses from a linear model
+```math
+(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2
+```
+and estimate the coefficients $\beta_0$ and $\beta_1$. Play with different choices of the parameters to see the effects on the parameter estimates and the $p$-values. 
+
+## Multiple Regression Model
+
+*Idea*: Generalize the simple linear regression model to multiple covariates, w.g., predict ``volume`` using ``girth`` and `height``.
+
+=> **Linear Model**
+$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$
+where 
+ $y_i$: $i$ th measurement of the response,
+ $x_{i1}$: $i$ th value of first covariate,
+...
+ $x_{ip}$: $i$ th value of $p$-th covariate,
+ $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally distributed errors with unknown variance $\sigma^2$.
+
+*Task:* Find the *optimal* estimators for $\beta_0, \beta_1, \ldots, \beta_p$.
+
+*Our choice again:* Least squares estimator (= maximum likelihood estimator for normal errors)
+```math
+(\hat \beta_0, \hat \beta_1, \ldots, \beta_p) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \mathbf{x}_p \beta_p\|^2
+```
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$_j is the vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
+
+Written in matrix style:
+```math
+\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ x_p\end{array} \right) \|^2
+```
+
+Defining the *design matrix* $\mathbf{X}$ (size $n \times (p+1)$), we get the short form
+```math
+\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+```
+
+[use Julia code (existing package) to perform linear regression for ``volume ~ girth + height``]
+
+Interpretation of the Julia output is similar to the simple linear regression model, but we provide explicit formulas now:
+ parameter estimates
+ estimated standard errors
+ $t$-statistics
+ $p$-values
+
+ **Exercise**: Implement functions that estimate the $\beta$-parameters, the corresponding standard errors and the $t$-statistics.
+ Test your functions with the ```tree''' data set and try to reproduce the output above.
+
+
+ ## Potential Add-on: Multiple Regression Models with Categorical Covariates