summerschool_simtech_2023/material/3_wed/regression/MultipleRegressionBasics.qmd

---
editor: 
  markdown: 
    wrap: 72
---

# Multiple Regression Basics

## Motivation

### Introductory Example: tree dataset from R

\[figure of raw data\]

*Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth?

\[figure including a straight line\]

First Guess: There is a linear relation!

## Simple Linear Regression

Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate

$\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$.

*Task:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$.

*Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors)

$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$

where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones.

Written in matrix style:

$$
 (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$

Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem.

\[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\]

*Interpretation of the Julia output:*

-   column `estimate` : least square estimates for $\hat \beta_0$ and
    $\hat \beta_1$

-   column `Std. Error` : estimated standard deviation
    $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$

-   column `t value` : value of the $t$-statistics

    $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$

    Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
    follow a $t$-distribution.

-   column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
    $i=0,1$

**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
(you can choose them)), simulate responses from a linear model and
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
choices of the parameters to see the effects on the parameter estimates
and the $p$-values.

## Multiple Regression Model

*Idea*: Generalize the simple linear regression model to multiple
covariates, w.g., predict `volume` using `girth` and \`height\`\`.

$\leadsto$ **Linear Model:**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$where

-   $y_i$: $i$-th measurement of the response,

-   $x_{i1}$: $i$ th value of first covariate,

-   ...

-   $x_{ip}$: $i$-th value of $p$-th covariate,

-   $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
    distributed errors with unknown variance $\sigma^2$.

*Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.

*Our choice again:* Least squares estimator (= maximum likelihood
estimator for normal errors)

$$
 \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$

where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.

Written in matrix style:

$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$

Defining the *design matrix*

$$ \mathbf{X} = \left( \begin{array}{cccc}
                1 & x_{11} & \ldots & x_{1p} \\
                \vdots & \vdots & \ddots & \vdots \\
                1 & x_{11} & \ldots & x_{1p} 
                \end{array}\right) \qquad 
  (\text{size } n \times (p+1)), $$

we get the short form

$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$

\[use Julia code (existing package) to perform linear regression for
`volume ~ girth + height`\]

The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now:

-   parameter estimates:

    $$
     (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
    $$

-   estimated standard errors:

    $$
     \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
    $$

-   $t$-statistics:

    $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$

-   $p$-values:

    $$
    p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
    $$

**Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your
functions with the \`\`\`tree''' data set and try to reproduce the
output above.

## Generalized Linear Models

Classical linear model

$$
 \mathbf{y} = \mathbf{X} \beta + \varepsilon 
$$

implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$

In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.

However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include:

-   for counting data: $$
       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
      $$

    Here, the components are considered to be independent and the
    exponential function is applied componentwise.

-   for binary data: $$
       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
      $$

    Again, the components are considered to be independent and all the
    operations are applied componentwise.

All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that

$$
 \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$

For the models above, these are:

+----------------------+---------------------+----------------------+
| Type of Data         | Distribution Family | Link Function        |
+======================+=====================+======================+
| continuous           | Normal              | identity:            |
|                      |                     |                      |
|                      |                     | $$                   |
|                      |                     | g(x)=x               |
|                      |                     | $$                   |
+----------------------+---------------------+----------------------+
| count                | Poisson             | log:                 |
|                      |                     |                      |
|                      |                     | $$                   |
|                      |                     |  g(x) = \log(x)      |
|                      |                     | $$                   |
+----------------------+---------------------+----------------------+
| binary               | Bernoulli           | logit:               |
|                      |                     |                      |
|                      |                     | $$                   |
|                      |                     | g(x) = \log\left     |
|                      |                     | (                    |
|                      |                     | \frac{x}{1-x}\right) |
|                      |                     | $$                   |
+----------------------+---------------------+----------------------+

In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,

$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$

which is equivalent to the maximization of the log-likelihood, i.e.,

$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$

In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.

\[\[ Example in Julia: maybe `SwissLabor` \]\]

**Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution
family.
Transformed Multiple Regression Basics to quarto and added more material on GLMs 2023-09-14 17:42:06 +02:00			`---`
			`editor:`
			`markdown:`
			`wrap: 72`
			`---`

			`# Multiple Regression Basics`

			`## Motivation`

			`### Introductory Example: tree dataset from R`

			`\[figure of raw data\]`

			Aim: Find relationship between the response variable `volume` and
			the explanatory variable/covariate `girth`? Can we predict the volume
			`of a tree given its girth?`

			`\[figure including a straight line\]`

			`First Guess: There is a linear relation!`

			`## Simple Linear Regression`

			`Main assumption: up to some error term, each measurement of the response`
			`variable $y_i$ depends linearly on the corresponding value $x_i$ of the`
			`covariate`

			`$\leadsto$ (Simple) Linear Model:`
			`$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$`
			`where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent`
			`normally distributed errors with unknown variance $\sigma^2$.`

			`Task: Find the straight line that fits best, i.e., find the optimal`
			`estimators for $\beta_0$ and $\beta_1$.`

			`Typical choice: Least squares estimator (= maximum likelihood`
			`estimator for normal errors)`

			`$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \\| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\\|^2 $$`

			`where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the`
			`vector of covariates and $\mathbf{1}$ is a vector of ones.`

			`Written in matrix style:`

			`$$`
			`(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\\|^2`
			`$$`

			`Note: There is a closed-form expression for`
			`$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but`
			`rather use Julia to solve the problem.`

			`\[use Julia code (existing package) to perform linear regression for`
			`volume ~ girth`\]

			`Interpretation of the Julia output:`

			- column `estimate` : least square estimates for $\hat \beta_0$ and
			$\hat \beta_1$

			- column `Std. Error` : estimated standard deviation
			`$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$`

			- column `t value` : value of the $t$-statistics

			`$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$`

			`Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would`
			`follow a $t$-distribution.`

			- column `Pr(>\|t\|)`: $p$-values for the hyptheses $\beta_i=0$ for
			$i=0,1$

			`Task 1: Generate a random set of covariates $\mathbf{x}$. Given`
			`these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$`
			`(you can choose them)), simulate responses from a linear model and`
			`estimate the coefficients $\beta_0$ and $\beta_1$. Play with different`
			`choices of the parameters to see the effects on the parameter estimates`
			`and the $p$-values.`

			`## Multiple Regression Model`

			`Idea: Generalize the simple linear regression model to multiple`
			covariates, w.g., predict `volume` using `girth` and \`height\`\`.

			`$\leadsto$ Linear Model:`
			`$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where`

			`- $y_i$: $i$-th measurement of the response,`

			`- $x_{i1}$: $i$ th value of first covariate,`

			`- ...`

			`- $x_{ip}$: $i$-th value of $p$-th covariate,`

			`- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally`
			`distributed errors with unknown variance $\sigma^2$.`

			`Task: Find the optimal estimators for`
			`$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.`

			`Our choice again: Least squares estimator (= maximum likelihood`
			`estimator for normal errors)`

			`$$`
			`\hat \beta = \mathrm{argmin} \ \\| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\\|^2`
			`$$`

			`where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the`
			`vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.`

			`Written in matrix style:`

			`$$`
			`\mathbf{\hat \beta} = \mathrm{argmin} \ \left\\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\\|^2`
			`$$`

			`Defining the design matrix`

			`$$ \mathbf{X} = \left( \begin{array}{cccc}`
			`1 & x_{11} & \ldots & x_{1p} \\`
			`\vdots & \vdots & \ddots & \vdots \\`
			`1 & x_{11} & \ldots & x_{1p}`
			`\end{array}\right) \qquad`
			`(\text{size } n \times (p+1)), $$`

			`we get the short form`

			`$$`
			`\mathbf{\hat \beta} = \mathrm{argmin} \ \\| \mathbf{y} - \mathbf{X} \mathbf{\beta} \\|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}`
			`$$`

			`\[use Julia code (existing package) to perform linear regression for`
			`volume ~ girth + height`\]

			`The interpretation of the Julia output is similar to the simple linear`
			`regression model, but we provide explicit formulas now:`

			`- parameter estimates:`

			`$$`
			`(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}`
			`$$`

			`- estimated standard errors:`

			`$$`
			`\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \\|\mathbf{y} - \mathbf{X} \beta\\|^2}`
			`$$`

			`- $t$-statistics:`

			`$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$`

			`- $p$-values:`

			`$$`
			`p\text{-value} = \mathbb{P}(\|T\| > t_i), \quad \text{where } T \sim t_{n-p}`
			`$$`

			`Task 2: Implement functions that estimate the $\beta$-parameters,`
			`the corresponding standard errors and the $t$-statistics. Test your`
			functions with the \`\`\`tree''' data set and try to reproduce the
			`output above.`

			`## Generalized Linear Models`

			`Classical linear model`

			`$$`
			`\mathbf{y} = \mathbf{X} \beta + \varepsilon`
			`$$`

			`implies that`
			`$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$`

			`In particular, the conditional expectation satisfies`
			`$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.`

			`However, the assumption of a normal distribution is unrealistic for`
			`non-continuous data. Popular alternatives include:`

			`- for counting data: $$`
			`\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)`
			`$$`

			`Here, the components are considered to be independent and the`
			`exponential function is applied componentwise.`

			`- for binary data: $$`
			`\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}`
			`$$`

			`Again, the components are considered to be independent and all the`
			`operations are applied componentwise.`

			`All these models are defined by the choice of a family of distributions`
			`and a function $g$ (the so-called link function) such that`

			`$$`
			`\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).`
			`$$`

			`For the models above, these are:`

			`+----------------------+---------------------+----------------------+`
			`\| Type of Data \| Distribution Family \| Link Function \|`
			`+======================+=====================+======================+`
			`\| continuous \| Normal \| identity: \|`
			`\| \| \| \|`
			`\| \| \| $$ \|`
			`\| \| \| g(x)=x \|`
			`\| \| \| $$ \|`
			`+----------------------+---------------------+----------------------+`
			`\| count \| Poisson \| log: \|`
			`\| \| \| \|`
			`\| \| \| $$ \|`
			`\| \| \| g(x) = \log(x) \|`
			`\| \| \| $$ \|`
			`+----------------------+---------------------+----------------------+`
			`\| binary \| Bernoulli \| logit: \|`
			`\| \| \| \|`
			`\| \| \| $$ \|`
			`\| \| \| g(x) = \log\left \|`
			`\| \| \| ( \|`
			`\| \| \| \frac{x}{1-x}\right) \|`
			`\| \| \| $$ \|`
			`+----------------------+---------------------+----------------------+`

			`In general, the parameter vector $\beta$ is estimated via maximizing the`
			`likelihood, i.e.,`

			`$$`
			`\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),`
			`$$`

			`which is equivalent to the maximization of the log-likelihood, i.e.,`

			`$$`
			`\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),`
			`$$`

			`In the Gaussian case, the maximum likelihood estimator is identical to`
			`the least squares estimator considered above.`

			\[\[ Example in Julia: maybe `SwissLabor` \]\]

			Task 3: Reproduce the results of our data analysis of the `tree`
			`data set using a generalized linear model with normal distribution`
			`family.`