---
editor: 
  markdown: 
    wrap: 72
---

# Multiple Regression Basics

## Motivation

### Introductory Example: tree dataset from R

``` julia
using Statistics
using Plots
using RDatasets

trees = dataset("datasets", "trees")

scatter(trees.Volume, trees.Girth,
        legend=false, xlabel="Girth", ylabel="Volume")
```

*Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth?

``` julia
scatter(trees.Girth, trees.Volume,
        legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
```

First Guess: There is a linear relation!

## Simple Linear Regression

Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate

$\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$.

*Aim:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$.

*Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors)

$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$

where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones.

Written in matrix style:

$$
 (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$

Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem.

``` julia
lm(@formula(Volume ~ Girth), trees)
```

*Interpretation of the Julia output:*

-   column `estimate` : least square estimates for $\hat \beta_0$ and
    $\hat \beta_1$

-   column `Std. Error` : estimated standard deviation
    $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$

-   column `t value` : value of the $t$-statistics

    $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$

    Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
    follow a $t$-distribution.

-   column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for
    $i=0,1$

::: callout-tip
The command `rand(n)` generates a sample of `n` "random" (i.e.,
uniformly distributed) random numbers. If you want to sample from
another distribution, use the `Distributions` package, define an object
being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a
normal distribution with mean 0.0 and standard deviation 2.0, and sample
`n` times from this distribution by `rand(d, n)`.
:::

::: {.callout-caution collapse="false"}
## Task 1

1.  Generate $n=20$ covariates $\mathbf{x}$ randomly.
2.  Given these covariates and the true parameters $\beta_0=-3$,
    $\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model
    (with normally distributed errors) and estimate the coefficients
    $\beta_0$ and $\beta_1$.
3.  Play with different choices of the parameters above (including the
    sample size $n$) to see the effects on the parameter estimates and
    the $p$-values.
:::

## Multiple Regression Model

*Idea*: Generalize the simple linear regression model to multiple
covariates, w.g., predict `volume` using `girth` and \`height\`\`.

$\leadsto$ **Linear Model:**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$where

-   $y_i$: $i$-th measurement of the response,

-   $x_{i1}$: $i$ th value of first covariate,

-   ...

-   $x_{ip}$: $i$-th value of $p$-th covariate,

-   $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
    distributed errors with unknown variance $\sigma^2$.

*Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.

*Our choice again:* Least squares estimator (= maximum likelihood
estimator for normal errors)

$$
 \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$

where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.

Written in matrix style:

$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$

Defining the *design matrix*

$$ \mathbf{X} = \left( \begin{array}{cccc}
                1 & x_{11} & \ldots & x_{1p} \\
                \vdots & \vdots & \ddots & \vdots \\
                1 & x_{n1} & \ldots & x_{np} 
                \end{array}\right) \qquad 
  (\text{size } n \times (p+1)), $$

we get the short form

$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$

\[use Julia code (existing package) to perform linear regression for
`volume ~ girth + height`\]

The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now:

-   parameter estimates:

    $$
     (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
    $$

-   estimated standard errors:

    $$
     \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p-1} \|\mathbf{y} - \mathbf{X} \beta\|^2}
    $$

-   $t$-statistics:

    $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$

-   $p$-values:

    $$
    p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p-1}
    $$

::: {.callout-caution collapse="false"}
## Task 2

1.  Implement functions that estimate the $\beta$-parameters, the
    corresponding standard errors and the $t$-statistics.
2.  Test your functions with the \`tree' data set and try to reproduce
    the output above.
:::

Which model is the best? For linear models, one often uses the $R^2$
characteristic. Roughly speaking, it gives the percentage (between 0
and 1) of the variance that can be explained by the linear model.

``` julia
r2(linmod1)
r2(linmod2)

linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)

r2(linmod3)
```

::: callout-note
The more covariates you add the more variance can be explained by the
linear model - $R^2$ increases. In order to balance goodness-of-fit of a
model and its complexity, information criteria such as `aic` are
considered.
:::

## Generalized Linear Models

Classical linear model

$$
 \mathbf{y} = \mathbf{X} \beta + \varepsilon 
$$

implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$

In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.

However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include:

-   for counting data: $$
       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
      $$

    Here, the components are considered to be independent and the
    exponential function is applied componentwise.

-   for binary data: $$
       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
      $$

    Again, the components are considered to be independent and all the
    operations are applied componentwise.

All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that

$$
 \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$

For the models above, these are:

+----------------+-----------------+-------------------------+
| Type of Data   | Distribution    | Link Function           |
|                | Family          |                         |
+================+=================+=========================+
| continuous     | Normal          | identity:               |
|                |                 |                         |
|                |                 | $$                      |
|                |                 | g(x)=x                  |
|                |                 | $$                      |
+----------------+-----------------+-------------------------+
| count          | Poisson         | log:                    |
|                |                 |                         |
|                |                 | $$                      |
|                |                 |  g(x) = \log(x)         |
|                |                 | $$                      |
+----------------+-----------------+-------------------------+
| binary         | Bernoulli       | logit:                  |
|                |                 |                         |
|                |                 | $$                      |
|                |                 | g(x) = \log\left(       |
|                |                 | \frac{x}{1-x}           |
|                |                 | \right)                 |
|                |                 | $$                      |
+----------------+-----------------+-------------------------+

In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,

$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$

which is equivalent to the maximization of the log-likelihood, i.e.,

$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$

In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.

``` julia
using CSV
using HTTP

http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))

SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")

model = glm(@formula(participation ~ age^2), 
            SwissLabor, Binomial(), ProbitLink())
```

::: {.callout-caution collapse="false"}
## Task 3:

1.  Reproduce the results of our data analysis of the `tree` data set
    using a generalized linear model with normal distribution family.
2.  Generate $n=20$ random covariates $\mathbf{x}$ and
    Poisson-distributed counting data with parameters
    $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized
    linear model.
:::

## Outlook: Linear Mixed Models

In the linear regression models so far, we assumed that the response
variable $\mathbf{y}$ depends on the design matrix of covariates
$\mathbf{X}$ - which are assumed to be given/fixed - multiplied by the
so-called *fixed effects* coefficients $\mathbf{X}\beta$ and independent
errors $\varepsilon$. However, in many situations, there are also random
effects on several components of the response variable. These can be
included in the model by adding another design matrix $\mathbf{Z}$
multiplied by a random vector $u$, the so-called *random effects*
coefficients, that are assumed to be jointly normally distributed with
mean vector $0$ and variance-covariance matrix $\Sigma$ (typically *not*
a diagonal matrix). In matrix notation, we have the following form: 

$$
 \mathbf{y} = \mathbf{X} \beta + \mathbf{Z}u + \varepsilon 
$$ 

Maximizing the likelihood, we can estimate $\beta$ and optimally
predict the random vector $u$.