--- editor: markdown: wrap: 72 --- # Multiple Regression Basics ## Motivation ### Introductory Example: tree dataset from R ``` julia using Statistics using Plots using RDatasets trees = dataset("datasets", "trees") scatter(trees.Volume, trees.Girth, legend=false, xlabel="Girth", ylabel="Volume") ``` *Aim:* Find relationship between the *response variable* `volume` and the *explanatory variable/covariate* `girth`? Can we predict the volume of a tree given its girth? ``` julia scatter(trees.Girth, trees.Volume, legend=false, xlabel="Girth", ylabel="Volume") plot!(x -> -37 + 5*x) ``` First Guess: There is a linear relation! ## Simple Linear Regression Main assumption: up to some error term, each measurement of the response variable $y_i$ depends linearly on the corresponding value $x_i$ of the covariate $\leadsto$ **(Simple) Linear Model:** $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent normally distributed errors with unknown variance $\sigma^2$. *Aim:* Find the straight line that fits best, i.e., find the *optimal* estimators for $\beta_0$ and $\beta_1$. *Typical choice*: Least squares estimator (= maximum likelihood estimator for normal errors) $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the vector of covariates and $\mathbf{1}$ is a vector of ones. Written in matrix style: $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 $$ Note: There is a closed-form expression for $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but rather use Julia to solve the problem. ``` julia lm(@formula(Volume ~ Girth), trees) ``` *Interpretation of the Julia output:* - column `estimate` : least square estimates for $\hat \beta_0$ and $\hat \beta_1$ - column `Std. Error` : estimated standard deviation $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ - column `t value` : value of the $t$-statistics $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would follow a $t$-distribution. - column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for $i=0,1$ ::: callout-tip The command `rand(n)` generates a sample of `n` "random" (i.e., uniformly distributed) random numbers. If you want to sample from another distribution, use the `Distributions` package, define an object being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a normal distribution with mean 0.0 and standard deviation 2.0, and sample `n` times from this distribution by `rand(d, n)`. ::: ::: {.callout-caution collapse="false"} ## Task 1 1. Generate $n=20$ covariates $\mathbf{x}$ randomly. 2. Given these covariates and the true parameters $\beta_0=-3$, $\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model (with normally distributed errors) and estimate the coefficients $\beta_0$ and $\beta_1$. 3. Play with different choices of the parameters above (including the sample size $n$) to see the effects on the parameter estimates and the $p$-values. ::: ## Multiple Regression Model *Idea*: Generalize the simple linear regression model to multiple covariates, w.g., predict `volume` using `girth` and \`height\`\`. $\leadsto$ **Linear Model:** $$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where - $y_i$: $i$-th measurement of the response, - $x_{i1}$: $i$ th value of first covariate, - ... - $x_{ip}$: $i$-th value of $p$-th covariate, - $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally distributed errors with unknown variance $\sigma^2$. *Task:* Find the *optimal* estimators for $\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. *Our choice again:* Least squares estimator (= maximum likelihood estimator for normal errors) $$ \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 $$ where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. Written in matrix style: $$ \mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 $$ Defining the *design matrix* $$ \mathbf{X} = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \ldots & x_{np} \end{array}\right) \qquad (\text{size } n \times (p+1)), $$ we get the short form $$ \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} $$ \[use Julia code (existing package) to perform linear regression for `volume ~ girth + height`\] The interpretation of the Julia output is similar to the simple linear regression model, but we provide explicit formulas now: - parameter estimates: $$ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} $$ - estimated standard errors: $$ \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p-1} \|\mathbf{y} - \mathbf{X} \beta\|^2} $$ - $t$-statistics: $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ - $p$-values: $$ p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p-1} $$ ::: {.callout-caution collapse="false"} ## Task 2 1. Implement functions that estimate the $\beta$-parameters, the corresponding standard errors and the $t$-statistics. 2. Test your functions with the \`tree' data set and try to reproduce the output above. ::: Which model is the best? For linear models, one often uses the $R^2$ characteristic. Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model. ``` julia r2(linmod1) r2(linmod2) linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) r2(linmod3) ``` ::: callout-note The more covariates you add the more variance can be explained by the linear model - $R^2$ increases. In order to balance goodness-of-fit of a model and its complexity, information criteria such as `aic` are considered. ::: ## Generalized Linear Models Classical linear model $$ \mathbf{y} = \mathbf{X} \beta + \varepsilon $$ implies that $$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ In particular, the conditional expectation satisfies $\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. However, the assumption of a normal distribution is unrealistic for non-continuous data. Popular alternatives include: - for counting data: $$ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) $$ Here, the components are considered to be independent and the exponential function is applied componentwise. - for binary data: $$ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} $$ Again, the components are considered to be independent and all the operations are applied componentwise. All these models are defined by the choice of a family of distributions and a function $g$ (the so-called *link function*) such that $$ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). $$ For the models above, these are: +----------------+-----------------+-------------------------+ | Type of Data | Distribution | Link Function | | | Family | | +================+=================+=========================+ | continuous | Normal | identity: | | | | | | | | $$ | | | | g(x)=x | | | | $$ | +----------------+-----------------+-------------------------+ | count | Poisson | log: | | | | | | | | $$ | | | | g(x) = \log(x) | | | | $$ | +----------------+-----------------+-------------------------+ | binary | Bernoulli | logit: | | | | | | | | $$ | | | | g(x) = \log\left( | | | | \frac{x}{1-x} | | | | \right) | | | | $$ | +----------------+-----------------+-------------------------+ In general, the parameter vector $\beta$ is estimated via maximizing the likelihood, i.e., $$ \hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), $$ which is equivalent to the maximization of the log-likelihood, i.e., $$ \hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), $$ In the Gaussian case, the maximum likelihood estimator is identical to the least squares estimator considered above. ``` julia using CSV using HTTP http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") SwissLabor = DataFrame(CSV.File(http_response.body)) SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") model = glm(@formula(participation ~ age^2), SwissLabor, Binomial(), ProbitLink()) ``` ::: {.callout-caution collapse="false"} ## Task 3: 1. Reproduce the results of our data analysis of the `tree` data set using a generalized linear model with normal distribution family. 2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model. ::: ## Outlook: Linear Mixed Models In the linear regression models so far, we assumed that the response variable $\mathbf{y}$ depends on the design matrix of covariates $\mathbf{X}$ - which are assumed to be given/fixed - multiplied by the so-called *fixed effects* coefficients $\mathbf{X}\beta$ and independent errors $\varepsilon$. However, in many situations, there are also random effects on several components of the response variable. These can be included in the model by adding another design matrix $\mathbf{Z}$ multiplied by a random vector $u$, the so-called *random effects* coefficients, that are assumed to be jointly normally distributed with mean vector $0$ and variance-covariance matrix $\Sigma$ (typically *not* a diagonal matrix). In matrix notation, we have the following form: $$ \mathbf{y} = \mathbf{X} \beta + \mathbf{Z}u + \varepsilon $$ Maximizing the likelihood, we can estimate $\beta$ and optimally predict the random vector $u$.