From 12e0a50946fcdbdfef045b6b4879255fcacd3277 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Mon, 9 Oct 2023 22:11:52 +0200 Subject: [PATCH] Update Multiple Regression. --- .../regression/MultipleRegressionBasics.qmd | 640 +++++++++--------- 1 file changed, 316 insertions(+), 324 deletions(-) diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 294ca02..a54bf3c 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -1,324 +1,316 @@ ---- -editor: - markdown: - wrap: 72 ---- - -# Multiple Regression Basics - -## Motivation - -### Introductory Example: tree dataset from R - -``` julia -using Statistics -using Plots -using RDatasets - -trees = dataset("datasets", "trees") - -scatter(trees.Volume, trees.Girth, - legend=false, xlabel="Girth", ylabel="Volume") -``` - -*Aim:* Find relationship between the *response variable* `volume` and -the *explanatory variable/covariate* `girth`? Can we predict the volume -of a tree given its girth? - -``` julia -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") -plot!(x -> -37 + 5*x) -``` - -First Guess: There is a linear relation! - -## Simple Linear Regression - -Main assumption: up to some error term, each measurement of the response -variable $y_i$ depends linearly on the corresponding value $x_i$ of the -covariate - -$\leadsto$ **(Simple) Linear Model:** -$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ -where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent -normally distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the straight line that fits best, i.e., find the *optimal* -estimators for $\beta_0$ and $\beta_1$. - -*Typical choice*: Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the -vector of covariates and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ - (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 -$$ - -Note: There is a closed-form expression for -$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but -rather use Julia to solve the problem. - -``` julia -lm(@formula(Volume ~ Girth), trees) -``` - -*Interpretation of the Julia output:* - -- column `estimate` : least square estimates for $\hat \beta_0$ and - $\hat \beta_1$ - -- column `Std. Error` : estimated standard deviation - $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ - -- column `t value` : value of the $t$-statistics - - $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ - - Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would - follow a $t$-distribution. - -- column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for - $i=0,1$ - -::: callout-tip -The command `rand(n)` generates a sample of `n` "random" (i.e., -uniformly distributed) random numbers. If you want to sample from -another distribution, use the `Distributions` package, define an object -being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a -normal distribution with mean 0.0 and standard deviation 2.0, and sample -`n` times from this distribution by `rand(d, n)`. -::: - -::: {.callout-caution collapse="false"} -## Task 1 - -1. Generate $n=20$ covariates $\mathbf{x}$ randomly. -2. Given these covariates and the true parameters $\beta_0=-3$, - $\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model - (with normally distributed errors) and estimate the coefficients - $\beta_0$ and $\beta_1$. -3. Play with different choices of the parameters above (including the - sample size $n$) to see the effects on the parameter estimates and - the $p$-values. -::: - -## Multiple Regression Model - -*Idea*: Generalize the simple linear regression model to multiple -covariates, w.g., predict `volume` using `girth` and \`height\`\`. - -$\leadsto$ **Linear Model:** -$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where - -- $y_i$: $i$-th measurement of the response, - -- $x_{i1}$: $i$ th value of first covariate, - -- ... - -- $x_{ip}$: $i$-th value of $p$-th covariate, - -- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the *optimal* estimators for -$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. - -*Our choice again:* Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ - \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 -$$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the -vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 -$$ - -Defining the *design matrix* - -$$ \mathbf{X} = \left( \begin{array}{cccc} - 1 & x_{11} & \ldots & x_{1p} \\ - \vdots & \vdots & \ddots & \vdots \\ - 1 & x_{11} & \ldots & x_{1p} - \end{array}\right) \qquad - (\text{size } n \times (p+1)), $$ - -we get the short form - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} -$$ - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth + height`\] - -The interpretation of the Julia output is similar to the simple linear -regression model, but we provide explicit formulas now: - -- parameter estimates: - - $$ - (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} - $$ - -- estimated standard errors: - - $$ - \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} - $$ - -- $t$-statistics: - - $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ - -- $p$-values: - - $$ - p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} - $$ - -::: {.callout-caution collapse="false"} - -## Task 2 - -1. Implement functions that estimate the $\beta$-parameters, -the corresponding standard errors and the $t$-statistics. -2. Test your functions with the `tree' data set and try to reproduce the -output above. -::: - -Which model is the best? For linear models, one often uses the $R^2$ characteristic. -Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model. - -``` julia -r2(linmod1) -r2(linmod2) - -linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) - -r2(linmod3) -``` - -## Generalized Linear Models - -Classical linear model - -$$ - \mathbf{y} = \mathbf{X} \beta + \varepsilon -$$ - -implies that -$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ - -In particular, the conditional expectation satisfies -$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. - -However, the assumption of a normal distribution is unrealistic for -non-continuous data. Popular alternatives include: - -- for counting data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) - $$ - - Here, the components are considered to be independent and the - exponential function is applied componentwise. - -- for binary data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} - $$ - - Again, the components are considered to be independent and all the - operations are applied componentwise. - -All these models are defined by the choice of a family of distributions -and a function $g$ (the so-called *link function*) such that - -$$ - \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). -$$ - -For the models above, these are: - -+----------------+----------------+----------------+ -| Type of Data | Distribution | Link Function | -| | Family | | -+================+================+================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+----------------+----------------+----------------+ -| count | Poisson | log: | -| | | | -| | | \$\$ | -| | | | -| | | g(x) = \log(x) | -| | | \$\$ | -+----------------+----------------+----------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g | -| | | ( | -| | | x) = \log\left | -| | | ( | -| | | \ | -| | | f | -| | | ra | -| | | c | -| | | { | -| | | x}{1-x}\right) | -| | | $$ | -+----------------+----------------+----------------+ - -In general, the parameter vector $\beta$ is estimated via maximizing the -likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -which is equivalent to the maximization of the log-likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -In the Gaussian case, the maximum likelihood estimator is identical to -the least squares estimator considered above. - -``` julia -using CSV -using HTTP - -http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") -SwissLabor = DataFrame(CSV.File(http_response.body)) - -SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") - -model = glm(@formula(participation ~ age^2), - SwissLabor, Binomial(), ProbitLink()) -``` - -::: {.callout-caution collapse="false"} - -## Task 3: - -1. Reproduce the results of our data analysis of the `tree` data set - using a generalized linear model with normal distribution family. -2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model. -::: +--- +editor: + markdown: + wrap: 72 +--- + +# Multiple Regression Basics + +## Motivation + +### Introductory Example: tree dataset from R + +``` julia +using Statistics +using Plots +using RDatasets + +trees = dataset("datasets", "trees") + +scatter(trees.Volume, trees.Girth, + legend=false, xlabel="Girth", ylabel="Volume") +``` + +*Aim:* Find relationship between the *response variable* `volume` and +the *explanatory variable/covariate* `girth`? Can we predict the volume +of a tree given its girth? + +``` julia +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) +``` + +First Guess: There is a linear relation! + +## Simple Linear Regression + +Main assumption: up to some error term, each measurement of the response +variable $y_i$ depends linearly on the corresponding value $x_i$ of the +covariate + +$\leadsto$ **(Simple) Linear Model:** +$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ +where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent +normally distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the straight line that fits best, i.e., find the *optimal* +estimators for $\beta_0$ and $\beta_1$. + +*Typical choice*: Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the +vector of covariates and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ + (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 +$$ + +Note: There is a closed-form expression for +$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but +rather use Julia to solve the problem. + +``` julia +lm(@formula(Volume ~ Girth), trees) +``` + +*Interpretation of the Julia output:* + +- column `estimate` : least square estimates for $\hat \beta_0$ and + $\hat \beta_1$ + +- column `Std. Error` : estimated standard deviation + $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ + +- column `t value` : value of the $t$-statistics + + $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ + + Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would + follow a $t$-distribution. + +- column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for + $i=0,1$ + +::: callout-tip +The command `rand(n)` generates a sample of `n` "random" (i.e., +uniformly distributed) random numbers. If you want to sample from +another distribution, use the `Distributions` package, define an object +being the distribution of interest, e.g. `d = Normal(0.0, 2.0)` for a +normal distribution with mean 0.0 and standard deviation 2.0, and sample +`n` times from this distribution by `rand(d, n)`. +::: + +::: {.callout-caution collapse="false"} +## Task 1 + +1. Generate $n=20$ covariates $\mathbf{x}$ randomly. +2. Given these covariates and the true parameters $\beta_0=-3$, + $\beta_1=2$ and $\sigma=0.5$, simulate responses from a linear model + (with normally distributed errors) and estimate the coefficients + $\beta_0$ and $\beta_1$. +3. Play with different choices of the parameters above (including the + sample size $n$) to see the effects on the parameter estimates and + the $p$-values. +::: + +## Multiple Regression Model + +*Idea*: Generalize the simple linear regression model to multiple +covariates, w.g., predict `volume` using `girth` and \`height\`\`. + +$\leadsto$ **Linear Model:** +$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where + +- $y_i$: $i$-th measurement of the response, + +- $x_{i1}$: $i$ th value of first covariate, + +- ... + +- $x_{ip}$: $i$-th value of $p$-th covariate, + +- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally + distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the *optimal* estimators for +$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. + +*Our choice again:* Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ + \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 +$$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the +vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 +$$ + +Defining the *design matrix* + +$$ \mathbf{X} = \left( \begin{array}{cccc} + 1 & x_{11} & \ldots & x_{1p} \\ + \vdots & \vdots & \ddots & \vdots \\ + 1 & x_{11} & \ldots & x_{1p} + \end{array}\right) \qquad + (\text{size } n \times (p+1)), $$ + +we get the short form + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth + height`\] + +The interpretation of the Julia output is similar to the simple linear +regression model, but we provide explicit formulas now: + +- parameter estimates: + + $$ + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} + $$ + +- estimated standard errors: + + $$ + \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} + $$ + +- $t$-statistics: + + $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ + +- $p$-values: + + $$ + p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} + $$ + +::: {.callout-caution collapse="false"} + +## Task 2 + +1. Implement functions that estimate the $\beta$-parameters, +the corresponding standard errors and the $t$-statistics. +2. Test your functions with the `tree' data set and try to reproduce the +output above. +::: + +Which model is the best? For linear models, one often uses the $R^2$ characteristic. +Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model. + +``` julia +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) +``` + +## Generalized Linear Models + +Classical linear model + +$$ + \mathbf{y} = \mathbf{X} \beta + \varepsilon +$$ + +implies that +$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ + +In particular, the conditional expectation satisfies +$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. + +However, the assumption of a normal distribution is unrealistic for +non-continuous data. Popular alternatives include: + +- for counting data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) + $$ + + Here, the components are considered to be independent and the + exponential function is applied componentwise. + +- for binary data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} + $$ + + Again, the components are considered to be independent and all the + operations are applied componentwise. + +All these models are defined by the choice of a family of distributions +and a function $g$ (the so-called *link function*) such that + +$$ + \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). +$$ + +For the models above, these are: + ++----------------+------------------+--------------------------+ +| Type of Data | Distribution | Link Function | +| | Family | | ++================+==================+==========================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++----------------+------------------+--------------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++----------------+------------------+--------------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left( | +| | | \frac{x}{1-x} | +| | | \right) | +| | | $$ | ++----------------+------------------+--------------------------+ + +In general, the parameter vector $\beta$ is estimated via maximizing the +likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +which is equivalent to the maximization of the log-likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +In the Gaussian case, the maximum likelihood estimator is identical to +the least squares estimator considered above. + +``` julia +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +model = glm(@formula(participation ~ age^2), + SwissLabor, Binomial(), ProbitLink()) +``` + +::: {.callout-caution collapse="false"} + +## Task 3: + +1. Reproduce the results of our data analysis of the `tree` data set + using a generalized linear model with normal distribution family. +2. Generate $n=20$ random covariates $\mathbf{x}$ and Poisson-distributed counting data with parameters $\beta_0 + \beta_1 x_i$. Re-estimate the parameters by a generalized linear model. +:::