From 905aea75465825344dea3cc4fb2e1b74b8b685f2 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 15:49:03 +0200 Subject: [PATCH 1/3] Shorten testing lecture --- material/2_tue/testing/slides.md | 152 ++++++++----------------------- 1 file changed, 39 insertions(+), 113 deletions(-) diff --git a/material/2_tue/testing/slides.md b/material/2_tue/testing/slides.md index 539efeb..cde26dd 100644 --- a/material/2_tue/testing/slides.md +++ b/material/2_tue/testing/slides.md @@ -34,11 +34,11 @@ slideOptions: # Learning Goals -- Justify the effort of developing tests to some extent - Get to know a few common terms of testing - Work with the Julia unit testing package `Test.jl` -Material is taken and modified, on the one hand, from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and, on the other hand, from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/). +Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/). + --- @@ -48,30 +48,21 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https:// ## What is Testing? -- Smelling old milk before using it! -- A way to determine if a software is not producing reliable results and if so, what is the reason. -- Manual testing vs. automated testing. +- Smelling old milk before using it +- A way to determine if a software is not producing reliable results and if so, what is the reason +- Manual testing vs. automated testing --- ## Why Should you Test your Software? -- Improve software reliability and reproducibility. -- Make sure that changes (bugfixes, new features) do not affect other parts of software. +- Improve software reliability and reproducibility +- Make sure that changes (bugfixes, new features) do not affect other parts of software - Generally all software is better off being tested regularly. Possible exceptions are very small codes with single users. - Ensure that a released version of a software actually works. --- -## Nomenclature in Software Testing - -- **Fixture**: preparatory set for testing. -- **Actual result**: what the code produces when given the fixture. -- **Expected result**: what the actual result is compared to. -- **Test coverage**: how much of the code do tests touch in one run. - ---- - ## Some Ways to Test Software - Assertions @@ -83,29 +74,27 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https:// ## Assertions -- Principle of *defensive programming*. +```julia +@assert condition "message" +``` + +- Principle of *defensive programming* - Nothing happens when an assertion is true; throws error when false. - Types of assertion statements: - Precondition - Postcondition - Invariant -- A basic but powerful tool to test a software on-the-go. -- Assertion statement syntax in Python - -```julia -@assert condition "message" -``` +- A basic but powerful tool to test a software on-the-go --- ## Unit Testing -- Catching errors with assertions is good but preventing them is better! +- Catching errors with assertions is good but preventing them is better. - A *unit* is a single function in one situation. - A situation is one amongst many possible variations of input parameters. -- User creates the expected result manually. -- Fixture is the set of inputs used to generate an actual result. -- Actual result is compared to the expected result by `@test`. +- User creates the **expected result** manually. +- **Actual result** is compared to the expected result by `@test`. --- @@ -113,109 +102,44 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https:// - Test whether several units work in conjunction. - *Integrate* units and test them together in an *integration* test. -- Often more complicated than a unit test and has more test coverage. -- A fixture is used to generate an actual result. -- Actual result is compared to the expected result by `@test`. +- Often more complicated than a unit test and gives higher test coverage. --- ## Regression Testing - Generating an expected result is not possible in some situations. -- Compare the current actual result with a previous actual result. +- Compare the *current* actual result with a *previous* actual result. - No guarantee that the current actual result is correct. - Risk of a bug being carried over indefinitely. - Main purpose is to identify changes in the current state of the code with respect to a past state. --- -## Test Coverage - -- Coverage is the amount of code a test touches in one run. -- Aim for high test coverage. -- There is a trade-off: high test coverage vs. effort in test development - ---- - -## Comparing Floating-point Variables - -- Very often quantities in math software are `float` / `double`. -- Such quantities cannot be compared to exact values, an approximation is necessary. -- Comparison of floating point variables needs to be done to a certain tolerance. - -```julia -@test 1 ≈ 0.999999 rtol=1e-5 -``` - -- Get `≈` by Latex `\approx` + TAB - ---- - -## Test-driven Development (TDD) - -- Principle is to write a test and then write a code to fulfill the test. -- Advantages: - - In the end user ends up with a test alongside the code. - - Eliminates confirmation bias of the user. - - Writing tests gives clarity on what the code is supposed to do. -- Disadvantage: known to not improve productivity. - ---- - -## Checking-driven Development (CDD) - -- Developer performs spot checks; sanity checks at intermediate stages -- Math software often has heuristics which are easy to determine. -- Keep performing same checks at different stages of development to ensure the code works. - ---- - -## Verifying a Test - -- Test written as part of a bug-fix: - - Reproduce the bug in the test by ensuring that the test fails. - - Fix the bug. - - Rerun the test to ensure that it passes. -- Test written to increase code coverage: - - Make sure that the first iteration of the test passes. - - Try introducing a small fixable bug in the code to verify if the test fails. - ---- - # 2. Unit Testing in Julia with Test.jl --- -## Setup of Tests.jl +## Setup of Test.jl -- Standard library to write and manage tests, `using Test` - Standardized folder structure: -``` -├── Manifest.toml -├── Project.toml -├── src/ -└── test - ├── Manifest.toml - ├── Project.toml - ├── runtests.jl - └── setup.jl -``` + ``` + ├── Manifest.toml + ├── Project.toml + ├── src/ + └── test + ├── Manifest.toml + ├── Project.toml + ├── runtests.jl + └── setup.jl + ``` - Singular `test` vs plural `runtests.jl` - `setup.jl` for all `using XYZ` statements, included in `runtests.jl` -- Additional packages either in `[extra] section` of `./Project.toml` or in a new `./test/Project.toml` environment +- Additional packages in `[extra] section` of `./Project.toml` or in new `./test/Project.toml` - In case of the latter: Do not add the package itself to the `./test/Project.toml` - - ---- - -## Run Tests - -Various options: - -- Directly call `runtests.jl` TODO? -- From Pkg-Manager `]test` when root project is activated +- Run: `]test` when root project is activated --- @@ -234,11 +158,11 @@ Various options: - `@testset`: Structure tests ```julia - julia> @testset "trigonometric identities" begin - θ = 2/3*π - @test sin(-θ) ≈ -sin(θ) - @test cos(-θ) ≈ cos(θ) - end; + @testset "trigonometric identities" begin + θ = 2/3*π + @test sin(-θ) ≈ -sin(θ) + @test cos(-θ) ≈ cos(θ) + end; ``` - `@testset for ... end`: Test in loop @@ -251,6 +175,8 @@ Various options: - [HiRSE-Summer of Testing Part 2b: "Testing with Julia" by Nils Niggemann](https://www.youtube.com/watch?v=gSMKNbZOpZU) - [Official documentation of Test.jl](https://docs.julialang.org/en/v1/stdlib/Test/) +--- + # 3. Test.jl Demo We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree/main/material/2_tue/testing/MyTestPackage), which looks as follows: @@ -274,7 +200,7 @@ We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree - Look at `MyTestPackage.jl` and `find.jl`: We have two functions `find_max` and `find_mean`, which calculate the maximum and mean of all elements of a `::AbstractVector`. - Assertions were added to check for `NaN` values - Look at `runtests.jl`: - - TODO: Why do we need `using MyTestPackage`? + - Why do we need `using MyTestPackage`? - We include dependencies via `setup.jl`: `Test` and `StableRNG`. - Testset "find" - Look at `find.jl` From 08951f610d3ba359a91dab33df62a13d6ec57b90 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Mon, 9 Oct 2023 16:07:42 +0200 Subject: [PATCH 2/3] further update on multiple regression --- material/3_wed/regression/Code_Snippets.jl | 98 ++- .../regression/MultipleRegressionBasics.qmd | 584 +++++++++--------- 2 files changed, 339 insertions(+), 343 deletions(-) diff --git a/material/3_wed/regression/Code_Snippets.jl b/material/3_wed/regression/Code_Snippets.jl index 05bb3e8..e7889d9 100644 --- a/material/3_wed/regression/Code_Snippets.jl +++ b/material/3_wed/regression/Code_Snippets.jl @@ -1,52 +1,48 @@ -############################################################################ -#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' #### -############################################################################ - -using Statistics -using Plots -using RDatasets -using GLM - -## - -trees = dataset("datasets", "trees") - -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") - -## - -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") -plot!(x -> -37 + 5*x) - -## - -linmod1 = lm(@formula(Volume ~ Girth), trees) - -## - -linmod2 = lm(@formula(Volume ~ Girth + Height), trees) - -## - -r2(linmod1) -r2(linmod2) - -linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) - -r2(linmod3) - -## - -using CSV -using HTTP - -http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") -SwissLabor = DataFrame(CSV.File(http_response.body)) - -SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") - -## - +using Statistics +using Plots +using RDatasets +using GLM + +#--- + +trees = dataset("datasets", "trees") + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") + +#--- + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) + +#--- + +linmod1 = lm(@formula(Volume ~ Girth), trees) + +#--- + +linmod2 = lm(@formula(Volume ~ Girth + Height), trees) + +#--- + +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) + +#--- + +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +#--- + model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink()) \ No newline at end of file diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 1f23123..7e2f5a6 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -1,292 +1,292 @@ ---- -editor: - markdown: - wrap: 72 ---- - -# Multiple Regression Basics - -## Motivation - -### Introductory Example: tree dataset from R - -```{julia} -using Statistics -using Plots -using RDatasets - -trees = dataset("datasets", "trees") - -scatter(trees.Volume, trees.Girth, - legend=false, xlabel="Girth", ylabel="Volume") -``` - -*Aim:* Find relationship between the *response variable* `volume` and -the *explanatory variable/covariate* `girth`? Can we predict the volume -of a tree given its girth? - -```{julia} -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") -plot!(x -> -37 + 5*x) -``` - -First Guess: There is a linear relation! - -## Simple Linear Regression - -Main assumption: up to some error term, each measurement of the response -variable $y_i$ depends linearly on the corresponding value $x_i$ of the -covariate - -$\leadsto$ **(Simple) Linear Model:** -$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ -where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent -normally distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the straight line that fits best, i.e., find the *optimal* -estimators for $\beta_0$ and $\beta_1$. - -*Typical choice*: Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the -vector of covariates and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ - (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 -$$ - -Note: There is a closed-form expression for -$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but -rather use Julia to solve the problem. - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth`\] - -```{julia} -lm(@formula(Volume ~ Girth), trees) -``` - -*Interpretation of the Julia output:* - -- column `estimate` : least square estimates for $\hat \beta_0$ and - $\hat \beta_1$ - -- column `Std. Error` : estimated standard deviation - $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ - -- column `t value` : value of the $t$-statistics - - $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ - - Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would - follow a $t$-distribution. - -- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for - $i=0,1$ - -**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given -these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ -(you can choose them)), simulate responses from a linear model and -estimate the coefficients $\beta_0$ and $\beta_1$. Play with different -choices of the parameters to see the effects on the parameter estimates -and the $p$-values. - -## Multiple Regression Model - -*Idea*: Generalize the simple linear regression model to multiple -covariates, w.g., predict `volume` using `girth` and \`height\`\`. - -$\leadsto$ **Linear Model:** -$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where - -- $y_i$: $i$-th measurement of the response, - -- $x_{i1}$: $i$ th value of first covariate, - -- ... - -- $x_{ip}$: $i$-th value of $p$-th covariate, - -- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the *optimal* estimators for -$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. - -*Our choice again:* Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ - \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 -$$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the -vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 -$$ - -Defining the *design matrix* - -$$ \mathbf{X} = \left( \begin{array}{cccc} - 1 & x_{11} & \ldots & x_{1p} \\ - \vdots & \vdots & \ddots & \vdots \\ - 1 & x_{11} & \ldots & x_{1p} - \end{array}\right) \qquad - (\text{size } n \times (p+1)), $$ - -we get the short form - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} -$$ - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth + height`\] - -The interpretation of the Julia output is similar to the simple linear -regression model, but we provide explicit formulas now: - -- parameter estimates: - - $$ - (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} - $$ - -- estimated standard errors: - - $$ - \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} - $$ - -- $t$-statistics: - - $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ - -- $p$-values: - - $$ - p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} - $$ - -**Task 2**: Implement functions that estimate the $\beta$-parameters, -the corresponding standard errors and the $t$-statistics. Test your -functions with the \`\`\`tree''' data set and try to reproduce the -output above. - -```{julia} -r2(linmod1) -r2(linmod2) - -linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) - -r2(linmod3) -``` - -## Generalized Linear Models - -Classical linear model - -$$ - \mathbf{y} = \mathbf{X} \beta + \varepsilon -$$ - -implies that -$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ - -In particular, the conditional expectation satisfies -$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. - -However, the assumption of a normal distribution is unrealistic for -non-continuous data. Popular alternatives include: - -- for counting data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) - $$ - - Here, the components are considered to be independent and the - exponential function is applied componentwise. - -- for binary data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} - $$ - - Again, the components are considered to be independent and all the - operations are applied componentwise. - -All these models are defined by the choice of a family of distributions -and a function $g$ (the so-called *link function*) such that - -$$ - \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). -$$ - -For the models above, these are: - -+--------------+---------------------+--------------------+ -| Type of Data | Distribution Family | Link Function | -+==============+=====================+====================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+--------------+---------------------+--------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+--------------+---------------------+--------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \ | -| | | f | -| | | rac{x}{1-x}\right) | -| | | $$ | -+--------------+---------------------+--------------------+ - -In general, the parameter vector $\beta$ is estimated via maximizing the -likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -which is equivalent to the maximization of the log-likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -In the Gaussian case, the maximum likelihood estimator is identical to -the least squares estimator considered above. - -```{julia} -using CSV -using HTTP - -http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") -SwissLabor = DataFrame(CSV.File(http_response.body)) - -SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") - -model = glm(@formula(participation ~ age^2), - SwissLabor, Binomial(), ProbitLink()) -``` - -**Task 3:** Reproduce the results of our data analysis of the `tree` -data set using a generalized linear model with normal distribution -family. +--- +editor: + markdown: + wrap: 72 +--- + +# Multiple Regression Basics + +## Motivation + +### Introductory Example: tree dataset from R + +``` julia +using Statistics +using Plots +using RDatasets + +trees = dataset("datasets", "trees") + +scatter(trees.Volume, trees.Girth, + legend=false, xlabel="Girth", ylabel="Volume") +``` + +*Aim:* Find relationship between the *response variable* `volume` and +the *explanatory variable/covariate* `girth`? Can we predict the volume +of a tree given its girth? + +``` julia +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) +``` + +First Guess: There is a linear relation! + +## Simple Linear Regression + +Main assumption: up to some error term, each measurement of the response +variable $y_i$ depends linearly on the corresponding value $x_i$ of the +covariate + +$\leadsto$ **(Simple) Linear Model:** +$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ +where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent +normally distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the straight line that fits best, i.e., find the *optimal* +estimators for $\beta_0$ and $\beta_1$. + +*Typical choice*: Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the +vector of covariates and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ + (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 +$$ + +Note: There is a closed-form expression for +$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but +rather use Julia to solve the problem. + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth`\] + +``` julia +lm(@formula(Volume ~ Girth), trees) +``` + +*Interpretation of the Julia output:* + +- column `estimate` : least square estimates for $\hat \beta_0$ and + $\hat \beta_1$ + +- column `Std. Error` : estimated standard deviation + $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ + +- column `t value` : value of the $t$-statistics + + $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ + + Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would + follow a $t$-distribution. + +- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for + $i=0,1$ + +**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given +these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ +(you can choose them)), simulate responses from a linear model and +estimate the coefficients $\beta_0$ and $\beta_1$. Play with different +choices of the parameters to see the effects on the parameter estimates +and the $p$-values. + +## Multiple Regression Model + +*Idea*: Generalize the simple linear regression model to multiple +covariates, w.g., predict `volume` using `girth` and \`height\`\`. + +$\leadsto$ **Linear Model:** +$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where + +- $y_i$: $i$-th measurement of the response, + +- $x_{i1}$: $i$ th value of first covariate, + +- ... + +- $x_{ip}$: $i$-th value of $p$-th covariate, + +- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally + distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the *optimal* estimators for +$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. + +*Our choice again:* Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ + \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 +$$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the +vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 +$$ + +Defining the *design matrix* + +$$ \mathbf{X} = \left( \begin{array}{cccc} + 1 & x_{11} & \ldots & x_{1p} \\ + \vdots & \vdots & \ddots & \vdots \\ + 1 & x_{11} & \ldots & x_{1p} + \end{array}\right) \qquad + (\text{size } n \times (p+1)), $$ + +we get the short form + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth + height`\] + +The interpretation of the Julia output is similar to the simple linear +regression model, but we provide explicit formulas now: + +- parameter estimates: + + $$ + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} + $$ + +- estimated standard errors: + + $$ + \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} + $$ + +- $t$-statistics: + + $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ + +- $p$-values: + + $$ + p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} + $$ + +**Task 2**: Implement functions that estimate the $\beta$-parameters, +the corresponding standard errors and the $t$-statistics. Test your +functions with the \`\`\`tree''' data set and try to reproduce the +output above. + +``` julia +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) +``` + +## Generalized Linear Models + +Classical linear model + +$$ + \mathbf{y} = \mathbf{X} \beta + \varepsilon +$$ + +implies that +$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ + +In particular, the conditional expectation satisfies +$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. + +However, the assumption of a normal distribution is unrealistic for +non-continuous data. Popular alternatives include: + +- for counting data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) + $$ + + Here, the components are considered to be independent and the + exponential function is applied componentwise. + +- for binary data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} + $$ + + Again, the components are considered to be independent and all the + operations are applied componentwise. + +All these models are defined by the choice of a family of distributions +and a function $g$ (the so-called *link function*) such that + +$$ + \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). +$$ + +For the models above, these are: + ++--------------+---------------------+--------------------+ +| Type of Data | Distribution Family | Link Function | ++==============+=====================+====================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++--------------+---------------------+--------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++--------------+---------------------+--------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | rac{x}{1-x}\right) | +| | | $$ | ++--------------+---------------------+--------------------+ + +In general, the parameter vector $\beta$ is estimated via maximizing the +likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +which is equivalent to the maximization of the log-likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +In the Gaussian case, the maximum likelihood estimator is identical to +the least squares estimator considered above. + +``` julia +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +model = glm(@formula(participation ~ age^2), + SwissLabor, Binomial(), ProbitLink()) +``` + +**Task 3:** Reproduce the results of our data analysis of the `tree` +data set using a generalized linear model with normal distribution +family. From 39892ad1c1faa4505b3cf2f2c9cad456f45f3528 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Mon, 9 Oct 2023 16:14:00 +0200 Subject: [PATCH 3/3] Another update --- .../regression/MultipleRegressionBasics.qmd | 62 ++++++++++--------- 1 file changed, 32 insertions(+), 30 deletions(-) diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 7e2f5a6..f54f3b5 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -10,7 +10,7 @@ editor: ### Introductory Example: tree dataset from R -``` julia +``` julia using Statistics using Plots using RDatasets @@ -25,7 +25,7 @@ scatter(trees.Volume, trees.Girth, the *explanatory variable/covariate* `girth`? Can we predict the volume of a tree given its girth? -``` julia +``` julia scatter(trees.Girth, trees.Volume, legend=false, xlabel="Girth", ylabel="Volume") plot!(x -> -37 + 5*x) @@ -68,7 +68,7 @@ rather use Julia to solve the problem. \[use Julia code (existing package) to perform linear regression for `volume ~ girth`\] -``` julia +``` julia lm(@formula(Volume ~ Girth), trees) ``` @@ -183,7 +183,7 @@ the corresponding standard errors and the $t$-statistics. Test your functions with the \`\`\`tree''' data set and try to reproduce the output above. -``` julia +``` julia r2(linmod1) r2(linmod2) @@ -232,31 +232,33 @@ $$ For the models above, these are: -+--------------+---------------------+--------------------+ -| Type of Data | Distribution Family | Link Function | -+==============+=====================+====================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+--------------+---------------------+--------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+--------------+---------------------+--------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \ | -| | | f | -| | | rac{x}{1-x}\right) | -| | | $$ | -+--------------+---------------------+--------------------+ ++---------------+-------------------+------------------+ +| Type of Data | Distribution | Link Function | +| | Family | | ++===============+===================+==================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++---------------+-------------------+------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++---------------+-------------------+------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | ra | +| | | c{x}{1-x}\right) | +| | | $$ | ++---------------+-------------------+------------------+ In general, the parameter vector $\beta$ is estimated via maximizing the likelihood, i.e., @@ -274,7 +276,7 @@ $$ In the Gaussian case, the maximum likelihood estimator is identical to the least squares estimator considered above. -``` julia +``` julia using CSV using HTTP