This commit is contained in:
behinger (s-ccs 001) 2023-10-09 14:14:56 +00:00
commit 20c4a1e58b
3 changed files with 380 additions and 456 deletions

View File

@ -34,11 +34,11 @@ slideOptions:
# Learning Goals # Learning Goals
- Justify the effort of developing tests to some extent
- Get to know a few common terms of testing - Get to know a few common terms of testing
- Work with the Julia unit testing package `Test.jl` - Work with the Julia unit testing package `Test.jl`
Material is taken and modified, on the one hand, from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and, on the other hand, from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/). Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/).
--- ---
@ -48,30 +48,21 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
## What is Testing? ## What is Testing?
- Smelling old milk before using it! - Smelling old milk before using it
- A way to determine if a software is not producing reliable results and if so, what is the reason. - A way to determine if a software is not producing reliable results and if so, what is the reason
- Manual testing vs. automated testing. - Manual testing vs. automated testing
--- ---
## Why Should you Test your Software? ## Why Should you Test your Software?
- Improve software reliability and reproducibility. - Improve software reliability and reproducibility
- Make sure that changes (bugfixes, new features) do not affect other parts of software. - Make sure that changes (bugfixes, new features) do not affect other parts of software
- Generally all software is better off being tested regularly. Possible exceptions are very small codes with single users. - Generally all software is better off being tested regularly. Possible exceptions are very small codes with single users.
- Ensure that a released version of a software actually works. - Ensure that a released version of a software actually works.
--- ---
## Nomenclature in Software Testing
- **Fixture**: preparatory set for testing.
- **Actual result**: what the code produces when given the fixture.
- **Expected result**: what the actual result is compared to.
- **Test coverage**: how much of the code do tests touch in one run.
---
## Some Ways to Test Software ## Some Ways to Test Software
- Assertions - Assertions
@ -83,29 +74,27 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
## Assertions ## Assertions
- Principle of *defensive programming*. ```julia
@assert condition "message"
```
- Principle of *defensive programming*
- Nothing happens when an assertion is true; throws error when false. - Nothing happens when an assertion is true; throws error when false.
- Types of assertion statements: - Types of assertion statements:
- Precondition - Precondition
- Postcondition - Postcondition
- Invariant - Invariant
- A basic but powerful tool to test a software on-the-go. - A basic but powerful tool to test a software on-the-go
- Assertion statement syntax in Python
```julia
@assert condition "message"
```
--- ---
## Unit Testing ## Unit Testing
- Catching errors with assertions is good but preventing them is better! - Catching errors with assertions is good but preventing them is better.
- A *unit* is a single function in one situation. - A *unit* is a single function in one situation.
- A situation is one amongst many possible variations of input parameters. - A situation is one amongst many possible variations of input parameters.
- User creates the expected result manually. - User creates the **expected result** manually.
- Fixture is the set of inputs used to generate an actual result. - **Actual result** is compared to the expected result by `@test`.
- Actual result is compared to the expected result by `@test`.
--- ---
@ -113,109 +102,44 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
- Test whether several units work in conjunction. - Test whether several units work in conjunction.
- *Integrate* units and test them together in an *integration* test. - *Integrate* units and test them together in an *integration* test.
- Often more complicated than a unit test and has more test coverage. - Often more complicated than a unit test and gives higher test coverage.
- A fixture is used to generate an actual result.
- Actual result is compared to the expected result by `@test`.
--- ---
## Regression Testing ## Regression Testing
- Generating an expected result is not possible in some situations. - Generating an expected result is not possible in some situations.
- Compare the current actual result with a previous actual result. - Compare the *current* actual result with a *previous* actual result.
- No guarantee that the current actual result is correct. - No guarantee that the current actual result is correct.
- Risk of a bug being carried over indefinitely. - Risk of a bug being carried over indefinitely.
- Main purpose is to identify changes in the current state of the code with respect to a past state. - Main purpose is to identify changes in the current state of the code with respect to a past state.
--- ---
## Test Coverage
- Coverage is the amount of code a test touches in one run.
- Aim for high test coverage.
- There is a trade-off: high test coverage vs. effort in test development
---
## Comparing Floating-point Variables
- Very often quantities in math software are `float` / `double`.
- Such quantities cannot be compared to exact values, an approximation is necessary.
- Comparison of floating point variables needs to be done to a certain tolerance.
```julia
@test 1 ≈ 0.999999 rtol=1e-5
```
- Get `≈` by Latex `\approx` + TAB
---
## Test-driven Development (TDD)
- Principle is to write a test and then write a code to fulfill the test.
- Advantages:
- In the end user ends up with a test alongside the code.
- Eliminates confirmation bias of the user.
- Writing tests gives clarity on what the code is supposed to do.
- Disadvantage: known to not improve productivity.
---
## Checking-driven Development (CDD)
- Developer performs spot checks; sanity checks at intermediate stages
- Math software often has heuristics which are easy to determine.
- Keep performing same checks at different stages of development to ensure the code works.
---
## Verifying a Test
- Test written as part of a bug-fix:
- Reproduce the bug in the test by ensuring that the test fails.
- Fix the bug.
- Rerun the test to ensure that it passes.
- Test written to increase code coverage:
- Make sure that the first iteration of the test passes.
- Try introducing a small fixable bug in the code to verify if the test fails.
---
# 2. Unit Testing in Julia with Test.jl # 2. Unit Testing in Julia with Test.jl
--- ---
## Setup of Tests.jl ## Setup of Test.jl
- Standard library to write and manage tests, `using Test`
- Standardized folder structure: - Standardized folder structure:
``` ```
├── Manifest.toml ├── Manifest.toml
├── Project.toml ├── Project.toml
├── src/ ├── src/
└── test └── test
├── Manifest.toml ├── Manifest.toml
├── Project.toml ├── Project.toml
├── runtests.jl ├── runtests.jl
└── setup.jl └── setup.jl
``` ```
- Singular `test` vs plural `runtests.jl` - Singular `test` vs plural `runtests.jl`
- `setup.jl` for all `using XYZ` statements, included in `runtests.jl` - `setup.jl` for all `using XYZ` statements, included in `runtests.jl`
- Additional packages either in `[extra] section` of `./Project.toml` or in a new `./test/Project.toml` environment - Additional packages in `[extra] section` of `./Project.toml` or in new `./test/Project.toml`
- In case of the latter: Do not add the package itself to the `./test/Project.toml` - In case of the latter: Do not add the package itself to the `./test/Project.toml`
- Run: `]test` when root project is activated
---
## Run Tests
Various options:
- Directly call `runtests.jl` TODO?
- From Pkg-Manager `]test` when root project is activated
--- ---
@ -234,11 +158,11 @@ Various options:
- `@testset`: Structure tests - `@testset`: Structure tests
```julia ```julia
julia> @testset "trigonometric identities" begin @testset "trigonometric identities" begin
θ = 2/3*π θ = 2/3*π
@test sin(-θ) ≈ -sin(θ) @test sin(-θ) ≈ -sin(θ)
@test cos(-θ) ≈ cos(θ) @test cos(-θ) ≈ cos(θ)
end; end;
``` ```
- `@testset for ... end`: Test in loop - `@testset for ... end`: Test in loop
@ -251,6 +175,8 @@ Various options:
- [HiRSE-Summer of Testing Part 2b: "Testing with Julia" by Nils Niggemann](https://www.youtube.com/watch?v=gSMKNbZOpZU) - [HiRSE-Summer of Testing Part 2b: "Testing with Julia" by Nils Niggemann](https://www.youtube.com/watch?v=gSMKNbZOpZU)
- [Official documentation of Test.jl](https://docs.julialang.org/en/v1/stdlib/Test/) - [Official documentation of Test.jl](https://docs.julialang.org/en/v1/stdlib/Test/)
---
# 3. Test.jl Demo # 3. Test.jl Demo
We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree/main/material/2_tue/testing/MyTestPackage), which looks as follows: We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree/main/material/2_tue/testing/MyTestPackage), which looks as follows:
@ -274,7 +200,7 @@ We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree
- Look at `MyTestPackage.jl` and `find.jl`: We have two functions `find_max` and `find_mean`, which calculate the maximum and mean of all elements of a `::AbstractVector`. - Look at `MyTestPackage.jl` and `find.jl`: We have two functions `find_max` and `find_mean`, which calculate the maximum and mean of all elements of a `::AbstractVector`.
- Assertions were added to check for `NaN` values - Assertions were added to check for `NaN` values
- Look at `runtests.jl`: - Look at `runtests.jl`:
- TODO: Why do we need `using MyTestPackage`? - Why do we need `using MyTestPackage`?
- We include dependencies via `setup.jl`: `Test` and `StableRNG`. - We include dependencies via `setup.jl`: `Test` and `StableRNG`.
- Testset "find" - Testset "find"
- Look at `find.jl` - Look at `find.jl`

View File

@ -1,52 +1,48 @@
############################################################################ using Statistics
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' #### using Plots
############################################################################ using RDatasets
using GLM
using Statistics
using Plots #---
using RDatasets
using GLM trees = dataset("datasets", "trees")
## scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
trees = dataset("datasets", "trees")
#---
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume") scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
## plot!(x -> -37 + 5*x)
scatter(trees.Girth, trees.Volume, #---
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x) linmod1 = lm(@formula(Volume ~ Girth), trees)
## #---
linmod1 = lm(@formula(Volume ~ Girth), trees) linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
## #---
linmod2 = lm(@formula(Volume ~ Girth + Height), trees) r2(linmod1)
r2(linmod2)
##
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod1)
r2(linmod2) r2(linmod3)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) #---
r2(linmod3) using CSV
using HTTP
##
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
using CSV SwissLabor = DataFrame(CSV.File(http_response.body))
using HTTP
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body)) #---
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
##
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink()) model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())

View File

@ -1,292 +1,294 @@
--- ---
editor: editor:
markdown: markdown:
wrap: 72 wrap: 72
--- ---
# Multiple Regression Basics # Multiple Regression Basics
## Motivation ## Motivation
### Introductory Example: tree dataset from R ### Introductory Example: tree dataset from R
```{julia} ``` julia
using Statistics using Statistics
using Plots using Plots
using RDatasets using RDatasets
trees = dataset("datasets", "trees") trees = dataset("datasets", "trees")
scatter(trees.Volume, trees.Girth, scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume") legend=false, xlabel="Girth", ylabel="Volume")
``` ```
*Aim:* Find relationship between the *response variable* `volume` and *Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth? of a tree given its girth?
```{julia} ``` julia
scatter(trees.Girth, trees.Volume, scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume") legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x) plot!(x -> -37 + 5*x)
``` ```
First Guess: There is a linear relation! First Guess: There is a linear relation!
## Simple Linear Regression ## Simple Linear Regression
Main assumption: up to some error term, each measurement of the response Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate covariate
$\leadsto$ **(Simple) Linear Model:** $\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$. normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the straight line that fits best, i.e., find the *optimal* *Task:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$. estimators for $\beta_0$ and $\beta_1$.
*Typical choice*: Least squares estimator (= maximum likelihood *Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors) estimator for normal errors)
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones. vector of covariates and $\mathbf{1}$ is a vector of ones.
Written in matrix style: Written in matrix style:
$$ $$
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$ $$
Note: There is a closed-form expression for Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem. rather use Julia to solve the problem.
\[use Julia code (existing package) to perform linear regression for \[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\] `volume ~ girth`\]
```{julia} ``` julia
lm(@formula(Volume ~ Girth), trees) lm(@formula(Volume ~ Girth), trees)
``` ```
*Interpretation of the Julia output:* *Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and - column `estimate` : least square estimates for $\hat \beta_0$ and
$\hat \beta_1$ $\hat \beta_1$
- column `Std. Error` : estimated standard deviation - column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
- column `t value` : value of the $t$-statistics - column `t value` : value of the $t$-statistics
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution. follow a $t$-distribution.
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for - column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
$i=0,1$ $i=0,1$
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given **Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
(you can choose them)), simulate responses from a linear model and (you can choose them)), simulate responses from a linear model and
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
choices of the parameters to see the effects on the parameter estimates choices of the parameters to see the effects on the parameter estimates
and the $p$-values. and the $p$-values.
## Multiple Regression Model ## Multiple Regression Model
*Idea*: Generalize the simple linear regression model to multiple *Idea*: Generalize the simple linear regression model to multiple
covariates, w.g., predict `volume` using `girth` and \`height\`\`. covariates, w.g., predict `volume` using `girth` and \`height\`\`.
$\leadsto$ **Linear Model:** $\leadsto$ **Linear Model:**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where $$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- $y_i$: $i$-th measurement of the response, - $y_i$: $i$-th measurement of the response,
- $x_{i1}$: $i$ th value of first covariate, - $x_{i1}$: $i$ th value of first covariate,
- ... - ...
- $x_{ip}$: $i$-th value of $p$-th covariate, - $x_{ip}$: $i$-th value of $p$-th covariate,
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
distributed errors with unknown variance $\sigma^2$. distributed errors with unknown variance $\sigma^2$.
*Task:* Find the *optimal* estimators for *Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. $\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
*Our choice again:* Least squares estimator (= maximum likelihood *Our choice again:* Least squares estimator (= maximum likelihood
estimator for normal errors) estimator for normal errors)
$$ $$
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$ $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
Written in matrix style: Written in matrix style:
$$ $$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 \mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$ $$
Defining the *design matrix* Defining the *design matrix*
$$ \mathbf{X} = \left( \begin{array}{cccc} $$ \mathbf{X} = \left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1p} \\ 1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\
1 & x_{11} & \ldots & x_{1p} 1 & x_{11} & \ldots & x_{1p}
\end{array}\right) \qquad \end{array}\right) \qquad
(\text{size } n \times (p+1)), $$ (\text{size } n \times (p+1)), $$
we get the short form we get the short form
$$ $$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$ $$
\[use Julia code (existing package) to perform linear regression for \[use Julia code (existing package) to perform linear regression for
`volume ~ girth + height`\] `volume ~ girth + height`\]
The interpretation of the Julia output is similar to the simple linear The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now: regression model, but we provide explicit formulas now:
- parameter estimates: - parameter estimates:
$$ $$
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$ $$
- estimated standard errors: - estimated standard errors:
$$ $$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
$$ $$
- $t$-statistics: - $t$-statistics:
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
- $p$-values: - $p$-values:
$$ $$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$ $$
**Task 2**: Implement functions that estimate the $\beta$-parameters, **Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your the corresponding standard errors and the $t$-statistics. Test your
functions with the \`\`\`tree''' data set and try to reproduce the functions with the \`\`\`tree''' data set and try to reproduce the
output above. output above.
```{julia} ``` julia
r2(linmod1) r2(linmod1)
r2(linmod2) r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3) r2(linmod3)
``` ```
## Generalized Linear Models ## Generalized Linear Models
Classical linear model Classical linear model
$$ $$
\mathbf{y} = \mathbf{X} \beta + \varepsilon \mathbf{y} = \mathbf{X} \beta + \varepsilon
$$ $$
implies that implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ $$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
In particular, the conditional expectation satisfies In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. $\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
However, the assumption of a normal distribution is unrealistic for However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include: non-continuous data. Popular alternatives include:
- for counting data: $$ - for counting data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
$$ $$
Here, the components are considered to be independent and the Here, the components are considered to be independent and the
exponential function is applied componentwise. exponential function is applied componentwise.
- for binary data: $$ - for binary data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$ $$
Again, the components are considered to be independent and all the Again, the components are considered to be independent and all the
operations are applied componentwise. operations are applied componentwise.
All these models are defined by the choice of a family of distributions All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that and a function $g$ (the so-called *link function*) such that
$$ $$
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$ $$
For the models above, these are: For the models above, these are:
+--------------+---------------------+--------------------+ +---------------+-------------------+------------------+
| Type of Data | Distribution Family | Link Function | | Type of Data | Distribution | Link Function |
+==============+=====================+====================+ | | Family | |
| continuous | Normal | identity: | +===============+===================+==================+
| | | | | continuous | Normal | identity: |
| | | $$ | | | | |
| | | g(x)=x | | | | $$ |
| | | $$ | | | | g(x)=x |
+--------------+---------------------+--------------------+ | | | $$ |
| count | Poisson | log: | +---------------+-------------------+------------------+
| | | | | count | Poisson | log: |
| | | $$ | | | | |
| | | g(x) = \log(x) | | | | $$ |
| | | $$ | | | | g(x) = \log(x) |
+--------------+---------------------+--------------------+ | | | $$ |
| binary | Bernoulli | logit: | +---------------+-------------------+------------------+
| | | | | binary | Bernoulli | logit: |
| | | $$ | | | | |
| | | g(x) = \log\left | | | | $$ |
| | | ( | | | | g(x) = \log\left |
| | | \ | | | | ( |
| | | f | | | | \ |
| | | rac{x}{1-x}\right) | | | | f |
| | | $$ | | | | ra |
+--------------+---------------------+--------------------+ | | | c{x}{1-x}\right) |
| | | $$ |
In general, the parameter vector $\beta$ is estimated via maximizing the +---------------+-------------------+------------------+
likelihood, i.e.,
In general, the parameter vector $\beta$ is estimated via maximizing the
$$ likelihood, i.e.,
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$ $$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
which is equivalent to the maximization of the log-likelihood, i.e., $$
$$ which is equivalent to the maximization of the log-likelihood, i.e.,
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$ $$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
In the Gaussian case, the maximum likelihood estimator is identical to $$
the least squares estimator considered above.
In the Gaussian case, the maximum likelihood estimator is identical to
```{julia} the least squares estimator considered above.
using CSV
using HTTP ``` julia
using CSV
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") using HTTP
SwissLabor = DataFrame(CSV.File(http_response.body))
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") SwissLabor = DataFrame(CSV.File(http_response.body))
model = glm(@formula(participation ~ age^2), SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
SwissLabor, Binomial(), ProbitLink())
``` model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
**Task 3:** Reproduce the results of our data analysis of the `tree` ```
data set using a generalized linear model with normal distribution
family. **Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution
family.