Merge branch 'main' of https://github.com/s-ccs/summerschool_simtech_2023
This commit is contained in:
commit
20c4a1e58b
@ -34,11 +34,11 @@ slideOptions:
|
||||
|
||||
# Learning Goals
|
||||
|
||||
- Justify the effort of developing tests to some extent
|
||||
- Get to know a few common terms of testing
|
||||
- Work with the Julia unit testing package `Test.jl`
|
||||
|
||||
Material is taken and modified, on the one hand, from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and, on the other hand, from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/).
|
||||
Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/).
|
||||
|
||||
|
||||
---
|
||||
|
||||
@ -48,30 +48,21 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
|
||||
|
||||
## What is Testing?
|
||||
|
||||
- Smelling old milk before using it!
|
||||
- A way to determine if a software is not producing reliable results and if so, what is the reason.
|
||||
- Manual testing vs. automated testing.
|
||||
- Smelling old milk before using it
|
||||
- A way to determine if a software is not producing reliable results and if so, what is the reason
|
||||
- Manual testing vs. automated testing
|
||||
|
||||
---
|
||||
|
||||
## Why Should you Test your Software?
|
||||
|
||||
- Improve software reliability and reproducibility.
|
||||
- Make sure that changes (bugfixes, new features) do not affect other parts of software.
|
||||
- Improve software reliability and reproducibility
|
||||
- Make sure that changes (bugfixes, new features) do not affect other parts of software
|
||||
- Generally all software is better off being tested regularly. Possible exceptions are very small codes with single users.
|
||||
- Ensure that a released version of a software actually works.
|
||||
|
||||
---
|
||||
|
||||
## Nomenclature in Software Testing
|
||||
|
||||
- **Fixture**: preparatory set for testing.
|
||||
- **Actual result**: what the code produces when given the fixture.
|
||||
- **Expected result**: what the actual result is compared to.
|
||||
- **Test coverage**: how much of the code do tests touch in one run.
|
||||
|
||||
---
|
||||
|
||||
## Some Ways to Test Software
|
||||
|
||||
- Assertions
|
||||
@ -83,29 +74,27 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
|
||||
|
||||
## Assertions
|
||||
|
||||
- Principle of *defensive programming*.
|
||||
```julia
|
||||
@assert condition "message"
|
||||
```
|
||||
|
||||
- Principle of *defensive programming*
|
||||
- Nothing happens when an assertion is true; throws error when false.
|
||||
- Types of assertion statements:
|
||||
- Precondition
|
||||
- Postcondition
|
||||
- Invariant
|
||||
- A basic but powerful tool to test a software on-the-go.
|
||||
- Assertion statement syntax in Python
|
||||
|
||||
```julia
|
||||
@assert condition "message"
|
||||
```
|
||||
- A basic but powerful tool to test a software on-the-go
|
||||
|
||||
---
|
||||
|
||||
## Unit Testing
|
||||
|
||||
- Catching errors with assertions is good but preventing them is better!
|
||||
- Catching errors with assertions is good but preventing them is better.
|
||||
- A *unit* is a single function in one situation.
|
||||
- A situation is one amongst many possible variations of input parameters.
|
||||
- User creates the expected result manually.
|
||||
- Fixture is the set of inputs used to generate an actual result.
|
||||
- Actual result is compared to the expected result by `@test`.
|
||||
- User creates the **expected result** manually.
|
||||
- **Actual result** is compared to the expected result by `@test`.
|
||||
|
||||
---
|
||||
|
||||
@ -113,109 +102,44 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
|
||||
|
||||
- Test whether several units work in conjunction.
|
||||
- *Integrate* units and test them together in an *integration* test.
|
||||
- Often more complicated than a unit test and has more test coverage.
|
||||
- A fixture is used to generate an actual result.
|
||||
- Actual result is compared to the expected result by `@test`.
|
||||
- Often more complicated than a unit test and gives higher test coverage.
|
||||
|
||||
---
|
||||
|
||||
## Regression Testing
|
||||
|
||||
- Generating an expected result is not possible in some situations.
|
||||
- Compare the current actual result with a previous actual result.
|
||||
- Compare the *current* actual result with a *previous* actual result.
|
||||
- No guarantee that the current actual result is correct.
|
||||
- Risk of a bug being carried over indefinitely.
|
||||
- Main purpose is to identify changes in the current state of the code with respect to a past state.
|
||||
|
||||
---
|
||||
|
||||
## Test Coverage
|
||||
|
||||
- Coverage is the amount of code a test touches in one run.
|
||||
- Aim for high test coverage.
|
||||
- There is a trade-off: high test coverage vs. effort in test development
|
||||
|
||||
---
|
||||
|
||||
## Comparing Floating-point Variables
|
||||
|
||||
- Very often quantities in math software are `float` / `double`.
|
||||
- Such quantities cannot be compared to exact values, an approximation is necessary.
|
||||
- Comparison of floating point variables needs to be done to a certain tolerance.
|
||||
|
||||
```julia
|
||||
@test 1 ≈ 0.999999 rtol=1e-5
|
||||
```
|
||||
|
||||
- Get `≈` by Latex `\approx` + TAB
|
||||
|
||||
---
|
||||
|
||||
## Test-driven Development (TDD)
|
||||
|
||||
- Principle is to write a test and then write a code to fulfill the test.
|
||||
- Advantages:
|
||||
- In the end user ends up with a test alongside the code.
|
||||
- Eliminates confirmation bias of the user.
|
||||
- Writing tests gives clarity on what the code is supposed to do.
|
||||
- Disadvantage: known to not improve productivity.
|
||||
|
||||
---
|
||||
|
||||
## Checking-driven Development (CDD)
|
||||
|
||||
- Developer performs spot checks; sanity checks at intermediate stages
|
||||
- Math software often has heuristics which are easy to determine.
|
||||
- Keep performing same checks at different stages of development to ensure the code works.
|
||||
|
||||
---
|
||||
|
||||
## Verifying a Test
|
||||
|
||||
- Test written as part of a bug-fix:
|
||||
- Reproduce the bug in the test by ensuring that the test fails.
|
||||
- Fix the bug.
|
||||
- Rerun the test to ensure that it passes.
|
||||
- Test written to increase code coverage:
|
||||
- Make sure that the first iteration of the test passes.
|
||||
- Try introducing a small fixable bug in the code to verify if the test fails.
|
||||
|
||||
---
|
||||
|
||||
# 2. Unit Testing in Julia with Test.jl
|
||||
|
||||
---
|
||||
|
||||
## Setup of Tests.jl
|
||||
## Setup of Test.jl
|
||||
|
||||
- Standard library to write and manage tests, `using Test`
|
||||
- Standardized folder structure:
|
||||
|
||||
```
|
||||
├── Manifest.toml
|
||||
├── Project.toml
|
||||
├── src/
|
||||
└── test
|
||||
├── Manifest.toml
|
||||
├── Project.toml
|
||||
├── runtests.jl
|
||||
└── setup.jl
|
||||
```
|
||||
```
|
||||
├── Manifest.toml
|
||||
├── Project.toml
|
||||
├── src/
|
||||
└── test
|
||||
├── Manifest.toml
|
||||
├── Project.toml
|
||||
├── runtests.jl
|
||||
└── setup.jl
|
||||
```
|
||||
|
||||
- Singular `test` vs plural `runtests.jl`
|
||||
- `setup.jl` for all `using XYZ` statements, included in `runtests.jl`
|
||||
- Additional packages either in `[extra] section` of `./Project.toml` or in a new `./test/Project.toml` environment
|
||||
- Additional packages in `[extra] section` of `./Project.toml` or in new `./test/Project.toml`
|
||||
- In case of the latter: Do not add the package itself to the `./test/Project.toml`
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Run Tests
|
||||
|
||||
Various options:
|
||||
|
||||
- Directly call `runtests.jl` TODO?
|
||||
- From Pkg-Manager `]test` when root project is activated
|
||||
- Run: `]test` when root project is activated
|
||||
|
||||
---
|
||||
|
||||
@ -234,11 +158,11 @@ Various options:
|
||||
- `@testset`: Structure tests
|
||||
|
||||
```julia
|
||||
julia> @testset "trigonometric identities" begin
|
||||
θ = 2/3*π
|
||||
@test sin(-θ) ≈ -sin(θ)
|
||||
@test cos(-θ) ≈ cos(θ)
|
||||
end;
|
||||
@testset "trigonometric identities" begin
|
||||
θ = 2/3*π
|
||||
@test sin(-θ) ≈ -sin(θ)
|
||||
@test cos(-θ) ≈ cos(θ)
|
||||
end;
|
||||
```
|
||||
|
||||
- `@testset for ... end`: Test in loop
|
||||
@ -251,6 +175,8 @@ Various options:
|
||||
- [HiRSE-Summer of Testing Part 2b: "Testing with Julia" by Nils Niggemann](https://www.youtube.com/watch?v=gSMKNbZOpZU)
|
||||
- [Official documentation of Test.jl](https://docs.julialang.org/en/v1/stdlib/Test/)
|
||||
|
||||
---
|
||||
|
||||
# 3. Test.jl Demo
|
||||
|
||||
We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree/main/material/2_tue/testing/MyTestPackage), which looks as follows:
|
||||
@ -274,7 +200,7 @@ We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree
|
||||
- Look at `MyTestPackage.jl` and `find.jl`: We have two functions `find_max` and `find_mean`, which calculate the maximum and mean of all elements of a `::AbstractVector`.
|
||||
- Assertions were added to check for `NaN` values
|
||||
- Look at `runtests.jl`:
|
||||
- TODO: Why do we need `using MyTestPackage`?
|
||||
- Why do we need `using MyTestPackage`?
|
||||
- We include dependencies via `setup.jl`: `Test` and `StableRNG`.
|
||||
- Testset "find"
|
||||
- Look at `find.jl`
|
||||
|
@ -1,52 +1,48 @@
|
||||
############################################################################
|
||||
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
|
||||
############################################################################
|
||||
|
||||
using Statistics
|
||||
using Plots
|
||||
using RDatasets
|
||||
using GLM
|
||||
|
||||
##
|
||||
|
||||
trees = dataset("datasets", "trees")
|
||||
|
||||
scatter(trees.Girth, trees.Volume,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
|
||||
##
|
||||
|
||||
scatter(trees.Girth, trees.Volume,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
plot!(x -> -37 + 5*x)
|
||||
|
||||
##
|
||||
|
||||
linmod1 = lm(@formula(Volume ~ Girth), trees)
|
||||
|
||||
##
|
||||
|
||||
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
|
||||
|
||||
##
|
||||
|
||||
r2(linmod1)
|
||||
r2(linmod2)
|
||||
|
||||
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
|
||||
|
||||
r2(linmod3)
|
||||
|
||||
##
|
||||
|
||||
using CSV
|
||||
using HTTP
|
||||
|
||||
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
|
||||
SwissLabor = DataFrame(CSV.File(http_response.body))
|
||||
|
||||
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
|
||||
|
||||
##
|
||||
|
||||
using Statistics
|
||||
using Plots
|
||||
using RDatasets
|
||||
using GLM
|
||||
|
||||
#---
|
||||
|
||||
trees = dataset("datasets", "trees")
|
||||
|
||||
scatter(trees.Girth, trees.Volume,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
|
||||
#---
|
||||
|
||||
scatter(trees.Girth, trees.Volume,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
plot!(x -> -37 + 5*x)
|
||||
|
||||
#---
|
||||
|
||||
linmod1 = lm(@formula(Volume ~ Girth), trees)
|
||||
|
||||
#---
|
||||
|
||||
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
|
||||
|
||||
#---
|
||||
|
||||
r2(linmod1)
|
||||
r2(linmod2)
|
||||
|
||||
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
|
||||
|
||||
r2(linmod3)
|
||||
|
||||
#---
|
||||
|
||||
using CSV
|
||||
using HTTP
|
||||
|
||||
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
|
||||
SwissLabor = DataFrame(CSV.File(http_response.body))
|
||||
|
||||
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
|
||||
|
||||
#---
|
||||
|
||||
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())
|
@ -1,292 +1,294 @@
|
||||
---
|
||||
editor:
|
||||
markdown:
|
||||
wrap: 72
|
||||
---
|
||||
|
||||
# Multiple Regression Basics
|
||||
|
||||
## Motivation
|
||||
|
||||
### Introductory Example: tree dataset from R
|
||||
|
||||
```{julia}
|
||||
using Statistics
|
||||
using Plots
|
||||
using RDatasets
|
||||
|
||||
trees = dataset("datasets", "trees")
|
||||
|
||||
scatter(trees.Volume, trees.Girth,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
```
|
||||
|
||||
*Aim:* Find relationship between the *response variable* `volume` and
|
||||
the *explanatory variable/covariate* `girth`? Can we predict the volume
|
||||
of a tree given its girth?
|
||||
|
||||
```{julia}
|
||||
scatter(trees.Girth, trees.Volume,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
plot!(x -> -37 + 5*x)
|
||||
```
|
||||
|
||||
First Guess: There is a linear relation!
|
||||
|
||||
## Simple Linear Regression
|
||||
|
||||
Main assumption: up to some error term, each measurement of the response
|
||||
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
|
||||
covariate
|
||||
|
||||
$\leadsto$ **(Simple) Linear Model:**
|
||||
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
|
||||
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
|
||||
normally distributed errors with unknown variance $\sigma^2$.
|
||||
|
||||
*Task:* Find the straight line that fits best, i.e., find the *optimal*
|
||||
estimators for $\beta_0$ and $\beta_1$.
|
||||
|
||||
*Typical choice*: Least squares estimator (= maximum likelihood
|
||||
estimator for normal errors)
|
||||
|
||||
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
|
||||
|
||||
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
|
||||
vector of covariates and $\mathbf{1}$ is a vector of ones.
|
||||
|
||||
Written in matrix style:
|
||||
|
||||
$$
|
||||
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
|
||||
$$
|
||||
|
||||
Note: There is a closed-form expression for
|
||||
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
|
||||
rather use Julia to solve the problem.
|
||||
|
||||
\[use Julia code (existing package) to perform linear regression for
|
||||
`volume ~ girth`\]
|
||||
|
||||
```{julia}
|
||||
lm(@formula(Volume ~ Girth), trees)
|
||||
```
|
||||
|
||||
*Interpretation of the Julia output:*
|
||||
|
||||
- column `estimate` : least square estimates for $\hat \beta_0$ and
|
||||
$\hat \beta_1$
|
||||
|
||||
- column `Std. Error` : estimated standard deviation
|
||||
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
|
||||
|
||||
- column `t value` : value of the $t$-statistics
|
||||
|
||||
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
|
||||
|
||||
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
|
||||
follow a $t$-distribution.
|
||||
|
||||
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
|
||||
$i=0,1$
|
||||
|
||||
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
|
||||
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
|
||||
(you can choose them)), simulate responses from a linear model and
|
||||
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
|
||||
choices of the parameters to see the effects on the parameter estimates
|
||||
and the $p$-values.
|
||||
|
||||
## Multiple Regression Model
|
||||
|
||||
*Idea*: Generalize the simple linear regression model to multiple
|
||||
covariates, w.g., predict `volume` using `girth` and \`height\`\`.
|
||||
|
||||
$\leadsto$ **Linear Model:**
|
||||
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
|
||||
|
||||
- $y_i$: $i$-th measurement of the response,
|
||||
|
||||
- $x_{i1}$: $i$ th value of first covariate,
|
||||
|
||||
- ...
|
||||
|
||||
- $x_{ip}$: $i$-th value of $p$-th covariate,
|
||||
|
||||
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
|
||||
distributed errors with unknown variance $\sigma^2$.
|
||||
|
||||
*Task:* Find the *optimal* estimators for
|
||||
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
|
||||
|
||||
*Our choice again:* Least squares estimator (= maximum likelihood
|
||||
estimator for normal errors)
|
||||
|
||||
$$
|
||||
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
|
||||
$$
|
||||
|
||||
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
|
||||
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
|
||||
|
||||
Written in matrix style:
|
||||
|
||||
$$
|
||||
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
|
||||
$$
|
||||
|
||||
Defining the *design matrix*
|
||||
|
||||
$$ \mathbf{X} = \left( \begin{array}{cccc}
|
||||
1 & x_{11} & \ldots & x_{1p} \\
|
||||
\vdots & \vdots & \ddots & \vdots \\
|
||||
1 & x_{11} & \ldots & x_{1p}
|
||||
\end{array}\right) \qquad
|
||||
(\text{size } n \times (p+1)), $$
|
||||
|
||||
we get the short form
|
||||
|
||||
$$
|
||||
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
|
||||
$$
|
||||
|
||||
\[use Julia code (existing package) to perform linear regression for
|
||||
`volume ~ girth + height`\]
|
||||
|
||||
The interpretation of the Julia output is similar to the simple linear
|
||||
regression model, but we provide explicit formulas now:
|
||||
|
||||
- parameter estimates:
|
||||
|
||||
$$
|
||||
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
|
||||
$$
|
||||
|
||||
- estimated standard errors:
|
||||
|
||||
$$
|
||||
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
|
||||
$$
|
||||
|
||||
- $t$-statistics:
|
||||
|
||||
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
|
||||
|
||||
- $p$-values:
|
||||
|
||||
$$
|
||||
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
|
||||
$$
|
||||
|
||||
**Task 2**: Implement functions that estimate the $\beta$-parameters,
|
||||
the corresponding standard errors and the $t$-statistics. Test your
|
||||
functions with the \`\`\`tree''' data set and try to reproduce the
|
||||
output above.
|
||||
|
||||
```{julia}
|
||||
r2(linmod1)
|
||||
r2(linmod2)
|
||||
|
||||
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
|
||||
|
||||
r2(linmod3)
|
||||
```
|
||||
|
||||
## Generalized Linear Models
|
||||
|
||||
Classical linear model
|
||||
|
||||
$$
|
||||
\mathbf{y} = \mathbf{X} \beta + \varepsilon
|
||||
$$
|
||||
|
||||
implies that
|
||||
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
|
||||
|
||||
In particular, the conditional expectation satisfies
|
||||
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
|
||||
|
||||
However, the assumption of a normal distribution is unrealistic for
|
||||
non-continuous data. Popular alternatives include:
|
||||
|
||||
- for counting data: $$
|
||||
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
|
||||
$$
|
||||
|
||||
Here, the components are considered to be independent and the
|
||||
exponential function is applied componentwise.
|
||||
|
||||
- for binary data: $$
|
||||
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
|
||||
$$
|
||||
|
||||
Again, the components are considered to be independent and all the
|
||||
operations are applied componentwise.
|
||||
|
||||
All these models are defined by the choice of a family of distributions
|
||||
and a function $g$ (the so-called *link function*) such that
|
||||
|
||||
$$
|
||||
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
|
||||
$$
|
||||
|
||||
For the models above, these are:
|
||||
|
||||
+--------------+---------------------+--------------------+
|
||||
| Type of Data | Distribution Family | Link Function |
|
||||
+==============+=====================+====================+
|
||||
| continuous | Normal | identity: |
|
||||
| | | |
|
||||
| | | $$ |
|
||||
| | | g(x)=x |
|
||||
| | | $$ |
|
||||
+--------------+---------------------+--------------------+
|
||||
| count | Poisson | log: |
|
||||
| | | |
|
||||
| | | $$ |
|
||||
| | | g(x) = \log(x) |
|
||||
| | | $$ |
|
||||
+--------------+---------------------+--------------------+
|
||||
| binary | Bernoulli | logit: |
|
||||
| | | |
|
||||
| | | $$ |
|
||||
| | | g(x) = \log\left |
|
||||
| | | ( |
|
||||
| | | \ |
|
||||
| | | f |
|
||||
| | | rac{x}{1-x}\right) |
|
||||
| | | $$ |
|
||||
+--------------+---------------------+--------------------+
|
||||
|
||||
In general, the parameter vector $\beta$ is estimated via maximizing the
|
||||
likelihood, i.e.,
|
||||
|
||||
$$
|
||||
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
|
||||
$$
|
||||
|
||||
which is equivalent to the maximization of the log-likelihood, i.e.,
|
||||
|
||||
$$
|
||||
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
|
||||
$$
|
||||
|
||||
In the Gaussian case, the maximum likelihood estimator is identical to
|
||||
the least squares estimator considered above.
|
||||
|
||||
```{julia}
|
||||
using CSV
|
||||
using HTTP
|
||||
|
||||
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
|
||||
SwissLabor = DataFrame(CSV.File(http_response.body))
|
||||
|
||||
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
|
||||
|
||||
model = glm(@formula(participation ~ age^2),
|
||||
SwissLabor, Binomial(), ProbitLink())
|
||||
```
|
||||
|
||||
**Task 3:** Reproduce the results of our data analysis of the `tree`
|
||||
data set using a generalized linear model with normal distribution
|
||||
family.
|
||||
---
|
||||
editor:
|
||||
markdown:
|
||||
wrap: 72
|
||||
---
|
||||
|
||||
# Multiple Regression Basics
|
||||
|
||||
## Motivation
|
||||
|
||||
### Introductory Example: tree dataset from R
|
||||
|
||||
``` julia
|
||||
using Statistics
|
||||
using Plots
|
||||
using RDatasets
|
||||
|
||||
trees = dataset("datasets", "trees")
|
||||
|
||||
scatter(trees.Volume, trees.Girth,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
```
|
||||
|
||||
*Aim:* Find relationship between the *response variable* `volume` and
|
||||
the *explanatory variable/covariate* `girth`? Can we predict the volume
|
||||
of a tree given its girth?
|
||||
|
||||
``` julia
|
||||
scatter(trees.Girth, trees.Volume,
|
||||
legend=false, xlabel="Girth", ylabel="Volume")
|
||||
plot!(x -> -37 + 5*x)
|
||||
```
|
||||
|
||||
First Guess: There is a linear relation!
|
||||
|
||||
## Simple Linear Regression
|
||||
|
||||
Main assumption: up to some error term, each measurement of the response
|
||||
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
|
||||
covariate
|
||||
|
||||
$\leadsto$ **(Simple) Linear Model:**
|
||||
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
|
||||
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
|
||||
normally distributed errors with unknown variance $\sigma^2$.
|
||||
|
||||
*Task:* Find the straight line that fits best, i.e., find the *optimal*
|
||||
estimators for $\beta_0$ and $\beta_1$.
|
||||
|
||||
*Typical choice*: Least squares estimator (= maximum likelihood
|
||||
estimator for normal errors)
|
||||
|
||||
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
|
||||
|
||||
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
|
||||
vector of covariates and $\mathbf{1}$ is a vector of ones.
|
||||
|
||||
Written in matrix style:
|
||||
|
||||
$$
|
||||
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
|
||||
$$
|
||||
|
||||
Note: There is a closed-form expression for
|
||||
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
|
||||
rather use Julia to solve the problem.
|
||||
|
||||
\[use Julia code (existing package) to perform linear regression for
|
||||
`volume ~ girth`\]
|
||||
|
||||
``` julia
|
||||
lm(@formula(Volume ~ Girth), trees)
|
||||
```
|
||||
|
||||
*Interpretation of the Julia output:*
|
||||
|
||||
- column `estimate` : least square estimates for $\hat \beta_0$ and
|
||||
$\hat \beta_1$
|
||||
|
||||
- column `Std. Error` : estimated standard deviation
|
||||
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
|
||||
|
||||
- column `t value` : value of the $t$-statistics
|
||||
|
||||
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
|
||||
|
||||
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
|
||||
follow a $t$-distribution.
|
||||
|
||||
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
|
||||
$i=0,1$
|
||||
|
||||
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
|
||||
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
|
||||
(you can choose them)), simulate responses from a linear model and
|
||||
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
|
||||
choices of the parameters to see the effects on the parameter estimates
|
||||
and the $p$-values.
|
||||
|
||||
## Multiple Regression Model
|
||||
|
||||
*Idea*: Generalize the simple linear regression model to multiple
|
||||
covariates, w.g., predict `volume` using `girth` and \`height\`\`.
|
||||
|
||||
$\leadsto$ **Linear Model:**
|
||||
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
|
||||
|
||||
- $y_i$: $i$-th measurement of the response,
|
||||
|
||||
- $x_{i1}$: $i$ th value of first covariate,
|
||||
|
||||
- ...
|
||||
|
||||
- $x_{ip}$: $i$-th value of $p$-th covariate,
|
||||
|
||||
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
|
||||
distributed errors with unknown variance $\sigma^2$.
|
||||
|
||||
*Task:* Find the *optimal* estimators for
|
||||
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
|
||||
|
||||
*Our choice again:* Least squares estimator (= maximum likelihood
|
||||
estimator for normal errors)
|
||||
|
||||
$$
|
||||
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
|
||||
$$
|
||||
|
||||
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
|
||||
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
|
||||
|
||||
Written in matrix style:
|
||||
|
||||
$$
|
||||
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
|
||||
$$
|
||||
|
||||
Defining the *design matrix*
|
||||
|
||||
$$ \mathbf{X} = \left( \begin{array}{cccc}
|
||||
1 & x_{11} & \ldots & x_{1p} \\
|
||||
\vdots & \vdots & \ddots & \vdots \\
|
||||
1 & x_{11} & \ldots & x_{1p}
|
||||
\end{array}\right) \qquad
|
||||
(\text{size } n \times (p+1)), $$
|
||||
|
||||
we get the short form
|
||||
|
||||
$$
|
||||
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
|
||||
$$
|
||||
|
||||
\[use Julia code (existing package) to perform linear regression for
|
||||
`volume ~ girth + height`\]
|
||||
|
||||
The interpretation of the Julia output is similar to the simple linear
|
||||
regression model, but we provide explicit formulas now:
|
||||
|
||||
- parameter estimates:
|
||||
|
||||
$$
|
||||
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
|
||||
$$
|
||||
|
||||
- estimated standard errors:
|
||||
|
||||
$$
|
||||
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
|
||||
$$
|
||||
|
||||
- $t$-statistics:
|
||||
|
||||
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
|
||||
|
||||
- $p$-values:
|
||||
|
||||
$$
|
||||
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
|
||||
$$
|
||||
|
||||
**Task 2**: Implement functions that estimate the $\beta$-parameters,
|
||||
the corresponding standard errors and the $t$-statistics. Test your
|
||||
functions with the \`\`\`tree''' data set and try to reproduce the
|
||||
output above.
|
||||
|
||||
``` julia
|
||||
r2(linmod1)
|
||||
r2(linmod2)
|
||||
|
||||
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
|
||||
|
||||
r2(linmod3)
|
||||
```
|
||||
|
||||
## Generalized Linear Models
|
||||
|
||||
Classical linear model
|
||||
|
||||
$$
|
||||
\mathbf{y} = \mathbf{X} \beta + \varepsilon
|
||||
$$
|
||||
|
||||
implies that
|
||||
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
|
||||
|
||||
In particular, the conditional expectation satisfies
|
||||
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
|
||||
|
||||
However, the assumption of a normal distribution is unrealistic for
|
||||
non-continuous data. Popular alternatives include:
|
||||
|
||||
- for counting data: $$
|
||||
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
|
||||
$$
|
||||
|
||||
Here, the components are considered to be independent and the
|
||||
exponential function is applied componentwise.
|
||||
|
||||
- for binary data: $$
|
||||
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
|
||||
$$
|
||||
|
||||
Again, the components are considered to be independent and all the
|
||||
operations are applied componentwise.
|
||||
|
||||
All these models are defined by the choice of a family of distributions
|
||||
and a function $g$ (the so-called *link function*) such that
|
||||
|
||||
$$
|
||||
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
|
||||
$$
|
||||
|
||||
For the models above, these are:
|
||||
|
||||
+---------------+-------------------+------------------+
|
||||
| Type of Data | Distribution | Link Function |
|
||||
| | Family | |
|
||||
+===============+===================+==================+
|
||||
| continuous | Normal | identity: |
|
||||
| | | |
|
||||
| | | $$ |
|
||||
| | | g(x)=x |
|
||||
| | | $$ |
|
||||
+---------------+-------------------+------------------+
|
||||
| count | Poisson | log: |
|
||||
| | | |
|
||||
| | | $$ |
|
||||
| | | g(x) = \log(x) |
|
||||
| | | $$ |
|
||||
+---------------+-------------------+------------------+
|
||||
| binary | Bernoulli | logit: |
|
||||
| | | |
|
||||
| | | $$ |
|
||||
| | | g(x) = \log\left |
|
||||
| | | ( |
|
||||
| | | \ |
|
||||
| | | f |
|
||||
| | | ra |
|
||||
| | | c{x}{1-x}\right) |
|
||||
| | | $$ |
|
||||
+---------------+-------------------+------------------+
|
||||
|
||||
In general, the parameter vector $\beta$ is estimated via maximizing the
|
||||
likelihood, i.e.,
|
||||
|
||||
$$
|
||||
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
|
||||
$$
|
||||
|
||||
which is equivalent to the maximization of the log-likelihood, i.e.,
|
||||
|
||||
$$
|
||||
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
|
||||
$$
|
||||
|
||||
In the Gaussian case, the maximum likelihood estimator is identical to
|
||||
the least squares estimator considered above.
|
||||
|
||||
``` julia
|
||||
using CSV
|
||||
using HTTP
|
||||
|
||||
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
|
||||
SwissLabor = DataFrame(CSV.File(http_response.body))
|
||||
|
||||
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
|
||||
|
||||
model = glm(@formula(participation ~ age^2),
|
||||
SwissLabor, Binomial(), ProbitLink())
|
||||
```
|
||||
|
||||
**Task 3:** Reproduce the results of our data analysis of the `tree`
|
||||
data set using a generalized linear model with normal distribution
|
||||
family.
|
||||
|
Loading…
Reference in New Issue
Block a user