Merge branch 'main' of https://github.com/s-ccs/summerschool_simtech_2023

2023-10-09 14:14:56 +00:00 · 2023-10-09 14:14:56 +00:00 · 20c4a1e58b
commit 20c4a1e58b
parent 773c15150d 39892ad1c1
3 changed files with 380 additions and 456 deletions
--- a/material/2_tue/testing/slides.md
+++ b/material/2_tue/testing/slides.md
@ -34,11 +34,11 @@ slideOptions:
 # Learning Goals
 - Justify the effort of developing tests to some extent
 - Get to know a few common terms of testing
 - Work with the Julia unit testing package `Test.jl`
-Material is taken and modified, on the one hand, from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and, on the other hand, from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/).
+Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/).
 ---
@ -48,30 +48,21 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
 ## What is Testing?
- Smelling old milk before using it!
+- Smelling old milk before using it
- A way to determine if a software is not producing reliable results and if so, what is the reason.
+- A way to determine if a software is not producing reliable results and if so, what is the reason
- Manual testing vs. automated testing.
+- Manual testing vs. automated testing
 ---
 ## Why Should you Test your Software?
- Improve software reliability and reproducibility.
+- Improve software reliability and reproducibility
- Make sure that changes (bugfixes, new features) do not affect other parts of software.
+- Make sure that changes (bugfixes, new features) do not affect other parts of software
 - Generally all software is better off being tested regularly. Possible exceptions are very small codes with single users.
 - Ensure that a released version of a software actually works.
 ---
 ## Nomenclature in Software Testing
 - **Fixture**: preparatory set for testing.
 - **Actual result**: what the code produces when given the fixture.
 - **Expected result**: what the actual result is compared to.
 - **Test coverage**: how much of the code do tests touch in one run.
 ---
 ## Some Ways to Test Software
 - Assertions
@ -83,29 +74,27 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
 ## Assertions
- Principle of *defensive programming*.
+```julia
@assert condition "message"
 ```
 - Principle of *defensive programming*
 - Nothing happens when an assertion is true; throws error when false.
 - Types of assertion statements:
    - Precondition
    - Postcondition
    - Invariant
- A basic but powerful tool to test a software on-the-go.
+- A basic but powerful tool to test a software on-the-go
 - Assertion statement syntax in Python
 ```julia
@assert condition "message"
 ```
 ---
 ## Unit Testing
- Catching errors with assertions is good but preventing them is better!
+- Catching errors with assertions is good but preventing them is better.
 - A *unit* is a single function in one situation.
    - A situation is one amongst many possible variations of input parameters.
- User creates the expected result manually.
+- User creates the **expected result** manually.
- Fixture is the set of inputs used to generate an actual result.
+- **Actual result** is compared to the expected result by `@test`.
 - Actual result is compared to the expected result by `@test`.
 ---
@ -113,109 +102,44 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https://
 - Test whether several units work in conjunction.
 - *Integrate* units and test them together in an *integration* test.
- Often more complicated than a unit test and has more test coverage.
+- Often more complicated than a unit test and gives higher test coverage.
 - A fixture is used to generate an actual result.
 - Actual result is compared to the expected result by `@test`.
 ---
 ## Regression Testing
 - Generating an expected result is not possible in some situations.
- Compare the current actual result with a previous actual result.
+- Compare the *current* actual result with a *previous* actual result.
 - No guarantee that the current actual result is correct.
 - Risk of a bug being carried over indefinitely.
 - Main purpose is to identify changes in the current state of the code with respect to a past state.
 ---
 ## Test Coverage
 - Coverage is the amount of code a test touches in one run.
 - Aim for high test coverage.
 - There is a trade-off: high test coverage vs. effort in test development
 ---
 ## Comparing Floating-point Variables
 - Very often quantities in math software are `float` / `double`.
 - Such quantities cannot be compared to exact values, an approximation is necessary.
 - Comparison of floating point variables needs to be done to a certain tolerance.
 ```julia
@test 1 ≈ 0.999999  rtol=1e-5
 ```
 - Get `≈` by Latex `\approx` + TAB
 ---
 ## Test-driven Development (TDD)
 - Principle is to write a test and then write a code to fulfill the test.
 - Advantages:
    - In the end user ends up with a test alongside the code.
    - Eliminates confirmation bias of the user.
    - Writing tests gives clarity on what the code is supposed to do.
 - Disadvantage: known to not improve productivity.
 ---
 ## Checking-driven Development (CDD)
 - Developer performs spot checks; sanity checks at intermediate stages
 - Math software often has heuristics which are easy to determine.
 - Keep performing same checks at different stages of development to ensure the code works.
 ---
 ## Verifying a Test
 - Test written as part of a bug-fix:
    - Reproduce the bug in the test by ensuring that the test fails.
    - Fix the bug.
    - Rerun the test to ensure that it passes.
 - Test written to increase code coverage:
    - Make sure that the first iteration of the test passes.
    - Try introducing a small fixable bug in the code to verify if the test fails.
 ---
 # 2. Unit Testing in Julia with Test.jl
 ---
-## Setup of Tests.jl
+## Setup of Test.jl
 - Standard library to write and manage tests, `using Test`
 - Standardized folder structure:
-```
+  ```
-├── Manifest.toml
+  ├── Manifest.toml
-├── Project.toml
+  ├── Project.toml
-├── src/
+  ├── src/
-└── test
+  └── test
-    ├── Manifest.toml
+      ├── Manifest.toml
-    ├── Project.toml
+      ├── Project.toml
-    ├── runtests.jl
+      ├── runtests.jl
-    └── setup.jl
+      └── setup.jl
-```
+  ```
 - Singular `test` vs plural `runtests.jl`
 - `setup.jl` for all `using XYZ` statements, included in `runtests.jl`
- Additional packages either in `[extra] section` of `./Project.toml` or in a new `./test/Project.toml` environment
+- Additional packages in `[extra] section` of `./Project.toml` or in new `./test/Project.toml`
  - In case of the latter: Do not add the package itself to the `./test/Project.toml`
-
+- Run: `]test` when root project is activated
 ---
 ## Run Tests
 Various options:
 - Directly call `runtests.jl` TODO?
 - From Pkg-Manager `]test` when root project is activated
 ---
@ -234,11 +158,11 @@ Various options:
 - `@testset`: Structure tests
   ```julia
-   julia> @testset "trigonometric identities" begin
+   @testset "trigonometric identities" begin
-           θ = 2/3*π
+       θ = 2/3*π
-           @test sin(-θ) ≈ -sin(θ)
+       @test sin(-θ) ≈ -sin(θ)
-           @test cos(-θ) ≈ cos(θ)
+       @test cos(-θ) ≈ cos(θ)
-       end;
+   end;
   ```
 - `@testset for ... end`: Test in loop
@ -251,6 +175,8 @@ Various options:
 - [HiRSE-Summer of Testing Part 2b: "Testing with Julia" by Nils Niggemann](https://www.youtube.com/watch?v=gSMKNbZOpZU)
 - [Official documentation of Test.jl](https://docs.julialang.org/en/v1/stdlib/Test/)
 ---
 # 3. Test.jl Demo
 We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree/main/material/2_tue/testing/MyTestPackage), which looks as follows:
@ -274,7 +200,7 @@ We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree
 - Look at `MyTestPackage.jl` and `find.jl`: We have two functions `find_max` and `find_mean`, which calculate the maximum and mean of all elements of a `::AbstractVector`.
  - Assertions were added to check for `NaN` values
 - Look at `runtests.jl`:
-  - TODO: Why do we need `using MyTestPackage`?
+  - Why do we need `using MyTestPackage`?
  - We include dependencies via `setup.jl`: `Test` and `StableRNG`.
  - Testset "find"
 - Look at `find.jl`
--- a/material/3_wed/regression/Code_Snippets.jl
+++ b/material/3_wed/regression/Code_Snippets.jl
@ -1,52 +1,48 @@
-############################################################################
+using Statistics
-#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
+using Plots
-############################################################################
+using RDatasets
-
+using GLM
-using Statistics
+
-using Plots
+#---
-using RDatasets
+
-using GLM
+trees = dataset("datasets", "trees")
-
+
-##
+scatter(trees.Girth, trees.Volume,
-
+        legend=false, xlabel="Girth", ylabel="Volume")
-trees = dataset("datasets", "trees")
+
-
+#---
-scatter(trees.Girth, trees.Volume,
+
-        legend=false, xlabel="Girth", ylabel="Volume")
+scatter(trees.Girth, trees.Volume,
-
+        legend=false, xlabel="Girth", ylabel="Volume")
-##
+plot!(x -> -37 + 5*x)
-
+
-scatter(trees.Girth, trees.Volume,
+#---
-        legend=false, xlabel="Girth", ylabel="Volume")
+
-plot!(x -> -37 + 5*x)
+linmod1 = lm(@formula(Volume ~ Girth), trees)
-
+
-##
+#---
-
+
-linmod1 = lm(@formula(Volume ~ Girth), trees)
+linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
-
+
-##
+#---
-
+
-linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
+r2(linmod1)
-
+r2(linmod2)
-##
+
-
+linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
-r2(linmod1)
+
-r2(linmod2)
+r2(linmod3)
-
+
-linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
+#---
-
+
-r2(linmod3)
+using CSV
-
+using HTTP
-##
+
-
+http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
-using CSV
+SwissLabor = DataFrame(CSV.File(http_response.body))
-using HTTP
+
-
+SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
-http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
+
-SwissLabor = DataFrame(CSV.File(http_response.body))
+#---
-
+
 SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
 ##
 model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())
--- a/material/3_wed/regression/MultipleRegressionBasics.qmd
+++ b/material/3_wed/regression/MultipleRegressionBasics.qmd
@ -1,292 +1,294 @@
---
+---
-editor: 
+editor: 
-  markdown: 
+  markdown: 
-    wrap: 72
+    wrap: 72
---
+---
-
+
-# Multiple Regression Basics
+# Multiple Regression Basics
-
+
-## Motivation
+## Motivation
-
+
-### Introductory Example: tree dataset from R
+### Introductory Example: tree dataset from R
-
+
-```{julia}
+``` julia
-using Statistics
+using Statistics
-using Plots
+using Plots
-using RDatasets
+using RDatasets
-
+
-trees = dataset("datasets", "trees")
+trees = dataset("datasets", "trees")
-
+
-scatter(trees.Volume, trees.Girth,
+scatter(trees.Volume, trees.Girth,
-        legend=false, xlabel="Girth", ylabel="Volume")
+        legend=false, xlabel="Girth", ylabel="Volume")
-```
+```
-
+
-*Aim:* Find relationship between the *response variable* `volume` and
+*Aim:* Find relationship between the *response variable* `volume` and
-the *explanatory variable/covariate* `girth`? Can we predict the volume
+the *explanatory variable/covariate* `girth`? Can we predict the volume
-of a tree given its girth?
+of a tree given its girth?
-
+
-```{julia}
+``` julia
-scatter(trees.Girth, trees.Volume,
+scatter(trees.Girth, trees.Volume,
-        legend=false, xlabel="Girth", ylabel="Volume")
+        legend=false, xlabel="Girth", ylabel="Volume")
-plot!(x -> -37 + 5*x)
+plot!(x -> -37 + 5*x)
-```
+```
-
+
-First Guess: There is a linear relation!
+First Guess: There is a linear relation!
-
+
-## Simple Linear Regression
+## Simple Linear Regression
-
+
-Main assumption: up to some error term, each measurement of the response
+Main assumption: up to some error term, each measurement of the response
-variable $y_i$ depends linearly on the corresponding value $x_i$ of the
+variable $y_i$ depends linearly on the corresponding value $x_i$ of the
-covariate
+covariate
-
+
-$\leadsto$ **(Simple) Linear Model:**
+$\leadsto$ **(Simple) Linear Model:**
-$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
+$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
-where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
+where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
-normally distributed errors with unknown variance $\sigma^2$.
+normally distributed errors with unknown variance $\sigma^2$.
-
+
-*Task:* Find the straight line that fits best, i.e., find the *optimal*
+*Task:* Find the straight line that fits best, i.e., find the *optimal*
-estimators for $\beta_0$ and $\beta_1$.
+estimators for $\beta_0$ and $\beta_1$.
-
+
-*Typical choice*: Least squares estimator (= maximum likelihood
+*Typical choice*: Least squares estimator (= maximum likelihood
-estimator for normal errors)
+estimator for normal errors)
-
+
-$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
+$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
-
+
-where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
-vector of covariates and $\mathbf{1}$ is a vector of ones.
+vector of covariates and $\mathbf{1}$ is a vector of ones.
-
+
-Written in matrix style:
+Written in matrix style:
-
+
-$$
+$$
- (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
+ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
-$$
+$$
-
+
-Note: There is a closed-form expression for
+Note: There is a closed-form expression for
-$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
+$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
-rather use Julia to solve the problem.
+rather use Julia to solve the problem.
-
+
-\[use Julia code (existing package) to perform linear regression for
+\[use Julia code (existing package) to perform linear regression for
-`volume ~ girth`\]
+`volume ~ girth`\]
-
+
-```{julia}
+``` julia
-lm(@formula(Volume ~ Girth), trees)
+lm(@formula(Volume ~ Girth), trees)
-```
+```
-
+
-*Interpretation of the Julia output:*
+*Interpretation of the Julia output:*
-
+
-   column `estimate` : least square estimates for $\hat \beta_0$ and
+-   column `estimate` : least square estimates for $\hat \beta_0$ and
-    $\hat \beta_1$
+    $\hat \beta_1$
-
+
-   column `Std. Error` : estimated standard deviation
+-   column `Std. Error` : estimated standard deviation
-    $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
+    $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
-
+
-   column `t value` : value of the $t$-statistics
+-   column `t value` : value of the $t$-statistics
-
+
-    $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
+    $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
-
+
-    Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
+    Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
-    follow a $t$-distribution.
+    follow a $t$-distribution.
-
+
-   column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
+-   column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
-    $i=0,1$
+    $i=0,1$
-
+
-**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
+**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
-these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
+these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
-(you can choose them)), simulate responses from a linear model and
+(you can choose them)), simulate responses from a linear model and
-estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
+estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
-choices of the parameters to see the effects on the parameter estimates
+choices of the parameters to see the effects on the parameter estimates
-and the $p$-values.
+and the $p$-values.
-
+
-## Multiple Regression Model
+## Multiple Regression Model
-
+
-*Idea*: Generalize the simple linear regression model to multiple
+*Idea*: Generalize the simple linear regression model to multiple
-covariates, w.g., predict `volume` using `girth` and \`height\`\`.
+covariates, w.g., predict `volume` using `girth` and \`height\`\`.
-
+
-$\leadsto$ **Linear Model:**
+$\leadsto$ **Linear Model:**
-$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$where
+$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$where
-
+
-   $y_i$: $i$-th measurement of the response,
+-   $y_i$: $i$-th measurement of the response,
-
+
-   $x_{i1}$: $i$ th value of first covariate,
+-   $x_{i1}$: $i$ th value of first covariate,
-
+
-   ...
+-   ...
-
+
-   $x_{ip}$: $i$-th value of $p$-th covariate,
+-   $x_{ip}$: $i$-th value of $p$-th covariate,
-
+
-   $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
+-   $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
-    distributed errors with unknown variance $\sigma^2$.
+    distributed errors with unknown variance $\sigma^2$.
-
+
-*Task:* Find the *optimal* estimators for
+*Task:* Find the *optimal* estimators for
-$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
+$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
-
+
-*Our choice again:* Least squares estimator (= maximum likelihood
+*Our choice again:* Least squares estimator (= maximum likelihood
-estimator for normal errors)
+estimator for normal errors)
-
+
-$$
+$$
- \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
+ \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
-$$
+$$
-
+
-where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
-vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
+vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
-
+
-Written in matrix style:
+Written in matrix style:
-
+
-$$
+$$
-\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
+\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
-$$
+$$
-
+
-Defining the *design matrix*
+Defining the *design matrix*
-
+
-$$ \mathbf{X} = \left( \begin{array}{cccc}
+$$ \mathbf{X} = \left( \begin{array}{cccc}
-                1 & x_{11} & \ldots & x_{1p} \\
+                1 & x_{11} & \ldots & x_{1p} \\
-                \vdots & \vdots & \ddots & \vdots \\
+                \vdots & \vdots & \ddots & \vdots \\
-                1 & x_{11} & \ldots & x_{1p} 
+                1 & x_{11} & \ldots & x_{1p} 
-                \end{array}\right) \qquad 
+                \end{array}\right) \qquad 
-  (\text{size } n \times (p+1)), $$
+  (\text{size } n \times (p+1)), $$
-
+
-we get the short form
+we get the short form
-
+
-$$
+$$
-\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
-$$
+$$
-
+
-\[use Julia code (existing package) to perform linear regression for
+\[use Julia code (existing package) to perform linear regression for
-`volume ~ girth + height`\]
+`volume ~ girth + height`\]
-
+
-The interpretation of the Julia output is similar to the simple linear
+The interpretation of the Julia output is similar to the simple linear
-regression model, but we provide explicit formulas now:
+regression model, but we provide explicit formulas now:
-
+
-   parameter estimates:
+-   parameter estimates:
-
+
-    $$
+    $$
-     (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+     (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
-    $$
+    $$
-
+
-   estimated standard errors:
+-   estimated standard errors:
-
+
-    $$
+    $$
-     \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
+     \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
-    $$
+    $$
-
+
-   $t$-statistics:
+-   $t$-statistics:
-
+
-    $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
+    $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
-
+
-   $p$-values:
+-   $p$-values:
-
+
-    $$
+    $$
-    p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
+    p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
-    $$
+    $$
-
+
-**Task 2**: Implement functions that estimate the $\beta$-parameters,
+**Task 2**: Implement functions that estimate the $\beta$-parameters,
-the corresponding standard errors and the $t$-statistics. Test your
+the corresponding standard errors and the $t$-statistics. Test your
-functions with the \`\`\`tree''' data set and try to reproduce the
+functions with the \`\`\`tree''' data set and try to reproduce the
-output above.
+output above.
-
+
-```{julia}
+``` julia
-r2(linmod1)
+r2(linmod1)
-r2(linmod2)
+r2(linmod2)
-
+
-linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
+linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
-
+
-r2(linmod3)
+r2(linmod3)
-```
+```
-
+
-## Generalized Linear Models
+## Generalized Linear Models
-
+
-Classical linear model
+Classical linear model
-
+
-$$
+$$
- \mathbf{y} = \mathbf{X} \beta + \varepsilon 
+ \mathbf{y} = \mathbf{X} \beta + \varepsilon 
-$$
+$$
-
+
-implies that
+implies that
-$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
+$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
-
+
-In particular, the conditional expectation satisfies
+In particular, the conditional expectation satisfies
-$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
+$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
-
+
-However, the assumption of a normal distribution is unrealistic for
+However, the assumption of a normal distribution is unrealistic for
-non-continuous data. Popular alternatives include:
+non-continuous data. Popular alternatives include:
-
+
-   for counting data: $$
+-   for counting data: $$
-       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
+       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
-      $$
+      $$
-
+
-    Here, the components are considered to be independent and the
+    Here, the components are considered to be independent and the
-    exponential function is applied componentwise.
+    exponential function is applied componentwise.
-
+
-   for binary data: $$
+-   for binary data: $$
-       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
+       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
-      $$
+      $$
-
+
-    Again, the components are considered to be independent and all the
+    Again, the components are considered to be independent and all the
-    operations are applied componentwise.
+    operations are applied componentwise.
-
+
-All these models are defined by the choice of a family of distributions
+All these models are defined by the choice of a family of distributions
-and a function $g$ (the so-called *link function*) such that
+and a function $g$ (the so-called *link function*) such that
-
+
-$$
+$$
- \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
+ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
-$$
+$$
-
+
-For the models above, these are:
+For the models above, these are:
-
+
-+--------------+---------------------+--------------------+
+---------------+-------------------+------------------+
-| Type of Data | Distribution Family | Link Function      |
+| Type of Data  | Distribution      | Link Function    |
-+==============+=====================+====================+
+|               | Family            |                  |
-| continuous   | Normal              | identity:          |
+===============+===================+==================+
-|              |                     |                    |
+| continuous    | Normal            | identity:        |
-|              |                     | $$                 |
+|               |                   |                  |
-|              |                     | g(x)=x             |
+|               |                   | $$               |
-|              |                     | $$                 |
+|               |                   | g(x)=x           |
-+--------------+---------------------+--------------------+
+|               |                   | $$               |
-| count        | Poisson             | log:               |
+---------------+-------------------+------------------+
-|              |                     |                    |
+| count         | Poisson           | log:             |
-|              |                     | $$                 |
+|               |                   |                  |
-|              |                     |  g(x) = \log(x)    |
+|               |                   | $$               |
-|              |                     | $$                 |
+|               |                   |  g(x) = \log(x)  |
-+--------------+---------------------+--------------------+
+|               |                   | $$               |
-| binary       | Bernoulli           | logit:             |
+---------------+-------------------+------------------+
-|              |                     |                    |
+| binary        | Bernoulli         | logit:           |
-|              |                     | $$                 |
+|               |                   |                  |
-|              |                     | g(x) = \log\left   |
+|               |                   | $$               |
-|              |                     | (                  |
+|               |                   | g(x) = \log\left |
-|              |                     | \                  |
+|               |                   | (                |
-|              |                     | f                  |
+|               |                   | \                |
-|              |                     | rac{x}{1-x}\right) |
+|               |                   | f                |
-|              |                     | $$                 |
+|               |                   | ra               |
-+--------------+---------------------+--------------------+
+|               |                   | c{x}{1-x}\right) |
-
+|               |                   | $$               |
-In general, the parameter vector $\beta$ is estimated via maximizing the
+---------------+-------------------+------------------+
-likelihood, i.e.,
+
-
+In general, the parameter vector $\beta$ is estimated via maximizing the
-$$
+likelihood, i.e.,
-\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
+
-$$
+$$
-
+\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
-which is equivalent to the maximization of the log-likelihood, i.e.,
+$$
-
+
-$$
+which is equivalent to the maximization of the log-likelihood, i.e.,
-\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
+
-$$
+$$
-
+\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
-In the Gaussian case, the maximum likelihood estimator is identical to
+$$
-the least squares estimator considered above.
+
-
+In the Gaussian case, the maximum likelihood estimator is identical to
-```{julia}
+the least squares estimator considered above.
-using CSV
+
-using HTTP
+``` julia
-
+using CSV
-http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
+using HTTP
-SwissLabor = DataFrame(CSV.File(http_response.body))
+
-
+http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
-SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
+SwissLabor = DataFrame(CSV.File(http_response.body))
-
+
-model = glm(@formula(participation ~ age^2), 
+SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
-            SwissLabor, Binomial(), ProbitLink())
+
-```
+model = glm(@formula(participation ~ age^2), 
-
+            SwissLabor, Binomial(), ProbitLink())
-**Task 3:** Reproduce the results of our data analysis of the `tree`
+```
-data set using a generalized linear model with normal distribution
+
-family.
+**Task 3:** Reproduce the results of our data analysis of the `tree`
 data set using a generalized linear model with normal distribution
 family.