diff --git a/_quarto.yml b/_quarto.yml
index 117e3b4..2542c00 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -57,9 +57,9 @@ website:
text: "📝 1 - Advanced Git and Contributing"
- href: "material/2_tue/git/tasks.qmd"
text: "🛠 1 - Git: Exercises"
- - href: "material/2_tue/testing/slides.qmd"
+ - href: "material/2_tue/testing/slides.md"
text: "📝 2 - Testing"
- - href: "material/2_tue/CI/missing.qmd"
+ - href: "material/2_tue/ci/slides.md"
text: "📝 3 - Continuous Integration"
- href: material/2_tue/codereview/slides.qmd
text: "📝 4 - Code Review"
diff --git a/cheatsheets/git.qmd b/cheatsheets/git.qmd
index e69de29..becee2f 100644
--- a/cheatsheets/git.qmd
+++ b/cheatsheets/git.qmd
@@ -0,0 +1 @@
+There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf).
diff --git a/cheatsheets/githubactions.qmd b/cheatsheets/githubactions.qmd
index e69de29..ac8eaee 100644
--- a/cheatsheets/githubactions.qmd
+++ b/cheatsheets/githubactions.qmd
@@ -0,0 +1 @@
+Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf)
diff --git a/material/2_tue/ci/slides.md b/material/2_tue/ci/slides.md
new file mode 100644
index 0000000..b74324a
--- /dev/null
+++ b/material/2_tue/ci/slides.md
@@ -0,0 +1,395 @@
+---
+type: slide
+slideOptions:
+ transition: slide
+ width: 1400
+ height: 900
+ margin: 0.1
+---
+
+
+
+# Learning Goals
+
+- Name and explain common workflows to automate in RSE.
+- Explain the differences between the various continuous methodologies.
+- Explain why automation is crucial in RSE.
+- Write and understand basic automation scripts for GitHub Actions.
+ - s.t. we understand what `PkgTemplates` generates for us.
+
+
+Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material).
+
+---
+
+# 1. Workflow Automation
+
+---
+
+## Why Automation?
+
+- Automatize tasks
+ - Run tests frequently, give feedback early etc.
+ - Ensure reproducible test environments
+ - Cannot forget automatized tasks
+ - Less burden to developer (and their workstation)
+ - Avoid manual errors
+- Process often integrated in development workflow
+ - Example: Support by Git hooks or Git forges
+
+---
+
+## Typical Automation Tasks in RSE
+
+- Check code formatting and quality
+- Compile and test code for different platforms
+- Generate coverage reports and visualization
+- Build documentation and deploy it
+- Build, package, and upload releases
+
+---
+
+## Continuous Methodologies (1/2)
+
+- **Continuous Integration** (CI)
+ - Continuously integrate changes into "main" branch
+ - Avoids "merge hell"
+ - Relies on testing and checking code continuously
+ - Should be automatized
+
+---
+
+## Continuous Methodologies (2/2)
+
+- **Continuous Delivery** (CD)
+ - Software is in a state that allows new release at any time
+ - Software package is built
+ - Actual release triggered manually
+- **Continuous Deployment** (CD)
+ - Software is in a state that allows new release at any time
+ - Software package is built
+ - Actual release triggered automatically (continuously)
+
+---
+
+## Automation Services/Software
+
+- [GitHub Actions](https://github.com/features/actions)
+- [GitLab CI/CD](https://docs.gitlab.com/ee/ci/)
+- [Circle CI](https://circleci.com/)
+- [Travis CI](https://www.travis-ci.com/)
+- [Jenkins](https://www.jenkins.io/)
+- ...
+
+---
+
+# 2. GitHub Actions
+
+---
+
+## What is "GitHub Actions"?
+
+> Automate, customize, and execute your software development workflows right in your repository with GitHub Actions.
+
+From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions)
+
+---
+
+## General Information
+
+- Usage of GitHub's runners is [limited](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits)
+- Available for public repositories or accounts with subscription
+- By default Actions run on GitHub's runners
+ - Linux, Windows, or MacOS
+- Quickly evolving and significant improvements in recent years
+
+---
+
+## Components (1/2)
+
+- [Workflow](https://docs.github.com/en/actions/using-workflows): Runs one or more jobs
+- [Event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows): Triggers a workflow
+- [Jobs](https://docs.github.com/en/actions/using-jobs): Set of steps (running on same runner)
+ - Steps executed consecutively and share data
+ - Jobs by default executed in parallel
+- [Action](https://docs.github.com/en/actions/creating-actions): Application performing common, complex task (step) often used in workflows
+- [Runner](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners): Server that runs jobs
+- [Artifacts](https://docs.github.com/en/actions/learn-github-actions/essential-features-of-github-actions#sharing-data-between-jobs): Files to be shared between jobs or to be kept after workflow finishes
+
+---
+
+## Components (2/2)
+
+
+
+
+From [GitHub Actions tutorial](https://docs.github.com/en/actions)
+
+---
+
+## Setting up a Workflow
+
+- Workflow file files stored `${REPO_ROOT}/.github/workflows`
+- Configured via YAML file
+
+```yaml
+name: learn-github-actions
+on: [push]
+jobs:
+ check-bats-version:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v2
+ - uses: actions/setup-node@v2
+ with:
+ node-version: '14'
+ - run: npm install -g bats
+ - run: bats -v
+```
+
+---
+
+## Actions
+
+```yaml
+- uses: actions/checkout@v3
+- uses: actions/setup-node@v2
+ with:
+ node-version: '14'
+```
+
+- Integrated via `uses` directive
+- Additional configuration via `with` (options depend on Action)
+- Find actions in [marketplace](https://github.com/marketplace?type=actions)
+- Write [own actions](https://docs.github.com/en/actions/creating-actions)
+
+---
+
+## Some Useful Julia Actions
+
+- Find on [gitHub.com/julia-actions](https://github.com/julia-actions/)
+
+ ```
+ - uses: julia-actions/setup-julia@v1
+ with:
+ version: '1.9'
+ ```
+
+- More:
+ - `cache`: caches `~/.julia/artifacts/*` and `~/.julia/packages/*` to reduce runtime of CI
+ - `julia-buildpkg`: build package
+ - `julia-runtest`: run tests
+ - `julia-format`: format code
+
+---
+
+## User-specified Commands
+
+```yaml
+- name: "Single line command"
+ run: echo "Single line command"
+- name: "Multi line command"
+ run: |
+ echo "First line"
+ echo "Second line. Directory ${PWD}"
+ workdir: tmp/
+ shell: bash
+```
+
+---
+
+## Events
+
+- Single or multiple events
+
+ ```yaml
+ on: [push, fork]
+ ```
+
+- Activities
+
+ ```yaml
+ on:
+ issue:
+ types:
+ - opened
+ - labeled
+ ```
+
+- Filters
+
+ ```yaml
+ on:
+ push:
+ branches:
+ - main
+ - 'releases/**'
+ ```
+
+---
+
+## Artifacts
+
+- Data sharing between jobs and data upload
+- Uploading artifact
+
+ ```yaml
+ - name: "Upload artifact"
+ uses: actions/upload-artifact@v2
+ with:
+ name: my-artifact
+ path: my_file.txt
+ retention-days: 5
+ ```
+
+- Downloading artifact
+
+ ```yaml
+ - name: "Download a single artifact"
+ uses: actions/download-artifact@v2
+ with:
+ name: my-artifact
+ ```
+
+ **Note**: Drop name to download all artifacts
+
+---
+
+## Test Actions Locally
+
+- [act](https://github.com/nektos/act)
+- Relies extensively on Docker
+ - User should be in `docker` group
+- Run `act` from root of the repository
+
+ ```text
+ act (runs all workflows)
+ act --job WORKFLOWNAME
+ ```
+
+- Environment is not 100% identical to GitHub's
+ - Workflows may fail locally, but work on GitHub
+
+---
+
+## Further Reading
+
+- [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration)
+- [GitHub Actions documentation](https://docs.github.com/en/actions)
+- [GitHub Actions quickstart](https://docs.github.com/en/actions/quickstart)
+
+---
+
+# 3. Demo: Automation with GitHub Actions
+
+---
+
+## Setting up a Test Job
+
+- Import [Julia test package repository](https://github.com/uekerman/JuliaTestPackage) (the same code we used for testing)
+- Set up workflow file
+
+ ```bash
+ mkdir -p .github/workflows
+ cd .github/workflows
+ vi format-check.yml
+ ```
+
+- Let's check whether our code is formatted correctly. Edit `format-check.yml` to have following content
+
+ ```yaml
+ name: format-check
+
+ on: [push, pull_request]
+
+ jobs:
+ format:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v3
+ - uses: julia-actions/setup-julia@v1
+ with:
+ version: '1.9'
+ - name: Install JuliaFormatter and format
+ run: |
+ julia -e 'using Pkg; Pkg.add(PackageSpec(name="JuliaFormatter"))'
+ julia -e 'using JuliaFormatter; format(".", verbose=true)'
+ - name: Format check
+ run: |
+ julia -e '
+ out = Cmd(`git diff --name-only`) |> read |> String
+ if out == ""
+ exit(0)
+ else
+ @error "Some files have not been formatted"
+ write(stdout, out)
+ exit(1)
+ end'
+ ```
+
+- `runs-on` does **not** refer to a Docker container, but to a runner tag.
+- Add, commit, push
+- After the push, inspect "Action" panel on GitHub repository
+ - GitHub will schedule a run (yellow dot)
+ - Hooray. We have set up our first action.
+- Failing test example:
+ - Edit settings on GitHub that one can only merge if all tests pass:
+ - Settings -> Branches -> Branch protection rule
+ - Choose `main` branch
+ - Enable "Require status checks to pass before merging". Optionally enable "Require branches to be up to date before merging"
+ - Choose status checks that need to pass: `test`
+ - Click on "Create" at bottom of page.
+ - Create a new branch `break-code`.
+ - Edit some file, violate the formatting, commit it and push it to the branch. Afterwards open a new PR and inspect the failing test. We are also not able to merge the changes as the "Merge" button should be inactive.
+
+---
+
+## act Demo
+
+- `act` is for quick checks while developing workflows, not for developing the code
+- Check available jobs (at root of repository)
+
+ ```bash
+ act -l
+ ```
+
+- Run jobs for `push` event (default event)
+
+ ```bash
+ act
+ ```
+
+- Run a specific job
+
+ ```bash
+ act -j test
+ ```
+
+---
+
+# 4. Exercise
+
+Set up GitHub Actions for your statistics package. They should format your code and run the tests. To structure and parallelize things, you could use two separate jobs.
diff --git a/material/2_tue/git/slides.md b/material/2_tue/git/slides.md
index f1540fc..a1b4f22 100644
--- a/material/2_tue/git/slides.md
+++ b/material/2_tue/git/slides.md
@@ -32,7 +32,7 @@ slideOptions:
}
-## Learning Goals of the Git Lecture
+# Learning Goals
- Refresh and organize students' existing knowledge on Git (learn how to learn more).
- Students can explain difference between merge and rebase and when to use what.
diff --git a/material/2_tue/git/tasks.qmd b/material/2_tue/git/tasks.qmd
index 94b9a45..0ecbf26 100644
--- a/material/2_tue/git/tasks.qmd
+++ b/material/2_tue/git/tasks.qmd
@@ -3,6 +3,7 @@
1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again).
2. Push your package `MyStatsPackage` to a remote repository.
3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded).
-4. Use the package from somebody else in the classroom and verify with `printOwner` that you use the correct package.
-5. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
-6. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.
+4. Start a new Julia environment and use your package through its url: `]add https://github.com/[username]/MyStatsPackage`.
+5. Now use the package from somebody else in the classroom instead and verify with `printOwner` that you use the correct package.
+6. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
+7. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.
diff --git a/material/2_tue/testing/slides.qmd b/material/2_tue/testing/slides.md
similarity index 94%
rename from material/2_tue/testing/slides.qmd
rename to material/2_tue/testing/slides.md
index 96a877d..539efeb 100644
--- a/material/2_tue/testing/slides.qmd
+++ b/material/2_tue/testing/slides.md
@@ -1,9 +1,37 @@
-
---
-format: revealjs
-
+type: slide
+slideOptions:
+ transition: slide
+ width: 1400
+ height: 900
+ margin: 0.1
---
+
+
# Learning Goals
- Justify the effort of developing tests to some extent
diff --git a/material/3_wed/regression/Code_Snippets.jl b/material/3_wed/regression/Code_Snippets.jl
new file mode 100644
index 0000000..05bb3e8
--- /dev/null
+++ b/material/3_wed/regression/Code_Snippets.jl
@@ -0,0 +1,52 @@
+############################################################################
+#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
+############################################################################
+
+using Statistics
+using Plots
+using RDatasets
+using GLM
+
+##
+
+trees = dataset("datasets", "trees")
+
+scatter(trees.Girth, trees.Volume,
+ legend=false, xlabel="Girth", ylabel="Volume")
+
+##
+
+scatter(trees.Girth, trees.Volume,
+ legend=false, xlabel="Girth", ylabel="Volume")
+plot!(x -> -37 + 5*x)
+
+##
+
+linmod1 = lm(@formula(Volume ~ Girth), trees)
+
+##
+
+linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
+
+##
+
+r2(linmod1)
+r2(linmod2)
+
+linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
+
+r2(linmod3)
+
+##
+
+using CSV
+using HTTP
+
+http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
+SwissLabor = DataFrame(CSV.File(http_response.body))
+
+SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
+
+##
+
+model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())
\ No newline at end of file
diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd
index 8af7871..1f23123 100644
--- a/material/3_wed/regression/MultipleRegressionBasics.qmd
+++ b/material/3_wed/regression/MultipleRegressionBasics.qmd
@@ -1,253 +1,292 @@
----
-editor:
- markdown:
- wrap: 72
----
-
-# Multiple Regression Basics
-
-## Motivation
-
-### Introductory Example: tree dataset from R
-
-\[figure of raw data\]
-
-*Aim:* Find relationship between the *response variable* `volume` and
-the *explanatory variable/covariate* `girth`? Can we predict the volume
-of a tree given its girth?
-
-\[figure including a straight line\]
-
-First Guess: There is a linear relation!
-
-## Simple Linear Regression
-
-Main assumption: up to some error term, each measurement of the response
-variable $y_i$ depends linearly on the corresponding value $x_i$ of the
-covariate
-
-$\leadsto$ **(Simple) Linear Model:**
-$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
-where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
-normally distributed errors with unknown variance $\sigma^2$.
-
-*Task:* Find the straight line that fits best, i.e., find the *optimal*
-estimators for $\beta_0$ and $\beta_1$.
-
-*Typical choice*: Least squares estimator (= maximum likelihood
-estimator for normal errors)
-
-$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
-
-where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
-vector of covariates and $\mathbf{1}$ is a vector of ones.
-
-Written in matrix style:
-
-$$
- (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
-$$
-
-Note: There is a closed-form expression for
-$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
-rather use Julia to solve the problem.
-
-\[use Julia code (existing package) to perform linear regression for
-`volume ~ girth`\]
-
-*Interpretation of the Julia output:*
-
-- column `estimate` : least square estimates for $\hat \beta_0$ and
- $\hat \beta_1$
-
-- column `Std. Error` : estimated standard deviation
- $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
-
-- column `t value` : value of the $t$-statistics
-
- $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
-
- Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
- follow a $t$-distribution.
-
-- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
- $i=0,1$
-
-**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
-these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
-(you can choose them)), simulate responses from a linear model and
-estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
-choices of the parameters to see the effects on the parameter estimates
-and the $p$-values.
-
-## Multiple Regression Model
-
-*Idea*: Generalize the simple linear regression model to multiple
-covariates, w.g., predict `volume` using `girth` and \`height\`\`.
-
-$\leadsto$ **Linear Model:**
-$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
-
-- $y_i$: $i$-th measurement of the response,
-
-- $x_{i1}$: $i$ th value of first covariate,
-
-- ...
-
-- $x_{ip}$: $i$-th value of $p$-th covariate,
-
-- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
- distributed errors with unknown variance $\sigma^2$.
-
-*Task:* Find the *optimal* estimators for
-$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
-
-*Our choice again:* Least squares estimator (= maximum likelihood
-estimator for normal errors)
-
-$$
- \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
-$$
-
-where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
-vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
-
-Written in matrix style:
-
-$$
-\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
-$$
-
-Defining the *design matrix*
-
-$$ \mathbf{X} = \left( \begin{array}{cccc}
- 1 & x_{11} & \ldots & x_{1p} \\
- \vdots & \vdots & \ddots & \vdots \\
- 1 & x_{11} & \ldots & x_{1p}
- \end{array}\right) \qquad
- (\text{size } n \times (p+1)), $$
-
-we get the short form
-
-$$
-\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
-$$
-
-\[use Julia code (existing package) to perform linear regression for
-`volume ~ girth + height`\]
-
-The interpretation of the Julia output is similar to the simple linear
-regression model, but we provide explicit formulas now:
-
-- parameter estimates:
-
- $$
- (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
- $$
-
-- estimated standard errors:
-
- $$
- \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
- $$
-
-- $t$-statistics:
-
- $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
-
-- $p$-values:
-
- $$
- p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
- $$
-
-**Task 2**: Implement functions that estimate the $\beta$-parameters,
-the corresponding standard errors and the $t$-statistics. Test your
-functions with the \`\`\`tree''' data set and try to reproduce the
-output above.
-
-## Generalized Linear Models
-
-Classical linear model
-
-$$
- \mathbf{y} = \mathbf{X} \beta + \varepsilon
-$$
-
-implies that
-$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
-
-In particular, the conditional expectation satisfies
-$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
-
-However, the assumption of a normal distribution is unrealistic for
-non-continuous data. Popular alternatives include:
-
-- for counting data: $$
- \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
- $$
-
- Here, the components are considered to be independent and the
- exponential function is applied componentwise.
-
-- for binary data: $$
- \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
- $$
-
- Again, the components are considered to be independent and all the
- operations are applied componentwise.
-
-All these models are defined by the choice of a family of distributions
-and a function $g$ (the so-called *link function*) such that
-
-$$
- \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
-$$
-
-For the models above, these are:
-
-+----------------------+---------------------+----------------------+
-| Type of Data | Distribution Family | Link Function |
-+======================+=====================+======================+
-| continuous | Normal | identity: |
-| | | |
-| | | $$ |
-| | | g(x)=x |
-| | | $$ |
-+----------------------+---------------------+----------------------+
-| count | Poisson | log: |
-| | | |
-| | | $$ |
-| | | g(x) = \log(x) |
-| | | $$ |
-+----------------------+---------------------+----------------------+
-| binary | Bernoulli | logit: |
-| | | |
-| | | $$ |
-| | | g(x) = \log\left |
-| | | ( |
-| | | \frac{x}{1-x}\right) |
-| | | $$ |
-+----------------------+---------------------+----------------------+
-
-In general, the parameter vector $\beta$ is estimated via maximizing the
-likelihood, i.e.,
-
-$$
-\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
-$$
-
-which is equivalent to the maximization of the log-likelihood, i.e.,
-
-$$
-\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
-$$
-
-In the Gaussian case, the maximum likelihood estimator is identical to
-the least squares estimator considered above.
-
-\[\[ Example in Julia: maybe `SwissLabor` \]\]
-
-**Task 3:** Reproduce the results of our data analysis of the `tree`
-data set using a generalized linear model with normal distribution
-family.
+---
+editor:
+ markdown:
+ wrap: 72
+---
+
+# Multiple Regression Basics
+
+## Motivation
+
+### Introductory Example: tree dataset from R
+
+```{julia}
+using Statistics
+using Plots
+using RDatasets
+
+trees = dataset("datasets", "trees")
+
+scatter(trees.Volume, trees.Girth,
+ legend=false, xlabel="Girth", ylabel="Volume")
+```
+
+*Aim:* Find relationship between the *response variable* `volume` and
+the *explanatory variable/covariate* `girth`? Can we predict the volume
+of a tree given its girth?
+
+```{julia}
+scatter(trees.Girth, trees.Volume,
+ legend=false, xlabel="Girth", ylabel="Volume")
+plot!(x -> -37 + 5*x)
+```
+
+First Guess: There is a linear relation!
+
+## Simple Linear Regression
+
+Main assumption: up to some error term, each measurement of the response
+variable $y_i$ depends linearly on the corresponding value $x_i$ of the
+covariate
+
+$\leadsto$ **(Simple) Linear Model:**
+$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
+where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
+normally distributed errors with unknown variance $\sigma^2$.
+
+*Task:* Find the straight line that fits best, i.e., find the *optimal*
+estimators for $\beta_0$ and $\beta_1$.
+
+*Typical choice*: Least squares estimator (= maximum likelihood
+estimator for normal errors)
+
+$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
+
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
+vector of covariates and $\mathbf{1}$ is a vector of ones.
+
+Written in matrix style:
+
+$$
+ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
+$$
+
+Note: There is a closed-form expression for
+$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
+rather use Julia to solve the problem.
+
+\[use Julia code (existing package) to perform linear regression for
+`volume ~ girth`\]
+
+```{julia}
+lm(@formula(Volume ~ Girth), trees)
+```
+
+*Interpretation of the Julia output:*
+
+- column `estimate` : least square estimates for $\hat \beta_0$ and
+ $\hat \beta_1$
+
+- column `Std. Error` : estimated standard deviation
+ $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
+
+- column `t value` : value of the $t$-statistics
+
+ $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
+
+ Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
+ follow a $t$-distribution.
+
+- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
+ $i=0,1$
+
+**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
+these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
+(you can choose them)), simulate responses from a linear model and
+estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
+choices of the parameters to see the effects on the parameter estimates
+and the $p$-values.
+
+## Multiple Regression Model
+
+*Idea*: Generalize the simple linear regression model to multiple
+covariates, w.g., predict `volume` using `girth` and \`height\`\`.
+
+$\leadsto$ **Linear Model:**
+$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
+
+- $y_i$: $i$-th measurement of the response,
+
+- $x_{i1}$: $i$ th value of first covariate,
+
+- ...
+
+- $x_{ip}$: $i$-th value of $p$-th covariate,
+
+- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
+ distributed errors with unknown variance $\sigma^2$.
+
+*Task:* Find the *optimal* estimators for
+$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
+
+*Our choice again:* Least squares estimator (= maximum likelihood
+estimator for normal errors)
+
+$$
+ \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
+$$
+
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
+vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
+
+Written in matrix style:
+
+$$
+\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
+$$
+
+Defining the *design matrix*
+
+$$ \mathbf{X} = \left( \begin{array}{cccc}
+ 1 & x_{11} & \ldots & x_{1p} \\
+ \vdots & \vdots & \ddots & \vdots \\
+ 1 & x_{11} & \ldots & x_{1p}
+ \end{array}\right) \qquad
+ (\text{size } n \times (p+1)), $$
+
+we get the short form
+
+$$
+\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+$$
+
+\[use Julia code (existing package) to perform linear regression for
+`volume ~ girth + height`\]
+
+The interpretation of the Julia output is similar to the simple linear
+regression model, but we provide explicit formulas now:
+
+- parameter estimates:
+
+ $$
+ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+ $$
+
+- estimated standard errors:
+
+ $$
+ \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
+ $$
+
+- $t$-statistics:
+
+ $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
+
+- $p$-values:
+
+ $$
+ p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
+ $$
+
+**Task 2**: Implement functions that estimate the $\beta$-parameters,
+the corresponding standard errors and the $t$-statistics. Test your
+functions with the \`\`\`tree''' data set and try to reproduce the
+output above.
+
+```{julia}
+r2(linmod1)
+r2(linmod2)
+
+linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
+
+r2(linmod3)
+```
+
+## Generalized Linear Models
+
+Classical linear model
+
+$$
+ \mathbf{y} = \mathbf{X} \beta + \varepsilon
+$$
+
+implies that
+$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
+
+In particular, the conditional expectation satisfies
+$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
+
+However, the assumption of a normal distribution is unrealistic for
+non-continuous data. Popular alternatives include:
+
+- for counting data: $$
+ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
+ $$
+
+ Here, the components are considered to be independent and the
+ exponential function is applied componentwise.
+
+- for binary data: $$
+ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
+ $$
+
+ Again, the components are considered to be independent and all the
+ operations are applied componentwise.
+
+All these models are defined by the choice of a family of distributions
+and a function $g$ (the so-called *link function*) such that
+
+$$
+ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
+$$
+
+For the models above, these are:
+
++--------------+---------------------+--------------------+
+| Type of Data | Distribution Family | Link Function |
++==============+=====================+====================+
+| continuous | Normal | identity: |
+| | | |
+| | | $$ |
+| | | g(x)=x |
+| | | $$ |
++--------------+---------------------+--------------------+
+| count | Poisson | log: |
+| | | |
+| | | $$ |
+| | | g(x) = \log(x) |
+| | | $$ |
++--------------+---------------------+--------------------+
+| binary | Bernoulli | logit: |
+| | | |
+| | | $$ |
+| | | g(x) = \log\left |
+| | | ( |
+| | | \ |
+| | | f |
+| | | rac{x}{1-x}\right) |
+| | | $$ |
++--------------+---------------------+--------------------+
+
+In general, the parameter vector $\beta$ is estimated via maximizing the
+likelihood, i.e.,
+
+$$
+\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
+$$
+
+which is equivalent to the maximization of the log-likelihood, i.e.,
+
+$$
+\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
+$$
+
+In the Gaussian case, the maximum likelihood estimator is identical to
+the least squares estimator considered above.
+
+```{julia}
+using CSV
+using HTTP
+
+http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
+SwissLabor = DataFrame(CSV.File(http_response.body))
+
+SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
+
+model = glm(@formula(participation ~ age^2),
+ SwissLabor, Binomial(), ProbitLink())
+```
+
+**Task 3:** Reproduce the results of our data analysis of the `tree`
+data set using a generalized linear model with normal distribution
+family.