Merge branch 'main' of https://github.com/s-ccs/summerschool_simtech_2023

2023-10-09 09:51:53 +00:00
parent d888a883b4 314ac6e1af
commit ec20c173aa
9 changed files with 779 additions and 262 deletions
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -57,9 +57,9 @@ website:
                text: "📝 1 - Advanced Git and Contributing"
              - href: "material/2_tue/git/tasks.qmd"
                text: "🛠 1 - Git: Exercises"
-              - href: "material/2_tue/testing/slides.qmd"
+              - href: "material/2_tue/testing/slides.md"
                text: "📝 2 - Testing"
-              - href: "material/2_tue/CI/missing.qmd"
+              - href: "material/2_tue/ci/slides.md"
                text: "📝 3 - Continuous Integration"
              - href: material/2_tue/codereview/slides.qmd
                text: "📝 4 - Code Review"
--- a/cheatsheets/git.qmd
+++ b/cheatsheets/git.qmd
@@ -0,0 +1 @@
 There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf).
--- a/cheatsheets/githubactions.qmd
+++ b/cheatsheets/githubactions.qmd
@@ -0,0 +1 @@
 Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf)
--- a/material/2_tue/ci/slides.md
+++ b/material/2_tue/ci/slides.md
@@ -0,0 +1,395 @@
 ---
 type: slide
 slideOptions:
  transition: slide
  width: 1400
  height: 900
  margin: 0.1
 ---
 <style>
  .reveal strong {
    font-weight: bold;
    color: orange;
  }
  .reveal p {
    text-align: left;
  }
  .reveal section h1 {
    color: orange;
  }
  .reveal section h2 {
    color: orange;
  }
  .reveal code {
    font-family: 'Ubuntu Mono';
    color: orange;
  }
  .reveal section img {
    background:none;
    border:none;
    box-shadow:none;
  }
 </style>
 # Learning Goals
 - Name and explain common workflows to automate in RSE.
 - Explain the differences between the various continuous methodologies.
 - Explain why automation is crucial in RSE.
 - Write and understand basic automation scripts for GitHub Actions.
  - s.t. we understand what `PkgTemplates` generates for us.
 Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material).
 ---
 # 1. Workflow Automation
 ---
 ## Why Automation?
 - Automatize tasks
    - Run tests frequently, give feedback early etc.
    - Ensure reproducible test environments
    - Cannot forget automatized tasks
    - Less burden to developer (and their workstation)
    - Avoid manual errors
 - Process often integrated in development workflow
    - Example: Support by Git hooks or Git forges
 ---
 ## Typical Automation Tasks in RSE
 - Check code formatting and quality
 - Compile and test code for different platforms
 - Generate coverage reports and visualization
 - Build documentation and deploy it
 - Build, package, and upload releases
 ---
 ## Continuous Methodologies (1/2)
 - **Continuous Integration** (CI)
    - Continuously integrate changes into "main" branch
    - Avoids "merge hell"
    - Relies on testing and checking code continuously
        - Should be automatized
 ---
 ## Continuous Methodologies (2/2)
 - **Continuous Delivery** (CD)
    - Software is in a state that allows new release at any time
    - Software package is built
    - Actual release triggered manually
 - **Continuous Deployment** (CD)
    - Software is in a state that allows new release at any time
    - Software package is built
    - Actual release triggered automatically (continuously)
 ---
 ## Automation Services/Software
 - [GitHub Actions](https://github.com/features/actions)
 - [GitLab CI/CD](https://docs.gitlab.com/ee/ci/)
 - [Circle CI](https://circleci.com/)
 - [Travis CI](https://www.travis-ci.com/)
 - [Jenkins](https://www.jenkins.io/)
 - ...
 ---
 # 2. GitHub Actions
 ---
 ## What is "GitHub Actions"?
 > Automate, customize, and execute your software development workflows right in your repository with GitHub Actions.
 From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions)
 ---
 ## General Information
 - Usage of GitHub's runners is [limited](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits)
 - Available for public repositories or accounts with subscription
 - By default Actions run on GitHub's runners
    - Linux, Windows, or MacOS
 - Quickly evolving and significant improvements in recent years
 ---
 ## Components (1/2)
 - [Workflow](https://docs.github.com/en/actions/using-workflows): Runs one or more jobs
 - [Event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows): Triggers a workflow
 - [Jobs](https://docs.github.com/en/actions/using-jobs): Set of steps (running on same runner)
    - Steps executed consecutively and share data
    - Jobs by default executed in parallel
 - [Action](https://docs.github.com/en/actions/creating-actions): Application performing common, complex task (step) often used in workflows
 - [Runner](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners): Server that runs jobs
 - [Artifacts](https://docs.github.com/en/actions/learn-github-actions/essential-features-of-github-actions#sharing-data-between-jobs): Files to be shared between jobs or to be kept after workflow finishes
 ---
 ## Components (2/2)
 <img src="https://docs.github.com/assets/cb-25535/mw-1440/images/help/actions/overview-actions-simple.webp" width=95%; style="margin-left:auto; margin-right:auto; padding-top: 25px; padding-bottom: 25px; background: #eeeeee">
 From [GitHub Actions tutorial](https://docs.github.com/en/actions)
 ---
 ## Setting up a Workflow
 - Workflow file files stored `${REPO_ROOT}/.github/workflows`
 - Configured via YAML file
 ```yaml
 name: learn-github-actions
 on: [push]
 jobs:
  check-bats-version:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-node@v2
        with:
          node-version: '14'
      - run: npm install -g bats
      - run: bats -v
 ```
 ---
 ## Actions
 ```yaml
 - uses: actions/checkout@v3
 - uses: actions/setup-node@v2
  with:
    node-version: '14'
 ```
 - Integrated via `uses` directive
 - Additional configuration via `with` (options depend on Action)
 - Find actions in [marketplace](https://github.com/marketplace?type=actions)
 - Write [own actions](https://docs.github.com/en/actions/creating-actions)
 ---
 ## Some Useful Julia Actions
 - Find on [gitHub.com/julia-actions](https://github.com/julia-actions/)
   ```
   - uses: julia-actions/setup-julia@v1
        with:
          version: '1.9'
   ```
 - More:
  - `cache`: caches `~/.julia/artifacts/*` and `~/.julia/packages/*` to reduce runtime of CI
  - `julia-buildpkg`: build package
  - `julia-runtest`: run tests
  - `julia-format`: format code
 ---
 ## User-specified Commands
 ```yaml
 - name: "Single line command"
  run: echo "Single line command"
 - name: "Multi line command"
  run: |
    echo "First line"
    echo "Second line. Directory ${PWD}"
  workdir: tmp/
  shell: bash
 ```
 ---
 ## Events
 - Single or multiple events
  ```yaml
  on: [push, fork]
  ```
 - Activities
  ```yaml
  on:
    issue:
      types:
        - opened
        - labeled
  ```
 - Filters
  ```yaml
  on:
    push:
      branches:
        - main
        - 'releases/**'
  ```
 ---
 ## Artifacts
 - Data sharing between jobs and data upload
 - Uploading artifact
  ```yaml
  - name: "Upload artifact"
    uses: actions/upload-artifact@v2
    with:
      name: my-artifact
      path: my_file.txt
      retention-days: 5
  ```
 - Downloading artifact
  ```yaml
  - name: "Download a single artifact"
    uses: actions/download-artifact@v2
    with:
      name: my-artifact
  ```
  **Note**: Drop name to download all artifacts
 ---
 ## Test Actions Locally
 - [act](https://github.com/nektos/act)
 - Relies extensively on Docker
    - User should be in `docker` group
 - Run `act` from root of the repository
  ```text
  act (runs all workflows)
  act --job WORKFLOWNAME
  ```
 - Environment is not 100% identical to GitHub's
    - Workflows may fail locally, but work on GitHub
 ---
 ## Further Reading
 - [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration)
 - [GitHub Actions documentation](https://docs.github.com/en/actions)
 - [GitHub Actions quickstart](https://docs.github.com/en/actions/quickstart)
 ---
 # 3. Demo: Automation with GitHub Actions
 ---
 ## Setting up a Test Job
 - Import [Julia test package repository](https://github.com/uekerman/JuliaTestPackage) (the same code we used for testing)
 - Set up workflow file
  ```bash
  mkdir -p .github/workflows
  cd .github/workflows
  vi format-check.yml
  ```
 - Let's check whether our code is formatted correctly. Edit `format-check.yml` to have following content
  ```yaml
  name: format-check
  on: [push, pull_request]
  jobs:
    format:
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v3
        - uses: julia-actions/setup-julia@v1
          with:
            version: '1.9'
        - name: Install JuliaFormatter and format
          run: |
            julia  -e 'using Pkg; Pkg.add(PackageSpec(name="JuliaFormatter"))'
            julia  -e 'using JuliaFormatter; format(".", verbose=true)'
        - name: Format check
          run: |
            julia -e '
            out = Cmd(`git diff --name-only`) |> read |> String
            if out == ""
                exit(0)
            else
                @error "Some files have not been formatted"
                write(stdout, out)
                exit(1)
            end'     
  ```
 - `runs-on` does **not** refer to a Docker container, but to a runner tag.
 - Add, commit, push
 - After the push, inspect "Action" panel on GitHub repository
    - GitHub will schedule a run (yellow dot)
    - Hooray. We have set up our first action.
 - Failing test example:
    - Edit settings on GitHub that one can only merge if all tests pass:
        - Settings -> Branches -> Branch protection rule
        - Choose `main` branch
        - Enable "Require status checks to pass before merging". Optionally enable "Require branches to be up to date before merging"
        - Choose status checks that need to pass: `test`
        - Click on "Create" at bottom of page.
    - Create a new branch `break-code`.
    - Edit some file, violate the formatting, commit it and push it to the branch. Afterwards open a new PR and inspect the failing test. We are also not able to merge the changes as the "Merge" button should be inactive.
 ---
 ## act Demo
 - `act` is for quick checks while developing workflows, not for developing the code
 - Check available jobs (at root of repository) 
  ```bash
  act -l
  ```
 - Run jobs for `push` event (default event)
  ```bash
  act
  ```
 - Run a specific job
  ```bash
  act -j test
  ```
 ---
 # 4. Exercise
 Set up GitHub Actions for your statistics package. They should format your code and run the tests. To structure and parallelize things, you could use two separate jobs.
--- a/material/2_tue/git/slides.md
+++ b/material/2_tue/git/slides.md
@@ -32,7 +32,7 @@ slideOptions:
  }
 </style>
-## Learning Goals of the Git Lecture
+# Learning Goals
 - Refresh and organize students' existing knowledge on Git (learn how to learn more).
 - Students can explain difference between merge and rebase and when to use what.
--- a/material/2_tue/git/tasks.qmd
+++ b/material/2_tue/git/tasks.qmd
@@ -3,6 +3,7 @@
 1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again).
 2. Push your package `MyStatsPackage` to a remote repository.
 3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded).
-4. Use the package from somebody else in the classroom and verify with `printOwner` that you use the correct package.
+4. Start a new Julia environment and use your package through its url: `]add https://github.com/[username]/MyStatsPackage`.
-5. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
+5. Now use the package from somebody else in the classroom instead and verify with `printOwner` that you use the correct package.
-6. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.
+6. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
 7. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.
--- a/material/2_tue/testing/slides.qmd
+++ b/material/2_tue/testing/slides.qmd
@@ -1,9 +1,37 @@
 ---
-format: revealjs
+type: slide
-
+slideOptions:
  transition: slide
  width: 1400
  height: 900
  margin: 0.1
 ---
 <style>
  .reveal strong {
    font-weight: bold;
    color: orange;
  }
  .reveal p {
    text-align: left;
  }
  .reveal section h1 {
    color: orange;
  }
  .reveal section h2 {
    color: orange;
  }
  .reveal code {
    font-family: 'Ubuntu Mono';
    color: orange;
  }
  .reveal section img {
    background:none;
    border:none;
    box-shadow:none;
  }
 </style>
 # Learning Goals
 - Justify the effort of developing tests to some extent
--- a/material/3_wed/regression/Code_Snippets.jl
+++ b/material/3_wed/regression/Code_Snippets.jl
@@ -0,0 +1,52 @@
 ############################################################################
 #### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
 ############################################################################
 using Statistics
 using Plots
 using RDatasets
 using GLM
 ##
 trees = dataset("datasets", "trees")
 scatter(trees.Girth, trees.Volume,
        legend=false, xlabel="Girth", ylabel="Volume")
 ##
 scatter(trees.Girth, trees.Volume,
        legend=false, xlabel="Girth", ylabel="Volume")
 plot!(x -> -37 + 5*x)
 ##
 linmod1 = lm(@formula(Volume ~ Girth), trees)
 ##
 linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
 ##
 r2(linmod1)
 r2(linmod2)
 linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
 r2(linmod3)
 ##
 using CSV
 using HTTP
 http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
 SwissLabor = DataFrame(CSV.File(http_response.body))
 SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
 ##
 model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())
--- a/material/3_wed/regression/MultipleRegressionBasics.qmd
+++ b/material/3_wed/regression/MultipleRegressionBasics.qmd
@@ -1,253 +1,292 @@
---
+---
-editor: 
+editor: 
-  markdown: 
+  markdown: 
-    wrap: 72
+    wrap: 72
---
+---
-
+
-# Multiple Regression Basics
+# Multiple Regression Basics
-
+
-## Motivation
+## Motivation
-
+
-### Introductory Example: tree dataset from R
+### Introductory Example: tree dataset from R
-
+
-\[figure of raw data\]
+```{julia}
-
+using Statistics
-*Aim:* Find relationship between the *response variable* `volume` and
+using Plots
-the *explanatory variable/covariate* `girth`? Can we predict the volume
+using RDatasets
-of a tree given its girth?
+
-
+trees = dataset("datasets", "trees")
-\[figure including a straight line\]
+
-
+scatter(trees.Volume, trees.Girth,
-First Guess: There is a linear relation!
+        legend=false, xlabel="Girth", ylabel="Volume")
-
+```
-## Simple Linear Regression
+
-
+*Aim:* Find relationship between the *response variable* `volume` and
-Main assumption: up to some error term, each measurement of the response
+the *explanatory variable/covariate* `girth`? Can we predict the volume
-variable $y_i$ depends linearly on the corresponding value $x_i$ of the
+of a tree given its girth?
-covariate
+
-
+```{julia}
-$\leadsto$ **(Simple) Linear Model:**
+scatter(trees.Girth, trees.Volume,
-$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
+        legend=false, xlabel="Girth", ylabel="Volume")
-where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
+plot!(x -> -37 + 5*x)
-normally distributed errors with unknown variance $\sigma^2$.
+```
-
+
-*Task:* Find the straight line that fits best, i.e., find the *optimal*
+First Guess: There is a linear relation!
-estimators for $\beta_0$ and $\beta_1$.
+
-
+## Simple Linear Regression
-*Typical choice*: Least squares estimator (= maximum likelihood
+
-estimator for normal errors)
+Main assumption: up to some error term, each measurement of the response
-
+variable $y_i$ depends linearly on the corresponding value $x_i$ of the
-$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
+covariate
-
+
-where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
+$\leadsto$ **(Simple) Linear Model:**
-vector of covariates and $\mathbf{1}$ is a vector of ones.
+$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,   \qquad  i=1,...,n,$$
-
+where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
-Written in matrix style:
+normally distributed errors with unknown variance $\sigma^2$.
-
+
-$$
+*Task:* Find the straight line that fits best, i.e., find the *optimal*
- (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
+estimators for $\beta_0$ and $\beta_1$.
-$$
+
-
+*Typical choice*: Least squares estimator (= maximum likelihood
-Note: There is a closed-form expression for
+estimator for normal errors)
-$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
+
-rather use Julia to solve the problem.
+$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
-
+
-\[use Julia code (existing package) to perform linear regression for
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
-`volume ~ girth`\]
+vector of covariates and $\mathbf{1}$ is a vector of ones.
-
+
-*Interpretation of the Julia output:*
+Written in matrix style:
-
+
-   column `estimate` : least square estimates for $\hat \beta_0$ and
+$$
-    $\hat \beta_1$
+ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
-
+$$
-   column `Std. Error` : estimated standard deviation
+
-    $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
+Note: There is a closed-form expression for
-
+$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
-   column `t value` : value of the $t$-statistics
+rather use Julia to solve the problem.
-
+
-    $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
+\[use Julia code (existing package) to perform linear regression for
-
+`volume ~ girth`\]
-    Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
+
-    follow a $t$-distribution.
+```{julia}
-
+lm(@formula(Volume ~ Girth), trees)
-   column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
+```
-    $i=0,1$
+
-
+*Interpretation of the Julia output:*
-**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
+
-these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
+-   column `estimate` : least square estimates for $\hat \beta_0$ and
-(you can choose them)), simulate responses from a linear model and
+    $\hat \beta_1$
-estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
+
-choices of the parameters to see the effects on the parameter estimates
+-   column `Std. Error` : estimated standard deviation
-and the $p$-values.
+    $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
-
+
-## Multiple Regression Model
+-   column `t value` : value of the $t$-statistics
-
+
-*Idea*: Generalize the simple linear regression model to multiple
+    $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
-covariates, w.g., predict `volume` using `girth` and \`height\`\`.
+
-
+    Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
-$\leadsto$ **Linear Model:**
+    follow a $t$-distribution.
-$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$where
+
-
+-   column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
-   $y_i$: $i$-th measurement of the response,
+    $i=0,1$
-
+
-   $x_{i1}$: $i$ th value of first covariate,
+**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
-
+these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
-   ...
+(you can choose them)), simulate responses from a linear model and
-
+estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
-   $x_{ip}$: $i$-th value of $p$-th covariate,
+choices of the parameters to see the effects on the parameter estimates
-
+and the $p$-values.
-   $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
+
-    distributed errors with unknown variance $\sigma^2$.
+## Multiple Regression Model
-
+
-*Task:* Find the *optimal* estimators for
+*Idea*: Generalize the simple linear regression model to multiple
-$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
+covariates, w.g., predict `volume` using `girth` and \`height\`\`.
-
+
-*Our choice again:* Least squares estimator (= maximum likelihood
+$\leadsto$ **Linear Model:**
-estimator for normal errors)
+$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i,   \qquad  i=1,...,n,$$where
-
+
-$$
+-   $y_i$: $i$-th measurement of the response,
- \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
+
-$$
+-   $x_{i1}$: $i$ th value of first covariate,
-
+
-where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
+-   ...
-vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
+
-
+-   $x_{ip}$: $i$-th value of $p$-th covariate,
-Written in matrix style:
+
-
+-   $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
-$$
+    distributed errors with unknown variance $\sigma^2$.
-\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
+
-$$
+*Task:* Find the *optimal* estimators for
-
+$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
-Defining the *design matrix*
+
-
+*Our choice again:* Least squares estimator (= maximum likelihood
-$$ \mathbf{X} = \left( \begin{array}{cccc}
+estimator for normal errors)
-                1 & x_{11} & \ldots & x_{1p} \\
+
-                \vdots & \vdots & \ddots & \vdots \\
+$$
-                1 & x_{11} & \ldots & x_{1p} 
+ \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
-                \end{array}\right) \qquad 
+$$
-  (\text{size } n \times (p+1)), $$
+
-
+where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
-we get the short form
+vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
-
+
-$$
+Written in matrix style:
-\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+
-$$
+$$
-
+\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
-\[use Julia code (existing package) to perform linear regression for
+$$
-`volume ~ girth + height`\]
+
-
+Defining the *design matrix*
-The interpretation of the Julia output is similar to the simple linear
+
-regression model, but we provide explicit formulas now:
+$$ \mathbf{X} = \left( \begin{array}{cccc}
-
+                1 & x_{11} & \ldots & x_{1p} \\
-   parameter estimates:
+                \vdots & \vdots & \ddots & \vdots \\
-
+                1 & x_{11} & \ldots & x_{1p} 
-    $$
+                \end{array}\right) \qquad 
-     (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
+  (\text{size } n \times (p+1)), $$
-    $$
+
-
+we get the short form
-   estimated standard errors:
+
-
+$$
-    $$
+\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta}  \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
-     \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
+$$
-    $$
+
-
+\[use Julia code (existing package) to perform linear regression for
-   $t$-statistics:
+`volume ~ girth + height`\]
-
+
-    $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
+The interpretation of the Julia output is similar to the simple linear
-
+regression model, but we provide explicit formulas now:
-   $p$-values:
+
-
+-   parameter estimates:
-    $$
+
-    p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
+    $$
-    $$
+     (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
-
+    $$
-**Task 2**: Implement functions that estimate the $\beta$-parameters,
+
-the corresponding standard errors and the $t$-statistics. Test your
+-   estimated standard errors:
-functions with the \`\`\`tree''' data set and try to reproduce the
+
-output above.
+    $$
-
+     \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
-## Generalized Linear Models
+    $$
-
+
-Classical linear model
+-   $t$-statistics:
-
+
-$$
+    $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
- \mathbf{y} = \mathbf{X} \beta + \varepsilon 
+
-$$
+-   $p$-values:
-
+
-implies that
+    $$
-$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
+    p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
-
+    $$
-In particular, the conditional expectation satisfies
+
-$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
+**Task 2**: Implement functions that estimate the $\beta$-parameters,
-
+the corresponding standard errors and the $t$-statistics. Test your
-However, the assumption of a normal distribution is unrealistic for
+functions with the \`\`\`tree''' data set and try to reproduce the
-non-continuous data. Popular alternatives include:
+output above.
-
+
-   for counting data: $$
+```{julia}
-       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
+r2(linmod1)
-      $$
+r2(linmod2)
-
+
-    Here, the components are considered to be independent and the
+linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
-    exponential function is applied componentwise.
+
-
+r2(linmod3)
-   for binary data: $$
+```
-       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
+
-      $$
+## Generalized Linear Models
-
+
-    Again, the components are considered to be independent and all the
+Classical linear model
-    operations are applied componentwise.
+
-
+$$
-All these models are defined by the choice of a family of distributions
+ \mathbf{y} = \mathbf{X} \beta + \varepsilon 
-and a function $g$ (the so-called *link function*) such that
+$$
-
+
-$$
+implies that
- \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
+$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
-$$
+
-
+In particular, the conditional expectation satisfies
-For the models above, these are:
+$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
-
+
-+----------------------+---------------------+----------------------+
+However, the assumption of a normal distribution is unrealistic for
-| Type of Data         | Distribution Family | Link Function        |
+non-continuous data. Popular alternatives include:
-+======================+=====================+======================+
+
-| continuous           | Normal              | identity:            |
+-   for counting data: $$
-|                      |                     |                      |
+       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
-|                      |                     | $$                   |
+      $$
-|                      |                     | g(x)=x               |
+
-|                      |                     | $$                   |
+    Here, the components are considered to be independent and the
-+----------------------+---------------------+----------------------+
+    exponential function is applied componentwise.
-| count                | Poisson             | log:                 |
+
-|                      |                     |                      |
+-   for binary data: $$
-|                      |                     | $$                   |
+       \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto  \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
-|                      |                     |  g(x) = \log(x)      |
+      $$
-|                      |                     | $$                   |
+
-+----------------------+---------------------+----------------------+
+    Again, the components are considered to be independent and all the
-| binary               | Bernoulli           | logit:               |
+    operations are applied componentwise.
-|                      |                     |                      |
+
-|                      |                     | $$                   |
+All these models are defined by the choice of a family of distributions
-|                      |                     | g(x) = \log\left     |
+and a function $g$ (the so-called *link function*) such that
-|                      |                     | (                    |
+
-|                      |                     | \frac{x}{1-x}\right) |
+$$
-|                      |                     | $$                   |
+ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
-+----------------------+---------------------+----------------------+
+$$
-
+
-In general, the parameter vector $\beta$ is estimated via maximizing the
+For the models above, these are:
-likelihood, i.e.,
+
-
+--------------+---------------------+--------------------+
-$$
+| Type of Data | Distribution Family | Link Function      |
-\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
+==============+=====================+====================+
-$$
+| continuous   | Normal              | identity:          |
-
+|              |                     |                    |
-which is equivalent to the maximization of the log-likelihood, i.e.,
+|              |                     | $$                 |
-
+|              |                     | g(x)=x             |
-$$
+|              |                     | $$                 |
-\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
+--------------+---------------------+--------------------+
-$$
+| count        | Poisson             | log:               |
-
+|              |                     |                    |
-In the Gaussian case, the maximum likelihood estimator is identical to
+|              |                     | $$                 |
-the least squares estimator considered above.
+|              |                     |  g(x) = \log(x)    |
-
+|              |                     | $$                 |
-\[\[ Example in Julia: maybe `SwissLabor` \]\]
+--------------+---------------------+--------------------+
-
+| binary       | Bernoulli           | logit:             |
-**Task 3:** Reproduce the results of our data analysis of the `tree`
+|              |                     |                    |
-data set using a generalized linear model with normal distribution
+|              |                     | $$                 |
-family.
+|              |                     | g(x) = \log\left   |
 |              |                     | (                  |
 |              |                     | \                  |
 |              |                     | f                  |
 |              |                     | rac{x}{1-x}\right) |
 |              |                     | $$                 |
 +--------------+---------------------+--------------------+
 In general, the parameter vector $\beta$ is estimated via maximizing the
 likelihood, i.e.,
 $$
 \hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
 $$
 which is equivalent to the maximization of the log-likelihood, i.e.,
 $$
 \hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
 $$
 In the Gaussian case, the maximum likelihood estimator is identical to
 the least squares estimator considered above.
 ```{julia}
 using CSV
 using HTTP
 http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
 SwissLabor = DataFrame(CSV.File(http_response.body))
 SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
 model = glm(@formula(participation ~ age^2), 
            SwissLabor, Binomial(), ProbitLink())
 ```
 **Task 3:** Reproduce the results of our data analysis of the `tree`
 data set using a generalized linear model with normal distribution
 family.
		`@@ -0,0 +1 @@`
							`There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf).`
		`@@ -0,0 +1 @@`
							`Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf)`