diff --git a/_quarto.yml b/_quarto.yml index 117e3b4..2542c00 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -57,9 +57,9 @@ website: text: "📝 1 - Advanced Git and Contributing" - href: "material/2_tue/git/tasks.qmd" text: "🛠 1 - Git: Exercises" - - href: "material/2_tue/testing/slides.qmd" + - href: "material/2_tue/testing/slides.md" text: "📝 2 - Testing" - - href: "material/2_tue/CI/missing.qmd" + - href: "material/2_tue/ci/slides.md" text: "📝 3 - Continuous Integration" - href: material/2_tue/codereview/slides.qmd text: "📝 4 - Code Review" diff --git a/cheatsheets/git.qmd b/cheatsheets/git.qmd index e69de29..becee2f 100644 --- a/cheatsheets/git.qmd +++ b/cheatsheets/git.qmd @@ -0,0 +1 @@ +There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf). diff --git a/cheatsheets/githubactions.qmd b/cheatsheets/githubactions.qmd index e69de29..ac8eaee 100644 --- a/cheatsheets/githubactions.qmd +++ b/cheatsheets/githubactions.qmd @@ -0,0 +1 @@ +Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf) diff --git a/material/2_tue/ci/slides.md b/material/2_tue/ci/slides.md new file mode 100644 index 0000000..b74324a --- /dev/null +++ b/material/2_tue/ci/slides.md @@ -0,0 +1,395 @@ +--- +type: slide +slideOptions: + transition: slide + width: 1400 + height: 900 + margin: 0.1 +--- + + + +# Learning Goals + +- Name and explain common workflows to automate in RSE. +- Explain the differences between the various continuous methodologies. +- Explain why automation is crucial in RSE. +- Write and understand basic automation scripts for GitHub Actions. + - s.t. we understand what `PkgTemplates` generates for us. + + +Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material). + +--- + +# 1. Workflow Automation + +--- + +## Why Automation? + +- Automatize tasks + - Run tests frequently, give feedback early etc. + - Ensure reproducible test environments + - Cannot forget automatized tasks + - Less burden to developer (and their workstation) + - Avoid manual errors +- Process often integrated in development workflow + - Example: Support by Git hooks or Git forges + +--- + +## Typical Automation Tasks in RSE + +- Check code formatting and quality +- Compile and test code for different platforms +- Generate coverage reports and visualization +- Build documentation and deploy it +- Build, package, and upload releases + +--- + +## Continuous Methodologies (1/2) + +- **Continuous Integration** (CI) + - Continuously integrate changes into "main" branch + - Avoids "merge hell" + - Relies on testing and checking code continuously + - Should be automatized + +--- + +## Continuous Methodologies (2/2) + +- **Continuous Delivery** (CD) + - Software is in a state that allows new release at any time + - Software package is built + - Actual release triggered manually +- **Continuous Deployment** (CD) + - Software is in a state that allows new release at any time + - Software package is built + - Actual release triggered automatically (continuously) + +--- + +## Automation Services/Software + +- [GitHub Actions](https://github.com/features/actions) +- [GitLab CI/CD](https://docs.gitlab.com/ee/ci/) +- [Circle CI](https://circleci.com/) +- [Travis CI](https://www.travis-ci.com/) +- [Jenkins](https://www.jenkins.io/) +- ... + +--- + +# 2. GitHub Actions + +--- + +## What is "GitHub Actions"? + +> Automate, customize, and execute your software development workflows right in your repository with GitHub Actions. + +From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions) + +--- + +## General Information + +- Usage of GitHub's runners is [limited](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits) +- Available for public repositories or accounts with subscription +- By default Actions run on GitHub's runners + - Linux, Windows, or MacOS +- Quickly evolving and significant improvements in recent years + +--- + +## Components (1/2) + +- [Workflow](https://docs.github.com/en/actions/using-workflows): Runs one or more jobs +- [Event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows): Triggers a workflow +- [Jobs](https://docs.github.com/en/actions/using-jobs): Set of steps (running on same runner) + - Steps executed consecutively and share data + - Jobs by default executed in parallel +- [Action](https://docs.github.com/en/actions/creating-actions): Application performing common, complex task (step) often used in workflows +- [Runner](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners): Server that runs jobs +- [Artifacts](https://docs.github.com/en/actions/learn-github-actions/essential-features-of-github-actions#sharing-data-between-jobs): Files to be shared between jobs or to be kept after workflow finishes + +--- + +## Components (2/2) + + + + +From [GitHub Actions tutorial](https://docs.github.com/en/actions) + +--- + +## Setting up a Workflow + +- Workflow file files stored `${REPO_ROOT}/.github/workflows` +- Configured via YAML file + +```yaml +name: learn-github-actions +on: [push] +jobs: + check-bats-version: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + - uses: actions/setup-node@v2 + with: + node-version: '14' + - run: npm install -g bats + - run: bats -v +``` + +--- + +## Actions + +```yaml +- uses: actions/checkout@v3 +- uses: actions/setup-node@v2 + with: + node-version: '14' +``` + +- Integrated via `uses` directive +- Additional configuration via `with` (options depend on Action) +- Find actions in [marketplace](https://github.com/marketplace?type=actions) +- Write [own actions](https://docs.github.com/en/actions/creating-actions) + +--- + +## Some Useful Julia Actions + +- Find on [gitHub.com/julia-actions](https://github.com/julia-actions/) + + ``` + - uses: julia-actions/setup-julia@v1 + with: + version: '1.9' + ``` + +- More: + - `cache`: caches `~/.julia/artifacts/*` and `~/.julia/packages/*` to reduce runtime of CI + - `julia-buildpkg`: build package + - `julia-runtest`: run tests + - `julia-format`: format code + +--- + +## User-specified Commands + +```yaml +- name: "Single line command" + run: echo "Single line command" +- name: "Multi line command" + run: | + echo "First line" + echo "Second line. Directory ${PWD}" + workdir: tmp/ + shell: bash +``` + +--- + +## Events + +- Single or multiple events + + ```yaml + on: [push, fork] + ``` + +- Activities + + ```yaml + on: + issue: + types: + - opened + - labeled + ``` + +- Filters + + ```yaml + on: + push: + branches: + - main + - 'releases/**' + ``` + +--- + +## Artifacts + +- Data sharing between jobs and data upload +- Uploading artifact + + ```yaml + - name: "Upload artifact" + uses: actions/upload-artifact@v2 + with: + name: my-artifact + path: my_file.txt + retention-days: 5 + ``` + +- Downloading artifact + + ```yaml + - name: "Download a single artifact" + uses: actions/download-artifact@v2 + with: + name: my-artifact + ``` + + **Note**: Drop name to download all artifacts + +--- + +## Test Actions Locally + +- [act](https://github.com/nektos/act) +- Relies extensively on Docker + - User should be in `docker` group +- Run `act` from root of the repository + + ```text + act (runs all workflows) + act --job WORKFLOWNAME + ``` + +- Environment is not 100% identical to GitHub's + - Workflows may fail locally, but work on GitHub + +--- + +## Further Reading + +- [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration) +- [GitHub Actions documentation](https://docs.github.com/en/actions) +- [GitHub Actions quickstart](https://docs.github.com/en/actions/quickstart) + +--- + +# 3. Demo: Automation with GitHub Actions + +--- + +## Setting up a Test Job + +- Import [Julia test package repository](https://github.com/uekerman/JuliaTestPackage) (the same code we used for testing) +- Set up workflow file + + ```bash + mkdir -p .github/workflows + cd .github/workflows + vi format-check.yml + ``` + +- Let's check whether our code is formatted correctly. Edit `format-check.yml` to have following content + + ```yaml + name: format-check + + on: [push, pull_request] + + jobs: + format: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - uses: julia-actions/setup-julia@v1 + with: + version: '1.9' + - name: Install JuliaFormatter and format + run: | + julia -e 'using Pkg; Pkg.add(PackageSpec(name="JuliaFormatter"))' + julia -e 'using JuliaFormatter; format(".", verbose=true)' + - name: Format check + run: | + julia -e ' + out = Cmd(`git diff --name-only`) |> read |> String + if out == "" + exit(0) + else + @error "Some files have not been formatted" + write(stdout, out) + exit(1) + end' + ``` + +- `runs-on` does **not** refer to a Docker container, but to a runner tag. +- Add, commit, push +- After the push, inspect "Action" panel on GitHub repository + - GitHub will schedule a run (yellow dot) + - Hooray. We have set up our first action. +- Failing test example: + - Edit settings on GitHub that one can only merge if all tests pass: + - Settings -> Branches -> Branch protection rule + - Choose `main` branch + - Enable "Require status checks to pass before merging". Optionally enable "Require branches to be up to date before merging" + - Choose status checks that need to pass: `test` + - Click on "Create" at bottom of page. + - Create a new branch `break-code`. + - Edit some file, violate the formatting, commit it and push it to the branch. Afterwards open a new PR and inspect the failing test. We are also not able to merge the changes as the "Merge" button should be inactive. + +--- + +## act Demo + +- `act` is for quick checks while developing workflows, not for developing the code +- Check available jobs (at root of repository) + + ```bash + act -l + ``` + +- Run jobs for `push` event (default event) + + ```bash + act + ``` + +- Run a specific job + + ```bash + act -j test + ``` + +--- + +# 4. Exercise + +Set up GitHub Actions for your statistics package. They should format your code and run the tests. To structure and parallelize things, you could use two separate jobs. diff --git a/material/2_tue/git/slides.md b/material/2_tue/git/slides.md index f1540fc..a1b4f22 100644 --- a/material/2_tue/git/slides.md +++ b/material/2_tue/git/slides.md @@ -32,7 +32,7 @@ slideOptions: } -## Learning Goals of the Git Lecture +# Learning Goals - Refresh and organize students' existing knowledge on Git (learn how to learn more). - Students can explain difference between merge and rebase and when to use what. diff --git a/material/2_tue/git/tasks.qmd b/material/2_tue/git/tasks.qmd index 94b9a45..0ecbf26 100644 --- a/material/2_tue/git/tasks.qmd +++ b/material/2_tue/git/tasks.qmd @@ -3,6 +3,7 @@ 1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again). 2. Push your package `MyStatsPackage` to a remote repository. 3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded). -4. Use the package from somebody else in the classroom and verify with `printOwner` that you use the correct package. -5. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged. -6. Add more functions to other packages of classmates that print funny things, but always ensure a linear history. +4. Start a new Julia environment and use your package through its url: `]add https://github.com/[username]/MyStatsPackage`. +5. Now use the package from somebody else in the classroom instead and verify with `printOwner` that you use the correct package. +6. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged. +7. Add more functions to other packages of classmates that print funny things, but always ensure a linear history. diff --git a/material/2_tue/testing/slides.qmd b/material/2_tue/testing/slides.md similarity index 94% rename from material/2_tue/testing/slides.qmd rename to material/2_tue/testing/slides.md index 96a877d..539efeb 100644 --- a/material/2_tue/testing/slides.qmd +++ b/material/2_tue/testing/slides.md @@ -1,9 +1,37 @@ - --- -format: revealjs - +type: slide +slideOptions: + transition: slide + width: 1400 + height: 900 + margin: 0.1 --- + + # Learning Goals - Justify the effort of developing tests to some extent diff --git a/material/3_wed/regression/Code_Snippets.jl b/material/3_wed/regression/Code_Snippets.jl new file mode 100644 index 0000000..05bb3e8 --- /dev/null +++ b/material/3_wed/regression/Code_Snippets.jl @@ -0,0 +1,52 @@ +############################################################################ +#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' #### +############################################################################ + +using Statistics +using Plots +using RDatasets +using GLM + +## + +trees = dataset("datasets", "trees") + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") + +## + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) + +## + +linmod1 = lm(@formula(Volume ~ Girth), trees) + +## + +linmod2 = lm(@formula(Volume ~ Girth + Height), trees) + +## + +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) + +## + +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +## + +model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink()) \ No newline at end of file diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 8af7871..1f23123 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -1,253 +1,292 @@ ---- -editor: - markdown: - wrap: 72 ---- - -# Multiple Regression Basics - -## Motivation - -### Introductory Example: tree dataset from R - -\[figure of raw data\] - -*Aim:* Find relationship between the *response variable* `volume` and -the *explanatory variable/covariate* `girth`? Can we predict the volume -of a tree given its girth? - -\[figure including a straight line\] - -First Guess: There is a linear relation! - -## Simple Linear Regression - -Main assumption: up to some error term, each measurement of the response -variable $y_i$ depends linearly on the corresponding value $x_i$ of the -covariate - -$\leadsto$ **(Simple) Linear Model:** -$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ -where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent -normally distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the straight line that fits best, i.e., find the *optimal* -estimators for $\beta_0$ and $\beta_1$. - -*Typical choice*: Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the -vector of covariates and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ - (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 -$$ - -Note: There is a closed-form expression for -$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but -rather use Julia to solve the problem. - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth`\] - -*Interpretation of the Julia output:* - -- column `estimate` : least square estimates for $\hat \beta_0$ and - $\hat \beta_1$ - -- column `Std. Error` : estimated standard deviation - $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ - -- column `t value` : value of the $t$-statistics - - $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ - - Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would - follow a $t$-distribution. - -- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for - $i=0,1$ - -**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given -these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ -(you can choose them)), simulate responses from a linear model and -estimate the coefficients $\beta_0$ and $\beta_1$. Play with different -choices of the parameters to see the effects on the parameter estimates -and the $p$-values. - -## Multiple Regression Model - -*Idea*: Generalize the simple linear regression model to multiple -covariates, w.g., predict `volume` using `girth` and \`height\`\`. - -$\leadsto$ **Linear Model:** -$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where - -- $y_i$: $i$-th measurement of the response, - -- $x_{i1}$: $i$ th value of first covariate, - -- ... - -- $x_{ip}$: $i$-th value of $p$-th covariate, - -- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the *optimal* estimators for -$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. - -*Our choice again:* Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ - \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 -$$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the -vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 -$$ - -Defining the *design matrix* - -$$ \mathbf{X} = \left( \begin{array}{cccc} - 1 & x_{11} & \ldots & x_{1p} \\ - \vdots & \vdots & \ddots & \vdots \\ - 1 & x_{11} & \ldots & x_{1p} - \end{array}\right) \qquad - (\text{size } n \times (p+1)), $$ - -we get the short form - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} -$$ - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth + height`\] - -The interpretation of the Julia output is similar to the simple linear -regression model, but we provide explicit formulas now: - -- parameter estimates: - - $$ - (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} - $$ - -- estimated standard errors: - - $$ - \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} - $$ - -- $t$-statistics: - - $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ - -- $p$-values: - - $$ - p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} - $$ - -**Task 2**: Implement functions that estimate the $\beta$-parameters, -the corresponding standard errors and the $t$-statistics. Test your -functions with the \`\`\`tree''' data set and try to reproduce the -output above. - -## Generalized Linear Models - -Classical linear model - -$$ - \mathbf{y} = \mathbf{X} \beta + \varepsilon -$$ - -implies that -$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ - -In particular, the conditional expectation satisfies -$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. - -However, the assumption of a normal distribution is unrealistic for -non-continuous data. Popular alternatives include: - -- for counting data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) - $$ - - Here, the components are considered to be independent and the - exponential function is applied componentwise. - -- for binary data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} - $$ - - Again, the components are considered to be independent and all the - operations are applied componentwise. - -All these models are defined by the choice of a family of distributions -and a function $g$ (the so-called *link function*) such that - -$$ - \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). -$$ - -For the models above, these are: - -+----------------------+---------------------+----------------------+ -| Type of Data | Distribution Family | Link Function | -+======================+=====================+======================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+----------------------+---------------------+----------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+----------------------+---------------------+----------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \frac{x}{1-x}\right) | -| | | $$ | -+----------------------+---------------------+----------------------+ - -In general, the parameter vector $\beta$ is estimated via maximizing the -likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -which is equivalent to the maximization of the log-likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -In the Gaussian case, the maximum likelihood estimator is identical to -the least squares estimator considered above. - -\[\[ Example in Julia: maybe `SwissLabor` \]\] - -**Task 3:** Reproduce the results of our data analysis of the `tree` -data set using a generalized linear model with normal distribution -family. +--- +editor: + markdown: + wrap: 72 +--- + +# Multiple Regression Basics + +## Motivation + +### Introductory Example: tree dataset from R + +```{julia} +using Statistics +using Plots +using RDatasets + +trees = dataset("datasets", "trees") + +scatter(trees.Volume, trees.Girth, + legend=false, xlabel="Girth", ylabel="Volume") +``` + +*Aim:* Find relationship between the *response variable* `volume` and +the *explanatory variable/covariate* `girth`? Can we predict the volume +of a tree given its girth? + +```{julia} +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) +``` + +First Guess: There is a linear relation! + +## Simple Linear Regression + +Main assumption: up to some error term, each measurement of the response +variable $y_i$ depends linearly on the corresponding value $x_i$ of the +covariate + +$\leadsto$ **(Simple) Linear Model:** +$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ +where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent +normally distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the straight line that fits best, i.e., find the *optimal* +estimators for $\beta_0$ and $\beta_1$. + +*Typical choice*: Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the +vector of covariates and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ + (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 +$$ + +Note: There is a closed-form expression for +$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but +rather use Julia to solve the problem. + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth`\] + +```{julia} +lm(@formula(Volume ~ Girth), trees) +``` + +*Interpretation of the Julia output:* + +- column `estimate` : least square estimates for $\hat \beta_0$ and + $\hat \beta_1$ + +- column `Std. Error` : estimated standard deviation + $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ + +- column `t value` : value of the $t$-statistics + + $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ + + Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would + follow a $t$-distribution. + +- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for + $i=0,1$ + +**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given +these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ +(you can choose them)), simulate responses from a linear model and +estimate the coefficients $\beta_0$ and $\beta_1$. Play with different +choices of the parameters to see the effects on the parameter estimates +and the $p$-values. + +## Multiple Regression Model + +*Idea*: Generalize the simple linear regression model to multiple +covariates, w.g., predict `volume` using `girth` and \`height\`\`. + +$\leadsto$ **Linear Model:** +$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where + +- $y_i$: $i$-th measurement of the response, + +- $x_{i1}$: $i$ th value of first covariate, + +- ... + +- $x_{ip}$: $i$-th value of $p$-th covariate, + +- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally + distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the *optimal* estimators for +$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. + +*Our choice again:* Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ + \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 +$$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the +vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 +$$ + +Defining the *design matrix* + +$$ \mathbf{X} = \left( \begin{array}{cccc} + 1 & x_{11} & \ldots & x_{1p} \\ + \vdots & \vdots & \ddots & \vdots \\ + 1 & x_{11} & \ldots & x_{1p} + \end{array}\right) \qquad + (\text{size } n \times (p+1)), $$ + +we get the short form + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth + height`\] + +The interpretation of the Julia output is similar to the simple linear +regression model, but we provide explicit formulas now: + +- parameter estimates: + + $$ + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} + $$ + +- estimated standard errors: + + $$ + \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} + $$ + +- $t$-statistics: + + $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ + +- $p$-values: + + $$ + p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} + $$ + +**Task 2**: Implement functions that estimate the $\beta$-parameters, +the corresponding standard errors and the $t$-statistics. Test your +functions with the \`\`\`tree''' data set and try to reproduce the +output above. + +```{julia} +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) +``` + +## Generalized Linear Models + +Classical linear model + +$$ + \mathbf{y} = \mathbf{X} \beta + \varepsilon +$$ + +implies that +$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ + +In particular, the conditional expectation satisfies +$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. + +However, the assumption of a normal distribution is unrealistic for +non-continuous data. Popular alternatives include: + +- for counting data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) + $$ + + Here, the components are considered to be independent and the + exponential function is applied componentwise. + +- for binary data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} + $$ + + Again, the components are considered to be independent and all the + operations are applied componentwise. + +All these models are defined by the choice of a family of distributions +and a function $g$ (the so-called *link function*) such that + +$$ + \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). +$$ + +For the models above, these are: + ++--------------+---------------------+--------------------+ +| Type of Data | Distribution Family | Link Function | ++==============+=====================+====================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++--------------+---------------------+--------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++--------------+---------------------+--------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | rac{x}{1-x}\right) | +| | | $$ | ++--------------+---------------------+--------------------+ + +In general, the parameter vector $\beta$ is estimated via maximizing the +likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +which is equivalent to the maximization of the log-likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +In the Gaussian case, the maximum likelihood estimator is identical to +the least squares estimator considered above. + +```{julia} +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +model = glm(@formula(participation ~ age^2), + SwissLabor, Binomial(), ProbitLink()) +``` + +**Task 3:** Reproduce the results of our data analysis of the `tree` +data set using a generalized linear model with normal distribution +family.