This commit is contained in:
behinger (s-ccs 001)
2023-10-09 09:51:53 +00:00
9 changed files with 779 additions and 262 deletions

View File

@@ -57,9 +57,9 @@ website:
text: "📝 1 - Advanced Git and Contributing" text: "📝 1 - Advanced Git and Contributing"
- href: "material/2_tue/git/tasks.qmd" - href: "material/2_tue/git/tasks.qmd"
text: "🛠 1 - Git: Exercises" text: "🛠 1 - Git: Exercises"
- href: "material/2_tue/testing/slides.qmd" - href: "material/2_tue/testing/slides.md"
text: "📝 2 - Testing" text: "📝 2 - Testing"
- href: "material/2_tue/CI/missing.qmd" - href: "material/2_tue/ci/slides.md"
text: "📝 3 - Continuous Integration" text: "📝 3 - Continuous Integration"
- href: material/2_tue/codereview/slides.qmd - href: material/2_tue/codereview/slides.qmd
text: "📝 4 - Code Review" text: "📝 4 - Code Review"

View File

@@ -0,0 +1 @@
There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf).

View File

@@ -0,0 +1 @@
Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf)

395
material/2_tue/ci/slides.md Normal file
View File

@@ -0,0 +1,395 @@
---
type: slide
slideOptions:
transition: slide
width: 1400
height: 900
margin: 0.1
---
<style>
.reveal strong {
font-weight: bold;
color: orange;
}
.reveal p {
text-align: left;
}
.reveal section h1 {
color: orange;
}
.reveal section h2 {
color: orange;
}
.reveal code {
font-family: 'Ubuntu Mono';
color: orange;
}
.reveal section img {
background:none;
border:none;
box-shadow:none;
}
</style>
# Learning Goals
- Name and explain common workflows to automate in RSE.
- Explain the differences between the various continuous methodologies.
- Explain why automation is crucial in RSE.
- Write and understand basic automation scripts for GitHub Actions.
- s.t. we understand what `PkgTemplates` generates for us.
Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material).
---
# 1. Workflow Automation
---
## Why Automation?
- Automatize tasks
- Run tests frequently, give feedback early etc.
- Ensure reproducible test environments
- Cannot forget automatized tasks
- Less burden to developer (and their workstation)
- Avoid manual errors
- Process often integrated in development workflow
- Example: Support by Git hooks or Git forges
---
## Typical Automation Tasks in RSE
- Check code formatting and quality
- Compile and test code for different platforms
- Generate coverage reports and visualization
- Build documentation and deploy it
- Build, package, and upload releases
---
## Continuous Methodologies (1/2)
- **Continuous Integration** (CI)
- Continuously integrate changes into "main" branch
- Avoids "merge hell"
- Relies on testing and checking code continuously
- Should be automatized
---
## Continuous Methodologies (2/2)
- **Continuous Delivery** (CD)
- Software is in a state that allows new release at any time
- Software package is built
- Actual release triggered manually
- **Continuous Deployment** (CD)
- Software is in a state that allows new release at any time
- Software package is built
- Actual release triggered automatically (continuously)
---
## Automation Services/Software
- [GitHub Actions](https://github.com/features/actions)
- [GitLab CI/CD](https://docs.gitlab.com/ee/ci/)
- [Circle CI](https://circleci.com/)
- [Travis CI](https://www.travis-ci.com/)
- [Jenkins](https://www.jenkins.io/)
- ...
---
# 2. GitHub Actions
---
## What is "GitHub Actions"?
> Automate, customize, and execute your software development workflows right in your repository with GitHub Actions.
From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions)
---
## General Information
- Usage of GitHub's runners is [limited](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits)
- Available for public repositories or accounts with subscription
- By default Actions run on GitHub's runners
- Linux, Windows, or MacOS
- Quickly evolving and significant improvements in recent years
---
## Components (1/2)
- [Workflow](https://docs.github.com/en/actions/using-workflows): Runs one or more jobs
- [Event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows): Triggers a workflow
- [Jobs](https://docs.github.com/en/actions/using-jobs): Set of steps (running on same runner)
- Steps executed consecutively and share data
- Jobs by default executed in parallel
- [Action](https://docs.github.com/en/actions/creating-actions): Application performing common, complex task (step) often used in workflows
- [Runner](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners): Server that runs jobs
- [Artifacts](https://docs.github.com/en/actions/learn-github-actions/essential-features-of-github-actions#sharing-data-between-jobs): Files to be shared between jobs or to be kept after workflow finishes
---
## Components (2/2)
<img src="https://docs.github.com/assets/cb-25535/mw-1440/images/help/actions/overview-actions-simple.webp" width=95%; style="margin-left:auto; margin-right:auto; padding-top: 25px; padding-bottom: 25px; background: #eeeeee">
From [GitHub Actions tutorial](https://docs.github.com/en/actions)
---
## Setting up a Workflow
- Workflow file files stored `${REPO_ROOT}/.github/workflows`
- Configured via YAML file
```yaml
name: learn-github-actions
on: [push]
jobs:
check-bats-version:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-node@v2
with:
node-version: '14'
- run: npm install -g bats
- run: bats -v
```
---
## Actions
```yaml
- uses: actions/checkout@v3
- uses: actions/setup-node@v2
with:
node-version: '14'
```
- Integrated via `uses` directive
- Additional configuration via `with` (options depend on Action)
- Find actions in [marketplace](https://github.com/marketplace?type=actions)
- Write [own actions](https://docs.github.com/en/actions/creating-actions)
---
## Some Useful Julia Actions
- Find on [gitHub.com/julia-actions](https://github.com/julia-actions/)
```
- uses: julia-actions/setup-julia@v1
with:
version: '1.9'
```
- More:
- `cache`: caches `~/.julia/artifacts/*` and `~/.julia/packages/*` to reduce runtime of CI
- `julia-buildpkg`: build package
- `julia-runtest`: run tests
- `julia-format`: format code
---
## User-specified Commands
```yaml
- name: "Single line command"
run: echo "Single line command"
- name: "Multi line command"
run: |
echo "First line"
echo "Second line. Directory ${PWD}"
workdir: tmp/
shell: bash
```
---
## Events
- Single or multiple events
```yaml
on: [push, fork]
```
- Activities
```yaml
on:
issue:
types:
- opened
- labeled
```
- Filters
```yaml
on:
push:
branches:
- main
- 'releases/**'
```
---
## Artifacts
- Data sharing between jobs and data upload
- Uploading artifact
```yaml
- name: "Upload artifact"
uses: actions/upload-artifact@v2
with:
name: my-artifact
path: my_file.txt
retention-days: 5
```
- Downloading artifact
```yaml
- name: "Download a single artifact"
uses: actions/download-artifact@v2
with:
name: my-artifact
```
**Note**: Drop name to download all artifacts
---
## Test Actions Locally
- [act](https://github.com/nektos/act)
- Relies extensively on Docker
- User should be in `docker` group
- Run `act` from root of the repository
```text
act (runs all workflows)
act --job WORKFLOWNAME
```
- Environment is not 100% identical to GitHub's
- Workflows may fail locally, but work on GitHub
---
## Further Reading
- [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration)
- [GitHub Actions documentation](https://docs.github.com/en/actions)
- [GitHub Actions quickstart](https://docs.github.com/en/actions/quickstart)
---
# 3. Demo: Automation with GitHub Actions
---
## Setting up a Test Job
- Import [Julia test package repository](https://github.com/uekerman/JuliaTestPackage) (the same code we used for testing)
- Set up workflow file
```bash
mkdir -p .github/workflows
cd .github/workflows
vi format-check.yml
```
- Let's check whether our code is formatted correctly. Edit `format-check.yml` to have following content
```yaml
name: format-check
on: [push, pull_request]
jobs:
format:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: julia-actions/setup-julia@v1
with:
version: '1.9'
- name: Install JuliaFormatter and format
run: |
julia -e 'using Pkg; Pkg.add(PackageSpec(name="JuliaFormatter"))'
julia -e 'using JuliaFormatter; format(".", verbose=true)'
- name: Format check
run: |
julia -e '
out = Cmd(`git diff --name-only`) |> read |> String
if out == ""
exit(0)
else
@error "Some files have not been formatted"
write(stdout, out)
exit(1)
end'
```
- `runs-on` does **not** refer to a Docker container, but to a runner tag.
- Add, commit, push
- After the push, inspect "Action" panel on GitHub repository
- GitHub will schedule a run (yellow dot)
- Hooray. We have set up our first action.
- Failing test example:
- Edit settings on GitHub that one can only merge if all tests pass:
- Settings -> Branches -> Branch protection rule
- Choose `main` branch
- Enable "Require status checks to pass before merging". Optionally enable "Require branches to be up to date before merging"
- Choose status checks that need to pass: `test`
- Click on "Create" at bottom of page.
- Create a new branch `break-code`.
- Edit some file, violate the formatting, commit it and push it to the branch. Afterwards open a new PR and inspect the failing test. We are also not able to merge the changes as the "Merge" button should be inactive.
---
## act Demo
- `act` is for quick checks while developing workflows, not for developing the code
- Check available jobs (at root of repository)
```bash
act -l
```
- Run jobs for `push` event (default event)
```bash
act
```
- Run a specific job
```bash
act -j test
```
---
# 4. Exercise
Set up GitHub Actions for your statistics package. They should format your code and run the tests. To structure and parallelize things, you could use two separate jobs.

View File

@@ -32,7 +32,7 @@ slideOptions:
} }
</style> </style>
## Learning Goals of the Git Lecture # Learning Goals
- Refresh and organize students' existing knowledge on Git (learn how to learn more). - Refresh and organize students' existing knowledge on Git (learn how to learn more).
- Students can explain difference between merge and rebase and when to use what. - Students can explain difference between merge and rebase and when to use what.

View File

@@ -3,6 +3,7 @@
1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again). 1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again).
2. Push your package `MyStatsPackage` to a remote repository. 2. Push your package `MyStatsPackage` to a remote repository.
3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded). 3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded).
4. Use the package from somebody else in the classroom and verify with `printOwner` that you use the correct package. 4. Start a new Julia environment and use your package through its url: `]add https://github.com/[username]/MyStatsPackage`.
5. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged. 5. Now use the package from somebody else in the classroom instead and verify with `printOwner` that you use the correct package.
6. Add more functions to other packages of classmates that print funny things, but always ensure a linear history. 6. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
7. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.

View File

@@ -1,9 +1,37 @@
--- ---
format: revealjs type: slide
slideOptions:
transition: slide
width: 1400
height: 900
margin: 0.1
--- ---
<style>
.reveal strong {
font-weight: bold;
color: orange;
}
.reveal p {
text-align: left;
}
.reveal section h1 {
color: orange;
}
.reveal section h2 {
color: orange;
}
.reveal code {
font-family: 'Ubuntu Mono';
color: orange;
}
.reveal section img {
background:none;
border:none;
box-shadow:none;
}
</style>
# Learning Goals # Learning Goals
- Justify the effort of developing tests to some extent - Justify the effort of developing tests to some extent

View File

@@ -0,0 +1,52 @@
############################################################################
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
############################################################################
using Statistics
using Plots
using RDatasets
using GLM
##
trees = dataset("datasets", "trees")
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
##
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
##
linmod1 = lm(@formula(Volume ~ Girth), trees)
##
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
##
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
##
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
##
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())

View File

@@ -1,253 +1,292 @@
--- ---
editor: editor:
markdown: markdown:
wrap: 72 wrap: 72
--- ---
# Multiple Regression Basics # Multiple Regression Basics
## Motivation ## Motivation
### Introductory Example: tree dataset from R ### Introductory Example: tree dataset from R
\[figure of raw data\] ```{julia}
using Statistics
*Aim:* Find relationship between the *response variable* `volume` and using Plots
the *explanatory variable/covariate* `girth`? Can we predict the volume using RDatasets
of a tree given its girth?
trees = dataset("datasets", "trees")
\[figure including a straight line\]
scatter(trees.Volume, trees.Girth,
First Guess: There is a linear relation! legend=false, xlabel="Girth", ylabel="Volume")
```
## Simple Linear Regression
*Aim:* Find relationship between the *response variable* `volume` and
Main assumption: up to some error term, each measurement of the response the *explanatory variable/covariate* `girth`? Can we predict the volume
variable $y_i$ depends linearly on the corresponding value $x_i$ of the of a tree given its girth?
covariate
```{julia}
$\leadsto$ **(Simple) Linear Model:** scatter(trees.Girth, trees.Volume,
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ legend=false, xlabel="Girth", ylabel="Volume")
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent plot!(x -> -37 + 5*x)
normally distributed errors with unknown variance $\sigma^2$. ```
*Task:* Find the straight line that fits best, i.e., find the *optimal* First Guess: There is a linear relation!
estimators for $\beta_0$ and $\beta_1$.
## Simple Linear Regression
*Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors) Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ covariate
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the $\leadsto$ **(Simple) Linear Model:**
vector of covariates and $\mathbf{1}$ is a vector of ones. $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
Written in matrix style: normally distributed errors with unknown variance $\sigma^2$.
$$ *Task:* Find the straight line that fits best, i.e., find the *optimal*
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 estimators for $\beta_0$ and $\beta_1$.
$$
*Typical choice*: Least squares estimator (= maximum likelihood
Note: There is a closed-form expression for estimator for normal errors)
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem. $$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
\[use Julia code (existing package) to perform linear regression for where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
`volume ~ girth`\] vector of covariates and $\mathbf{1}$ is a vector of ones.
*Interpretation of the Julia output:* Written in matrix style:
- column `estimate` : least square estimates for $\hat \beta_0$ and $$
$\hat \beta_1$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$
- column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
- column `t value` : value of the $t$-statistics rather use Julia to solve the problem.
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ \[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\]
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution. ```{julia}
lm(@formula(Volume ~ Girth), trees)
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for ```
$i=0,1$
*Interpretation of the Julia output:*
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ - column `estimate` : least square estimates for $\hat \beta_0$ and
(you can choose them)), simulate responses from a linear model and $\hat \beta_1$
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
choices of the parameters to see the effects on the parameter estimates - column `Std. Error` : estimated standard deviation
and the $p$-values. $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
## Multiple Regression Model - column `t value` : value of the $t$-statistics
*Idea*: Generalize the simple linear regression model to multiple $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
covariates, w.g., predict `volume` using `girth` and \`height\`\`.
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
$\leadsto$ **Linear Model:** follow a $t$-distribution.
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
- $y_i$: $i$-th measurement of the response, $i=0,1$
- $x_{i1}$: $i$ th value of first covariate, **Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
- ... (you can choose them)), simulate responses from a linear model and
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
- $x_{ip}$: $i$-th value of $p$-th covariate, choices of the parameters to see the effects on the parameter estimates
and the $p$-values.
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
distributed errors with unknown variance $\sigma^2$. ## Multiple Regression Model
*Task:* Find the *optimal* estimators for *Idea*: Generalize the simple linear regression model to multiple
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. covariates, w.g., predict `volume` using `girth` and \`height\`\`.
*Our choice again:* Least squares estimator (= maximum likelihood $\leadsto$ **Linear Model:**
estimator for normal errors) $$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
$$ - $y_i$: $i$-th measurement of the response,
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$ - $x_{i1}$: $i$ th value of first covariate,
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the - ...
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
- $x_{ip}$: $i$-th value of $p$-th covariate,
Written in matrix style:
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
$$ distributed errors with unknown variance $\sigma^2$.
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$ *Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
Defining the *design matrix*
*Our choice again:* Least squares estimator (= maximum likelihood
$$ \mathbf{X} = \left( \begin{array}{cccc} estimator for normal errors)
1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\ $$
1 & x_{11} & \ldots & x_{1p} \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
\end{array}\right) \qquad $$
(\text{size } n \times (p+1)), $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
we get the short form vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
$$ Written in matrix style:
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$ $$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
\[use Julia code (existing package) to perform linear regression for $$
`volume ~ girth + height`\]
Defining the *design matrix*
The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now: $$ \mathbf{X} = \left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1p} \\
- parameter estimates: \vdots & \vdots & \ddots & \vdots \\
1 & x_{11} & \ldots & x_{1p}
$$ \end{array}\right) \qquad
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} (\text{size } n \times (p+1)), $$
$$
we get the short form
- estimated standard errors:
$$
$$ \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} $$
$$
\[use Julia code (existing package) to perform linear regression for
- $t$-statistics: `volume ~ girth + height`\]
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now:
- $p$-values:
- parameter estimates:
$$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} $$
$$ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
**Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your - estimated standard errors:
functions with the \`\`\`tree''' data set and try to reproduce the
output above. $$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
## Generalized Linear Models $$
Classical linear model - $t$-statistics:
$$ $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
\mathbf{y} = \mathbf{X} \beta + \varepsilon
$$ - $p$-values:
implies that $$
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$
In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. **Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your
However, the assumption of a normal distribution is unrealistic for functions with the \`\`\`tree''' data set and try to reproduce the
non-continuous data. Popular alternatives include: output above.
- for counting data: $$ ```{julia}
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) r2(linmod1)
$$ r2(linmod2)
Here, the components are considered to be independent and the linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
exponential function is applied componentwise.
r2(linmod3)
- for binary data: $$ ```
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$ ## Generalized Linear Models
Again, the components are considered to be independent and all the Classical linear model
operations are applied componentwise.
$$
All these models are defined by the choice of a family of distributions \mathbf{y} = \mathbf{X} \beta + \varepsilon
and a function $g$ (the so-called *link function*) such that $$
$$ implies that
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). $$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
$$
In particular, the conditional expectation satisfies
For the models above, these are: $\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
+----------------------+---------------------+----------------------+ However, the assumption of a normal distribution is unrealistic for
| Type of Data | Distribution Family | Link Function | non-continuous data. Popular alternatives include:
+======================+=====================+======================+
| continuous | Normal | identity: | - for counting data: $$
| | | | \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
| | | $$ | $$
| | | g(x)=x |
| | | $$ | Here, the components are considered to be independent and the
+----------------------+---------------------+----------------------+ exponential function is applied componentwise.
| count | Poisson | log: |
| | | | - for binary data: $$
| | | $$ | \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
| | | g(x) = \log(x) | $$
| | | $$ |
+----------------------+---------------------+----------------------+ Again, the components are considered to be independent and all the
| binary | Bernoulli | logit: | operations are applied componentwise.
| | | |
| | | $$ | All these models are defined by the choice of a family of distributions
| | | g(x) = \log\left | and a function $g$ (the so-called *link function*) such that
| | | ( |
| | | \frac{x}{1-x}\right) | $$
| | | $$ | \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
+----------------------+---------------------+----------------------+ $$
In general, the parameter vector $\beta$ is estimated via maximizing the For the models above, these are:
likelihood, i.e.,
+--------------+---------------------+--------------------+
$$ | Type of Data | Distribution Family | Link Function |
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +==============+=====================+====================+
$$ | continuous | Normal | identity: |
| | | |
which is equivalent to the maximization of the log-likelihood, i.e., | | | $$ |
| | | g(x)=x |
$$ | | | $$ |
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +--------------+---------------------+--------------------+
$$ | count | Poisson | log: |
| | | |
In the Gaussian case, the maximum likelihood estimator is identical to | | | $$ |
the least squares estimator considered above. | | | g(x) = \log(x) |
| | | $$ |
\[\[ Example in Julia: maybe `SwissLabor` \]\] +--------------+---------------------+--------------------+
| binary | Bernoulli | logit: |
**Task 3:** Reproduce the results of our data analysis of the `tree` | | | |
data set using a generalized linear model with normal distribution | | | $$ |
family. | | | g(x) = \log\left |
| | | ( |
| | | \ |
| | | f |
| | | rac{x}{1-x}\right) |
| | | $$ |
+--------------+---------------------+--------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$
which is equivalent to the maximization of the log-likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$
In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
```{julia}
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
```
**Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution
family.