This commit is contained in:
behinger (s-ccs 001) 2023-10-09 09:51:53 +00:00
commit ec20c173aa
9 changed files with 779 additions and 262 deletions

View File

@ -57,9 +57,9 @@ website:
text: "📝 1 - Advanced Git and Contributing"
- href: "material/2_tue/git/tasks.qmd"
text: "🛠 1 - Git: Exercises"
- href: "material/2_tue/testing/slides.qmd"
- href: "material/2_tue/testing/slides.md"
text: "📝 2 - Testing"
- href: "material/2_tue/CI/missing.qmd"
- href: "material/2_tue/ci/slides.md"
text: "📝 3 - Continuous Integration"
- href: material/2_tue/codereview/slides.qmd
text: "📝 4 - Code Review"

View File

@ -0,0 +1 @@
There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf).

View File

@ -0,0 +1 @@
Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf)

395
material/2_tue/ci/slides.md Normal file
View File

@ -0,0 +1,395 @@
---
type: slide
slideOptions:
transition: slide
width: 1400
height: 900
margin: 0.1
---
<style>
.reveal strong {
font-weight: bold;
color: orange;
}
.reveal p {
text-align: left;
}
.reveal section h1 {
color: orange;
}
.reveal section h2 {
color: orange;
}
.reveal code {
font-family: 'Ubuntu Mono';
color: orange;
}
.reveal section img {
background:none;
border:none;
box-shadow:none;
}
</style>
# Learning Goals
- Name and explain common workflows to automate in RSE.
- Explain the differences between the various continuous methodologies.
- Explain why automation is crucial in RSE.
- Write and understand basic automation scripts for GitHub Actions.
- s.t. we understand what `PkgTemplates` generates for us.
Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material).
---
# 1. Workflow Automation
---
## Why Automation?
- Automatize tasks
- Run tests frequently, give feedback early etc.
- Ensure reproducible test environments
- Cannot forget automatized tasks
- Less burden to developer (and their workstation)
- Avoid manual errors
- Process often integrated in development workflow
- Example: Support by Git hooks or Git forges
---
## Typical Automation Tasks in RSE
- Check code formatting and quality
- Compile and test code for different platforms
- Generate coverage reports and visualization
- Build documentation and deploy it
- Build, package, and upload releases
---
## Continuous Methodologies (1/2)
- **Continuous Integration** (CI)
- Continuously integrate changes into "main" branch
- Avoids "merge hell"
- Relies on testing and checking code continuously
- Should be automatized
---
## Continuous Methodologies (2/2)
- **Continuous Delivery** (CD)
- Software is in a state that allows new release at any time
- Software package is built
- Actual release triggered manually
- **Continuous Deployment** (CD)
- Software is in a state that allows new release at any time
- Software package is built
- Actual release triggered automatically (continuously)
---
## Automation Services/Software
- [GitHub Actions](https://github.com/features/actions)
- [GitLab CI/CD](https://docs.gitlab.com/ee/ci/)
- [Circle CI](https://circleci.com/)
- [Travis CI](https://www.travis-ci.com/)
- [Jenkins](https://www.jenkins.io/)
- ...
---
# 2. GitHub Actions
---
## What is "GitHub Actions"?
> Automate, customize, and execute your software development workflows right in your repository with GitHub Actions.
From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions)
---
## General Information
- Usage of GitHub's runners is [limited](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits)
- Available for public repositories or accounts with subscription
- By default Actions run on GitHub's runners
- Linux, Windows, or MacOS
- Quickly evolving and significant improvements in recent years
---
## Components (1/2)
- [Workflow](https://docs.github.com/en/actions/using-workflows): Runs one or more jobs
- [Event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows): Triggers a workflow
- [Jobs](https://docs.github.com/en/actions/using-jobs): Set of steps (running on same runner)
- Steps executed consecutively and share data
- Jobs by default executed in parallel
- [Action](https://docs.github.com/en/actions/creating-actions): Application performing common, complex task (step) often used in workflows
- [Runner](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners): Server that runs jobs
- [Artifacts](https://docs.github.com/en/actions/learn-github-actions/essential-features-of-github-actions#sharing-data-between-jobs): Files to be shared between jobs or to be kept after workflow finishes
---
## Components (2/2)
<img src="https://docs.github.com/assets/cb-25535/mw-1440/images/help/actions/overview-actions-simple.webp" width=95%; style="margin-left:auto; margin-right:auto; padding-top: 25px; padding-bottom: 25px; background: #eeeeee">
From [GitHub Actions tutorial](https://docs.github.com/en/actions)
---
## Setting up a Workflow
- Workflow file files stored `${REPO_ROOT}/.github/workflows`
- Configured via YAML file
```yaml
name: learn-github-actions
on: [push]
jobs:
check-bats-version:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-node@v2
with:
node-version: '14'
- run: npm install -g bats
- run: bats -v
```
---
## Actions
```yaml
- uses: actions/checkout@v3
- uses: actions/setup-node@v2
with:
node-version: '14'
```
- Integrated via `uses` directive
- Additional configuration via `with` (options depend on Action)
- Find actions in [marketplace](https://github.com/marketplace?type=actions)
- Write [own actions](https://docs.github.com/en/actions/creating-actions)
---
## Some Useful Julia Actions
- Find on [gitHub.com/julia-actions](https://github.com/julia-actions/)
```
- uses: julia-actions/setup-julia@v1
with:
version: '1.9'
```
- More:
- `cache`: caches `~/.julia/artifacts/*` and `~/.julia/packages/*` to reduce runtime of CI
- `julia-buildpkg`: build package
- `julia-runtest`: run tests
- `julia-format`: format code
---
## User-specified Commands
```yaml
- name: "Single line command"
run: echo "Single line command"
- name: "Multi line command"
run: |
echo "First line"
echo "Second line. Directory ${PWD}"
workdir: tmp/
shell: bash
```
---
## Events
- Single or multiple events
```yaml
on: [push, fork]
```
- Activities
```yaml
on:
issue:
types:
- opened
- labeled
```
- Filters
```yaml
on:
push:
branches:
- main
- 'releases/**'
```
---
## Artifacts
- Data sharing between jobs and data upload
- Uploading artifact
```yaml
- name: "Upload artifact"
uses: actions/upload-artifact@v2
with:
name: my-artifact
path: my_file.txt
retention-days: 5
```
- Downloading artifact
```yaml
- name: "Download a single artifact"
uses: actions/download-artifact@v2
with:
name: my-artifact
```
**Note**: Drop name to download all artifacts
---
## Test Actions Locally
- [act](https://github.com/nektos/act)
- Relies extensively on Docker
- User should be in `docker` group
- Run `act` from root of the repository
```text
act (runs all workflows)
act --job WORKFLOWNAME
```
- Environment is not 100% identical to GitHub's
- Workflows may fail locally, but work on GitHub
---
## Further Reading
- [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration)
- [GitHub Actions documentation](https://docs.github.com/en/actions)
- [GitHub Actions quickstart](https://docs.github.com/en/actions/quickstart)
---
# 3. Demo: Automation with GitHub Actions
---
## Setting up a Test Job
- Import [Julia test package repository](https://github.com/uekerman/JuliaTestPackage) (the same code we used for testing)
- Set up workflow file
```bash
mkdir -p .github/workflows
cd .github/workflows
vi format-check.yml
```
- Let's check whether our code is formatted correctly. Edit `format-check.yml` to have following content
```yaml
name: format-check
on: [push, pull_request]
jobs:
format:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: julia-actions/setup-julia@v1
with:
version: '1.9'
- name: Install JuliaFormatter and format
run: |
julia -e 'using Pkg; Pkg.add(PackageSpec(name="JuliaFormatter"))'
julia -e 'using JuliaFormatter; format(".", verbose=true)'
- name: Format check
run: |
julia -e '
out = Cmd(`git diff --name-only`) |> read |> String
if out == ""
exit(0)
else
@error "Some files have not been formatted"
write(stdout, out)
exit(1)
end'
```
- `runs-on` does **not** refer to a Docker container, but to a runner tag.
- Add, commit, push
- After the push, inspect "Action" panel on GitHub repository
- GitHub will schedule a run (yellow dot)
- Hooray. We have set up our first action.
- Failing test example:
- Edit settings on GitHub that one can only merge if all tests pass:
- Settings -> Branches -> Branch protection rule
- Choose `main` branch
- Enable "Require status checks to pass before merging". Optionally enable "Require branches to be up to date before merging"
- Choose status checks that need to pass: `test`
- Click on "Create" at bottom of page.
- Create a new branch `break-code`.
- Edit some file, violate the formatting, commit it and push it to the branch. Afterwards open a new PR and inspect the failing test. We are also not able to merge the changes as the "Merge" button should be inactive.
---
## act Demo
- `act` is for quick checks while developing workflows, not for developing the code
- Check available jobs (at root of repository)
```bash
act -l
```
- Run jobs for `push` event (default event)
```bash
act
```
- Run a specific job
```bash
act -j test
```
---
# 4. Exercise
Set up GitHub Actions for your statistics package. They should format your code and run the tests. To structure and parallelize things, you could use two separate jobs.

View File

@ -32,7 +32,7 @@ slideOptions:
}
</style>
## Learning Goals of the Git Lecture
# Learning Goals
- Refresh and organize students' existing knowledge on Git (learn how to learn more).
- Students can explain difference between merge and rebase and when to use what.

View File

@ -3,6 +3,7 @@
1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again).
2. Push your package `MyStatsPackage` to a remote repository.
3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded).
4. Use the package from somebody else in the classroom and verify with `printOwner` that you use the correct package.
5. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
6. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.
4. Start a new Julia environment and use your package through its url: `]add https://github.com/[username]/MyStatsPackage`.
5. Now use the package from somebody else in the classroom instead and verify with `printOwner` that you use the correct package.
6. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged.
7. Add more functions to other packages of classmates that print funny things, but always ensure a linear history.

View File

@ -1,9 +1,37 @@
---
format: revealjs
type: slide
slideOptions:
transition: slide
width: 1400
height: 900
margin: 0.1
---
<style>
.reveal strong {
font-weight: bold;
color: orange;
}
.reveal p {
text-align: left;
}
.reveal section h1 {
color: orange;
}
.reveal section h2 {
color: orange;
}
.reveal code {
font-family: 'Ubuntu Mono';
color: orange;
}
.reveal section img {
background:none;
border:none;
box-shadow:none;
}
</style>
# Learning Goals
- Justify the effort of developing tests to some extent

View File

@ -0,0 +1,52 @@
############################################################################
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
############################################################################
using Statistics
using Plots
using RDatasets
using GLM
##
trees = dataset("datasets", "trees")
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
##
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
##
linmod1 = lm(@formula(Volume ~ Girth), trees)
##
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
##
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
##
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
##
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())

View File

@ -1,253 +1,292 @@
---
editor:
markdown:
wrap: 72
---
# Multiple Regression Basics
## Motivation
### Introductory Example: tree dataset from R
\[figure of raw data\]
*Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth?
\[figure including a straight line\]
First Guess: There is a linear relation!
## Simple Linear Regression
Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate
$\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$.
*Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors)
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones.
Written in matrix style:
$$
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$
Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem.
\[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\]
*Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and
$\hat \beta_1$
- column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
- column `t value` : value of the $t$-statistics
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution.
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
$i=0,1$
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
(you can choose them)), simulate responses from a linear model and
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
choices of the parameters to see the effects on the parameter estimates
and the $p$-values.
## Multiple Regression Model
*Idea*: Generalize the simple linear regression model to multiple
covariates, w.g., predict `volume` using `girth` and \`height\`\`.
$\leadsto$ **Linear Model:**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- $y_i$: $i$-th measurement of the response,
- $x_{i1}$: $i$ th value of first covariate,
- ...
- $x_{ip}$: $i$-th value of $p$-th covariate,
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
distributed errors with unknown variance $\sigma^2$.
*Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
*Our choice again:* Least squares estimator (= maximum likelihood
estimator for normal errors)
$$
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
Written in matrix style:
$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$
Defining the *design matrix*
$$ \mathbf{X} = \left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{11} & \ldots & x_{1p}
\end{array}\right) \qquad
(\text{size } n \times (p+1)), $$
we get the short form
$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
\[use Julia code (existing package) to perform linear regression for
`volume ~ girth + height`\]
The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now:
- parameter estimates:
$$
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
- estimated standard errors:
$$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
$$
- $t$-statistics:
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
- $p$-values:
$$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$
**Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your
functions with the \`\`\`tree''' data set and try to reproduce the
output above.
## Generalized Linear Models
Classical linear model
$$
\mathbf{y} = \mathbf{X} \beta + \varepsilon
$$
implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include:
- for counting data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
$$
Here, the components are considered to be independent and the
exponential function is applied componentwise.
- for binary data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$
Again, the components are considered to be independent and all the
operations are applied componentwise.
All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that
$$
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$
For the models above, these are:
+----------------------+---------------------+----------------------+
| Type of Data | Distribution Family | Link Function |
+======================+=====================+======================+
| continuous | Normal | identity: |
| | | |
| | | $$ |
| | | g(x)=x |
| | | $$ |
+----------------------+---------------------+----------------------+
| count | Poisson | log: |
| | | |
| | | $$ |
| | | g(x) = \log(x) |
| | | $$ |
+----------------------+---------------------+----------------------+
| binary | Bernoulli | logit: |
| | | |
| | | $$ |
| | | g(x) = \log\left |
| | | ( |
| | | \frac{x}{1-x}\right) |
| | | $$ |
+----------------------+---------------------+----------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$
which is equivalent to the maximization of the log-likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$
In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
\[\[ Example in Julia: maybe `SwissLabor` \]\]
**Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution
family.
---
editor:
markdown:
wrap: 72
---
# Multiple Regression Basics
## Motivation
### Introductory Example: tree dataset from R
```{julia}
using Statistics
using Plots
using RDatasets
trees = dataset("datasets", "trees")
scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume")
```
*Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth?
```{julia}
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
```
First Guess: There is a linear relation!
## Simple Linear Regression
Main assumption: up to some error term, each measurement of the response
variable $y_i$ depends linearly on the corresponding value $x_i$ of the
covariate
$\leadsto$ **(Simple) Linear Model:**
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$
where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent
normally distributed errors with unknown variance $\sigma^2$.
*Task:* Find the straight line that fits best, i.e., find the *optimal*
estimators for $\beta_0$ and $\beta_1$.
*Typical choice*: Least squares estimator (= maximum likelihood
estimator for normal errors)
$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the
vector of covariates and $\mathbf{1}$ is a vector of ones.
Written in matrix style:
$$
(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2
$$
Note: There is a closed-form expression for
$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but
rather use Julia to solve the problem.
\[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\]
```{julia}
lm(@formula(Volume ~ Girth), trees)
```
*Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and
$\hat \beta_1$
- column `Std. Error` : estimated standard deviation
$\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$
- column `t value` : value of the $t$-statistics
$$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$
Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would
follow a $t$-distribution.
- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for
$i=0,1$
**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given
these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$
(you can choose them)), simulate responses from a linear model and
estimate the coefficients $\beta_0$ and $\beta_1$. Play with different
choices of the parameters to see the effects on the parameter estimates
and the $p$-values.
## Multiple Regression Model
*Idea*: Generalize the simple linear regression model to multiple
covariates, w.g., predict `volume` using `girth` and \`height\`\`.
$\leadsto$ **Linear Model:**
$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where
- $y_i$: $i$-th measurement of the response,
- $x_{i1}$: $i$ th value of first covariate,
- ...
- $x_{ip}$: $i$-th value of $p$-th covariate,
- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally
distributed errors with unknown variance $\sigma^2$.
*Task:* Find the *optimal* estimators for
$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$.
*Our choice again:* Least squares estimator (= maximum likelihood
estimator for normal errors)
$$
\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2
$$
where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the
vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones.
Written in matrix style:
$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2
$$
Defining the *design matrix*
$$ \mathbf{X} = \left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{11} & \ldots & x_{1p}
\end{array}\right) \qquad
(\text{size } n \times (p+1)), $$
we get the short form
$$
\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
\[use Julia code (existing package) to perform linear regression for
`volume ~ girth + height`\]
The interpretation of the Julia output is similar to the simple linear
regression model, but we provide explicit formulas now:
- parameter estimates:
$$
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$
- estimated standard errors:
$$
\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2}
$$
- $t$-statistics:
$$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$
- $p$-values:
$$
p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p}
$$
**Task 2**: Implement functions that estimate the $\beta$-parameters,
the corresponding standard errors and the $t$-statistics. Test your
functions with the \`\`\`tree''' data set and try to reproduce the
output above.
```{julia}
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
```
## Generalized Linear Models
Classical linear model
$$
\mathbf{y} = \mathbf{X} \beta + \varepsilon
$$
implies that
$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$
In particular, the conditional expectation satisfies
$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$.
However, the assumption of a normal distribution is unrealistic for
non-continuous data. Popular alternatives include:
- for counting data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)
$$
Here, the components are considered to be independent and the
exponential function is applied componentwise.
- for binary data: $$
\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}
$$
Again, the components are considered to be independent and all the
operations are applied componentwise.
All these models are defined by the choice of a family of distributions
and a function $g$ (the so-called *link function*) such that
$$
\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).
$$
For the models above, these are:
+--------------+---------------------+--------------------+
| Type of Data | Distribution Family | Link Function |
+==============+=====================+====================+
| continuous | Normal | identity: |
| | | |
| | | $$ |
| | | g(x)=x |
| | | $$ |
+--------------+---------------------+--------------------+
| count | Poisson | log: |
| | | |
| | | $$ |
| | | g(x) = \log(x) |
| | | $$ |
+--------------+---------------------+--------------------+
| binary | Bernoulli | logit: |
| | | |
| | | $$ |
| | | g(x) = \log\left |
| | | ( |
| | | \ |
| | | f |
| | | rac{x}{1-x}\right) |
| | | $$ |
+--------------+---------------------+--------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),
$$
which is equivalent to the maximization of the log-likelihood, i.e.,
$$
\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),
$$
In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
```{julia}
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
```
**Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution
family.