From 8adf75b14ce36e9df9f8157b7befaeacc234fe22 Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Sun, 8 Oct 2023 18:33:30 +0000 Subject: [PATCH 01/20] small updates --- material/1_mon/envs/tasks.qmd | 20 +++++++++++++++++--- material/1_mon/rse/rse_basics_slides.qmd | 3 +++ 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/material/1_mon/envs/tasks.qmd b/material/1_mon/envs/tasks.qmd index be6e862..d133e3b 100644 --- a/material/1_mon/envs/tasks.qmd +++ b/material/1_mon/envs/tasks.qmd @@ -3,6 +3,20 @@ 2. Add your `statistic.jl` & "include" it. 3. Export all functions 4. Create a new environment in a separate folder and add the package. -5. Does `using MyStatsPackage` work now? :tada: congratulations! -6. Go back to your package environment. Now add a dependency (e.g. ProgressMeter) and a `compat`-entry -7. Go back to your project environment, has the dependency been updated? Think: should you use `resolve` or `instantiate`? \ No newline at end of file +5. Does `using MyStatsPackage` work now? + +:::{.callout collapse=true} +## Yes! +:tada: congratulations! +::: + +:::{.callout collapse=true} +## No! +Oh no, better check you activated the right environment - ask for help! +::: +6. Go back to your package environment. Now add a dependency (e.g. ProgressMeter.jl) and a `compat`-entry +7. Go back to your project environment, has the dependency been updated? + +:::{.callout-hint collapse=true} +Should you use `resolve` or `instantiate`? +::: \ No newline at end of file diff --git a/material/1_mon/rse/rse_basics_slides.qmd b/material/1_mon/rse/rse_basics_slides.qmd index af57c9c..e9d119e 100644 --- a/material/1_mon/rse/rse_basics_slides.qmd +++ b/material/1_mon/rse/rse_basics_slides.qmd @@ -2,6 +2,9 @@ format: revealjs: output-file: rse_basics_slides_revealjs.html + scrollable: true + progress: true + history: false html: default --- From 1dd1090685d3aeff7f342c4689ef160d58347cf9 Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Sun, 8 Oct 2023 18:39:03 +0000 Subject: [PATCH 02/20] fix hint --- material/1_mon/envs/tasks.qmd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/material/1_mon/envs/tasks.qmd b/material/1_mon/envs/tasks.qmd index d133e3b..0af7d10 100644 --- a/material/1_mon/envs/tasks.qmd +++ b/material/1_mon/envs/tasks.qmd @@ -15,8 +15,10 @@ Oh no, better check you activated the right environment - ask for help! ::: 6. Go back to your package environment. Now add a dependency (e.g. ProgressMeter.jl) and a `compat`-entry -7. Go back to your project environment, has the dependency been updated? +7. Go back to your project environment, has the + dependency been updated? -:::{.callout-hint collapse=true} +:::{.callout collapse=true} +## Hint? Should you use `resolve` or `instantiate`? ::: \ No newline at end of file From f2d84806ea664557868af8041cfeb49130a714a7 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Sun, 8 Oct 2023 20:56:51 +0200 Subject: [PATCH 03/20] Updated Text & Code on Regression. --- material/3_wed/regression/Code_Snippets.jl | 52 ++ .../regression/MultipleRegressionBasics.qmd | 545 ++++++++++-------- 2 files changed, 344 insertions(+), 253 deletions(-) create mode 100644 material/3_wed/regression/Code_Snippets.jl diff --git a/material/3_wed/regression/Code_Snippets.jl b/material/3_wed/regression/Code_Snippets.jl new file mode 100644 index 0000000..05bb3e8 --- /dev/null +++ b/material/3_wed/regression/Code_Snippets.jl @@ -0,0 +1,52 @@ +############################################################################ +#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' #### +############################################################################ + +using Statistics +using Plots +using RDatasets +using GLM + +## + +trees = dataset("datasets", "trees") + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") + +## + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) + +## + +linmod1 = lm(@formula(Volume ~ Girth), trees) + +## + +linmod2 = lm(@formula(Volume ~ Girth + Height), trees) + +## + +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) + +## + +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +## + +model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink()) \ No newline at end of file diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 8af7871..1f23123 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -1,253 +1,292 @@ ---- -editor: - markdown: - wrap: 72 ---- - -# Multiple Regression Basics - -## Motivation - -### Introductory Example: tree dataset from R - -\[figure of raw data\] - -*Aim:* Find relationship between the *response variable* `volume` and -the *explanatory variable/covariate* `girth`? Can we predict the volume -of a tree given its girth? - -\[figure including a straight line\] - -First Guess: There is a linear relation! - -## Simple Linear Regression - -Main assumption: up to some error term, each measurement of the response -variable $y_i$ depends linearly on the corresponding value $x_i$ of the -covariate - -$\leadsto$ **(Simple) Linear Model:** -$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ -where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent -normally distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the straight line that fits best, i.e., find the *optimal* -estimators for $\beta_0$ and $\beta_1$. - -*Typical choice*: Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the -vector of covariates and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ - (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 -$$ - -Note: There is a closed-form expression for -$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but -rather use Julia to solve the problem. - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth`\] - -*Interpretation of the Julia output:* - -- column `estimate` : least square estimates for $\hat \beta_0$ and - $\hat \beta_1$ - -- column `Std. Error` : estimated standard deviation - $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ - -- column `t value` : value of the $t$-statistics - - $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ - - Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would - follow a $t$-distribution. - -- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for - $i=0,1$ - -**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given -these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ -(you can choose them)), simulate responses from a linear model and -estimate the coefficients $\beta_0$ and $\beta_1$. Play with different -choices of the parameters to see the effects on the parameter estimates -and the $p$-values. - -## Multiple Regression Model - -*Idea*: Generalize the simple linear regression model to multiple -covariates, w.g., predict `volume` using `girth` and \`height\`\`. - -$\leadsto$ **Linear Model:** -$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where - -- $y_i$: $i$-th measurement of the response, - -- $x_{i1}$: $i$ th value of first covariate, - -- ... - -- $x_{ip}$: $i$-th value of $p$-th covariate, - -- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the *optimal* estimators for -$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. - -*Our choice again:* Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ - \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 -$$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the -vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 -$$ - -Defining the *design matrix* - -$$ \mathbf{X} = \left( \begin{array}{cccc} - 1 & x_{11} & \ldots & x_{1p} \\ - \vdots & \vdots & \ddots & \vdots \\ - 1 & x_{11} & \ldots & x_{1p} - \end{array}\right) \qquad - (\text{size } n \times (p+1)), $$ - -we get the short form - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} -$$ - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth + height`\] - -The interpretation of the Julia output is similar to the simple linear -regression model, but we provide explicit formulas now: - -- parameter estimates: - - $$ - (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} - $$ - -- estimated standard errors: - - $$ - \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} - $$ - -- $t$-statistics: - - $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ - -- $p$-values: - - $$ - p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} - $$ - -**Task 2**: Implement functions that estimate the $\beta$-parameters, -the corresponding standard errors and the $t$-statistics. Test your -functions with the \`\`\`tree''' data set and try to reproduce the -output above. - -## Generalized Linear Models - -Classical linear model - -$$ - \mathbf{y} = \mathbf{X} \beta + \varepsilon -$$ - -implies that -$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ - -In particular, the conditional expectation satisfies -$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. - -However, the assumption of a normal distribution is unrealistic for -non-continuous data. Popular alternatives include: - -- for counting data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) - $$ - - Here, the components are considered to be independent and the - exponential function is applied componentwise. - -- for binary data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} - $$ - - Again, the components are considered to be independent and all the - operations are applied componentwise. - -All these models are defined by the choice of a family of distributions -and a function $g$ (the so-called *link function*) such that - -$$ - \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). -$$ - -For the models above, these are: - -+----------------------+---------------------+----------------------+ -| Type of Data | Distribution Family | Link Function | -+======================+=====================+======================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+----------------------+---------------------+----------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+----------------------+---------------------+----------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \frac{x}{1-x}\right) | -| | | $$ | -+----------------------+---------------------+----------------------+ - -In general, the parameter vector $\beta$ is estimated via maximizing the -likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -which is equivalent to the maximization of the log-likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -In the Gaussian case, the maximum likelihood estimator is identical to -the least squares estimator considered above. - -\[\[ Example in Julia: maybe `SwissLabor` \]\] - -**Task 3:** Reproduce the results of our data analysis of the `tree` -data set using a generalized linear model with normal distribution -family. +--- +editor: + markdown: + wrap: 72 +--- + +# Multiple Regression Basics + +## Motivation + +### Introductory Example: tree dataset from R + +```{julia} +using Statistics +using Plots +using RDatasets + +trees = dataset("datasets", "trees") + +scatter(trees.Volume, trees.Girth, + legend=false, xlabel="Girth", ylabel="Volume") +``` + +*Aim:* Find relationship between the *response variable* `volume` and +the *explanatory variable/covariate* `girth`? Can we predict the volume +of a tree given its girth? + +```{julia} +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) +``` + +First Guess: There is a linear relation! + +## Simple Linear Regression + +Main assumption: up to some error term, each measurement of the response +variable $y_i$ depends linearly on the corresponding value $x_i$ of the +covariate + +$\leadsto$ **(Simple) Linear Model:** +$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ +where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent +normally distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the straight line that fits best, i.e., find the *optimal* +estimators for $\beta_0$ and $\beta_1$. + +*Typical choice*: Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the +vector of covariates and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ + (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 +$$ + +Note: There is a closed-form expression for +$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but +rather use Julia to solve the problem. + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth`\] + +```{julia} +lm(@formula(Volume ~ Girth), trees) +``` + +*Interpretation of the Julia output:* + +- column `estimate` : least square estimates for $\hat \beta_0$ and + $\hat \beta_1$ + +- column `Std. Error` : estimated standard deviation + $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ + +- column `t value` : value of the $t$-statistics + + $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ + + Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would + follow a $t$-distribution. + +- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for + $i=0,1$ + +**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given +these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ +(you can choose them)), simulate responses from a linear model and +estimate the coefficients $\beta_0$ and $\beta_1$. Play with different +choices of the parameters to see the effects on the parameter estimates +and the $p$-values. + +## Multiple Regression Model + +*Idea*: Generalize the simple linear regression model to multiple +covariates, w.g., predict `volume` using `girth` and \`height\`\`. + +$\leadsto$ **Linear Model:** +$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where + +- $y_i$: $i$-th measurement of the response, + +- $x_{i1}$: $i$ th value of first covariate, + +- ... + +- $x_{ip}$: $i$-th value of $p$-th covariate, + +- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally + distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the *optimal* estimators for +$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. + +*Our choice again:* Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ + \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 +$$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the +vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 +$$ + +Defining the *design matrix* + +$$ \mathbf{X} = \left( \begin{array}{cccc} + 1 & x_{11} & \ldots & x_{1p} \\ + \vdots & \vdots & \ddots & \vdots \\ + 1 & x_{11} & \ldots & x_{1p} + \end{array}\right) \qquad + (\text{size } n \times (p+1)), $$ + +we get the short form + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth + height`\] + +The interpretation of the Julia output is similar to the simple linear +regression model, but we provide explicit formulas now: + +- parameter estimates: + + $$ + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} + $$ + +- estimated standard errors: + + $$ + \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} + $$ + +- $t$-statistics: + + $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ + +- $p$-values: + + $$ + p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} + $$ + +**Task 2**: Implement functions that estimate the $\beta$-parameters, +the corresponding standard errors and the $t$-statistics. Test your +functions with the \`\`\`tree''' data set and try to reproduce the +output above. + +```{julia} +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) +``` + +## Generalized Linear Models + +Classical linear model + +$$ + \mathbf{y} = \mathbf{X} \beta + \varepsilon +$$ + +implies that +$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ + +In particular, the conditional expectation satisfies +$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. + +However, the assumption of a normal distribution is unrealistic for +non-continuous data. Popular alternatives include: + +- for counting data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) + $$ + + Here, the components are considered to be independent and the + exponential function is applied componentwise. + +- for binary data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} + $$ + + Again, the components are considered to be independent and all the + operations are applied componentwise. + +All these models are defined by the choice of a family of distributions +and a function $g$ (the so-called *link function*) such that + +$$ + \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). +$$ + +For the models above, these are: + ++--------------+---------------------+--------------------+ +| Type of Data | Distribution Family | Link Function | ++==============+=====================+====================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++--------------+---------------------+--------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++--------------+---------------------+--------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | rac{x}{1-x}\right) | +| | | $$ | ++--------------+---------------------+--------------------+ + +In general, the parameter vector $\beta$ is estimated via maximizing the +likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +which is equivalent to the maximization of the log-likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +In the Gaussian case, the maximum likelihood estimator is identical to +the least squares estimator considered above. + +```{julia} +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +model = glm(@formula(participation ~ age^2), + SwissLabor, Binomial(), ProbitLink()) +``` + +**Task 3:** Reproduce the results of our data analysis of the `tree` +data set using a generalized linear model with normal distribution +family. From 08de4cde277098a6e23a9f281ff80cc23b8a4e48 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 08:55:43 +0200 Subject: [PATCH 04/20] Add CI material --- _quarto.yml | 2 +- material/2_tue/ci/slides.md | 406 ++++++++++++++++++++++++++++++++++++ 2 files changed, 407 insertions(+), 1 deletion(-) create mode 100644 material/2_tue/ci/slides.md diff --git a/_quarto.yml b/_quarto.yml index 117e3b4..f3301a1 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -59,7 +59,7 @@ website: text: "🛠 1 - Git: Exercises" - href: "material/2_tue/testing/slides.qmd" text: "📝 2 - Testing" - - href: "material/2_tue/CI/missing.qmd" + - href: "material/2_tue/ci/slides.md" text: "📝 3 - Continuous Integration" - href: material/2_tue/codereview/slides.qmd text: "📝 4 - Code Review" diff --git a/material/2_tue/ci/slides.md b/material/2_tue/ci/slides.md new file mode 100644 index 0000000..19594ec --- /dev/null +++ b/material/2_tue/ci/slides.md @@ -0,0 +1,406 @@ +--- +type: slide +slideOptions: + transition: slide + width: 1400 + height: 900 + margin: 0.1 +--- + + + +# Learning Goals + +- Name and explain common workflows to automate in RSE. +- Explain the differences between the various continuous methodologies. +- Explain why automation is crucial in RSE. +- Write and understand basic automation scripts for GitHub Actions. + - s.t. we are able to understand what `PkgTemplates` generates for us. + + +Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material). + +--- + +# 1. Workflow Automation + +--- + +## Why Automation? + +- Automatize tasks + - Run tests frequently, give feedback early etc. + - Ensure reproducible test environments + - Cannot forget automatized tasks + - Less burden to developer (and their workstation) + - Avoid manual errors +- Process often integrated in development workflow + - Example: Support by Git hooks or Git forges + +--- + +## Typical Automation Tasks in RSE + +- Check code formatting and quality +- Compile and test code for different platforms +- Generate coverage reports and visualization +- Build documentation and deploy it +- Build, package, and upload releases + +--- + +## Continuous Methodologies (1/2) + +- **Continuous Integration** (CI) + - Continuously integrate changes into "main" branch + - Avoids "merge hell" + - Relies on testing and checking code continuously + - Should be automatized + +--- + +## Continuous Methodologies (2/2) + +- **Continuous Delivery** (CD) + - Software is in a state that allows new release at any time + - Software package is built + - Actual release triggered manually +- **Continuous Deployment** (CD) + - Software is in a state that allows new release at any time + - Software package is built + - Actual release triggered automatically (continuously) + +--- + +## Automation Services/Software + +- [GitHub Actions](https://github.com/features/actions) +- [GitLab CI/CD](https://docs.gitlab.com/ee/ci/) +- [Circle CI](https://circleci.com/) +- [Travis CI](https://www.travis-ci.com/) +- [Jenkins](https://www.jenkins.io/) +- ... + +--- + +# 2. GitHub Actions + +--- + +## What is "GitHub Actions"? + +> Automate, customize, and execute your software development workflows right in your repository with GitHub Actions. + +From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions) + +--- + +## General Information + +- Usage of GitHub's runners is [limited](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits) +- Available for public repositories or accounts with subscription +- By default Actions run on GitHub's runners + - Linux, Windows, or MacOS +- Quickly evolving and significant improvements in recent years + +--- + +## Components (1/2) + +- [Workflow](https://docs.github.com/en/actions/using-workflows): Runs one or more jobs +- [Event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows): Triggers a workflow +- [Jobs](https://docs.github.com/en/actions/using-jobs): Set of steps (running on same runner) + - Steps executed consecutively and share data + - Jobs by default executed in parallel +- [Action](https://docs.github.com/en/actions/creating-actions): Application performing common, complex task (step) often used in workflows +- [Runner](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners): Server that runs jobs +- [Artifacts](https://docs.github.com/en/actions/learn-github-actions/essential-features-of-github-actions#sharing-data-between-jobs): Files to be shared between jobs or to be kept after workflow finishes + +--- + +## Components (2/2) + + + + +From [GitHub Actions tutorial](https://docs.github.com/en/actions) + +--- + +## Setting up a Workflow + +- Workflow file files stored `${REPO_ROOT}/.github/workflows` +- Configured via YAML file + +```yaml +name: learn-github-actions +on: [push] +jobs: + check-bats-version: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + - uses: actions/setup-node@v2 + with: + node-version: '14' + - run: npm install -g bats + - run: bats -v +``` + +--- + +## Actions + +```yaml +- uses: actions/checkout@v3 +- uses: actions/setup-node@v2 + with: + node-version: '14' +``` + +- Integrated via `uses` directive +- Additional configuration via `with` (options depend on Action) +- Find actions in [marketplace](https://github.com/marketplace?type=actions) +- Write [own actions](https://docs.github.com/en/actions/creating-actions) + +--- + +## Some Useful Julia Actions + +- Find on [gitHub.com/julia-actions](https://github.com/julia-actions/) + + ``` + - uses: julia-actions/setup-julia@v1 + with: + version: '1.9' + ``` + +- More: + - `cache`: caches `~/.julia/artifacts/*` and `~/.julia/packages/*` to reduce runtime of CI + - `julia-buildpkg`: build package + - `julia-runtest`: run tests + - `julia-format`: format code + +--- + +## User-specified Commands + +```yaml +- name: "Single line command" + run: echo "Single line command" +- name: "Multi line command" + run: | + echo "First line" + echo "Second line. Directory ${PWD}" + workdir: tmp/ + shell: bash +``` + +--- + +## Events + +- Single or multiple events + + ```yaml + on: [push, fork] + ``` + +- Activities + + ```yaml + on: + issue: + types: + - opened + - labeled + ``` + +- Filters + + ```yaml + on: + push: + branches: + - main + - 'releases/**' + ``` + +--- + +## Artifacts + +- Data sharing between jobs and data upload +- Uploading artifact + + ```yaml + - name: "Upload artifact" + uses: actions/upload-artifact@v2 + with: + name: my-artifact + path: my_file.txt + retention-days: 5 + ``` + +- Downloading artifact + + ```yaml + - name: "Download a single artifact" + uses: actions/download-artifact@v2 + with: + name: my-artifact + ``` + + **Note**: Drop name to download all artifacts + +--- + +## Test Actions Locally + +- [act](https://github.com/nektos/act) +- Relies extensively on Docker + - User should be in `docker` group +- Run `act` from root of the repository + + ```text + act (runs all workflows) + act --job WORKFLOWNAME + ``` + +- Environment is not 100% identical to GitHub's + - Workflows may fail locally, but work on GitHub + +--- + +## Advanced Topics + +- [Self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners) +- [Secrets and tokens](https://docs.github.com/en/actions/security-guides/encrypted-secrets) +- [Continuous deployment](https://docs.github.com/en/actions/deployment/about-deployments/about-continuous-deployment) +- [Custom actions](https://docs.github.com/en/actions/creating-actions/about-custom-actions) +- [Build matrices](https://docs.github.com/en/actions/using-workflows/advanced-workflow-features#using-a-build-matrix) +- Using own Docker containers + +--- + +## Further Reading + +- [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration) +- [GitHub Actions documentation](https://docs.github.com/en/actions) +- [GitHub Actions quickstart](https://docs.github.com/en/actions/quickstart) + +--- + +# 3. Demo: Automation with GitHub Actions + +--- + +## Setting up a Test Job + +- Import [Julia test package repository](https://github.com/uekerman/JuliaTestPackage) (the same code we used for testing) +- Set up workflow file + + ```bash + mkdir -p .github/workflows + cd .github/workflows + vi format-check.yml + ``` + +- Let's check whether our code is formatted correctly. Edit `format-check.yml` to have following content + + ```yaml + name: format-check + + on: [push, pull_request] + + jobs: + format: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - uses: julia-actions/setup-julia@v1 + with: + version: '1.9' + - name: Install JuliaFormatter and format + run: | + julia -e 'using Pkg; Pkg.add(PackageSpec(name="JuliaFormatter"))' + julia -e 'using JuliaFormatter; format(".", verbose=true)' + - name: Format check + run: | + julia -e ' + out = Cmd(`git diff --name-only`) |> read |> String + if out == "" + exit(0) + else + @error "Some files have not been formatted" + write(stdout, out) + exit(1) + end' + ``` + +- `runs-on` does **not** refer to a Docker container, but to a runner tag. +- Add, commit, push +- After the push, inspect "Action" panel on GitHub repository + - GitHub will schedule a run (yellow dot) + - Hooray. We have set up our first action. +- Failing test example: + - Edit settings on GitHub that one can only merge if all tests pass: + - Settings -> Branches -> Branch protection rule + - Choose `main` branch + - Enable "Require status checks to pass before merging". Optionally enable "Require branches to be up to date before merging" + - Choose status checks that need to pass: `test` + - Click on "Create" at bottom of page. + - Create a new branch `break-code`. + - Edit some file, violate the formatting, commit it and push it to the branch. Afterwards open a new PR and inspect the failing test. We are also not able to merge the changes as the "Merge" button should be inactive. + +--- + +## act Demo + +- `act` is for quick checks while developing workflows, not for developing the code +- Check available jobs (at root of repository) + + ```bash + act -l + ``` + +- Run jobs for `push` event (default event) + + ```bash + act + ``` + +- Run a specific job + + ```bash + act -j test + ``` + +--- + +# 4. Exercise + +Set up GitHub Actions for your statistics package. They should format your code and run the tests. To structure and parallelize things, you could use two separate jobs. From b40a2b0ec875ea4dc3d2a780684ba38b1d728be1 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 09:02:35 +0200 Subject: [PATCH 05/20] Tweak CI content --- material/2_tue/ci/slides.md | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/material/2_tue/ci/slides.md b/material/2_tue/ci/slides.md index 19594ec..b74324a 100644 --- a/material/2_tue/ci/slides.md +++ b/material/2_tue/ci/slides.md @@ -38,7 +38,7 @@ slideOptions: - Explain the differences between the various continuous methodologies. - Explain why automation is crucial in RSE. - Write and understand basic automation scripts for GitHub Actions. - - s.t. we are able to understand what `PkgTemplates` generates for us. + - s.t. we understand what `PkgTemplates` generates for us. Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material). @@ -143,7 +143,7 @@ From: [https://docs.github.com/en/actions](https://docs.github.com/en/actions) ## Components (2/2) - + From [GitHub Actions tutorial](https://docs.github.com/en/actions) @@ -295,17 +295,6 @@ jobs: --- -## Advanced Topics - -- [Self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners) -- [Secrets and tokens](https://docs.github.com/en/actions/security-guides/encrypted-secrets) -- [Continuous deployment](https://docs.github.com/en/actions/deployment/about-deployments/about-continuous-deployment) -- [Custom actions](https://docs.github.com/en/actions/creating-actions/about-custom-actions) -- [Build matrices](https://docs.github.com/en/actions/using-workflows/advanced-workflow-features#using-a-build-matrix) -- Using own Docker containers - ---- - ## Further Reading - [What is Continuous Integration?](https://www.atlassian.com/continuous-delivery/continuous-integration) From 40f1eaca1c2504f93f93d8ac5959df4f6d99a028 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 09:03:54 +0200 Subject: [PATCH 06/20] Revert rendering as slides --- .../2_tue/testing/{slides.qmd => slides.md} | 34 +++++++++++++++++-- 1 file changed, 31 insertions(+), 3 deletions(-) rename material/2_tue/testing/{slides.qmd => slides.md} (94%) diff --git a/material/2_tue/testing/slides.qmd b/material/2_tue/testing/slides.md similarity index 94% rename from material/2_tue/testing/slides.qmd rename to material/2_tue/testing/slides.md index 96a877d..539efeb 100644 --- a/material/2_tue/testing/slides.qmd +++ b/material/2_tue/testing/slides.md @@ -1,9 +1,37 @@ - --- -format: revealjs - +type: slide +slideOptions: + transition: slide + width: 1400 + height: 900 + margin: 0.1 --- + + # Learning Goals - Justify the effort of developing tests to some extent From d6172d273536f1b009c74e6a357ac9a7e44e6498 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 09:14:57 +0200 Subject: [PATCH 07/20] Fix broken link --- _quarto.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_quarto.yml b/_quarto.yml index f3301a1..2542c00 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -57,7 +57,7 @@ website: text: "📝 1 - Advanced Git and Contributing" - href: "material/2_tue/git/tasks.qmd" text: "🛠 1 - Git: Exercises" - - href: "material/2_tue/testing/slides.qmd" + - href: "material/2_tue/testing/slides.md" text: "📝 2 - Testing" - href: "material/2_tue/ci/slides.md" text: "📝 3 - Continuous Integration" From e0161b2b1d5f75174b641eb7aba4700db555eb23 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 10:39:18 +0200 Subject: [PATCH 08/20] Tweak rendering --- material/2_tue/git/slides.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/material/2_tue/git/slides.md b/material/2_tue/git/slides.md index f1540fc..a1b4f22 100644 --- a/material/2_tue/git/slides.md +++ b/material/2_tue/git/slides.md @@ -32,7 +32,7 @@ slideOptions: } -## Learning Goals of the Git Lecture +# Learning Goals - Refresh and organize students' existing knowledge on Git (learn how to learn more). - Students can explain difference between merge and rebase and when to use what. From 3905b9f895c9938d0b0c5bde316702cb730e13dc Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 10:52:19 +0200 Subject: [PATCH 09/20] Add cheatsheets --- cheatsheets/git.qmd | 1 + cheatsheets/githubactions.qmd | 1 + 2 files changed, 2 insertions(+) diff --git a/cheatsheets/git.qmd b/cheatsheets/git.qmd index e69de29..becee2f 100644 --- a/cheatsheets/git.qmd +++ b/cheatsheets/git.qmd @@ -0,0 +1 @@ +There are many good ones out there. One we can recommend is the [one from GitHub](https://education.github.com/git-cheat-sheet-education.pdf). diff --git a/cheatsheets/githubactions.qmd b/cheatsheets/githubactions.qmd index e69de29..ac8eaee 100644 --- a/cheatsheets/githubactions.qmd +++ b/cheatsheets/githubactions.qmd @@ -0,0 +1 @@ +Also [one from GitHub](https://github.github.io/actions-cheat-sheet/actions-cheat-sheet.pdf) From 314ac6e1af15231a971b6356faec086edab68649 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 11:11:04 +0200 Subject: [PATCH 10/20] Tweak Git exercise --- material/2_tue/git/tasks.qmd | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/material/2_tue/git/tasks.qmd b/material/2_tue/git/tasks.qmd index 94b9a45..0ecbf26 100644 --- a/material/2_tue/git/tasks.qmd +++ b/material/2_tue/git/tasks.qmd @@ -3,6 +3,7 @@ 1. Work with any forge that you like and create a user account (we strongly recommend GitHub since we will need it later again). 2. Push your package `MyStatsPackage` to a remote repository. 3. Add a function `printOwner` to the package through a pull request. The function should print your (GitHub) user name (hard-coded). -4. Use the package from somebody else in the classroom and verify with `printOwner` that you use the correct package. -5. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged. -6. Add more functions to other packages of classmates that print funny things, but always ensure a linear history. +4. Start a new Julia environment and use your package through its url: `]add https://github.com/[username]/MyStatsPackage`. +5. Now use the package from somebody else in the classroom instead and verify with `printOwner` that you use the correct package. +6. Fork this other package and contribute a function `printContributor` to it via a PR. Get a review and get it merged. +7. Add more functions to other packages of classmates that print funny things, but always ensure a linear history. From d888a883b4788413ab554b20484959b747006d44 Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Mon, 9 Oct 2023 09:51:45 +0000 Subject: [PATCH 11/20] added topic suggestions --- index.qmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/index.qmd b/index.qmd index 3f276e2..2dbeb9d 100644 --- a/index.qmd +++ b/index.qmd @@ -22,6 +22,9 @@ Seminar room in the groundfloor (directly at the entrance) [Link to map](https://www.simtech.uni-stuttgart.de/events/simtech-summer-school/SuSch_2/location/) + +#### ⚗ Advanced topics +We probably have some time to discuss advanced topics towards the end of the summers school. You are welcome to send an email to benedikt and/or put it into [the git issue](https://github.com/s-ccs/summerschool_simtech_2023/issues/15) ---- We wish you all a interesting, safe and fun summerschool. If there are any interpersonal issues (especially regarding [code-of-conduct](https://www.uni-stuttgart.de/en/university/profile/diversity/code-of-conduct/)), please directly contact [Benedikt Ehinger](benedikt.ehinger@vis.uni-stuttgart.de)^[If there are problem with him, please contact **Marco Oesting**]. For organizational issues, please contact [Sina Schorndorfer](sina.schorndorfer@imsb.uni-stuttgart.de) From 8e098136badeaedc13f5ad480972d4c36bfe4f8a Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 14:17:26 +0200 Subject: [PATCH 12/20] Tweak Git slides --- material/2_tue/git/slides.md | 1 - 1 file changed, 1 deletion(-) diff --git a/material/2_tue/git/slides.md b/material/2_tue/git/slides.md index a1b4f22..9fd9320 100644 --- a/material/2_tue/git/slides.md +++ b/material/2_tue/git/slides.md @@ -232,7 +232,6 @@ Which level do you have? - [Official documentation](http://git-scm.com/doc) - [Video: Git in 15 minutes: basics, branching, no remote](https://www.youtube.com/watch?v=USjZcfj8yxE) -- [The GitHub Blog: Commits are snapshots, not diffs](https://github.blog/2020-12-17-commits-are-snapshots-not-diffs/) - Chapters [6](https://merely-useful.tech/py-rse/git-cmdline.html) and [7](https://merely-useful.tech/py-rse/git-advanced.html) of Research Software Engineering with Python - [Podcast All Things Git: History of VC](https://www.allthingsgit.com/episodes/the_history_of_vc_with_eric_sink.html) - [git purr](https://girliemac.com/blog/2017/12/26/git-purr/) From aff64e340db97aa5cdf80e9b4fdf1d8b895be57b Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Mon, 9 Oct 2023 12:40:39 +0000 Subject: [PATCH 13/20] fix std + tstat --- material/1_mon/firststeps/tasks.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/material/1_mon/firststeps/tasks.qmd b/material/1_mon/firststeps/tasks.qmd index 05c4cb7..980ec3b 100644 --- a/material/1_mon/firststeps/tasks.qmd +++ b/material/1_mon/firststeps/tasks.qmd @@ -14,9 +14,9 @@ You can mark some code and execute it using `ctrl` + `enter` - you can also gene 3. implement a second function called `rse_mean`, which calculates the mean of the provided vector. Make sure to use the `rse_sum` function! Test it using `res_mean(-15:17) == 1` -4. Next implement a standard deviation function `rse_std`: $\sqrt{\frac{\sum(x-mean(x))}{n-1}}$, this time you should use elementwise/broadcasting operators. Test it with `rse_std(1:3) == 1` +4. Next implement a standard deviation function `rse_std`: $\sqrt{\frac{\sum((x-mean(x))^2)}{n-1}}$, this time you should use elementwise/broadcasting operators. Test it with `rse_std(1:3) == 1` -5. Finally, we will implement `rse_tstat`, returning the t-value with `length(x)-1` DF, that the provided Array actually has a mean of 0. Test it with `rse_tstat(2:3) == 5`. Add the keyword argument `σ` that allows the user to optionally provide a pre-calculated standard deviation. +5. Finally, we will implement `rse_tstat`, returning the t-value with `length(x)-1` DF, that the provided Array actually has a mean of 0. The formula is $\frac{mean(x)}{std(x) / (sqrt(x))}$ Test it with `rse_tstat(2:3) == 5`. Add the keyword argument `σ` that allows the user to optionally provide a pre-calculated standard deviation. Well done! You now have all functions defined with which we will continue our journey. From bfe7bb318c74189b46a918aded3f2874cf880bb8 Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Mon, 9 Oct 2023 12:42:09 +0000 Subject: [PATCH 14/20] fix typo --- material/1_mon/firststeps/tasks.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/material/1_mon/firststeps/tasks.qmd b/material/1_mon/firststeps/tasks.qmd index 980ec3b..06b4d47 100644 --- a/material/1_mon/firststeps/tasks.qmd +++ b/material/1_mon/firststeps/tasks.qmd @@ -10,9 +10,9 @@ You can mark some code and execute it using `ctrl` + `enter` - you can also gene ## The exercise 1. Open a new script `statistic_functions.jl` in VSCode in a folder of your choice. -2. implement a function called `rse_sum`^[rse = research software engineering, we could use `sum` in a principled way, but it requires some knowledge you likely don't have right now]. This function should return `true` if provided with the following test: `res_sum(1:36) == 666`. You should further make use of a for-loop. +2. implement a function called `rse_sum`^[rse = research software engineering, we could use `sum` in a principled way, but it requires some knowledge you likely don't have right now]. This function should return `true` if provided with the following test: `rse_sum(1:36) == 666`. You should further make use of a for-loop. -3. implement a second function called `rse_mean`, which calculates the mean of the provided vector. Make sure to use the `rse_sum` function! Test it using `res_mean(-15:17) == 1` +3. implement a second function called `rse_mean`, which calculates the mean of the provided vector. Make sure to use the `rse_sum` function! Test it using `rse_mean(-15:17) == 1` 4. Next implement a standard deviation function `rse_std`: $\sqrt{\frac{\sum((x-mean(x))^2)}{n-1}}$, this time you should use elementwise/broadcasting operators. Test it with `rse_std(1:3) == 1` From 96efdd56d1c092df2b6c71893d629c40fdd0e2cd Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Mon, 9 Oct 2023 12:55:18 +0000 Subject: [PATCH 15/20] fix int/float + ength --- material/1_mon/firststeps/tasks.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/material/1_mon/firststeps/tasks.qmd b/material/1_mon/firststeps/tasks.qmd index 06b4d47..5794768 100644 --- a/material/1_mon/firststeps/tasks.qmd +++ b/material/1_mon/firststeps/tasks.qmd @@ -14,9 +14,9 @@ You can mark some code and execute it using `ctrl` + `enter` - you can also gene 3. implement a second function called `rse_mean`, which calculates the mean of the provided vector. Make sure to use the `rse_sum` function! Test it using `rse_mean(-15:17) == 1` -4. Next implement a standard deviation function `rse_std`: $\sqrt{\frac{\sum((x-mean(x))^2)}{n-1}}$, this time you should use elementwise/broadcasting operators. Test it with `rse_std(1:3) == 1` +4. Next implement a standard deviation function `rse_std`: $\sqrt{\frac{\sum((x-mean(x))^2)}{n-1}}$, this time you should use elementwise/broadcasting operators. Test it with `rse_std(1:3) == 1.` -5. Finally, we will implement `rse_tstat`, returning the t-value with `length(x)-1` DF, that the provided Array actually has a mean of 0. The formula is $\frac{mean(x)}{std(x) / (sqrt(x))}$ Test it with `rse_tstat(2:3) == 5`. Add the keyword argument `σ` that allows the user to optionally provide a pre-calculated standard deviation. +5. Finally, we will implement `rse_tstat`, returning the t-value with `length(x)-1` DF, that the provided Array actually has a mean of 0. The formula is $\frac{mean(x)}{std(x) / (sqrt(length(x)))}$ Test it with `rse_tstat(2:3) == 5.`. Add the keyword argument `σ` that allows the user to optionally provide a pre-calculated standard deviation. Well done! You now have all functions defined with which we will continue our journey. From 905aea75465825344dea3cc4fb2e1b74b8b685f2 Mon Sep 17 00:00:00 2001 From: Benjamin Uekermann Date: Mon, 9 Oct 2023 15:49:03 +0200 Subject: [PATCH 16/20] Shorten testing lecture --- material/2_tue/testing/slides.md | 152 ++++++++----------------------- 1 file changed, 39 insertions(+), 113 deletions(-) diff --git a/material/2_tue/testing/slides.md b/material/2_tue/testing/slides.md index 539efeb..cde26dd 100644 --- a/material/2_tue/testing/slides.md +++ b/material/2_tue/testing/slides.md @@ -34,11 +34,11 @@ slideOptions: # Learning Goals -- Justify the effort of developing tests to some extent - Get to know a few common terms of testing - Work with the Julia unit testing package `Test.jl` -Material is taken and modified, on the one hand, from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and, on the other hand, from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/). +Material is taken and modified from the [SSE lecture](https://github.com/Simulation-Software-Engineering/Lecture-Material), which builds partly on the [py-rse book](https://merely-useful.tech/py-rse), and from the [Test.jl docs](https://docs.julialang.org/en/v1/stdlib/Test/). + --- @@ -48,30 +48,21 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https:// ## What is Testing? -- Smelling old milk before using it! -- A way to determine if a software is not producing reliable results and if so, what is the reason. -- Manual testing vs. automated testing. +- Smelling old milk before using it +- A way to determine if a software is not producing reliable results and if so, what is the reason +- Manual testing vs. automated testing --- ## Why Should you Test your Software? -- Improve software reliability and reproducibility. -- Make sure that changes (bugfixes, new features) do not affect other parts of software. +- Improve software reliability and reproducibility +- Make sure that changes (bugfixes, new features) do not affect other parts of software - Generally all software is better off being tested regularly. Possible exceptions are very small codes with single users. - Ensure that a released version of a software actually works. --- -## Nomenclature in Software Testing - -- **Fixture**: preparatory set for testing. -- **Actual result**: what the code produces when given the fixture. -- **Expected result**: what the actual result is compared to. -- **Test coverage**: how much of the code do tests touch in one run. - ---- - ## Some Ways to Test Software - Assertions @@ -83,29 +74,27 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https:// ## Assertions -- Principle of *defensive programming*. +```julia +@assert condition "message" +``` + +- Principle of *defensive programming* - Nothing happens when an assertion is true; throws error when false. - Types of assertion statements: - Precondition - Postcondition - Invariant -- A basic but powerful tool to test a software on-the-go. -- Assertion statement syntax in Python - -```julia -@assert condition "message" -``` +- A basic but powerful tool to test a software on-the-go --- ## Unit Testing -- Catching errors with assertions is good but preventing them is better! +- Catching errors with assertions is good but preventing them is better. - A *unit* is a single function in one situation. - A situation is one amongst many possible variations of input parameters. -- User creates the expected result manually. -- Fixture is the set of inputs used to generate an actual result. -- Actual result is compared to the expected result by `@test`. +- User creates the **expected result** manually. +- **Actual result** is compared to the expected result by `@test`. --- @@ -113,109 +102,44 @@ Material is taken and modified, on the one hand, from the [SSE lecture](https:// - Test whether several units work in conjunction. - *Integrate* units and test them together in an *integration* test. -- Often more complicated than a unit test and has more test coverage. -- A fixture is used to generate an actual result. -- Actual result is compared to the expected result by `@test`. +- Often more complicated than a unit test and gives higher test coverage. --- ## Regression Testing - Generating an expected result is not possible in some situations. -- Compare the current actual result with a previous actual result. +- Compare the *current* actual result with a *previous* actual result. - No guarantee that the current actual result is correct. - Risk of a bug being carried over indefinitely. - Main purpose is to identify changes in the current state of the code with respect to a past state. --- -## Test Coverage - -- Coverage is the amount of code a test touches in one run. -- Aim for high test coverage. -- There is a trade-off: high test coverage vs. effort in test development - ---- - -## Comparing Floating-point Variables - -- Very often quantities in math software are `float` / `double`. -- Such quantities cannot be compared to exact values, an approximation is necessary. -- Comparison of floating point variables needs to be done to a certain tolerance. - -```julia -@test 1 ≈ 0.999999 rtol=1e-5 -``` - -- Get `≈` by Latex `\approx` + TAB - ---- - -## Test-driven Development (TDD) - -- Principle is to write a test and then write a code to fulfill the test. -- Advantages: - - In the end user ends up with a test alongside the code. - - Eliminates confirmation bias of the user. - - Writing tests gives clarity on what the code is supposed to do. -- Disadvantage: known to not improve productivity. - ---- - -## Checking-driven Development (CDD) - -- Developer performs spot checks; sanity checks at intermediate stages -- Math software often has heuristics which are easy to determine. -- Keep performing same checks at different stages of development to ensure the code works. - ---- - -## Verifying a Test - -- Test written as part of a bug-fix: - - Reproduce the bug in the test by ensuring that the test fails. - - Fix the bug. - - Rerun the test to ensure that it passes. -- Test written to increase code coverage: - - Make sure that the first iteration of the test passes. - - Try introducing a small fixable bug in the code to verify if the test fails. - ---- - # 2. Unit Testing in Julia with Test.jl --- -## Setup of Tests.jl +## Setup of Test.jl -- Standard library to write and manage tests, `using Test` - Standardized folder structure: -``` -├── Manifest.toml -├── Project.toml -├── src/ -└── test - ├── Manifest.toml - ├── Project.toml - ├── runtests.jl - └── setup.jl -``` + ``` + ├── Manifest.toml + ├── Project.toml + ├── src/ + └── test + ├── Manifest.toml + ├── Project.toml + ├── runtests.jl + └── setup.jl + ``` - Singular `test` vs plural `runtests.jl` - `setup.jl` for all `using XYZ` statements, included in `runtests.jl` -- Additional packages either in `[extra] section` of `./Project.toml` or in a new `./test/Project.toml` environment +- Additional packages in `[extra] section` of `./Project.toml` or in new `./test/Project.toml` - In case of the latter: Do not add the package itself to the `./test/Project.toml` - - ---- - -## Run Tests - -Various options: - -- Directly call `runtests.jl` TODO? -- From Pkg-Manager `]test` when root project is activated +- Run: `]test` when root project is activated --- @@ -234,11 +158,11 @@ Various options: - `@testset`: Structure tests ```julia - julia> @testset "trigonometric identities" begin - θ = 2/3*π - @test sin(-θ) ≈ -sin(θ) - @test cos(-θ) ≈ cos(θ) - end; + @testset "trigonometric identities" begin + θ = 2/3*π + @test sin(-θ) ≈ -sin(θ) + @test cos(-θ) ≈ cos(θ) + end; ``` - `@testset for ... end`: Test in loop @@ -251,6 +175,8 @@ Various options: - [HiRSE-Summer of Testing Part 2b: "Testing with Julia" by Nils Niggemann](https://www.youtube.com/watch?v=gSMKNbZOpZU) - [Official documentation of Test.jl](https://docs.julialang.org/en/v1/stdlib/Test/) +--- + # 3. Test.jl Demo We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree/main/material/2_tue/testing/MyTestPackage), which looks as follows: @@ -274,7 +200,7 @@ We use [`MyTestPackage`](https://github.com/s-ccs/summerschool_simtech_2023/tree - Look at `MyTestPackage.jl` and `find.jl`: We have two functions `find_max` and `find_mean`, which calculate the maximum and mean of all elements of a `::AbstractVector`. - Assertions were added to check for `NaN` values - Look at `runtests.jl`: - - TODO: Why do we need `using MyTestPackage`? + - Why do we need `using MyTestPackage`? - We include dependencies via `setup.jl`: `Test` and `StableRNG`. - Testset "find" - Look at `find.jl` From 08951f610d3ba359a91dab33df62a13d6ec57b90 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Mon, 9 Oct 2023 16:07:42 +0200 Subject: [PATCH 17/20] further update on multiple regression --- material/3_wed/regression/Code_Snippets.jl | 98 ++- .../regression/MultipleRegressionBasics.qmd | 584 +++++++++--------- 2 files changed, 339 insertions(+), 343 deletions(-) diff --git a/material/3_wed/regression/Code_Snippets.jl b/material/3_wed/regression/Code_Snippets.jl index 05bb3e8..e7889d9 100644 --- a/material/3_wed/regression/Code_Snippets.jl +++ b/material/3_wed/regression/Code_Snippets.jl @@ -1,52 +1,48 @@ -############################################################################ -#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' #### -############################################################################ - -using Statistics -using Plots -using RDatasets -using GLM - -## - -trees = dataset("datasets", "trees") - -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") - -## - -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") -plot!(x -> -37 + 5*x) - -## - -linmod1 = lm(@formula(Volume ~ Girth), trees) - -## - -linmod2 = lm(@formula(Volume ~ Girth + Height), trees) - -## - -r2(linmod1) -r2(linmod2) - -linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) - -r2(linmod3) - -## - -using CSV -using HTTP - -http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") -SwissLabor = DataFrame(CSV.File(http_response.body)) - -SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") - -## - +using Statistics +using Plots +using RDatasets +using GLM + +#--- + +trees = dataset("datasets", "trees") + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") + +#--- + +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) + +#--- + +linmod1 = lm(@formula(Volume ~ Girth), trees) + +#--- + +linmod2 = lm(@formula(Volume ~ Girth + Height), trees) + +#--- + +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) + +#--- + +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +#--- + model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink()) \ No newline at end of file diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 1f23123..7e2f5a6 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -1,292 +1,292 @@ ---- -editor: - markdown: - wrap: 72 ---- - -# Multiple Regression Basics - -## Motivation - -### Introductory Example: tree dataset from R - -```{julia} -using Statistics -using Plots -using RDatasets - -trees = dataset("datasets", "trees") - -scatter(trees.Volume, trees.Girth, - legend=false, xlabel="Girth", ylabel="Volume") -``` - -*Aim:* Find relationship between the *response variable* `volume` and -the *explanatory variable/covariate* `girth`? Can we predict the volume -of a tree given its girth? - -```{julia} -scatter(trees.Girth, trees.Volume, - legend=false, xlabel="Girth", ylabel="Volume") -plot!(x -> -37 + 5*x) -``` - -First Guess: There is a linear relation! - -## Simple Linear Regression - -Main assumption: up to some error term, each measurement of the response -variable $y_i$ depends linearly on the corresponding value $x_i$ of the -covariate - -$\leadsto$ **(Simple) Linear Model:** -$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ -where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent -normally distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the straight line that fits best, i.e., find the *optimal* -estimators for $\beta_0$ and $\beta_1$. - -*Typical choice*: Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the -vector of covariates and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ - (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 -$$ - -Note: There is a closed-form expression for -$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but -rather use Julia to solve the problem. - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth`\] - -```{julia} -lm(@formula(Volume ~ Girth), trees) -``` - -*Interpretation of the Julia output:* - -- column `estimate` : least square estimates for $\hat \beta_0$ and - $\hat \beta_1$ - -- column `Std. Error` : estimated standard deviation - $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ - -- column `t value` : value of the $t$-statistics - - $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ - - Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would - follow a $t$-distribution. - -- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for - $i=0,1$ - -**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given -these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ -(you can choose them)), simulate responses from a linear model and -estimate the coefficients $\beta_0$ and $\beta_1$. Play with different -choices of the parameters to see the effects on the parameter estimates -and the $p$-values. - -## Multiple Regression Model - -*Idea*: Generalize the simple linear regression model to multiple -covariates, w.g., predict `volume` using `girth` and \`height\`\`. - -$\leadsto$ **Linear Model:** -$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where - -- $y_i$: $i$-th measurement of the response, - -- $x_{i1}$: $i$ th value of first covariate, - -- ... - -- $x_{ip}$: $i$-th value of $p$-th covariate, - -- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally - distributed errors with unknown variance $\sigma^2$. - -*Task:* Find the *optimal* estimators for -$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. - -*Our choice again:* Least squares estimator (= maximum likelihood -estimator for normal errors) - -$$ - \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 -$$ - -where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the -vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. - -Written in matrix style: - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 -$$ - -Defining the *design matrix* - -$$ \mathbf{X} = \left( \begin{array}{cccc} - 1 & x_{11} & \ldots & x_{1p} \\ - \vdots & \vdots & \ddots & \vdots \\ - 1 & x_{11} & \ldots & x_{1p} - \end{array}\right) \qquad - (\text{size } n \times (p+1)), $$ - -we get the short form - -$$ -\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} -$$ - -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth + height`\] - -The interpretation of the Julia output is similar to the simple linear -regression model, but we provide explicit formulas now: - -- parameter estimates: - - $$ - (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} - $$ - -- estimated standard errors: - - $$ - \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} - $$ - -- $t$-statistics: - - $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ - -- $p$-values: - - $$ - p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} - $$ - -**Task 2**: Implement functions that estimate the $\beta$-parameters, -the corresponding standard errors and the $t$-statistics. Test your -functions with the \`\`\`tree''' data set and try to reproduce the -output above. - -```{julia} -r2(linmod1) -r2(linmod2) - -linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) - -r2(linmod3) -``` - -## Generalized Linear Models - -Classical linear model - -$$ - \mathbf{y} = \mathbf{X} \beta + \varepsilon -$$ - -implies that -$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ - -In particular, the conditional expectation satisfies -$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. - -However, the assumption of a normal distribution is unrealistic for -non-continuous data. Popular alternatives include: - -- for counting data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) - $$ - - Here, the components are considered to be independent and the - exponential function is applied componentwise. - -- for binary data: $$ - \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} - $$ - - Again, the components are considered to be independent and all the - operations are applied componentwise. - -All these models are defined by the choice of a family of distributions -and a function $g$ (the so-called *link function*) such that - -$$ - \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). -$$ - -For the models above, these are: - -+--------------+---------------------+--------------------+ -| Type of Data | Distribution Family | Link Function | -+==============+=====================+====================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+--------------+---------------------+--------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+--------------+---------------------+--------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \ | -| | | f | -| | | rac{x}{1-x}\right) | -| | | $$ | -+--------------+---------------------+--------------------+ - -In general, the parameter vector $\beta$ is estimated via maximizing the -likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -which is equivalent to the maximization of the log-likelihood, i.e., - -$$ -\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), -$$ - -In the Gaussian case, the maximum likelihood estimator is identical to -the least squares estimator considered above. - -```{julia} -using CSV -using HTTP - -http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") -SwissLabor = DataFrame(CSV.File(http_response.body)) - -SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") - -model = glm(@formula(participation ~ age^2), - SwissLabor, Binomial(), ProbitLink()) -``` - -**Task 3:** Reproduce the results of our data analysis of the `tree` -data set using a generalized linear model with normal distribution -family. +--- +editor: + markdown: + wrap: 72 +--- + +# Multiple Regression Basics + +## Motivation + +### Introductory Example: tree dataset from R + +``` julia +using Statistics +using Plots +using RDatasets + +trees = dataset("datasets", "trees") + +scatter(trees.Volume, trees.Girth, + legend=false, xlabel="Girth", ylabel="Volume") +``` + +*Aim:* Find relationship between the *response variable* `volume` and +the *explanatory variable/covariate* `girth`? Can we predict the volume +of a tree given its girth? + +``` julia +scatter(trees.Girth, trees.Volume, + legend=false, xlabel="Girth", ylabel="Volume") +plot!(x -> -37 + 5*x) +``` + +First Guess: There is a linear relation! + +## Simple Linear Regression + +Main assumption: up to some error term, each measurement of the response +variable $y_i$ depends linearly on the corresponding value $x_i$ of the +covariate + +$\leadsto$ **(Simple) Linear Model:** +$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$$ +where $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ are independent +normally distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the straight line that fits best, i.e., find the *optimal* +estimators for $\beta_0$ and $\beta_1$. + +*Typical choice*: Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 $$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$ is the +vector of covariates and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ + (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 +$$ + +Note: There is a closed-form expression for +$(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but +rather use Julia to solve the problem. + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth`\] + +``` julia +lm(@formula(Volume ~ Girth), trees) +``` + +*Interpretation of the Julia output:* + +- column `estimate` : least square estimates for $\hat \beta_0$ and + $\hat \beta_1$ + +- column `Std. Error` : estimated standard deviation + $\hat s_{\hat \beta_i}$ of the estimator $\hat \beta_i$ + +- column `t value` : value of the $t$-statistics + + $$ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, $$ + + Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would + follow a $t$-distribution. + +- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for + $i=0,1$ + +**Task 1**: Generate a random set of covariates $\mathbf{x}$. Given +these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ +(you can choose them)), simulate responses from a linear model and +estimate the coefficients $\beta_0$ and $\beta_1$. Play with different +choices of the parameters to see the effects on the parameter estimates +and the $p$-values. + +## Multiple Regression Model + +*Idea*: Generalize the simple linear regression model to multiple +covariates, w.g., predict `volume` using `girth` and \`height\`\`. + +$\leadsto$ **Linear Model:** +$$y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$$where + +- $y_i$: $i$-th measurement of the response, + +- $x_{i1}$: $i$ th value of first covariate, + +- ... + +- $x_{ip}$: $i$-th value of $p$-th covariate, + +- $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$: independent normally + distributed errors with unknown variance $\sigma^2$. + +*Task:* Find the *optimal* estimators for +$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$. + +*Our choice again:* Least squares estimator (= maximum likelihood +estimator for normal errors) + +$$ + \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 +$$ + +where $\mathbf{y}$ is the vector of responses, $\mathbf{x}$\_j is the +vector of the $j$ th covariate and $\mathbf{1}$ is a vector of ones. + +Written in matrix style: + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 +$$ + +Defining the *design matrix* + +$$ \mathbf{X} = \left( \begin{array}{cccc} + 1 & x_{11} & \ldots & x_{1p} \\ + \vdots & \vdots & \ddots & \vdots \\ + 1 & x_{11} & \ldots & x_{1p} + \end{array}\right) \qquad + (\text{size } n \times (p+1)), $$ + +we get the short form + +$$ +\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +\[use Julia code (existing package) to perform linear regression for +`volume ~ girth + height`\] + +The interpretation of the Julia output is similar to the simple linear +regression model, but we provide explicit formulas now: + +- parameter estimates: + + $$ + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} + $$ + +- estimated standard errors: + + $$ + \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p} \|\mathbf{y} - \mathbf{X} \beta\|^2} + $$ + +- $t$-statistics: + + $$ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. $$ + +- $p$-values: + + $$ + p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p} + $$ + +**Task 2**: Implement functions that estimate the $\beta$-parameters, +the corresponding standard errors and the $t$-statistics. Test your +functions with the \`\`\`tree''' data set and try to reproduce the +output above. + +``` julia +r2(linmod1) +r2(linmod2) + +linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees) + +r2(linmod3) +``` + +## Generalized Linear Models + +Classical linear model + +$$ + \mathbf{y} = \mathbf{X} \beta + \varepsilon +$$ + +implies that +$$ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$$ + +In particular, the conditional expectation satisfies +$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$. + +However, the assumption of a normal distribution is unrealistic for +non-continuous data. Popular alternatives include: + +- for counting data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) + $$ + + Here, the components are considered to be independent and the + exponential function is applied componentwise. + +- for binary data: $$ + \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} + $$ + + Again, the components are considered to be independent and all the + operations are applied componentwise. + +All these models are defined by the choice of a family of distributions +and a function $g$ (the so-called *link function*) such that + +$$ + \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). +$$ + +For the models above, these are: + ++--------------+---------------------+--------------------+ +| Type of Data | Distribution Family | Link Function | ++==============+=====================+====================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++--------------+---------------------+--------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++--------------+---------------------+--------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | rac{x}{1-x}\right) | +| | | $$ | ++--------------+---------------------+--------------------+ + +In general, the parameter vector $\beta$ is estimated via maximizing the +likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +which is equivalent to the maximization of the log-likelihood, i.e., + +$$ +\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), +$$ + +In the Gaussian case, the maximum likelihood estimator is identical to +the least squares estimator considered above. + +``` julia +using CSV +using HTTP + +http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv") +SwissLabor = DataFrame(CSV.File(http_response.body)) + +SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes") + +model = glm(@formula(participation ~ age^2), + SwissLabor, Binomial(), ProbitLink()) +``` + +**Task 3:** Reproduce the results of our data analysis of the `tree` +data set using a generalized linear model with normal distribution +family. From 39892ad1c1faa4505b3cf2f2c9cad456f45f3528 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Mon, 9 Oct 2023 16:14:00 +0200 Subject: [PATCH 18/20] Another update --- .../regression/MultipleRegressionBasics.qmd | 62 ++++++++++--------- 1 file changed, 32 insertions(+), 30 deletions(-) diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index 7e2f5a6..f54f3b5 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -10,7 +10,7 @@ editor: ### Introductory Example: tree dataset from R -``` julia +``` julia using Statistics using Plots using RDatasets @@ -25,7 +25,7 @@ scatter(trees.Volume, trees.Girth, the *explanatory variable/covariate* `girth`? Can we predict the volume of a tree given its girth? -``` julia +``` julia scatter(trees.Girth, trees.Volume, legend=false, xlabel="Girth", ylabel="Volume") plot!(x -> -37 + 5*x) @@ -68,7 +68,7 @@ rather use Julia to solve the problem. \[use Julia code (existing package) to perform linear regression for `volume ~ girth`\] -``` julia +``` julia lm(@formula(Volume ~ Girth), trees) ``` @@ -183,7 +183,7 @@ the corresponding standard errors and the $t$-statistics. Test your functions with the \`\`\`tree''' data set and try to reproduce the output above. -``` julia +``` julia r2(linmod1) r2(linmod2) @@ -232,31 +232,33 @@ $$ For the models above, these are: -+--------------+---------------------+--------------------+ -| Type of Data | Distribution Family | Link Function | -+==============+=====================+====================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+--------------+---------------------+--------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+--------------+---------------------+--------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \ | -| | | f | -| | | rac{x}{1-x}\right) | -| | | $$ | -+--------------+---------------------+--------------------+ ++---------------+-------------------+------------------+ +| Type of Data | Distribution | Link Function | +| | Family | | ++===============+===================+==================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++---------------+-------------------+------------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++---------------+-------------------+------------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g(x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | ra | +| | | c{x}{1-x}\right) | +| | | $$ | ++---------------+-------------------+------------------+ In general, the parameter vector $\beta$ is estimated via maximizing the likelihood, i.e., @@ -274,7 +276,7 @@ $$ In the Gaussian case, the maximum likelihood estimator is identical to the least squares estimator considered above. -``` julia +``` julia using CSV using HTTP From 773c15150d8206ed1d0d4ee1fbeb8828173e57f6 Mon Sep 17 00:00:00 2001 From: "behinger (s-ccs 001)" Date: Mon, 9 Oct 2023 14:14:50 +0000 Subject: [PATCH 19/20] solutions day 1 --- _quarto.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_quarto.yml b/_quarto.yml index 2542c00..4db53c5 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -47,6 +47,8 @@ website: text: "📝 3 - First Steps: Handout" - href: "material/1_mon/firststeps/tasks.qmd" text: "🛠 3 - First Steps: Exercises" + - href: "https://github.com/s-ccs/summerschool_simtech_2023/blob/main/material/1_mon/firststeps/statistic_functions.jl" + text: "✔ 3 - Solutions" - href: "material/1_mon/envs/envs_handout.qmd" text: "📝 4 - Envs & Pkgs : Handout" - href: "material/1_mon/envs/tasks.qmd" From 58cf2de20df4b96a4461cc55ffccd12156775641 Mon Sep 17 00:00:00 2001 From: Marco Oesting Date: Mon, 9 Oct 2023 16:39:16 +0200 Subject: [PATCH 20/20] Elaborate regression exercises. --- .../regression/MultipleRegressionBasics.qmd | 74 ++++++++++--------- 1 file changed, 41 insertions(+), 33 deletions(-) diff --git a/material/3_wed/regression/MultipleRegressionBasics.qmd b/material/3_wed/regression/MultipleRegressionBasics.qmd index f54f3b5..9e02584 100644 --- a/material/3_wed/regression/MultipleRegressionBasics.qmd +++ b/material/3_wed/regression/MultipleRegressionBasics.qmd @@ -65,8 +65,6 @@ Note: There is a closed-form expression for $(\hat \beta_0, \hat \beta_1)$. We will not make use of it here, but rather use Julia to solve the problem. -\[use Julia code (existing package) to perform linear regression for -`volume ~ girth`\] ``` julia lm(@formula(Volume ~ Girth), trees) @@ -87,8 +85,12 @@ lm(@formula(Volume ~ Girth), trees) Under the hypothesis $\beta_i=0$, the test statistics $t_i$ would follow a $t$-distribution. -- column `Pr(>|t|)`: $p$-values for the hyptheses $\beta_i=0$ for +- column `Pr(>|t|)`: $p$-values for the hypotheses $\beta_i=0$ for $i=0,1$ + +:::callout.tip +The command `rand(n)` generates a sample of `n` "random" (i.e., uniformly distributed) random numbers. +::: **Task 1**: Generate a random set of covariates $\mathbf{x}$. Given these covariates and true parameters $\beta_0$, $\beta_1$ and $\sigma^2$ @@ -232,33 +234,35 @@ $$ For the models above, these are: -+---------------+-------------------+------------------+ -| Type of Data | Distribution | Link Function | -| | Family | | -+===============+===================+==================+ -| continuous | Normal | identity: | -| | | | -| | | $$ | -| | | g(x)=x | -| | | $$ | -+---------------+-------------------+------------------+ -| count | Poisson | log: | -| | | | -| | | $$ | -| | | g(x) = \log(x) | -| | | $$ | -+---------------+-------------------+------------------+ -| binary | Bernoulli | logit: | -| | | | -| | | $$ | -| | | g(x) = \log\left | -| | | ( | -| | | \ | -| | | f | -| | | ra | -| | | c{x}{1-x}\right) | -| | | $$ | -+---------------+-------------------+------------------+ ++----------------+------------------+-----------------+ +| Type of Data | Distribution | Link Function | +| | Family | | ++================+==================+=================+ +| continuous | Normal | identity: | +| | | | +| | | $$ | +| | | g(x)=x | +| | | $$ | ++----------------+------------------+-----------------+ +| count | Poisson | log: | +| | | | +| | | $$ | +| | | g(x) = \log(x) | +| | | $$ | ++----------------+------------------+-----------------+ +| binary | Bernoulli | logit: | +| | | | +| | | $$ | +| | | g | +| | | (x) = \log\left | +| | | ( | +| | | \ | +| | | f | +| | | ra | +| | | c | +| | | {x}{1-x}\right) | +| | | $$ | ++----------------+------------------+-----------------+ In general, the parameter vector $\beta$ is estimated via maximizing the likelihood, i.e., @@ -289,6 +293,10 @@ model = glm(@formula(participation ~ age^2), SwissLabor, Binomial(), ProbitLink()) ``` -**Task 3:** Reproduce the results of our data analysis of the `tree` -data set using a generalized linear model with normal distribution -family. +::: callout-task +**Task 3**: + +1. Reproduce the results of our data analysis of the `tree` data set using +a generalized linear model with normal distribution family. +2. Generate +:::