Updated Text & Code on Regression.

This commit is contained in:
Marco Oesting
2023-10-08 20:56:51 +02:00
parent cc5c76f770
commit f2d84806ea
2 changed files with 344 additions and 253 deletions

View File

@@ -0,0 +1,52 @@
############################################################################
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
############################################################################
using Statistics
using Plots
using RDatasets
using GLM
##
trees = dataset("datasets", "trees")
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
##
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
##
linmod1 = lm(@formula(Volume ~ Girth), trees)
##
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
##
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
##
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
##
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())

View File

@@ -10,13 +10,26 @@ editor:
### Introductory Example: tree dataset from R ### Introductory Example: tree dataset from R
\[figure of raw data\] ```{julia}
using Statistics
using Plots
using RDatasets
trees = dataset("datasets", "trees")
scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume")
```
*Aim:* Find relationship between the *response variable* `volume` and *Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth? of a tree given its girth?
\[figure including a straight line\] ```{julia}
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
```
First Guess: There is a linear relation! First Guess: There is a linear relation!
@@ -55,6 +68,10 @@ rather use Julia to solve the problem.
\[use Julia code (existing package) to perform linear regression for \[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\] `volume ~ girth`\]
```{julia}
lm(@formula(Volume ~ Girth), trees)
```
*Interpretation of the Julia output:* *Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and - column `estimate` : least square estimates for $\hat \beta_0$ and
@@ -166,6 +183,15 @@ the corresponding standard errors and the $t$-statistics. Test your
functions with the \`\`\`tree''' data set and try to reproduce the functions with the \`\`\`tree''' data set and try to reproduce the
output above. output above.
```{julia}
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
```
## Generalized Linear Models ## Generalized Linear Models
Classical linear model Classical linear model
@@ -206,29 +232,31 @@ $$
For the models above, these are: For the models above, these are:
+----------------------+---------------------+----------------------+ +--------------+---------------------+--------------------+
| Type of Data | Distribution Family | Link Function | | Type of Data | Distribution Family | Link Function |
+======================+=====================+======================+ +==============+=====================+====================+
| continuous | Normal | identity: | | continuous | Normal | identity: |
| | | | | | | |
| | | $$ | | | | $$ |
| | | g(x)=x | | | | g(x)=x |
| | | $$ | | | | $$ |
+----------------------+---------------------+----------------------+ +--------------+---------------------+--------------------+
| count | Poisson | log: | | count | Poisson | log: |
| | | | | | | |
| | | $$ | | | | $$ |
| | | g(x) = \log(x) | | | | g(x) = \log(x) |
| | | $$ | | | | $$ |
+----------------------+---------------------+----------------------+ +--------------+---------------------+--------------------+
| binary | Bernoulli | logit: | | binary | Bernoulli | logit: |
| | | | | | | |
| | | $$ | | | | $$ |
| | | g(x) = \log\left | | | | g(x) = \log\left |
| | | ( | | | | ( |
| | | \frac{x}{1-x}\right) | | | | \ |
| | | f |
| | | rac{x}{1-x}\right) |
| | | $$ | | | | $$ |
+----------------------+---------------------+----------------------+ +--------------+---------------------+--------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e., likelihood, i.e.,
@@ -246,7 +274,18 @@ $$
In the Gaussian case, the maximum likelihood estimator is identical to In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above. the least squares estimator considered above.
\[\[ Example in Julia: maybe `SwissLabor` \]\] ```{julia}
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
```
**Task 3:** Reproduce the results of our data analysis of the `tree` **Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution data set using a generalized linear model with normal distribution