Updated Text & Code on Regression.

This commit is contained in:
Marco Oesting
2023-10-08 20:56:51 +02:00
parent cc5c76f770
commit f2d84806ea
2 changed files with 344 additions and 253 deletions

View File

@@ -0,0 +1,52 @@
############################################################################
#### Execute code chunks separately in VSCODE by pressing 'Alt + Enter' ####
############################################################################
using Statistics
using Plots
using RDatasets
using GLM
##
trees = dataset("datasets", "trees")
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
##
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
##
linmod1 = lm(@formula(Volume ~ Girth), trees)
##
linmod2 = lm(@formula(Volume ~ Girth + Height), trees)
##
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
##
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
##
model = glm(@formula(participation ~ age), SwissLabor, Binomial(), ProbitLink())

View File

@@ -10,13 +10,26 @@ editor:
### Introductory Example: tree dataset from R
\[figure of raw data\]
```{julia}
using Statistics
using Plots
using RDatasets
trees = dataset("datasets", "trees")
scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume")
```
*Aim:* Find relationship between the *response variable* `volume` and
the *explanatory variable/covariate* `girth`? Can we predict the volume
of a tree given its girth?
\[figure including a straight line\]
```{julia}
scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)
```
First Guess: There is a linear relation!
@@ -55,6 +68,10 @@ rather use Julia to solve the problem.
\[use Julia code (existing package) to perform linear regression for
`volume ~ girth`\]
```{julia}
lm(@formula(Volume ~ Girth), trees)
```
*Interpretation of the Julia output:*
- column `estimate` : least square estimates for $\hat \beta_0$ and
@@ -166,6 +183,15 @@ the corresponding standard errors and the $t$-statistics. Test your
functions with the \`\`\`tree''' data set and try to reproduce the
output above.
```{julia}
r2(linmod1)
r2(linmod2)
linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
r2(linmod3)
```
## Generalized Linear Models
Classical linear model
@@ -206,29 +232,31 @@ $$
For the models above, these are:
+----------------------+---------------------+----------------------+
+--------------+---------------------+--------------------+
| Type of Data | Distribution Family | Link Function |
+======================+=====================+======================+
+==============+=====================+====================+
| continuous | Normal | identity: |
| | | |
| | | $$ |
| | | g(x)=x |
| | | $$ |
+----------------------+---------------------+----------------------+
+--------------+---------------------+--------------------+
| count | Poisson | log: |
| | | |
| | | $$ |
| | | g(x) = \log(x) |
| | | $$ |
+----------------------+---------------------+----------------------+
+--------------+---------------------+--------------------+
| binary | Bernoulli | logit: |
| | | |
| | | $$ |
| | | g(x) = \log\left |
| | | ( |
| | | \frac{x}{1-x}\right) |
| | | \ |
| | | f |
| | | rac{x}{1-x}\right) |
| | | $$ |
+----------------------+---------------------+----------------------+
+--------------+---------------------+--------------------+
In general, the parameter vector $\beta$ is estimated via maximizing the
likelihood, i.e.,
@@ -246,7 +274,18 @@ $$
In the Gaussian case, the maximum likelihood estimator is identical to
the least squares estimator considered above.
\[\[ Example in Julia: maybe `SwissLabor` \]\]
```{julia}
using CSV
using HTTP
http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))
SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")
model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
```
**Task 3:** Reproduce the results of our data analysis of the `tree`
data set using a generalized linear model with normal distribution