JuliaForDataAnalysis/exercises/exercises11.md
2022-10-14 13:43:12 +02:00

485 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 11
# Problems
### Exercise 1
Generate a data frame `df` having one column `x` consisting of 100,000 values
sampled from uniform distribution on [0, 1[ interval.
Serialize it to disk, and next deserialize. Check if the deserialized
object is the same as the source data frame.
<details>
<summary>Solution</summary>
```
julia> using DataFrames
julia> df = DataFrame(x=rand(100_000));
julia> using Serialization
julia> serialize("df.bin", df)
julia> deserialize("df.bin") == df
true
```
</details>
### Exercise 2
Add a column `n` to the `df` data frame that in each row will hold the
number of observations in column `x` that have distance less than `0.1` to
a value stored in a given row of `x`.
<details>
<summary>Solution</summary>
A simple approach is:
```
df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
```
A more sophisticated approach (faster and allocating less memory) would be:
```
df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
```
An even faster solution that is type stable would use function barrier:
```
f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
df.n = f(df.x)
```
Finally you can work on sorted data to get a much better performance. Here is an
example (it is a bit more advanced):
```
function f2(x)
p = sortperm(x)
n = zeros(Int, length(x))
start = 1
stop = 1
idx = 0
while idx < length(x) # you could add @inbounds here but I typically avoid it
idx += 1
while x[p[idx]] - x[p[start]] >= 0.1
start += 1
end
while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
stop += 1
end
n[p[idx]] = stop - start
end
return n
end
df.n = f2(df.x)
```
In this solution the fact that we used function barrier is even more relevant
as we explicitly use loops inside.
</details>
### Exercise 3
Investigate visually how does `n` depend on `x` in data frame `df`.
<details>
<summary>Solution</summary>
```
using Plots
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
```
As expected on the border of the domain number of neighbors drops.
</details>
### Exercise 4
Someone has prepared the following test data for you:
```
teststr = """
"x","sinx"
0.139279,0.138829
0.456779,0.441059
0.344034,0.337287
0.140253,0.139794
0.848344,0.750186
0.977512,0.829109
0.032737,0.032731
0.702750,0.646318
0.422339,0.409895
0.393878,0.383772
"""
```
Load this data into `testdf` data frame.
<details>
<summary>Solution</summary>
```
julia> using CSV
julia> using DataFrames
julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
10×2 DataFrame
Row │ x sinx
│ Float64 Float64
─────┼────────────────────
1 │ 0.139279 0.138829
2 │ 0.456779 0.441059
3 │ 0.344034 0.337287
4 │ 0.140253 0.139794
5 │ 0.848344 0.750186
6 │ 0.977512 0.829109
7 │ 0.032737 0.032731
8 │ 0.70275 0.646318
9 │ 0.422339 0.409895
10 │ 0.393878 0.383772
```
</details>
### Exercise 5
Check the accuracy of computations of sinus of `x` in `testdf`.
Print all rows for which the absolute difference is greater than `5e-7`.
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
difference.
<details>
<summary>Solution</summary>
Since data frame is small we can use `eachrow`:
```
julia> for row in eachrow(testdf)
sinx = sin(row.x)
dev = abs(sinx - row.sinx)
dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
end
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
```
</details>
### Exercise 6
Group data in data frame `df` into buckets of 0.1 width and store the result in
`gdf` data frame (sort the groups). Use the `cut` function from
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
Check the number of values in each group.
<details>
<summary>Solution</summary>
```
julia> using CategoricalArrays
julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
julia> gdf = groupby(df, :xbins; sort=true);
julia> [nrow(group) for group in gdf]
10-element Vector{Int64}:
9872
9976
9968
9943
10063
10173
9977
10076
9908
10044
julia> combine(gdf, nrow) # alternative way to do it
10×2 DataFrame
Row │ xbins nrow
│ Cat… Int64
─────┼───────────────────
1 │ [0.0, 0.1) 9872
2 │ [0.1, 0.2) 9976
3 │ [0.2, 0.3) 9968
4 │ [0.3, 0.4) 9943
5 │ [0.4, 0.5) 10063
6 │ [0.5, 0.6) 10173
7 │ [0.6, 0.7) 9977
8 │ [0.7, 0.8) 10076
9 │ [0.8, 0.9) 9908
10 │ [0.9, 1.0) 10044
```
You might get a bit different numbers but all should be around 10,000.
</details>
### Exercise 7
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
Check what would be the group order if you asked not to sort them.
<details>
<summary>Solution</summary>
```
julia> NamedTuple.(keys(gdf))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
(xbins = "[0.0, 0.1)",)
(xbins = "[0.1, 0.2)",)
(xbins = "[0.2, 0.3)",)
(xbins = "[0.3, 0.4)",)
(xbins = "[0.4, 0.5)",)
(xbins = "[0.5, 0.6)",)
(xbins = "[0.6, 0.7)",)
(xbins = "[0.7, 0.8)",)
(xbins = "[0.8, 0.9)",)
(xbins = "[0.9, 1.0)",)
julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
(xbins = "[0.4, 0.5)",)
(xbins = "[0.9, 1.0)",)
(xbins = "[0.8, 0.9)",)
(xbins = "[0.0, 0.1)",)
(xbins = "[0.2, 0.3)",)
(xbins = "[0.5, 0.6)",)
(xbins = "[0.7, 0.8)",)
(xbins = "[0.3, 0.4)",)
(xbins = "[0.1, 0.2)",)
(xbins = "[0.6, 0.7)",)
```
If you pass `sort=false` instead of `sort=true` you get groups in their order
of appearance in `df`. If you skipped specifying `sort` keyword argument
the resulting group order could depend on the type of grouping column, so if
you want to depend on the order of groups always spass `sort` keyword argument
explicitly.
</details>
### Exercise 8
Compute average `n` for each group in `gdf`.
<details>
<summary>Solution</summary>
```
julia> using Statistics
julia> [mean(group.n) for group in gdf]
10-element Vector{Float64}:
14845.847751215559
19835.367882919007
19919.195826645264
19993.023936437694
20105.506111497565
20222.35761329008
20151.794727874112
20022.69610956729
19909.331550262414
14944.511449621665
julia> combine(gdf, :n => mean) # alternative way to do it
10×2 DataFrame
Row │ xbins n_mean
│ Cat… Float64
─────┼─────────────────────
1 │ [0.0, 0.1) 14845.8
2 │ [0.1, 0.2) 19835.4
3 │ [0.2, 0.3) 19919.2
4 │ [0.3, 0.4) 19993.0
5 │ [0.4, 0.5) 20105.5
6 │ [0.5, 0.6) 20222.4
7 │ [0.6, 0.7) 20151.8
8 │ [0.7, 0.8) 20022.7
9 │ [0.8, 0.9) 19909.3
10 │ [0.9, 1.0) 14944.5
```
</details>
### Exercise 9
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
Use the `\` operator to fit it (recall it from chapter 4).
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
<details>
<summary>Solution</summary>
```
julia> function fitmodel(x, n)
X = [ones(length(x)) x]
α₀, αₓ = X \ n
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
end
fitmodel (generic function with 1 method)
julia> [fitmodel(group.x, group.n) for group in gdf]
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
(α₀ = 9900.190310776916, αₓ = 99131.14394200995)
(α₀ = 19823.115188829383, αₓ = 81.66979172871368)
(α₀ = 19812.9822724435, αₓ = 424.00895772216785)
(α₀ = 19810.726510910834, αₓ = 520.6763238983195)
(α₀ = 19437.772385484135, αₓ = 1483.333906139938)
(α₀ = 20187.521449870146, αₓ = 63.30709585406235)
(α₀ = 20424.362332155855, αₓ = -419.42268710601405)
(α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
(α₀ = 20013.690535193662, αₓ = -122.80055110522495)
(α₀ = 109320.55276082881, αₓ = -99305.18846102979)
julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
10×3 DataFrame
Row │ xbins α₀ αₓ
│ Cat… Float64 Float64
─────┼────────────────────────────────────────
1 │ [0.0, 0.1) 9900.19 99131.1
2 │ [0.1, 0.2) 19823.1 81.6698
3 │ [0.2, 0.3) 19813.0 424.009
4 │ [0.3, 0.4) 19810.7 520.676
5 │ [0.4, 0.5) 19437.8 1483.33
6 │ [0.5, 0.6) 20187.5 63.3071
7 │ [0.6, 0.7) 20424.4 -419.423
8 │ [0.7, 0.8) 20789.7 -1022.98
9 │ [0.8, 0.9) 20013.7 -122.801
10 │ [0.9, 1.0) 1.09321e5 -99305.2
```
We note that indeed in the first and last group the regression has a significant
slope.
</details>
### Exercise 10
Repeat exercise 9 but using the GLM.jl package. This time
extract the p-value for the slope of estimated coefficient for `x` variable.
Use the `coeftable` function from GLM.jl to get this information.
Check the documentation of this function to learn how to do it (it will be
easiest for you to first convert its result to a `DataFrame`).
<details>
<summary>Solution</summary>
```
julia> using GLM
julia> function fitlmmodel(group; info=false)
model = lm(@formula(n~x), group)
coefdf = DataFrame(coeftable(model))
info && @show coefdf # to see how the data frame looks like
α₀, αₓ = coefdf[:, "Coef."]
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
end
fitlmmodel (generic function with 1 method)
julia> [fitlmmodel(group; info = true) for group in gdf]
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95
2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1
2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6
2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5
2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7
2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6
2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4
2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4
2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1
2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5
2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
(α₀ = 9900.190310776927, αₓ = 99131.1439420097)
(α₀ = 19823.115188829663, αₓ = 81.66979172690417)
(α₀ = 19812.98227244386, αₓ = 424.00895772074136)
(α₀ = 19810.726510911398, αₓ = 520.6763238966264)
(α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
(α₀ = 20187.521449871012, αₓ = 63.307095852511125)
(α₀ = 20424.36233216108, αₓ = -419.4226871140539)
(α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
(α₀ = 20013.69053519897, αₓ = -122.80055111148658)
(α₀ = 109320.55276074051, αₓ = -99305.18846093686)
julia> combine(gdf, fitlmmodel)
10×3 DataFrame
Row │ xbins α₀ αₓ
│ Cat… Float64 Float64
─────┼────────────────────────────────────────
1 │ [0.0, 0.1) 9900.19 99131.1
2 │ [0.1, 0.2) 19823.1 81.6698
3 │ [0.2, 0.3) 19813.0 424.009
4 │ [0.3, 0.4) 19810.7 520.676
5 │ [0.4, 0.5) 19437.8 1483.33
6 │ [0.5, 0.6) 20187.5 63.3071
7 │ [0.6, 0.7) 20424.4 -419.423
8 │ [0.7, 0.8) 20789.7 -1022.98
9 │ [0.8, 0.9) 20013.7 -122.801
10 │ [0.9, 1.0) 1.09321e5 -99305.2
```
We got the same results. The `combine(gdf, fitlmmodel)` style of using
the `combine` function is a bit more advanced and is not covered in the book.
It is used in the cases, like the one we have here, when you want to pass
a whole group to the function in `combine`. Check DataFrames.jl documentation
for more detailed explanations.
</details>