485 lines
18 KiB
Markdown
485 lines
18 KiB
Markdown
# Julia for Data Analysis
|
||
|
||
## Bogumił Kamiński, Daniel Kaszyński
|
||
|
||
# Chapter 11
|
||
|
||
# Problems
|
||
|
||
### Exercise 1
|
||
|
||
Generate a data frame `df` having one column `x` consisting of 100,000 values
|
||
sampled from uniform distribution on [0, 1[ interval.
|
||
Serialize it to disk, and next deserialize. Check if the deserialized
|
||
object is the same as the source data frame.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using DataFrames
|
||
|
||
julia> df = DataFrame(x=rand(100_000));
|
||
|
||
julia> using Serialization
|
||
|
||
julia> serialize("df.bin", df)
|
||
|
||
julia> deserialize("df.bin") == df
|
||
true
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 2
|
||
|
||
Add a column `n` to the `df` data frame that in each row will hold the
|
||
number of observations in column `x` that have distance less than `0.1` to
|
||
a value stored in a given row of `x`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
A simple approach is:
|
||
```
|
||
df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
|
||
```
|
||
|
||
A more sophisticated approach (faster and allocating less memory) would be:
|
||
```
|
||
df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
|
||
```
|
||
|
||
An even faster solution that is type stable would use function barrier:
|
||
```
|
||
f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
|
||
df.n = f(df.x)
|
||
```
|
||
|
||
Finally you can work on sorted data to get a much better performance. Here is an
|
||
example (it is a bit more advanced):
|
||
```
|
||
function f2(x)
|
||
p = sortperm(x)
|
||
n = zeros(Int, length(x))
|
||
start = 1
|
||
stop = 1
|
||
idx = 0
|
||
while idx < length(x) # you could add @inbounds here but I typically avoid it
|
||
idx += 1
|
||
while x[p[idx]] - x[p[start]] >= 0.1
|
||
start += 1
|
||
end
|
||
while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
|
||
stop += 1
|
||
end
|
||
n[p[idx]] = stop - start
|
||
end
|
||
return n
|
||
end
|
||
df.n = f2(df.x)
|
||
```
|
||
|
||
In this solution the fact that we used function barrier is even more relevant
|
||
as we explicitly use loops inside.
|
||
|
||
</details>
|
||
|
||
### Exercise 3
|
||
|
||
Investigate visually how does `n` depend on `x` in data frame `df`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
using Plots
|
||
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
|
||
```
|
||
|
||
As expected on the border of the domain number of neighbors drops.
|
||
|
||
</details>
|
||
|
||
### Exercise 4
|
||
|
||
Someone has prepared the following test data for you:
|
||
```
|
||
teststr = """
|
||
"x","sinx"
|
||
0.139279,0.138829
|
||
0.456779,0.441059
|
||
0.344034,0.337287
|
||
0.140253,0.139794
|
||
0.848344,0.750186
|
||
0.977512,0.829109
|
||
0.032737,0.032731
|
||
0.702750,0.646318
|
||
0.422339,0.409895
|
||
0.393878,0.383772
|
||
"""
|
||
```
|
||
|
||
Load this data into `testdf` data frame.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using CSV
|
||
|
||
julia> using DataFrames
|
||
|
||
julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
|
||
10×2 DataFrame
|
||
Row │ x sinx
|
||
│ Float64 Float64
|
||
─────┼────────────────────
|
||
1 │ 0.139279 0.138829
|
||
2 │ 0.456779 0.441059
|
||
3 │ 0.344034 0.337287
|
||
4 │ 0.140253 0.139794
|
||
5 │ 0.848344 0.750186
|
||
6 │ 0.977512 0.829109
|
||
7 │ 0.032737 0.032731
|
||
8 │ 0.70275 0.646318
|
||
9 │ 0.422339 0.409895
|
||
10 │ 0.393878 0.383772
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 5
|
||
|
||
Check the accuracy of computations of sinus of `x` in `testdf`.
|
||
Print all rows for which the absolute difference is greater than `5e-7`.
|
||
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
|
||
difference.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
Since data frame is small we can use `eachrow`:
|
||
|
||
```
|
||
julia> for row in eachrow(testdf)
|
||
sinx = sin(row.x)
|
||
dev = abs(sinx - row.sinx)
|
||
dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
|
||
end
|
||
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
|
||
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 6
|
||
|
||
Group data in data frame `df` into buckets of 0.1 width and store the result in
|
||
`gdf` data frame (sort the groups). Use the `cut` function from
|
||
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
|
||
Check the number of values in each group.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using CategoricalArrays
|
||
|
||
julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
|
||
|
||
julia> gdf = groupby(df, :xbins; sort=true);
|
||
|
||
julia> [nrow(group) for group in gdf]
|
||
10-element Vector{Int64}:
|
||
9872
|
||
9976
|
||
9968
|
||
9943
|
||
10063
|
||
10173
|
||
9977
|
||
10076
|
||
9908
|
||
10044
|
||
|
||
julia> combine(gdf, nrow) # alternative way to do it
|
||
10×2 DataFrame
|
||
Row │ xbins nrow
|
||
│ Cat… Int64
|
||
─────┼───────────────────
|
||
1 │ [0.0, 0.1) 9872
|
||
2 │ [0.1, 0.2) 9976
|
||
3 │ [0.2, 0.3) 9968
|
||
4 │ [0.3, 0.4) 9943
|
||
5 │ [0.4, 0.5) 10063
|
||
6 │ [0.5, 0.6) 10173
|
||
7 │ [0.6, 0.7) 9977
|
||
8 │ [0.7, 0.8) 10076
|
||
9 │ [0.8, 0.9) 9908
|
||
10 │ [0.9, 1.0) 10044
|
||
```
|
||
|
||
You might get a bit different numbers but all should be around 10,000.
|
||
|
||
</details>
|
||
|
||
### Exercise 7
|
||
|
||
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
|
||
Check what would be the group order if you asked not to sort them.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> NamedTuple.(keys(gdf))
|
||
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
|
||
(xbins = "[0.0, 0.1)",)
|
||
(xbins = "[0.1, 0.2)",)
|
||
(xbins = "[0.2, 0.3)",)
|
||
(xbins = "[0.3, 0.4)",)
|
||
(xbins = "[0.4, 0.5)",)
|
||
(xbins = "[0.5, 0.6)",)
|
||
(xbins = "[0.6, 0.7)",)
|
||
(xbins = "[0.7, 0.8)",)
|
||
(xbins = "[0.8, 0.9)",)
|
||
(xbins = "[0.9, 1.0)",)
|
||
|
||
julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
|
||
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
|
||
(xbins = "[0.4, 0.5)",)
|
||
(xbins = "[0.9, 1.0)",)
|
||
(xbins = "[0.8, 0.9)",)
|
||
(xbins = "[0.0, 0.1)",)
|
||
(xbins = "[0.2, 0.3)",)
|
||
(xbins = "[0.5, 0.6)",)
|
||
(xbins = "[0.7, 0.8)",)
|
||
(xbins = "[0.3, 0.4)",)
|
||
(xbins = "[0.1, 0.2)",)
|
||
(xbins = "[0.6, 0.7)",)
|
||
```
|
||
|
||
If you pass `sort=false` instead of `sort=true` you get groups in their order
|
||
of appearance in `df`. If you skipped specifying `sort` keyword argument
|
||
the resulting group order could depend on the type of grouping column, so if
|
||
you want to depend on the order of groups always spass `sort` keyword argument
|
||
explicitly.
|
||
|
||
</details>
|
||
|
||
### Exercise 8
|
||
|
||
Compute average `n` for each group in `gdf`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using Statistics
|
||
|
||
julia> [mean(group.n) for group in gdf]
|
||
10-element Vector{Float64}:
|
||
14845.847751215559
|
||
19835.367882919007
|
||
19919.195826645264
|
||
19993.023936437694
|
||
20105.506111497565
|
||
20222.35761329008
|
||
20151.794727874112
|
||
20022.69610956729
|
||
19909.331550262414
|
||
14944.511449621665
|
||
|
||
julia> combine(gdf, :n => mean) # alternative way to do it
|
||
10×2 DataFrame
|
||
Row │ xbins n_mean
|
||
│ Cat… Float64
|
||
─────┼─────────────────────
|
||
1 │ [0.0, 0.1) 14845.8
|
||
2 │ [0.1, 0.2) 19835.4
|
||
3 │ [0.2, 0.3) 19919.2
|
||
4 │ [0.3, 0.4) 19993.0
|
||
5 │ [0.4, 0.5) 20105.5
|
||
6 │ [0.5, 0.6) 20222.4
|
||
7 │ [0.6, 0.7) 20151.8
|
||
8 │ [0.7, 0.8) 20022.7
|
||
9 │ [0.8, 0.9) 19909.3
|
||
10 │ [0.9, 1.0) 14944.5
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 9
|
||
|
||
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
|
||
Use the `\` operator to fit it (recall it from chapter 4).
|
||
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> function fitmodel(x, n)
|
||
X = [ones(length(x)) x]
|
||
α₀, αₓ = X \ n
|
||
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
|
||
end
|
||
fitmodel (generic function with 1 method)
|
||
|
||
julia> [fitmodel(group.x, group.n) for group in gdf]
|
||
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
|
||
(α₀ = 9900.190310776916, αₓ = 99131.14394200995)
|
||
(α₀ = 19823.115188829383, αₓ = 81.66979172871368)
|
||
(α₀ = 19812.9822724435, αₓ = 424.00895772216785)
|
||
(α₀ = 19810.726510910834, αₓ = 520.6763238983195)
|
||
(α₀ = 19437.772385484135, αₓ = 1483.333906139938)
|
||
(α₀ = 20187.521449870146, αₓ = 63.30709585406235)
|
||
(α₀ = 20424.362332155855, αₓ = -419.42268710601405)
|
||
(α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
|
||
(α₀ = 20013.690535193662, αₓ = -122.80055110522495)
|
||
(α₀ = 109320.55276082881, αₓ = -99305.18846102979)
|
||
|
||
julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
|
||
10×3 DataFrame
|
||
Row │ xbins α₀ αₓ
|
||
│ Cat… Float64 Float64
|
||
─────┼────────────────────────────────────────
|
||
1 │ [0.0, 0.1) 9900.19 99131.1
|
||
2 │ [0.1, 0.2) 19823.1 81.6698
|
||
3 │ [0.2, 0.3) 19813.0 424.009
|
||
4 │ [0.3, 0.4) 19810.7 520.676
|
||
5 │ [0.4, 0.5) 19437.8 1483.33
|
||
6 │ [0.5, 0.6) 20187.5 63.3071
|
||
7 │ [0.6, 0.7) 20424.4 -419.423
|
||
8 │ [0.7, 0.8) 20789.7 -1022.98
|
||
9 │ [0.8, 0.9) 20013.7 -122.801
|
||
10 │ [0.9, 1.0) 1.09321e5 -99305.2
|
||
```
|
||
|
||
We note that indeed in the first and last group the regression has a significant
|
||
slope.
|
||
|
||
</details>
|
||
|
||
### Exercise 10
|
||
|
||
Repeat exercise 9 but using the GLM.jl package. This time
|
||
extract the p-value for the slope of estimated coefficient for `x` variable.
|
||
Use the `coeftable` function from GLM.jl to get this information.
|
||
Check the documentation of this function to learn how to do it (it will be
|
||
easiest for you to first convert its result to a `DataFrame`).
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using GLM
|
||
|
||
julia> function fitlmmodel(group; info=false)
|
||
model = lm(@formula(n~x), group)
|
||
coefdf = DataFrame(coeftable(model))
|
||
info && @show coefdf # to see how the data frame looks like
|
||
α₀, αₓ = coefdf[:, "Coef."]
|
||
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
|
||
end
|
||
fitlmmodel (generic function with 1 method)
|
||
|
||
julia> [fitlmmodel(group; info = true) for group in gdf]
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95
|
||
2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼───────────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1
|
||
2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼───────────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6
|
||
2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼───────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5
|
||
2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼─────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7
|
||
2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼─────────────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6
|
||
2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼──────────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4
|
||
2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼──────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4
|
||
2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼─────────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1
|
||
2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375
|
||
coefdf = 2×7 DataFrame
|
||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
|
||
1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5
|
||
2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3
|
||
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
|
||
(α₀ = 9900.190310776927, αₓ = 99131.1439420097)
|
||
(α₀ = 19823.115188829663, αₓ = 81.66979172690417)
|
||
(α₀ = 19812.98227244386, αₓ = 424.00895772074136)
|
||
(α₀ = 19810.726510911398, αₓ = 520.6763238966264)
|
||
(α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
|
||
(α₀ = 20187.521449871012, αₓ = 63.307095852511125)
|
||
(α₀ = 20424.36233216108, αₓ = -419.4226871140539)
|
||
(α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
|
||
(α₀ = 20013.69053519897, αₓ = -122.80055111148658)
|
||
(α₀ = 109320.55276074051, αₓ = -99305.18846093686)
|
||
|
||
julia> combine(gdf, fitlmmodel)
|
||
10×3 DataFrame
|
||
Row │ xbins α₀ αₓ
|
||
│ Cat… Float64 Float64
|
||
─────┼────────────────────────────────────────
|
||
1 │ [0.0, 0.1) 9900.19 99131.1
|
||
2 │ [0.1, 0.2) 19823.1 81.6698
|
||
3 │ [0.2, 0.3) 19813.0 424.009
|
||
4 │ [0.3, 0.4) 19810.7 520.676
|
||
5 │ [0.4, 0.5) 19437.8 1483.33
|
||
6 │ [0.5, 0.6) 20187.5 63.3071
|
||
7 │ [0.6, 0.7) 20424.4 -419.423
|
||
8 │ [0.7, 0.8) 20789.7 -1022.98
|
||
9 │ [0.8, 0.9) 20013.7 -122.801
|
||
10 │ [0.9, 1.0) 1.09321e5 -99305.2
|
||
```
|
||
|
||
We got the same results. The `combine(gdf, fitlmmodel)` style of using
|
||
the `combine` function is a bit more advanced and is not covered in the book.
|
||
It is used in the cases, like the one we have here, when you want to pass
|
||
a whole group to the function in `combine`. Check DataFrames.jl documentation
|
||
for more detailed explanations.
|
||
|
||
</details>
|