2022-10-14 12:27:04 +02:00
|
|
|
|
# Julia for Data Analysis
|
|
|
|
|
|
|
|
|
|
## Bogumił Kamiński, Daniel Kaszyński
|
|
|
|
|
|
|
|
|
|
# Chapter 11
|
|
|
|
|
|
|
|
|
|
# Problems
|
|
|
|
|
|
|
|
|
|
### Exercise 1
|
|
|
|
|
|
|
|
|
|
Generate a data frame `df` having one column `x` consisting of 100,000 values
|
|
|
|
|
sampled from uniform distribution on [0, 1[ interval.
|
|
|
|
|
Serialize it to disk, and next deserialize. Check if the deserialized
|
|
|
|
|
object is the same as the source data frame.
|
|
|
|
|
|
|
|
|
|
<details>
|
2022-10-14 13:43:12 +02:00
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> using DataFrames
|
|
|
|
|
|
|
|
|
|
julia> df = DataFrame(x=rand(100_000));
|
|
|
|
|
|
|
|
|
|
julia> using Serialization
|
|
|
|
|
|
|
|
|
|
julia> serialize("df.bin", df)
|
|
|
|
|
|
|
|
|
|
julia> deserialize("df.bin") == df
|
|
|
|
|
true
|
|
|
|
|
```
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 2
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Add a column `n` to the `df` data frame that in each row will hold the
|
|
|
|
|
number of observations in column `x` that have distance less than `0.1` to
|
|
|
|
|
a value stored in a given row of `x`.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
A simple approach is:
|
|
|
|
|
```
|
|
|
|
|
df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
A more sophisticated approach (faster and allocating less memory) would be:
|
|
|
|
|
```
|
|
|
|
|
df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
An even faster solution that is type stable would use function barrier:
|
|
|
|
|
```
|
|
|
|
|
f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
|
|
|
|
|
df.n = f(df.x)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Finally you can work on sorted data to get a much better performance. Here is an
|
|
|
|
|
example (it is a bit more advanced):
|
|
|
|
|
```
|
|
|
|
|
function f2(x)
|
|
|
|
|
p = sortperm(x)
|
|
|
|
|
n = zeros(Int, length(x))
|
|
|
|
|
start = 1
|
|
|
|
|
stop = 1
|
|
|
|
|
idx = 0
|
|
|
|
|
while idx < length(x) # you could add @inbounds here but I typically avoid it
|
|
|
|
|
idx += 1
|
|
|
|
|
while x[p[idx]] - x[p[start]] >= 0.1
|
|
|
|
|
start += 1
|
|
|
|
|
end
|
|
|
|
|
while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
|
|
|
|
|
stop += 1
|
|
|
|
|
end
|
|
|
|
|
n[p[idx]] = stop - start
|
|
|
|
|
end
|
|
|
|
|
return n
|
|
|
|
|
end
|
|
|
|
|
df.n = f2(df.x)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
In this solution the fact that we used function barrier is even more relevant
|
|
|
|
|
as we explicitly use loops inside.
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 3
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Investigate visually how does `n` depend on `x` in data frame `df`.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
using Plots
|
|
|
|
|
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
As expected on the border of the domain number of neighbors drops.
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 4
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Someone has prepared the following test data for you:
|
|
|
|
|
```
|
|
|
|
|
teststr = """
|
|
|
|
|
"x","sinx"
|
|
|
|
|
0.139279,0.138829
|
|
|
|
|
0.456779,0.441059
|
|
|
|
|
0.344034,0.337287
|
|
|
|
|
0.140253,0.139794
|
|
|
|
|
0.848344,0.750186
|
|
|
|
|
0.977512,0.829109
|
|
|
|
|
0.032737,0.032731
|
|
|
|
|
0.702750,0.646318
|
|
|
|
|
0.422339,0.409895
|
|
|
|
|
0.393878,0.383772
|
|
|
|
|
"""
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Load this data into `testdf` data frame.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> using CSV
|
|
|
|
|
|
|
|
|
|
julia> using DataFrames
|
|
|
|
|
|
|
|
|
|
julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
|
|
|
|
|
10×2 DataFrame
|
|
|
|
|
Row │ x sinx
|
|
|
|
|
│ Float64 Float64
|
|
|
|
|
─────┼────────────────────
|
|
|
|
|
1 │ 0.139279 0.138829
|
|
|
|
|
2 │ 0.456779 0.441059
|
|
|
|
|
3 │ 0.344034 0.337287
|
|
|
|
|
4 │ 0.140253 0.139794
|
|
|
|
|
5 │ 0.848344 0.750186
|
|
|
|
|
6 │ 0.977512 0.829109
|
|
|
|
|
7 │ 0.032737 0.032731
|
|
|
|
|
8 │ 0.70275 0.646318
|
|
|
|
|
9 │ 0.422339 0.409895
|
|
|
|
|
10 │ 0.393878 0.383772
|
|
|
|
|
```
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 5
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Check the accuracy of computations of sinus of `x` in `testdf`.
|
|
|
|
|
Print all rows for which the absolute difference is greater than `5e-7`.
|
|
|
|
|
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
|
|
|
|
|
difference.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
Since data frame is small we can use `eachrow`:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> for row in eachrow(testdf)
|
|
|
|
|
sinx = sin(row.x)
|
|
|
|
|
dev = abs(sinx - row.sinx)
|
|
|
|
|
dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
|
|
|
|
|
end
|
|
|
|
|
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
|
|
|
|
|
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
|
|
|
|
|
```
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 6
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Group data in data frame `df` into buckets of 0.1 width and store the result in
|
|
|
|
|
`gdf` data frame (sort the groups). Use the `cut` function from
|
|
|
|
|
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
|
|
|
|
|
Check the number of values in each group.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
```
|
|
|
|
|
julia> using CategoricalArrays
|
|
|
|
|
|
|
|
|
|
julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
|
|
|
|
|
|
|
|
|
|
julia> gdf = groupby(df, :xbins; sort=true);
|
|
|
|
|
|
|
|
|
|
julia> [nrow(group) for group in gdf]
|
|
|
|
|
10-element Vector{Int64}:
|
|
|
|
|
9872
|
|
|
|
|
9976
|
|
|
|
|
9968
|
|
|
|
|
9943
|
|
|
|
|
10063
|
|
|
|
|
10173
|
|
|
|
|
9977
|
|
|
|
|
10076
|
|
|
|
|
9908
|
|
|
|
|
10044
|
|
|
|
|
|
|
|
|
|
julia> combine(gdf, nrow) # alternative way to do it
|
|
|
|
|
10×2 DataFrame
|
|
|
|
|
Row │ xbins nrow
|
|
|
|
|
│ Cat… Int64
|
|
|
|
|
─────┼───────────────────
|
|
|
|
|
1 │ [0.0, 0.1) 9872
|
|
|
|
|
2 │ [0.1, 0.2) 9976
|
|
|
|
|
3 │ [0.2, 0.3) 9968
|
|
|
|
|
4 │ [0.3, 0.4) 9943
|
|
|
|
|
5 │ [0.4, 0.5) 10063
|
|
|
|
|
6 │ [0.5, 0.6) 10173
|
|
|
|
|
7 │ [0.6, 0.7) 9977
|
|
|
|
|
8 │ [0.7, 0.8) 10076
|
|
|
|
|
9 │ [0.8, 0.9) 9908
|
|
|
|
|
10 │ [0.9, 1.0) 10044
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You might get a bit different numbers but all should be around 10,000.
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 7
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
|
|
|
|
|
Check what would be the group order if you asked not to sort them.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> NamedTuple.(keys(gdf))
|
|
|
|
|
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
|
|
|
|
|
(xbins = "[0.0, 0.1)",)
|
|
|
|
|
(xbins = "[0.1, 0.2)",)
|
|
|
|
|
(xbins = "[0.2, 0.3)",)
|
|
|
|
|
(xbins = "[0.3, 0.4)",)
|
|
|
|
|
(xbins = "[0.4, 0.5)",)
|
|
|
|
|
(xbins = "[0.5, 0.6)",)
|
|
|
|
|
(xbins = "[0.6, 0.7)",)
|
|
|
|
|
(xbins = "[0.7, 0.8)",)
|
|
|
|
|
(xbins = "[0.8, 0.9)",)
|
|
|
|
|
(xbins = "[0.9, 1.0)",)
|
|
|
|
|
|
|
|
|
|
julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
|
|
|
|
|
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
|
|
|
|
|
(xbins = "[0.4, 0.5)",)
|
|
|
|
|
(xbins = "[0.9, 1.0)",)
|
|
|
|
|
(xbins = "[0.8, 0.9)",)
|
|
|
|
|
(xbins = "[0.0, 0.1)",)
|
|
|
|
|
(xbins = "[0.2, 0.3)",)
|
|
|
|
|
(xbins = "[0.5, 0.6)",)
|
|
|
|
|
(xbins = "[0.7, 0.8)",)
|
|
|
|
|
(xbins = "[0.3, 0.4)",)
|
|
|
|
|
(xbins = "[0.1, 0.2)",)
|
|
|
|
|
(xbins = "[0.6, 0.7)",)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you pass `sort=false` instead of `sort=true` you get groups in their order
|
|
|
|
|
of appearance in `df`. If you skipped specifying `sort` keyword argument
|
|
|
|
|
the resulting group order could depend on the type of grouping column, so if
|
|
|
|
|
you want to depend on the order of groups always spass `sort` keyword argument
|
|
|
|
|
explicitly.
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 8
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Compute average `n` for each group in `gdf`.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> using Statistics
|
|
|
|
|
|
|
|
|
|
julia> [mean(group.n) for group in gdf]
|
|
|
|
|
10-element Vector{Float64}:
|
|
|
|
|
14845.847751215559
|
|
|
|
|
19835.367882919007
|
|
|
|
|
19919.195826645264
|
|
|
|
|
19993.023936437694
|
|
|
|
|
20105.506111497565
|
|
|
|
|
20222.35761329008
|
|
|
|
|
20151.794727874112
|
|
|
|
|
20022.69610956729
|
|
|
|
|
19909.331550262414
|
|
|
|
|
14944.511449621665
|
|
|
|
|
|
|
|
|
|
julia> combine(gdf, :n => mean) # alternative way to do it
|
|
|
|
|
10×2 DataFrame
|
|
|
|
|
Row │ xbins n_mean
|
|
|
|
|
│ Cat… Float64
|
|
|
|
|
─────┼─────────────────────
|
|
|
|
|
1 │ [0.0, 0.1) 14845.8
|
|
|
|
|
2 │ [0.1, 0.2) 19835.4
|
|
|
|
|
3 │ [0.2, 0.3) 19919.2
|
|
|
|
|
4 │ [0.3, 0.4) 19993.0
|
|
|
|
|
5 │ [0.4, 0.5) 20105.5
|
|
|
|
|
6 │ [0.5, 0.6) 20222.4
|
|
|
|
|
7 │ [0.6, 0.7) 20151.8
|
|
|
|
|
8 │ [0.7, 0.8) 20022.7
|
|
|
|
|
9 │ [0.8, 0.9) 19909.3
|
|
|
|
|
10 │ [0.9, 1.0) 14944.5
|
|
|
|
|
```
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 9
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
|
|
|
|
|
Use the `\` operator to fit it (recall it from chapter 4).
|
|
|
|
|
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> function fitmodel(x, n)
|
|
|
|
|
X = [ones(length(x)) x]
|
|
|
|
|
α₀, αₓ = X \ n
|
|
|
|
|
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
|
|
|
|
|
end
|
|
|
|
|
fitmodel (generic function with 1 method)
|
|
|
|
|
|
|
|
|
|
julia> [fitmodel(group.x, group.n) for group in gdf]
|
|
|
|
|
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
|
|
|
|
|
(α₀ = 9900.190310776916, αₓ = 99131.14394200995)
|
|
|
|
|
(α₀ = 19823.115188829383, αₓ = 81.66979172871368)
|
|
|
|
|
(α₀ = 19812.9822724435, αₓ = 424.00895772216785)
|
|
|
|
|
(α₀ = 19810.726510910834, αₓ = 520.6763238983195)
|
|
|
|
|
(α₀ = 19437.772385484135, αₓ = 1483.333906139938)
|
|
|
|
|
(α₀ = 20187.521449870146, αₓ = 63.30709585406235)
|
|
|
|
|
(α₀ = 20424.362332155855, αₓ = -419.42268710601405)
|
|
|
|
|
(α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
|
|
|
|
|
(α₀ = 20013.690535193662, αₓ = -122.80055110522495)
|
|
|
|
|
(α₀ = 109320.55276082881, αₓ = -99305.18846102979)
|
|
|
|
|
|
|
|
|
|
julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
|
|
|
|
|
10×3 DataFrame
|
|
|
|
|
Row │ xbins α₀ αₓ
|
|
|
|
|
│ Cat… Float64 Float64
|
|
|
|
|
─────┼────────────────────────────────────────
|
|
|
|
|
1 │ [0.0, 0.1) 9900.19 99131.1
|
|
|
|
|
2 │ [0.1, 0.2) 19823.1 81.6698
|
|
|
|
|
3 │ [0.2, 0.3) 19813.0 424.009
|
|
|
|
|
4 │ [0.3, 0.4) 19810.7 520.676
|
|
|
|
|
5 │ [0.4, 0.5) 19437.8 1483.33
|
|
|
|
|
6 │ [0.5, 0.6) 20187.5 63.3071
|
|
|
|
|
7 │ [0.6, 0.7) 20424.4 -419.423
|
|
|
|
|
8 │ [0.7, 0.8) 20789.7 -1022.98
|
|
|
|
|
9 │ [0.8, 0.9) 20013.7 -122.801
|
|
|
|
|
10 │ [0.9, 1.0) 1.09321e5 -99305.2
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
We note that indeed in the first and last group the regression has a significant
|
|
|
|
|
slope.
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
</details>
|
|
|
|
|
|
2022-10-14 12:27:04 +02:00
|
|
|
|
### Exercise 10
|
|
|
|
|
|
2022-10-14 13:43:12 +02:00
|
|
|
|
Repeat exercise 9 but using the GLM.jl package. This time
|
|
|
|
|
extract the p-value for the slope of estimated coefficient for `x` variable.
|
|
|
|
|
Use the `coeftable` function from GLM.jl to get this information.
|
|
|
|
|
Check the documentation of this function to learn how to do it (it will be
|
|
|
|
|
easiest for you to first convert its result to a `DataFrame`).
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
|
<summary>Solution</summary>
|
2022-10-14 12:27:04 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
julia> using GLM
|
|
|
|
|
|
|
|
|
|
julia> function fitlmmodel(group; info=false)
|
|
|
|
|
model = lm(@formula(n~x), group)
|
|
|
|
|
coefdf = DataFrame(coeftable(model))
|
|
|
|
|
info && @show coefdf # to see how the data frame looks like
|
|
|
|
|
α₀, αₓ = coefdf[:, "Coef."]
|
|
|
|
|
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
|
|
|
|
|
end
|
|
|
|
|
fitlmmodel (generic function with 1 method)
|
|
|
|
|
|
|
|
|
|
julia> [fitlmmodel(group; info = true) for group in gdf]
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95
|
|
|
|
|
2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼───────────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1
|
|
|
|
|
2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼───────────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6
|
|
|
|
|
2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼───────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5
|
|
|
|
|
2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼─────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7
|
|
|
|
|
2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼─────────────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6
|
|
|
|
|
2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼──────────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4
|
|
|
|
|
2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼──────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4
|
|
|
|
|
2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼─────────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1
|
|
|
|
|
2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375
|
|
|
|
|
coefdf = 2×7 DataFrame
|
|
|
|
|
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
|
|
|
|
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
|
|
|
|
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5
|
|
|
|
|
2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3
|
|
|
|
|
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
|
|
|
|
|
(α₀ = 9900.190310776927, αₓ = 99131.1439420097)
|
|
|
|
|
(α₀ = 19823.115188829663, αₓ = 81.66979172690417)
|
|
|
|
|
(α₀ = 19812.98227244386, αₓ = 424.00895772074136)
|
|
|
|
|
(α₀ = 19810.726510911398, αₓ = 520.6763238966264)
|
|
|
|
|
(α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
|
|
|
|
|
(α₀ = 20187.521449871012, αₓ = 63.307095852511125)
|
|
|
|
|
(α₀ = 20424.36233216108, αₓ = -419.4226871140539)
|
|
|
|
|
(α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
|
|
|
|
|
(α₀ = 20013.69053519897, αₓ = -122.80055111148658)
|
|
|
|
|
(α₀ = 109320.55276074051, αₓ = -99305.18846093686)
|
|
|
|
|
|
|
|
|
|
julia> combine(gdf, fitlmmodel)
|
|
|
|
|
10×3 DataFrame
|
|
|
|
|
Row │ xbins α₀ αₓ
|
|
|
|
|
│ Cat… Float64 Float64
|
|
|
|
|
─────┼────────────────────────────────────────
|
|
|
|
|
1 │ [0.0, 0.1) 9900.19 99131.1
|
|
|
|
|
2 │ [0.1, 0.2) 19823.1 81.6698
|
|
|
|
|
3 │ [0.2, 0.3) 19813.0 424.009
|
|
|
|
|
4 │ [0.3, 0.4) 19810.7 520.676
|
|
|
|
|
5 │ [0.4, 0.5) 19437.8 1483.33
|
|
|
|
|
6 │ [0.5, 0.6) 20187.5 63.3071
|
|
|
|
|
7 │ [0.6, 0.7) 20424.4 -419.423
|
|
|
|
|
8 │ [0.7, 0.8) 20789.7 -1022.98
|
|
|
|
|
9 │ [0.8, 0.9) 20013.7 -122.801
|
|
|
|
|
10 │ [0.9, 1.0) 1.09321e5 -99305.2
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
We got the same results. The `combine(gdf, fitlmmodel)` style of using
|
|
|
|
|
the `combine` function is a bit more advanced and is not covered in the book.
|
|
|
|
|
It is used in the cases, like the one we have here, when you want to pass
|
|
|
|
|
a whole group to the function in `combine`. Check DataFrames.jl documentation
|
|
|
|
|
for more detailed explanations.
|
|
|
|
|
|
|
|
|
|
</details>
|