18 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 11
Problems
Exercise 1
Generate a data frame df
having one column
x
consisting of 100,000 values sampled from uniform
distribution on [0, 1[ interval. Serialize it to disk, and next
deserialize. Check if the deserialized object is the same as the source
data frame.
Solution
julia> using DataFrames
julia> df = DataFrame(x=rand(100_000));
julia> using Serialization
julia> serialize("df.bin", df)
julia> deserialize("df.bin") == df
true
Exercise 2
Add a column n
to the df
data frame that in
each row will hold the number of observations in column x
that have distance less than 0.1
to a value stored in a
given row of x
.
Solution
A simple approach is:
df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
A more sophisticated approach (faster and allocating less memory) would be:
df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
An even faster solution that is type stable would use function barrier:
f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
df.n = f(df.x)
Finally you can work on sorted data to get a much better performance. Here is an example (it is a bit more advanced):
function f2(x)
p = sortperm(x)
n = zeros(Int, length(x))
start = 1
stop = 1
idx = 0
while idx < length(x) # you could add @inbounds here but I typically avoid it
idx += 1
while x[p[idx]] - x[p[start]] >= 0.1
start += 1
end
while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
stop += 1
end
n[p[idx]] = stop - start
end
return n
end
df.n = f2(df.x)
In this solution the fact that we used function barrier is even more relevant as we explicitly use loops inside.
Exercise 3
Investigate visually how does n
depend on x
in data frame df
.
Solution
using Plots
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
As expected on the border of the domain number of neighbors drops.
Exercise 4
Someone has prepared the following test data for you:
teststr = """
"x","sinx"
0.139279,0.138829
0.456779,0.441059
0.344034,0.337287
0.140253,0.139794
0.848344,0.750186
0.977512,0.829109
0.032737,0.032731
0.702750,0.646318
0.422339,0.409895
0.393878,0.383772
"""
Load this data into testdf
data frame.
Solution
julia> using CSV
julia> using DataFrames
julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
10×2 DataFrame
Row │ x sinx
│ Float64 Float64
─────┼────────────────────
1 │ 0.139279 0.138829
2 │ 0.456779 0.441059
3 │ 0.344034 0.337287
4 │ 0.140253 0.139794
5 │ 0.848344 0.750186
6 │ 0.977512 0.829109
7 │ 0.032737 0.032731
8 │ 0.70275 0.646318
9 │ 0.422339 0.409895
10 │ 0.393878 0.383772
Exercise 5
Check the accuracy of computations of sinus of x
in
testdf
. Print all rows for which the absolute difference is
greater than 5e-7
. In this case display x
,
sinx
, the exact value of sin(x)
and the
absolute difference.
Solution
Since data frame is small we can use eachrow
:
julia> for row in eachrow(testdf)
sinx = sin(row.x)
dev = abs(sinx - row.sinx)
dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
end
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
Exercise 6
Group data in data frame df
into buckets of 0.1 width
and store the result in gdf
data frame (sort the groups).
Use the cut
function from CategoricalArrays.jl to do it
(check its documentation to learn how to do it). Check the number of
values in each group.
Solution
julia> using CategoricalArrays
julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
julia> gdf = groupby(df, :xbins; sort=true);
julia> [nrow(group) for group in gdf]
10-element Vector{Int64}:
9872
9976
9968
9943
10063
10173
9977
10076
9908
10044
julia> combine(gdf, nrow) # alternative way to do it
10×2 DataFrame
Row │ xbins nrow
│ Cat… Int64
─────┼───────────────────
1 │ [0.0, 0.1) 9872
2 │ [0.1, 0.2) 9976
3 │ [0.2, 0.3) 9968
4 │ [0.3, 0.4) 9943
5 │ [0.4, 0.5) 10063
6 │ [0.5, 0.6) 10173
7 │ [0.6, 0.7) 9977
8 │ [0.7, 0.8) 10076
9 │ [0.8, 0.9) 9908
10 │ [0.9, 1.0) 10044
You might get a bit different numbers but all should be around 10,000.
Exercise 7
Display the grouping keys in gdf
grouped data frame.
Show them as named tuples. Check what would be the group order if you
asked not to sort them.
Solution
julia> NamedTuple.(keys(gdf))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
(xbins = "[0.0, 0.1)",)
(xbins = "[0.1, 0.2)",)
(xbins = "[0.2, 0.3)",)
(xbins = "[0.3, 0.4)",)
(xbins = "[0.4, 0.5)",)
(xbins = "[0.5, 0.6)",)
(xbins = "[0.6, 0.7)",)
(xbins = "[0.7, 0.8)",)
(xbins = "[0.8, 0.9)",)
(xbins = "[0.9, 1.0)",)
julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
(xbins = "[0.4, 0.5)",)
(xbins = "[0.9, 1.0)",)
(xbins = "[0.8, 0.9)",)
(xbins = "[0.0, 0.1)",)
(xbins = "[0.2, 0.3)",)
(xbins = "[0.5, 0.6)",)
(xbins = "[0.7, 0.8)",)
(xbins = "[0.3, 0.4)",)
(xbins = "[0.1, 0.2)",)
(xbins = "[0.6, 0.7)",)
If you pass sort=false
instead of sort=true
you get groups in their order of appearance in df
. If you
skipped specifying sort
keyword argument the resulting
group order could depend on the type of grouping column, so if you want
to depend on the order of groups always spass sort
keyword
argument explicitly.
Exercise 8
Compute average n
for each group in
gdf
.
Solution
julia> using Statistics
julia> [mean(group.n) for group in gdf]
10-element Vector{Float64}:
14845.847751215559
19835.367882919007
19919.195826645264
19993.023936437694
20105.506111497565
20222.35761329008
20151.794727874112
20022.69610956729
19909.331550262414
14944.511449621665
julia> combine(gdf, :n => mean) # alternative way to do it
10×2 DataFrame
Row │ xbins n_mean
│ Cat… Float64
─────┼─────────────────────
1 │ [0.0, 0.1) 14845.8
2 │ [0.1, 0.2) 19835.4
3 │ [0.2, 0.3) 19919.2
4 │ [0.3, 0.4) 19993.0
5 │ [0.4, 0.5) 20105.5
6 │ [0.5, 0.6) 20222.4
7 │ [0.6, 0.7) 20151.8
8 │ [0.7, 0.8) 20022.7
9 │ [0.8, 0.9) 19909.3
10 │ [0.9, 1.0) 14944.5
Exercise 9
Fit a linear model explaining n
by x
separately for each group in gdf
. Use the \
operator to fit it (recall it from chapter 4). For each group produce
the result as named tuple having fields α₀
and
αₓ
.
Solution
julia> function fitmodel(x, n)
X = [ones(length(x)) x]
α₀, αₓ = X \ n
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
end
fitmodel (generic function with 1 method)
julia> [fitmodel(group.x, group.n) for group in gdf]
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
(α₀ = 9900.190310776916, αₓ = 99131.14394200995)
(α₀ = 19823.115188829383, αₓ = 81.66979172871368)
(α₀ = 19812.9822724435, αₓ = 424.00895772216785)
(α₀ = 19810.726510910834, αₓ = 520.6763238983195)
(α₀ = 19437.772385484135, αₓ = 1483.333906139938)
(α₀ = 20187.521449870146, αₓ = 63.30709585406235)
(α₀ = 20424.362332155855, αₓ = -419.42268710601405)
(α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
(α₀ = 20013.690535193662, αₓ = -122.80055110522495)
(α₀ = 109320.55276082881, αₓ = -99305.18846102979)
julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
10×3 DataFrame
Row │ xbins α₀ αₓ
│ Cat… Float64 Float64
─────┼────────────────────────────────────────
1 │ [0.0, 0.1) 9900.19 99131.1
2 │ [0.1, 0.2) 19823.1 81.6698
3 │ [0.2, 0.3) 19813.0 424.009
4 │ [0.3, 0.4) 19810.7 520.676
5 │ [0.4, 0.5) 19437.8 1483.33
6 │ [0.5, 0.6) 20187.5 63.3071
7 │ [0.6, 0.7) 20424.4 -419.423
8 │ [0.7, 0.8) 20789.7 -1022.98
9 │ [0.8, 0.9) 20013.7 -122.801
10 │ [0.9, 1.0) 1.09321e5 -99305.2
We note that indeed in the first and last group the regression has a significant slope.
Exercise 10
Repeat exercise 9 but using the GLM.jl package. This time extract the
p-value for the slope of estimated coefficient for x
variable. Use the coeftable
function from GLM.jl to get
this information. Check the documentation of this function to learn how
to do it (it will be easiest for you to first convert its result to a
DataFrame
).
Solution
julia> using GLM
julia> function fitlmmodel(group; info=false)
model = lm(@formula(n~x), group)
coefdf = DataFrame(coeftable(model))
info && @show coefdf # to see how the data frame looks like
α₀, αₓ = coefdf[:, "Coef."]
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
end
fitlmmodel (generic function with 1 method)
julia> [fitlmmodel(group; info = true) for group in gdf]
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95
2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1
2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6
2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5
2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7
2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6
2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4
2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4
2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1
2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5
2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
(α₀ = 9900.190310776927, αₓ = 99131.1439420097)
(α₀ = 19823.115188829663, αₓ = 81.66979172690417)
(α₀ = 19812.98227244386, αₓ = 424.00895772074136)
(α₀ = 19810.726510911398, αₓ = 520.6763238966264)
(α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
(α₀ = 20187.521449871012, αₓ = 63.307095852511125)
(α₀ = 20424.36233216108, αₓ = -419.4226871140539)
(α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
(α₀ = 20013.69053519897, αₓ = -122.80055111148658)
(α₀ = 109320.55276074051, αₓ = -99305.18846093686)
julia> combine(gdf, fitlmmodel)
10×3 DataFrame
Row │ xbins α₀ αₓ
│ Cat… Float64 Float64
─────┼────────────────────────────────────────
1 │ [0.0, 0.1) 9900.19 99131.1
2 │ [0.1, 0.2) 19823.1 81.6698
3 │ [0.2, 0.3) 19813.0 424.009
4 │ [0.3, 0.4) 19810.7 520.676
5 │ [0.4, 0.5) 19437.8 1483.33
6 │ [0.5, 0.6) 20187.5 63.3071
7 │ [0.6, 0.7) 20424.4 -419.423
8 │ [0.7, 0.8) 20789.7 -1022.98
9 │ [0.8, 0.9) 20013.7 -122.801
10 │ [0.9, 1.0) 1.09321e5 -99305.2
We got the same results. The combine(gdf, fitlmmodel)
style of using the combine
function is a bit more advanced
and is not covered in the book. It is used in the cases, like the one we
have here, when you want to pass a whole group to the function in
combine
. Check DataFrames.jl documentation for more
detailed explanations.