JuliaForDataAnalysis/exercises/exercises11.md
Bogumił Kamiński 3b8ffa5d40 add exercises
2022-10-14 12:27:04 +02:00

18 KiB
Raw Blame History

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 11

Problems

Exercise 1

Generate a data frame df having one column x consisting of 100,000 values sampled from uniform distribution on [0, 1[ interval. Serialize it to disk, and next deserialize. Check if the deserialized object is the same as the source data frame.

Exercise 2

Add a column n to the df data frame that in each row will hold the number of observations in column x that have distance less than 0.1 to a value stored in a given row of x.

Exercise 3

Investigate visually how does n depend on x in data frame df.

Exercise 4

Someone has prepared the following test data for you:

teststr = """
"x","sinx"
0.139279,0.138829
0.456779,0.441059
0.344034,0.337287
0.140253,0.139794
0.848344,0.750186
0.977512,0.829109
0.032737,0.032731
0.702750,0.646318
0.422339,0.409895
0.393878,0.383772
"""

Load this data into testdf data frame.

Exercise 5

Check the accuracy of computations of sinus of x in testdf. Print all rows for which the absolute difference is greater than 5e-7. In this case display x, sinx, the exact value of sin(x) and the absolute difference.

Exercise 6

Group data in data frame df into buckets of 0.1 width and store the result in gdf data frame (sort the groups). Use the cut function from CategoricalArrays.jl to do it (check its documentation to learn how to do it). Check the number of values in each group.

Exercise 7

Display the grouping keys in gdf grouped data frame. Show them as named tuples. Check what would be the group order if you asked not to sort them.

Exercise 8

Compute average n for each group in gdf.

Exercise 9

Fit a linear model explaining n by x separately for each group in gdf. Use the \ operator to fit it (recall it from chapter 4). For each group produce the result as named tuple having fields α₀ and αₓ.

Exercise 10

Repeat exercise 9 but using the GLM.jl package. This time extract the p-value for the slope of estimated coefficient for x variable. Use the coeftable function from GLM.jl to get this information. Check the documentation of this function to learn how to do it (it will be easiest for you to first convert its result to a DataFrame).

Solutions

Show!

Exercise 1

Solution:

julia> using DataFrames

julia> df = DataFrame(x=rand(100_000));

julia> using Serialization

julia> serialize("df.bin", df)

julia> deserialize("df.bin") == df
true

Exercise 2

Solution:

A simple approach is:

df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)

A more sophisticated approach (faster and allocating less memory) would be:

df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`

An even faster solution that is type stable would use function barrier:

f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
df.n = f(df.x)

Finally you can work on sorted data to get a much better performance. Here is an example (it is a bit more advanced):

function f2(x)
    p = sortperm(x)
    n = zeros(Int, length(x))
    start = 1
    stop = 1
    idx = 0
    while idx < length(x) # you could add @inbounds here but I typically avoid it
        idx += 1
        while x[p[idx]] - x[p[start]] >= 0.1
            start += 1
        end
        while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
            stop += 1
        end
        n[p[idx]] = stop - start
    end
    return n
end
df.n = f2(df.x)

In this solution the fact that we used function barrier is even more relevant as we explicitly use loops inside.

Exercise 3

Solution:

using Plots
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)

As expected on the border of the domain number of neighbors drops.

Exercise 4

Solution:

julia> using CSV

julia> using DataFrames

julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
10×2 DataFrame
 Row │ x         sinx
     │ Float64   Float64
─────┼────────────────────
   1 │ 0.139279  0.138829
   2 │ 0.456779  0.441059
   3 │ 0.344034  0.337287
   4 │ 0.140253  0.139794
   5 │ 0.848344  0.750186
   6 │ 0.977512  0.829109
   7 │ 0.032737  0.032731
   8 │ 0.70275   0.646318
   9 │ 0.422339  0.409895
  10 │ 0.393878  0.383772

Exercise 5

Since data frame is small we can use eachrow:

julia> for row in eachrow(testdf)
           sinx = sin(row.x)
           dev = abs(sinx - row.sinx)
           dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
       end
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)

Exercise 6

Solution:

julia> using CategoricalArrays

julia> df.xbins = cut(df.x, 0.0:0.1:1.0);

julia> gdf = groupby(df, :xbins; sort=true);

julia> [nrow(group) for group in gdf]
10-element Vector{Int64}:
  9872
  9976
  9968
  9943
 10063
 10173
  9977
 10076
  9908
 10044

julia> combine(gdf, nrow) # alternative way to do it
10×2 DataFrame
 Row │ xbins       nrow
     │ Cat…        Int64
─────┼───────────────────
   1 │ [0.0, 0.1)   9872
   2 │ [0.1, 0.2)   9976
   3 │ [0.2, 0.3)   9968
   4 │ [0.3, 0.4)   9943
   5 │ [0.4, 0.5)  10063
   6 │ [0.5, 0.6)  10173
   7 │ [0.6, 0.7)   9977
   8 │ [0.7, 0.8)  10076
   9 │ [0.8, 0.9)   9908
  10 │ [0.9, 1.0)  10044

You might get a bit different numbers but all should be around 10,000.

Exercise 7

Solution:

julia> NamedTuple.(keys(gdf))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
 (xbins = "[0.0, 0.1)",)
 (xbins = "[0.1, 0.2)",)
 (xbins = "[0.2, 0.3)",)
 (xbins = "[0.3, 0.4)",)
 (xbins = "[0.4, 0.5)",)
 (xbins = "[0.5, 0.6)",)
 (xbins = "[0.6, 0.7)",)
 (xbins = "[0.7, 0.8)",)
 (xbins = "[0.8, 0.9)",)
 (xbins = "[0.9, 1.0)",)

julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
 (xbins = "[0.4, 0.5)",)
 (xbins = "[0.9, 1.0)",)
 (xbins = "[0.8, 0.9)",)
 (xbins = "[0.0, 0.1)",)
 (xbins = "[0.2, 0.3)",)
 (xbins = "[0.5, 0.6)",)
 (xbins = "[0.7, 0.8)",)
 (xbins = "[0.3, 0.4)",)
 (xbins = "[0.1, 0.2)",)
 (xbins = "[0.6, 0.7)",)

If you pass sort=false instead of sort=true you get groups in their order of appearance in df. If you skipped specifying sort keyword argument the resulting group order could depend on the type of grouping column, so if you want to depend on the order of groups always spass sort keyword argument explicitly.

Exercise 8

Solution:

julia> using Statistics

julia> [mean(group.n) for group in gdf]
10-element Vector{Float64}:
 14845.847751215559
 19835.367882919007
 19919.195826645264
 19993.023936437694
 20105.506111497565
 20222.35761329008
 20151.794727874112
 20022.69610956729
 19909.331550262414
 14944.511449621665

julia> combine(gdf, :n => mean) # alternative way to do it
10×2 DataFrame
 Row │ xbins       n_mean
     │ Cat…        Float64
─────┼─────────────────────
   1 │ [0.0, 0.1)  14845.8
   2 │ [0.1, 0.2)  19835.4
   3 │ [0.2, 0.3)  19919.2
   4 │ [0.3, 0.4)  19993.0
   5 │ [0.4, 0.5)  20105.5
   6 │ [0.5, 0.6)  20222.4
   7 │ [0.6, 0.7)  20151.8
   8 │ [0.7, 0.8)  20022.7
   9 │ [0.8, 0.9)  19909.3
  10 │ [0.9, 1.0)  14944.5

Exercise 9

Solution:

julia> function fitmodel(x, n)
           X = [ones(length(x)) x]
           α₀, αₓ = X \ n
           return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
       end
fitmodel (generic function with 1 method)

julia> [fitmodel(group.x, group.n) for group in gdf]
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
 (α₀ = 9900.190310776916, αₓ = 99131.14394200995)
 (α₀ = 19823.115188829383, αₓ = 81.66979172871368)
 (α₀ = 19812.9822724435, αₓ = 424.00895772216785)
 (α₀ = 19810.726510910834, αₓ = 520.6763238983195)
 (α₀ = 19437.772385484135, αₓ = 1483.333906139938)
 (α₀ = 20187.521449870146, αₓ = 63.30709585406235)
 (α₀ = 20424.362332155855, αₓ = -419.42268710601405)
 (α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
 (α₀ = 20013.690535193662, αₓ = -122.80055110522495)
 (α₀ = 109320.55276082881, αₓ = -99305.18846102979)

julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
10×3 DataFrame
 Row │ xbins       α₀             αₓ
     │ Cat…        Float64        Float64
─────┼────────────────────────────────────────
   1 │ [0.0, 0.1)   9900.19        99131.1
   2 │ [0.1, 0.2)  19823.1            81.6698
   3 │ [0.2, 0.3)  19813.0           424.009
   4 │ [0.3, 0.4)  19810.7           520.676
   5 │ [0.4, 0.5)  19437.8          1483.33
   6 │ [0.5, 0.6)  20187.5            63.3071
   7 │ [0.6, 0.7)  20424.4          -419.423
   8 │ [0.7, 0.8)  20789.7         -1022.98
   9 │ [0.8, 0.9)  20013.7          -122.801
  10 │ [0.9, 1.0)      1.09321e5  -99305.2

We note that indeed in the first and last group the regression has a significant slope.

Exercise 10

Solution:

julia> using GLM

julia> function fitlmmodel(group; info=false)
           model = lm(@formula(n~x), group)
           coefdf = DataFrame(coeftable(model))
           info && @show coefdf # to see how the data frame looks like
           α₀, αₓ = coefdf[:, "Coef."]
           return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
       end
fitlmmodel (generic function with 1 method)

julia> [fitlmmodel(group; info = true) for group in gdf]
coefdf = 2×7 DataFrame
 Row │ Name         Coef.     Std. Error  t        Pr(>|t|)  Lower 95%  Upper 95%
     │ String       Float64   Float64     Float64  Float64   Float64    Float64
─────┼────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)   9900.19    0.388607  25476.1       0.0    9899.43    9900.95
   2 │ x            99131.1     6.75846   14667.7       0.0   99117.9    99144.4
coefdf = 2×7 DataFrame
 Row │ Name         Coef.       Std. Error  t           Pr(>|t|)    Lower 95%  Upper 95%
     │ String       Float64     Float64     Float64     Float64     Float64    Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  19823.1        2.52926  7837.5      0.0         19818.2    19828.1
   2 │ x               81.6698    16.5512      4.93436  8.17139e-7     49.226    114.114
coefdf = 2×7 DataFrame
 Row │ Name         Coef.      Std. Error  t          Pr(>|t|)      Lower 95%  Upper 95%
     │ String       Float64    Float64     Float64    Float64       Float64    Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  19813.0        2.8427  6969.79    0.0            19807.4   19818.6
   2 │ x              424.009     11.2737    37.6106  1.32368e-289     401.91    446.108
coefdf = 2×7 DataFrame
 Row │ Name         Coef.      Std. Error  t          Pr(>|t|)  Lower 95%  Upper 95%
     │ String       Float64    Float64     Float64    Float64   Float64    Float64
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  19810.7       3.98478  4971.59         0.0  19802.9    19818.5
   2 │ x              520.676    11.3429     45.9033       0.0    498.442    542.911
coefdf = 2×7 DataFrame
 Row │ Name         Coef.     Std. Error  t         Pr(>|t|)  Lower 95%  Upper 95%
     │ String       Float64   Float64     Float64   Float64   Float64    Float64
─────┼─────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  19437.8      6.07925  3197.4         0.0   19425.9    19449.7
   2 │ x             1483.33    13.4768    110.065       0.0    1456.92    1509.75
coefdf = 2×7 DataFrame
 Row │ Name         Coef.       Std. Error  t           Pr(>|t|)     Lower 95%   Upper 95%
     │ String       Float64     Float64     Float64     Float64      Float64     Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  20187.5        9.72795  2075.21     0.0          20168.5     20206.6
   2 │ x               63.3071    17.6538      3.58603  0.000337323     28.7022     97.912
coefdf = 2×7 DataFrame
 Row │ Name         Coef.      Std. Error  t          Pr(>|t|)     Lower 95%  Upper 95%
     │ String       Float64    Float64     Float64    Float64      Float64    Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  20424.4       10.2201  1998.45    0.0           20404.3   20444.4
   2 │ x             -419.423     15.7112   -26.6958  1.0356e-151    -450.22   -388.626
coefdf = 2×7 DataFrame
 Row │ Name         Coef.     Std. Error  t          Pr(>|t|)  Lower 95%  Upper 95%
     │ String       Float64   Float64     Float64    Float64   Float64    Float64
─────┼──────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  20789.7      9.56063  2174.51         0.0   20771.0   20808.4
   2 │ x            -1022.98    12.7417    -80.2856       0.0   -1047.95   -998.001
coefdf = 2×7 DataFrame
 Row │ Name         Coef.      Std. Error  t         Pr(>|t|)     Lower 95%  Upper 95%
     │ String       Float64    Float64     Float64   Float64      Float64    Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)  20013.7       8.86033  2258.8    0.0          19996.3    20031.1
   2 │ x             -122.801    10.4201    -11.785  7.60822e-32   -143.226   -102.375
coefdf = 2×7 DataFrame
 Row │ Name         Coef.           Std. Error  t         Pr(>|t|)  Lower 95%       Upper 95%
     │ String       Float64         Float64     Float64   Float64   Float64         Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ (Intercept)       1.09321e5     5.78343   18902.4       0.0       1.09309e5       1.09332e5
   2 │ x            -99305.2           6.08269  -16325.9       0.0  -99317.1        -99293.3
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
 (α₀ = 9900.190310776927, αₓ = 99131.1439420097)
 (α₀ = 19823.115188829663, αₓ = 81.66979172690417)
 (α₀ = 19812.98227244386, αₓ = 424.00895772074136)
 (α₀ = 19810.726510911398, αₓ = 520.6763238966264)
 (α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
 (α₀ = 20187.521449871012, αₓ = 63.307095852511125)
 (α₀ = 20424.36233216108, αₓ = -419.4226871140539)
 (α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
 (α₀ = 20013.69053519897, αₓ = -122.80055111148658)
 (α₀ = 109320.55276074051, αₓ = -99305.18846093686)

julia> combine(gdf, fitlmmodel)
10×3 DataFrame
 Row │ xbins       α₀             αₓ
     │ Cat…        Float64        Float64
─────┼────────────────────────────────────────
   1 │ [0.0, 0.1)   9900.19        99131.1
   2 │ [0.1, 0.2)  19823.1            81.6698
   3 │ [0.2, 0.3)  19813.0           424.009
   4 │ [0.3, 0.4)  19810.7           520.676
   5 │ [0.4, 0.5)  19437.8          1483.33
   6 │ [0.5, 0.6)  20187.5            63.3071
   7 │ [0.6, 0.7)  20424.4          -419.423
   8 │ [0.7, 0.8)  20789.7         -1022.98
   9 │ [0.8, 0.9)  20013.7          -122.801
  10 │ [0.9, 1.0)      1.09321e5  -99305.2

We got the same results. The combine(gdf, fitlmmodel) style of using the combine function is a bit more advanced and is not covered in the book. It is used in the cases, like the one we have here, when you want to pass a whole group to the function in combine. Check DataFrames.jl documentation for more detailed explanations.