# Julia for Data Analysis ## Bogumił Kamiński, Daniel Kaszyński # Chapter 9 # Problems In this problemset we will use the `puzzles.csv` file that was created in chapter 8. Please first load it into your Julia session using the commands: ``` using CSV using DataFrames puzzles = CSV.read("puzzles.csv", DataFrame); ``` ### Exercise 1 Create `matein2` data frame that will have only puzzles that have `"mateIn2"` in the `Themes` column. Use the `contains` function (check its documentation first).
Solution ``` julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :] 274135×9 DataFrame Row │ PuzzleId FEN Moves Rating RatingDeviation Popularity NbPlays Themes GameUrl ⋯ │ String7 String String Int64 Int64 Int64 Int64 String String ⋯ ────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 000hf r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1… e8f7 e2e6 f7f8 e6f7 1560 76 88 441 mate mateIn2 middlegame short https://li ⋯ 2 │ 001Wz 4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P… e8e5 d1d8 e5e8 d8e8 1128 81 87 54 backRankMate endgame mate mateIn… https://li 3 │ 001om 5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1… g4h4 c5f2 g2g3 f2g3 991 78 89 215 mate mateIn2 middlegame short https://li 4 │ 003Tx 2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P… e1e4 f3d2 b1a1 c8c1 1716 77 87 476 backRankMate endgame fork mate m… https://li ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 274132 │ zzxQS 2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn… c8f8 d6d1 e3e1 d1e1 1149 75 96 1722 mate mateIn2 middlegame short https://li ⋯ 274133 │ zzxvB 5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3… f6g4 c7f7 f8f7 a7a8 1695 74 95 4857 endgame mate mateIn2 pin sacrifi… https://li 274134 │ zzzRN 4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3… g1h1 e3e1 f7f1 e1f1 830 108 67 31 endgame mate mateIn2 short https://li 274135 │ zzzco 5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P… f7f2 b2c2 c1b1 e2d1 1783 75 90 763 endgame mate mateIn2 queensideAt… https://li 1 column and 274127 rows omitted ```
### Exercise 2 What is the fraction of puzzles that are mate in 2 in relation to all puzzles in the `puzzles` data frame?
Solution Two ways to do it: ``` julia> using Statistics julia> nrow(matein2) / nrow(puzzles) 0.12852152542746353 julia> mean(contains.(puzzles.Themes, "mateIn2")) 0.12852152542746353 ```
### Exercise 3 Create `small` data frame that holds first 10 rows of `matein2` data frame and columns `Rating`, `RatingDeviation`, and `NbPlays`.
Solution ``` julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]] 10×3 DataFrame Row │ Rating RatingDeviation NbPlays │ Int64 Int64 Int64 ─────┼────────────────────────────────── 1 │ 1560 76 441 2 │ 1128 81 54 3 │ 991 78 215 4 │ 1716 77 476 5 │ 711 81 111 6 │ 723 86 806 7 │ 754 92 248 8 │ 1177 76 827 9 │ 994 81 71 10 │ 979 144 14 ```
### Exercise 4 Iterate rows of `small` data frame and print the ratio of `RatingDeviation` and `NbPlays` for each row.
Solution ``` julia> for row in eachrow(small) println(row.RatingDeviation / row.NbPlays) end 0.17233560090702948 1.5 0.3627906976744186 0.16176470588235295 0.7297297297297297 0.10669975186104218 0.3709677419354839 0.09189842805320435 1.1408450704225352 10.285714285714286 ```
### Exercise 5 Get names of columns from the `matein2` data frame that end with `n` (ignore case).
Solution Several options: ``` julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N')) 2-element Vector{String}: "FEN" "RatingDeviation" julia> names(matein2, Cols(col -> endswith(uppercase(col), "N"))) 2-element Vector{String}: "FEN" "RatingDeviation" julia> names(matein2, r"[nN]$") 2-element Vector{String}: "FEN" "RatingDeviation" ```
### Exercise 6 Write a function `collatz` that runs the following process. Start with a positive number `n`. If it is even divide it by two. If it is odd multiply it by 3 and add one. The function should return the number of steps needed to reach 1. Create a `d` dictionary that maps number of steps needed to a list of numbers from the range `1:100` that required this number of steps.
Solution ``` julia> function collatz(n) i = 0 while n != 1 i += 1 n = iseven(n) ? div(n, 2) : 3 * n + 1 end return i end collatz (generic function with 1 method) julia> d = Dict{Int, Vector{Int}}() Dict{Int64, Vector{Int64}}() julia> for n in 1:100 i = collatz(n) if haskey(d, i) push!(d[i], n) else d[i] = [n] end end julia> d Dict{Int64, Vector{Int64}} with 45 entries: 5 => [5, 32] 35 => [78, 79] 110 => [82, 83] 30 => [86, 87, 89] 32 => [57, 59] 6 => [10, 64] 115 => [73] 112 => [54, 55] 4 => [16] 13 => [34, 35] 104 => [47] 12 => [17, 96] 23 => [25] 111 => [27] 92 => [91] 11 => [48, 52, 53] 118 => [97] ⋮ => ⋮ ``` As we can see even for small `n` the number of steps required to reach `1` can get quite large.
### Exercise 7 Using the `d` dictionary make a scatter plot of number of steps required vs average value of numbers that require this number of steps.
Solution ``` using Plots using Statistics steps = collect(keys(d)) mean_number = mean.(values(d)) scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false) ``` Note that we needed to use `collect` on `keys` as `scatter` expects an array not just an iterator.
### Exercise 8 Repeat the process from exercises 6 and 7, but this time use a data frame and try to write an appropriate expression using the `combine` and `groupby` functions (as it was explained in the last part of chapter 9). This time perform computations for numbers ranging from one to one million.
Solution ``` df = DataFrame(n=1:10^6); df.collatz = collatz.(df.n); agg = combine(groupby(df, :collatz), :n => mean); scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false) ```
### Exercise 9 Set seed of random number generator to `1234`. Draw 100 random points from the interval `[0, 1]`. Store this vector in a data frame as `x` column. Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`. Add random noise to column `y` that has normal distribution with mean 0 and standard deviation 0.25. Call this column `z`. Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
Solution ``` using Random Random.seed!(1234) df = DataFrame(x=rand(100)) df.y = 4 .* (df.x .- 0.5) .^ 2 df.z = df.y + randn(100) / 4 scatter(df.x, [df.y df.z], labels=["y" "z"]) ```
### Exercise 10 Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
Solution ``` using Loess model = loess(df.x, df.z); x_predict = sort(df.x) z_predict = predict(model, x_predict) plot!(x_predict, z_predict; label="z predicted") ```