JuliaForDataAnalysis/exercises/exercises09.md
2022-10-14 13:43:12 +02:00

8.7 KiB
Raw Blame History

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 9

Problems

In this problemset we will use the puzzles.csv file that was created in chapter 8. Please first load it into your Julia session using the commands:

using CSV
using DataFrames
puzzles = CSV.read("puzzles.csv", DataFrame);

Exercise 1

Create matein2 data frame that will have only puzzles that have "mateIn2" in the Themes column. Use the contains function (check its documentation first).

Solution
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
274135×9 DataFrame
    Row │ PuzzleId  FEN                                Moves                Rating  RatingDeviation  Popularity  NbPlays  Themes                             GameUrl    ⋯
        │ String7   String                             String               Int64   Int64            Int64       Int64    String                             String     ⋯
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      1 │ 000hf     r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1…  e8f7 e2e6 f7f8 e6f7    1560               76          88      441  mate mateIn2 middlegame short      https://li ⋯
      2 │ 001Wz     4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P…  e8e5 d1d8 e5e8 d8e8    1128               81          87       54  backRankMate endgame mate mateIn…  https://li
      3 │ 001om     5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1…  g4h4 c5f2 g2g3 f2g3     991               78          89      215  mate mateIn2 middlegame short      https://li
      4 │ 003Tx     2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P…  e1e4 f3d2 b1a1 c8c1    1716               77          87      476  backRankMate endgame fork mate m…  https://li
   ⋮    │    ⋮                      ⋮                           ⋮             ⋮            ⋮             ⋮          ⋮                     ⋮                             ⋱
 274132 │ zzxQS     2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn…  c8f8 d6d1 e3e1 d1e1    1149               75          96     1722  mate mateIn2 middlegame short      https://li ⋯
 274133 │ zzxvB     5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3…  f6g4 c7f7 f8f7 a7a8    1695               74          95     4857  endgame mate mateIn2 pin sacrifi…  https://li
 274134 │ zzzRN     4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3…  g1h1 e3e1 f7f1 e1f1     830              108          67       31  endgame mate mateIn2 short         https://li
 274135 │ zzzco     5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P…  f7f2 b2c2 c1b1 e2d1    1783               75          90      763  endgame mate mateIn2 queensideAt…  https://li
                                                                                                                                         1 column and 274127 rows omitted

Exercise 2

What is the fraction of puzzles that are mate in 2 in relation to all puzzles in the puzzles data frame?

Solution

Two ways to do it:

julia> using Statistics

julia> nrow(matein2) / nrow(puzzles)
0.12852152542746353

julia> mean(contains.(puzzles.Themes, "mateIn2"))
0.12852152542746353

Exercise 3

Create small data frame that holds first 10 rows of matein2 data frame and columns Rating, RatingDeviation, and NbPlays.

Solution
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
10×3 DataFrame
 Row │ Rating  RatingDeviation  NbPlays
     │ Int64   Int64            Int64
─────┼──────────────────────────────────
   1 │   1560               76      441
   2 │   1128               81       54
   3 │    991               78      215
   4 │   1716               77      476
   5 │    711               81      111
   6 │    723               86      806
   7 │    754               92      248
   8 │   1177               76      827
   9 │    994               81       71
  10 │    979              144       14

Exercise 4

Iterate rows of small data frame and print the ratio of RatingDeviation and NbPlays for each row.

Solution
julia> for row in eachrow(small)
           println(row.RatingDeviation / row.NbPlays)
       end
0.17233560090702948
1.5
0.3627906976744186
0.16176470588235295
0.7297297297297297
0.10669975186104218
0.3709677419354839
0.09189842805320435
1.1408450704225352
10.285714285714286

Exercise 5

Get names of columns from the matein2 data frame that end with n (ignore case).

Solution

Several options:

julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
2-element Vector{String}:
 "FEN"
 "RatingDeviation"

julia> names(matein2, Cols(col -> endswith(uppercase(col), "N")))
2-element Vector{String}:
 "FEN"
 "RatingDeviation"

julia> names(matein2, r"[nN]$")
2-element Vector{String}:
 "FEN"
 "RatingDeviation"

Exercise 6

Write a function collatz that runs the following process. Start with a positive number n. If it is even divide it by two. If it is odd multiply it by 3 and add one. The function should return the number of steps needed to reach 1.

Create a d dictionary that maps number of steps needed to a list of numbers from the range 1:100 that required this number of steps.

Solution
julia> function collatz(n)
           i = 0
           while n != 1
               i += 1
               n = iseven(n) ? div(n, 2) : 3 * n + 1
           end
           return i
       end
collatz (generic function with 1 method)

julia> d = Dict{Int, Vector{Int}}()
Dict{Int64, Vector{Int64}}()

julia> for n in 1:100
           i = collatz(n)
           if haskey(d, i)
               push!(d[i], n)
           else
               d[i] = [n]
           end
       end

julia> d
Dict{Int64, Vector{Int64}} with 45 entries:
  5   => [5, 32]
  35  => [78, 79]
  110 => [82, 83]
  30  => [86, 87, 89]
  32  => [57, 59]
  6   => [10, 64]
  115 => [73]
  112 => [54, 55]
  4   => [16]
  13  => [34, 35]
  104 => [47]
  12  => [17, 96]
  23  => [25]
  111 => [27]
  92  => [91]
  11  => [48, 52, 53]
  118 => [97]
  ⋮   => ⋮

As we can see even for small n the number of steps required to reach 1 can get quite large.

Exercise 7

Using the d dictionary make a scatter plot of number of steps required vs average value of numbers that require this number of steps.

Solution
using Plots
using Statistics
steps = collect(keys(d))
mean_number = mean.(values(d))
scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false)

Note that we needed to use collect on keys as scatter expects an array not just an iterator.

Exercise 8

Repeat the process from exercises 6 and 7, but this time use a data frame and try to write an appropriate expression using the combine and groupby functions (as it was explained in the last part of chapter 9). This time perform computations for numbers ranging from one to one million.

Solution
df = DataFrame(n=1:10^6);
df.collatz = collatz.(df.n);
agg = combine(groupby(df, :collatz), :n => mean);
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)

Exercise 9

Set seed of random number generator to 1234. Draw 100 random points from the interval [0, 1]. Store this vector in a data frame as x column. Now compute y column using a formula 4 * (x - 0.5) ^ 2. Add random noise to column y that has normal distribution with mean 0 and standard deviation 0.25. Call this column z. Make a scatter plot with x on x-axis and y and z on y-axis.

Solution
using Random
Random.seed!(1234)
df = DataFrame(x=rand(100))
df.y = 4 .* (df.x .- 0.5) .^ 2
df.z = df.y + randn(100) / 4
scatter(df.x, [df.y df.z], labels=["y" "z"])

Exercise 10

Add a line of LOESS regression of x explaining z plot to figure produced in exercise 10.

Solution
using Loess
model = loess(df.x, df.z);
x_predict = sort(df.x)
z_predict = predict(model, x_predict)
plot!(x_predict, z_predict; label="z predicted")