8.7 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 9
Problems
In this problemset we will use the puzzles.csv
file that
was created in chapter 8. Please first load it into your Julia session
using the commands:
using CSV
using DataFrames
puzzles = CSV.read("puzzles.csv", DataFrame);
Exercise 1
Create matein2
data frame that will have only puzzles
that have "mateIn2"
in the Themes
column. Use
the contains
function (check its documentation first).
Solution
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
274135×9 DataFrame
Row │ PuzzleId FEN Moves Rating RatingDeviation Popularity NbPlays Themes GameUrl ⋯
│ String7 String String Int64 Int64 Int64 Int64 String String ⋯
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 000hf r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1… e8f7 e2e6 f7f8 e6f7 1560 76 88 441 mate mateIn2 middlegame short https://li ⋯
2 │ 001Wz 4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P… e8e5 d1d8 e5e8 d8e8 1128 81 87 54 backRankMate endgame mate mateIn… https://li
3 │ 001om 5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1… g4h4 c5f2 g2g3 f2g3 991 78 89 215 mate mateIn2 middlegame short https://li
4 │ 003Tx 2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P… e1e4 f3d2 b1a1 c8c1 1716 77 87 476 backRankMate endgame fork mate m… https://li
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
274132 │ zzxQS 2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn… c8f8 d6d1 e3e1 d1e1 1149 75 96 1722 mate mateIn2 middlegame short https://li ⋯
274133 │ zzxvB 5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3… f6g4 c7f7 f8f7 a7a8 1695 74 95 4857 endgame mate mateIn2 pin sacrifi… https://li
274134 │ zzzRN 4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3… g1h1 e3e1 f7f1 e1f1 830 108 67 31 endgame mate mateIn2 short https://li
274135 │ zzzco 5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P… f7f2 b2c2 c1b1 e2d1 1783 75 90 763 endgame mate mateIn2 queensideAt… https://li
1 column and 274127 rows omitted
Exercise 2
What is the fraction of puzzles that are mate in 2 in relation to all
puzzles in the puzzles
data frame?
Solution
Two ways to do it:
julia> using Statistics
julia> nrow(matein2) / nrow(puzzles)
0.12852152542746353
julia> mean(contains.(puzzles.Themes, "mateIn2"))
0.12852152542746353
Exercise 3
Create small
data frame that holds first 10 rows of
matein2
data frame and columns Rating
,
RatingDeviation
, and NbPlays
.
Solution
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
10×3 DataFrame
Row │ Rating RatingDeviation NbPlays
│ Int64 Int64 Int64
─────┼──────────────────────────────────
1 │ 1560 76 441
2 │ 1128 81 54
3 │ 991 78 215
4 │ 1716 77 476
5 │ 711 81 111
6 │ 723 86 806
7 │ 754 92 248
8 │ 1177 76 827
9 │ 994 81 71
10 │ 979 144 14
Exercise 4
Iterate rows of small
data frame and print the ratio of
RatingDeviation
and NbPlays
for each row.
Solution
julia> for row in eachrow(small)
println(row.RatingDeviation / row.NbPlays)
end
0.17233560090702948
1.5
0.3627906976744186
0.16176470588235295
0.7297297297297297
0.10669975186104218
0.3709677419354839
0.09189842805320435
1.1408450704225352
10.285714285714286
Exercise 5
Get names of columns from the matein2
data frame that
end with n
(ignore case).
Solution
Several options:
julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
2-element Vector{String}:
"FEN"
"RatingDeviation"
julia> names(matein2, Cols(col -> endswith(uppercase(col), "N")))
2-element Vector{String}:
"FEN"
"RatingDeviation"
julia> names(matein2, r"[nN]$")
2-element Vector{String}:
"FEN"
"RatingDeviation"
Exercise 6
Write a function collatz
that runs the following
process. Start with a positive number n
. If it is even
divide it by two. If it is odd multiply it by 3 and add one. The
function should return the number of steps needed to reach 1.
Create a d
dictionary that maps number of steps needed
to a list of numbers from the range 1:100
that required
this number of steps.
Solution
julia> function collatz(n)
i = 0
while n != 1
i += 1
n = iseven(n) ? div(n, 2) : 3 * n + 1
end
return i
end
collatz (generic function with 1 method)
julia> d = Dict{Int, Vector{Int}}()
Dict{Int64, Vector{Int64}}()
julia> for n in 1:100
i = collatz(n)
if haskey(d, i)
push!(d[i], n)
else
d[i] = [n]
end
end
julia> d
Dict{Int64, Vector{Int64}} with 45 entries:
5 => [5, 32]
35 => [78, 79]
110 => [82, 83]
30 => [86, 87, 89]
32 => [57, 59]
6 => [10, 64]
115 => [73]
112 => [54, 55]
4 => [16]
13 => [34, 35]
104 => [47]
12 => [17, 96]
23 => [25]
111 => [27]
92 => [91]
11 => [48, 52, 53]
118 => [97]
⋮ => ⋮
As we can see even for small n
the number of steps
required to reach 1
can get quite large.
Exercise 7
Using the d
dictionary make a scatter plot of number of
steps required vs average value of numbers that require this number of
steps.
Solution
using Plots
using Statistics
steps = collect(keys(d))
mean_number = mean.(values(d))
scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false)
Note that we needed to use collect
on keys
as scatter
expects an array not just an iterator.
Exercise 8
Repeat the process from exercises 6 and 7, but this time use a data
frame and try to write an appropriate expression using the
combine
and groupby
functions (as it was
explained in the last part of chapter 9). This time perform computations
for numbers ranging from one to one million.
Solution
df = DataFrame(n=1:10^6);
df.collatz = collatz.(df.n);
agg = combine(groupby(df, :collatz), :n => mean);
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
Exercise 9
Set seed of random number generator to 1234
. Draw 100
random points from the interval [0, 1]
. Store this vector
in a data frame as x
column. Now compute y
column using a formula 4 * (x - 0.5) ^ 2
. Add random noise
to column y
that has normal distribution with mean 0 and
standard deviation 0.25. Call this column z
. Make a scatter
plot with x
on x-axis and y
and z
on y-axis.
Solution
using Random
Random.seed!(1234)
df = DataFrame(x=rand(100))
df.y = 4 .* (df.x .- 0.5) .^ 2
df.z = df.y + randn(100) / 4
scatter(df.x, [df.y df.z], labels=["y" "z"])
Exercise 10
Add a line of LOESS regression of x
explaining
z
plot to figure produced in exercise 10.
Solution
using Loess
model = loess(df.x, df.z);
x_predict = sort(df.x)
z_predict = predict(model, x_predict)
plot!(x_predict, z_predict; label="z predicted")