JuliaForDataAnalysis/exercises/exercises09.md

293 lines
8.7 KiB
Markdown
Raw Normal View History

2022-10-14 12:27:04 +02:00
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 9
# Problems
In this problemset we will use the `puzzles.csv` file that was
created in chapter 8. Please first load it into your Julia
session using the commands:
```
using CSV
using DataFrames
puzzles = CSV.read("puzzles.csv", DataFrame);
```
### Exercise 1
Create `matein2` data frame that will have only puzzles that have `"mateIn2"`
in the `Themes` column.
Use the `contains` function (check its documentation first).
<details>
2022-10-14 13:43:12 +02:00
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
274135×9 DataFrame
Row │ PuzzleId FEN Moves Rating RatingDeviation Popularity NbPlays Themes GameUrl ⋯
│ String7 String String Int64 Int64 Int64 Int64 String String ⋯
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 000hf r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1… e8f7 e2e6 f7f8 e6f7 1560 76 88 441 mate mateIn2 middlegame short https://li ⋯
2 │ 001Wz 4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P… e8e5 d1d8 e5e8 d8e8 1128 81 87 54 backRankMate endgame mate mateIn… https://li
3 │ 001om 5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1… g4h4 c5f2 g2g3 f2g3 991 78 89 215 mate mateIn2 middlegame short https://li
4 │ 003Tx 2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P… e1e4 f3d2 b1a1 c8c1 1716 77 87 476 backRankMate endgame fork mate m… https://li
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
274132 │ zzxQS 2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn… c8f8 d6d1 e3e1 d1e1 1149 75 96 1722 mate mateIn2 middlegame short https://li ⋯
274133 │ zzxvB 5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3… f6g4 c7f7 f8f7 a7a8 1695 74 95 4857 endgame mate mateIn2 pin sacrifi… https://li
274134 │ zzzRN 4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3… g1h1 e3e1 f7f1 e1f1 830 108 67 31 endgame mate mateIn2 short https://li
274135 │ zzzco 5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P… f7f2 b2c2 c1b1 e2d1 1783 75 90 763 endgame mate mateIn2 queensideAt… https://li
1 column and 274127 rows omitted
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 2
2022-10-14 13:43:12 +02:00
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
in the `puzzles` data frame?
<details>
<summary>Solution</summary>
Two ways to do it:
2022-10-14 12:27:04 +02:00
```
julia> using Statistics
julia> nrow(matein2) / nrow(puzzles)
0.12852152542746353
julia> mean(contains.(puzzles.Themes, "mateIn2"))
0.12852152542746353
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 3
2022-10-14 13:43:12 +02:00
Create `small` data frame that holds first 10 rows of `matein2` data frame
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
10×3 DataFrame
Row │ Rating RatingDeviation NbPlays
│ Int64 Int64 Int64
─────┼──────────────────────────────────
1 │ 1560 76 441
2 │ 1128 81 54
3 │ 991 78 215
4 │ 1716 77 476
5 │ 711 81 111
6 │ 723 86 806
7 │ 754 92 248
8 │ 1177 76 827
9 │ 994 81 71
10 │ 979 144 14
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 4
2022-10-14 13:43:12 +02:00
Iterate rows of `small` data frame and print the ratio of
`RatingDeviation` and `NbPlays` for each row.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> for row in eachrow(small)
println(row.RatingDeviation / row.NbPlays)
end
0.17233560090702948
1.5
0.3627906976744186
0.16176470588235295
0.7297297297297297
0.10669975186104218
0.3709677419354839
0.09189842805320435
1.1408450704225352
10.285714285714286
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 5
2022-10-14 13:43:12 +02:00
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
<details>
<summary>Solution</summary>
Several options:
2022-10-14 12:27:04 +02:00
```
julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
2-element Vector{String}:
"FEN"
"RatingDeviation"
julia> names(matein2, Cols(col -> endswith(uppercase(col), "N")))
2-element Vector{String}:
"FEN"
"RatingDeviation"
julia> names(matein2, r"[nN]$")
2-element Vector{String}:
"FEN"
"RatingDeviation"
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 6
2022-10-14 13:43:12 +02:00
Write a function `collatz` that runs the following process. Start with a
positive number `n`. If it is even divide it by two. If it is odd multiply
it by 3 and add one. The function should return the number of steps needed to
reach 1.
Create a `d` dictionary that maps number of steps needed to a list of numbers from
the range `1:100` that required this number of steps.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> function collatz(n)
i = 0
while n != 1
i += 1
n = iseven(n) ? div(n, 2) : 3 * n + 1
end
return i
end
collatz (generic function with 1 method)
julia> d = Dict{Int, Vector{Int}}()
Dict{Int64, Vector{Int64}}()
julia> for n in 1:100
i = collatz(n)
if haskey(d, i)
push!(d[i], n)
else
d[i] = [n]
end
end
julia> d
Dict{Int64, Vector{Int64}} with 45 entries:
5 => [5, 32]
35 => [78, 79]
110 => [82, 83]
30 => [86, 87, 89]
32 => [57, 59]
6 => [10, 64]
115 => [73]
112 => [54, 55]
4 => [16]
13 => [34, 35]
104 => [47]
12 => [17, 96]
23 => [25]
111 => [27]
92 => [91]
11 => [48, 52, 53]
118 => [97]
⋮ => ⋮
```
As we can see even for small `n` the number of steps required to reach `1`
can get quite large.
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 7
2022-10-14 13:43:12 +02:00
Using the `d` dictionary make a scatter plot of number of steps required
vs average value of numbers that require this number of steps.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
using Plots
using Statistics
steps = collect(keys(d))
mean_number = mean.(values(d))
scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false)
```
Note that we needed to use `collect` on `keys` as `scatter` expects an array
not just an iterator.
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 8
2022-10-14 13:43:12 +02:00
Repeat the process from exercises 6 and 7, but this time use a data frame
and try to write an appropriate expression using the `combine` and `groupby`
functions (as it was explained in the last part of chapter 9). This time
perform computations for numbers ranging from one to one million.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
df = DataFrame(n=1:10^6);
df.collatz = collatz.(df.n);
agg = combine(groupby(df, :collatz), :n => mean);
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 9
Set seed of random number generator to `1234`. Draw 100 random points
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
Add random noise to column `y` that has normal distribution with mean 0 and
standard deviation 0.25. Call this column `z`.
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
2022-10-14 13:43:12 +02:00
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
using Random
Random.seed!(1234)
df = DataFrame(x=rand(100))
df.y = 4 .* (df.x .- 0.5) .^ 2
df.z = df.y + randn(100) / 4
scatter(df.x, [df.y df.z], labels=["y" "z"])
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 10
2022-10-14 13:43:12 +02:00
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
using Loess
model = loess(df.x, df.z);
x_predict = sort(df.x)
z_predict = predict(model, x_predict)
plot!(x_predict, z_predict; label="z predicted")
```
</details>