add exercises

2022-10-14 12:27:04 +02:00
parent 67f17f270f
commit 3b8ffa5d40
17 changed files with 4341 additions and 7 deletions
--- a/exercises/exercises04.md
+++ b/exercises/exercises04.md
@@ -0,0 +1,536 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 4
+
+# Problems
+
+### Exercise 1
+
+Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
+columnwise with consecutive numbers). Next calculate sum, mean and standard
+deviation of each row and each column of this matrix.
+
+### Exercise 2
+
+For each column of the matrix created in exercise 1 compute its range
+(i.e. the difference between maximum and minimum element stored in it).
+
+### Exercise 3
+
+This is data for car speed (mph) and distance taken to stop (ft)
+from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
+
+```
+speed  dist
+    4     2
+    4    10
+    7     4
+    7    22
+    8    16
+    9    10
+   10    18
+   10    26
+   10    34
+   11    17
+   11    28
+   12    14
+   12    20
+   12    24
+   12    28
+   13    26
+   13    34
+   13    34
+   13    46
+   14    26
+   14    36
+   14    60
+   14    80
+   15    20
+   15    26
+   15    54
+   16    32
+   16    40
+   17    32
+   17    40
+   17    50
+   18    42
+   18    56
+   18    76
+   18    84
+   19    36
+   19    46
+   19    68
+   20    32
+   20    48
+   20    52
+   20    56
+   20    64
+   22    66
+   23    54
+   24    70
+   24    92
+   24    93
+   24   120
+   25    85
+```
+
+Load this data into Julia (this is part of the exercise) and fit a linear
+regression where speed is a feature and distance is target variable.
+
+### Exercise 4
+
+Plot the data loaded in exercise 4. Additionally plot the fitted regression
+(you need to check Plots.jl documentation to find a way to do this).
+
+### Exercise 5
+
+A simple code for calculation of Fibonacci numbers for positive
+arguments is as follows:
+
+```
+fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
+```
+
+Using the BenchmarkTools.jl package measure runtime of this function for
+`n` ranging from `1` to `20`.
+
+### Exercise 6
+
+Improve the speed of code from exercise 5 by using a dictionary where you
+store a mapping of `n` to `fib(n)`. Measure the performance of this function
+for the same range of values as in exercise 5.
+
+### Exercise 7
+
+Create a vector containing named tuples representing elements of a 4x4 grid.
+So the first element of this vector should be `(x=1, y=1)` and last should be
+`(x=4, y=4)`. Store the vector in variable `v`.
+
+### Exercise 8
+
+The `filter` function allows you to select some values of an input collection.
+Check its documentation first. Next, use it to keep from the vector `v` from
+exercise 7 only elements whose sum is even.
+
+### Exercise 9
+
+Check the documentation of the `filter!` function. Perform the same operation
+as asked in exercise 8 but using `filter!`. What is the difference?
+
+### Exercise 10
+
+Write a function that takes a number `n`. Next it generates two independent
+random vectors of length `n` and returns their correlation coefficient.
+Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
+and `10000`.
+Create a plot with four histograms of distribution of computed Pearson
+correlation coefficient. Check in the Plots.jl package which function can be
+used to plot histograms.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Write:
+```
+julia> using Statistics
+
+julia> mat = [1 3 5
+              2 4 6]
+2×3 Matrix{Int64}:
+ 1  3  5
+ 2  4  6
+
+julia> sum(mat, dims=1)
+1×3 Matrix{Int64}:
+ 3  7  11
+
+julia> sum(mat, dims=2)
+2×1 Matrix{Int64}:
+  9
+ 12
+
+julia> mean(mat, dims=1)
+1×3 Matrix{Float64}:
+ 1.5  3.5  5.5
+
+julia> mean(mat, dims=2)
+2×1 Matrix{Float64}:
+ 3.0
+ 4.0
+
+julia> std(mat, dims=1)
+1×3 Matrix{Float64}:
+ 0.707107  0.707107  0.707107
+
+julia> std(mat, dims=2)
+2×1 Matrix{Float64}:
+ 2.0
+ 2.0
+```
+
+Observe that the returned statistics are also stored in matrices.
+If we compute them for columns (`dims=1`) then the produced matrix has one row.
+If we compute them for rows (`dims=2`) then the produced matrix has one column.
+
+### Exercise 2
+
+Here are some ways you can do it:
+```
+julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
+3-element Vector{Int64}:
+ 1
+ 1
+ 1
+
+julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
+3-element Vector{Int64}:
+ 1
+ 1
+ 1
+```
+
+Observe that if we used `eachcol` the produced result is a vector (not a matrix
+like in exercise 1).
+
+### Exercise 3
+
+First create a matrix with source data by copy pasting it from the exercise
+like this:
+```
+data = [
+    4     2
+    4    10
+    7     4
+    7    22
+    8    16
+    9    10
+   10    18
+   10    26
+   10    34
+   11    17
+   11    28
+   12    14
+   12    20
+   12    24
+   12    28
+   13    26
+   13    34
+   13    34
+   13    46
+   14    26
+   14    36
+   14    60
+   14    80
+   15    20
+   15    26
+   15    54
+   16    32
+   16    40
+   17    32
+   17    40
+   17    50
+   18    42
+   18    56
+   18    76
+   18    84
+   19    36
+   19    46
+   19    68
+   20    32
+   20    48
+   20    52
+   20    56
+   20    64
+   22    66
+   23    54
+   24    70
+   24    92
+   24    93
+   24   120
+   25    85
+]
+```
+
+Now use the GLM.jl package to fit the model:
+
+```
+julia> using GLM
+
+julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1]))
+StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}}
+
+distance ~ 1 + speed
+
+Coefficients:
+─────────────────────────────────────────────────────────────────────────
+                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
+─────────────────────────────────────────────────────────────────────────
+(Intercept)  -17.5791     6.75844   -2.60    0.0123  -31.1678    -3.99034
+speed          3.93241    0.415513   9.46    <1e-11    3.09696    4.76785
+─────────────────────────────────────────────────────────────────────────
+```
+
+You can get the same estimates using the `\` operator like this:
+```
+julia> [ones(50) data[:, 1]] \ data[:, 2]
+2-element Vector{Float64}:
+ -17.579094890510966
+   3.9324087591240877
+```
+
+### Exercise 4
+
+Run the following:
+```
+using Plots
+scatter(data[:, 1], data[:, 2];
+        xlab="speed", ylab="distance", legend=false, smooth=true)
+```
+
+The `smooth=true` keyword argument adds the linear regression line to the plot.
+
+### Exercise 5
+
+Use the following code:
+```
+julia> using BenchmarkTools
+
+julia> for i in 1:40
+           print(i, " ")
+           @btime fib($i)
+       end
+1   2.500 ns (0 allocations: 0 bytes)
+2   2.700 ns (0 allocations: 0 bytes)
+3   4.800 ns (0 allocations: 0 bytes)
+4   7.500 ns (0 allocations: 0 bytes)
+5   12.112 ns (0 allocations: 0 bytes)
+6   19.980 ns (0 allocations: 0 bytes)
+7   32.125 ns (0 allocations: 0 bytes)
+8   52.696 ns (0 allocations: 0 bytes)
+9   85.010 ns (0 allocations: 0 bytes)
+10   140.311 ns (0 allocations: 0 bytes)
+11   222.177 ns (0 allocations: 0 bytes)
+12   359.903 ns (0 allocations: 0 bytes)
+13   582.123 ns (0 allocations: 0 bytes)
+14   1.000 μs (0 allocations: 0 bytes)
+15   1.560 μs (0 allocations: 0 bytes)
+16   2.522 μs (0 allocations: 0 bytes)
+17   4.000 μs (0 allocations: 0 bytes)
+18   6.600 μs (0 allocations: 0 bytes)
+19   11.400 μs (0 allocations: 0 bytes)
+20   18.100 μs (0 allocations: 0 bytes)
+```
+
+Notice that execution time for number `n` is roughly sum of ececution times
+for numbers `n-1` and `n-2`.
+
+### Exercise 6
+
+Use the following code:
+
+```
+julia> fib_dict = Dict{Int, Int}()
+Dict{Int64, Int64}()
+
+julia> function fib2(n)
+           haskey(fib_dict, n) && return fib_dict[n]
+           fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
+           fib_dict[n] = fib_n
+           return fib_n
+       end
+fib2 (generic function with 1 method)
+
+julia> for i in 1:20
+           print(i, " ")
+           @btime fib2($i)
+       end
+1   40.808 ns (0 allocations: 0 bytes)
+2   40.101 ns (0 allocations: 0 bytes)
+3   40.101 ns (0 allocations: 0 bytes)
+4   40.707 ns (0 allocations: 0 bytes)
+5   42.727 ns (0 allocations: 0 bytes)
+6   40.909 ns (0 allocations: 0 bytes)
+7   40.404 ns (0 allocations: 0 bytes)
+8   40.707 ns (0 allocations: 0 bytes)
+9   40.808 ns (0 allocations: 0 bytes)
+10   39.798 ns (0 allocations: 0 bytes)
+11   40.909 ns (0 allocations: 0 bytes)
+12   40.404 ns (0 allocations: 0 bytes)
+13   42.872 ns (0 allocations: 0 bytes)
+14   42.626 ns (0 allocations: 0 bytes)
+15   47.972 ns (1 allocation: 16 bytes)
+16   46.505 ns (1 allocation: 16 bytes)
+17   46.302 ns (1 allocation: 16 bytes)
+18   45.390 ns (1 allocation: 16 bytes)
+19   47.160 ns (1 allocation: 16 bytes)
+20   46.201 ns (1 allocation: 16 bytes)
+```
+
+Note that benchmarking essentially gives us a time of dictionary lookup.
+The reason is that `@btime` executes the same expression many times, so
+for the fastest execution time the value for each `n` is already stored in
+`fib_dict`.
+
+It would be more interesting to see the runtime of `fib2` for some large value
+of `n` executed once:
+
+```
+julia> @time fib2(100)
+  0.000018 seconds (107 allocations: 1.672 KiB)
+3736710778780434371
+
+julia> @time fib2(200)
+  0.000025 seconds (204 allocations: 20.453 KiB)
+-1123705814761610347
+```
+
+As you can see things are indeed fast. Note that for `n=200` we get a negative
+values because of integer overflow.
+
+As a more advanced topic (not covered in the book) it is worth to comment that
+`fib2` is not type stable. If we wanted to make it type stable we need to
+declare `fib_dict` dictionary as `const`. Here is the code and benchmarks
+(you need to restart Julia to run this test):
+
+```
+julia> const fib_dict = Dict{Int, Int}()
+Dict{Int64, Int64}()
+
+julia> function fib2(n)
+           haskey(fib_dict, n) && return fib_dict[n]
+           fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
+           fib_dict[n] = fib_n
+           return fib_n
+       end
+fib2 (generic function with 1 method)
+
+julia> @time fib2(100)
+  0.000014 seconds (6 allocations: 5.828 KiB)
+3736710778780434371
+
+julia> @time fib2(200)
+  0.000011 seconds (3 allocations: 17.312 KiB)
+-1123705814761610347
+```
+
+As you can see the code does less allocations and is faster now.
+
+### Exercise 7
+
+Since we are asked to create a vector we can write:
+
+```
+julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4]
+16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 2)
+ (x = 1, y = 3)
+ (x = 1, y = 4)
+ (x = 2, y = 1)
+ (x = 2, y = 2)
+ (x = 2, y = 3)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 2)
+ (x = 3, y = 3)
+ (x = 3, y = 4)
+ (x = 4, y = 1)
+ (x = 4, y = 2)
+ (x = 4, y = 3)
+ (x = 4, y = 4)
+```
+
+Note (not covered in the book) that you could create a matrix by changing
+the syntax a bit:
+
+```
+julia> [(x=x, y=y) for x in 1:4, y in 1:4]
+4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)  (x = 1, y = 2)  (x = 1, y = 3)  (x = 1, y = 4)
+ (x = 2, y = 1)  (x = 2, y = 2)  (x = 2, y = 3)  (x = 2, y = 4)
+ (x = 3, y = 1)  (x = 3, y = 2)  (x = 3, y = 3)  (x = 3, y = 4)
+ (x = 4, y = 1)  (x = 4, y = 2)  (x = 4, y = 3)  (x = 4, y = 4)
+```
+
+Finally, we can use a bit shorter syntax (covered in chapter 14 of the book):
+
+```
+julia> [(; x, y) for x in 1:4, y in 1:4]
+4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)  (x = 1, y = 2)  (x = 1, y = 3)  (x = 1, y = 4)
+ (x = 2, y = 1)  (x = 2, y = 2)  (x = 2, y = 3)  (x = 2, y = 4)
+ (x = 3, y = 1)  (x = 3, y = 2)  (x = 3, y = 3)  (x = 3, y = 4)
+ (x = 4, y = 1)  (x = 4, y = 2)  (x = 4, y = 3)  (x = 4, y = 4)
+```
+
+### Exercise 8
+
+To get help on the `filter` function write `?filter`. Next run:
+
+```
+julia> filter(e -> iseven(e.x + e.y), v)
+8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 3)
+ (x = 2, y = 2)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 3)
+ (x = 4, y = 2)
+ (x = 4, y = 4)
+```
+
+### Exercise 9
+
+To get help on the `filter!` function write `?filter!`. Next run:
+
+```
+julia> filter!(e -> iseven(e.x + e.y), v)
+8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 3)
+ (x = 2, y = 2)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 3)
+ (x = 4, y = 2)
+ (x = 4, y = 4)
+
+julia> v
+8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 3)
+ (x = 2, y = 2)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 3)
+ (x = 4, y = 2)
+ (x = 4, y = 4)
+```
+
+Notice that `filter` allocated a new vector, while `filter!` updated the `v`
+vector in place.
+
+### Exercise 10
+
+You can use for example the following code:
+
+```
+using Statistics
+using Plots
+rand_cor(n) = cor(rand(n), rand(n))
+plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false)
+      for n in [10, 100, 1000, 10000]]...)
+```
+
+Observe that as you increase `n` the dispersion of the correlation coefficient
+decreases.
+
+</details>