# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 4
# Problems
### Exercise 1
Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
columnwise with consecutive numbers). Next calculate sum, mean and standard
deviation of each row and each column of this matrix.
Solution
Write:
```
julia> using Statistics
julia> mat = [1 3 5
2 4 6]
2×3 Matrix{Int64}:
1 3 5
2 4 6
julia> sum(mat, dims=1)
1×3 Matrix{Int64}:
3 7 11
julia> sum(mat, dims=2)
2×1 Matrix{Int64}:
9
12
julia> mean(mat, dims=1)
1×3 Matrix{Float64}:
1.5 3.5 5.5
julia> mean(mat, dims=2)
2×1 Matrix{Float64}:
3.0
4.0
julia> std(mat, dims=1)
1×3 Matrix{Float64}:
0.707107 0.707107 0.707107
julia> std(mat, dims=2)
2×1 Matrix{Float64}:
2.0
2.0
```
Observe that the returned statistics are also stored in matrices.
If we compute them for columns (`dims=1`) then the produced matrix has one row.
If we compute them for rows (`dims=2`) then the produced matrix has one column.
### Exercise 2
For each column of the matrix created in exercise 1 compute its range
(i.e. the difference between maximum and minimum element stored in it).
Solution
Here are some ways you can do it:
```
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
3-element Vector{Int64}:
1
1
1
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
3-element Vector{Int64}:
1
1
1
```
Observe that if we used `eachcol` the produced result is a vector (not a matrix
like in exercise 1).
### Exercise 3
This is data for car speed (mph) and distance taken to stop (ft)
from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
```
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34
13 34
13 46
14 26
14 36
14 60
14 80
15 20
15 26
15 54
16 32
16 40
17 32
17 40
17 50
18 42
18 56
18 76
18 84
19 36
19 46
19 68
20 32
20 48
20 52
20 56
20 64
22 66
23 54
24 70
24 92
24 93
24 120
25 85
```
Load this data into Julia (this is part of the exercise) and fit a linear
regression where speed is a feature and distance is target variable.
Solution
First create a matrix with source data by copy pasting it from the exercise
like this:
```
data = [
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34
13 34
13 46
14 26
14 36
14 60
14 80
15 20
15 26
15 54
16 32
16 40
17 32
17 40
17 50
18 42
18 56
18 76
18 84
19 36
19 46
19 68
20 32
20 48
20 52
20 56
20 64
22 66
23 54
24 70
24 92
24 93
24 120
25 85
]
```
Now use the GLM.jl package to fit the model:
```
julia> using GLM
julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1]))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}}
distance ~ 1 + speed
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -17.5791 6.75844 -2.60 0.0123 -31.1678 -3.99034
speed 3.93241 0.415513 9.46 <1e-11 3.09696 4.76785
─────────────────────────────────────────────────────────────────────────
```
You can get the same estimates using the `\` operator like this:
```
julia> [ones(50) data[:, 1]] \ data[:, 2]
2-element Vector{Float64}:
-17.579094890510966
3.9324087591240877
```
### Exercise 4
Plot the data loaded in exercise 4. Additionally plot the fitted regression
(you need to check Plots.jl documentation to find a way to do this).
Solution
Run the following:
```
using Plots
scatter(data[:, 1], data[:, 2];
xlab="speed", ylab="distance", legend=false, smooth=true)
```
The `smooth=true` keyword argument adds the linear regression line to the plot.
### Exercise 5
A simple code for calculation of Fibonacci numbers for positive
arguments is as follows:
```
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
```
Using the BenchmarkTools.jl package measure runtime of this function for
`n` ranging from `1` to `20`.
Solution
Use the following code:
```
julia> using BenchmarkTools
julia> for i in 1:40
print(i, " ")
@btime fib($i)
end
1 2.500 ns (0 allocations: 0 bytes)
2 2.700 ns (0 allocations: 0 bytes)
3 4.800 ns (0 allocations: 0 bytes)
4 7.500 ns (0 allocations: 0 bytes)
5 12.112 ns (0 allocations: 0 bytes)
6 19.980 ns (0 allocations: 0 bytes)
7 32.125 ns (0 allocations: 0 bytes)
8 52.696 ns (0 allocations: 0 bytes)
9 85.010 ns (0 allocations: 0 bytes)
10 140.311 ns (0 allocations: 0 bytes)
11 222.177 ns (0 allocations: 0 bytes)
12 359.903 ns (0 allocations: 0 bytes)
13 582.123 ns (0 allocations: 0 bytes)
14 1.000 μs (0 allocations: 0 bytes)
15 1.560 μs (0 allocations: 0 bytes)
16 2.522 μs (0 allocations: 0 bytes)
17 4.000 μs (0 allocations: 0 bytes)
18 6.600 μs (0 allocations: 0 bytes)
19 11.400 μs (0 allocations: 0 bytes)
20 18.100 μs (0 allocations: 0 bytes)
```
Notice that execution time for number `n` is roughly sum of ececution times
for numbers `n-1` and `n-2`.
### Exercise 6
Improve the speed of code from exercise 5 by using a dictionary where you
store a mapping of `n` to `fib(n)`. Measure the performance of this function
for the same range of values as in exercise 5.
Solution
Use the following code:
```
julia> fib_dict = Dict{Int, Int}()
Dict{Int64, Int64}()
julia> function fib2(n)
haskey(fib_dict, n) && return fib_dict[n]
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
fib_dict[n] = fib_n
return fib_n
end
fib2 (generic function with 1 method)
julia> for i in 1:20
print(i, " ")
@btime fib2($i)
end
1 40.808 ns (0 allocations: 0 bytes)
2 40.101 ns (0 allocations: 0 bytes)
3 40.101 ns (0 allocations: 0 bytes)
4 40.707 ns (0 allocations: 0 bytes)
5 42.727 ns (0 allocations: 0 bytes)
6 40.909 ns (0 allocations: 0 bytes)
7 40.404 ns (0 allocations: 0 bytes)
8 40.707 ns (0 allocations: 0 bytes)
9 40.808 ns (0 allocations: 0 bytes)
10 39.798 ns (0 allocations: 0 bytes)
11 40.909 ns (0 allocations: 0 bytes)
12 40.404 ns (0 allocations: 0 bytes)
13 42.872 ns (0 allocations: 0 bytes)
14 42.626 ns (0 allocations: 0 bytes)
15 47.972 ns (1 allocation: 16 bytes)
16 46.505 ns (1 allocation: 16 bytes)
17 46.302 ns (1 allocation: 16 bytes)
18 45.390 ns (1 allocation: 16 bytes)
19 47.160 ns (1 allocation: 16 bytes)
20 46.201 ns (1 allocation: 16 bytes)
```
Note that benchmarking essentially gives us a time of dictionary lookup.
The reason is that `@btime` executes the same expression many times, so
for the fastest execution time the value for each `n` is already stored in
`fib_dict`.
It would be more interesting to see the runtime of `fib2` for some large value
of `n` executed once:
```
julia> @time fib2(100)
0.000018 seconds (107 allocations: 1.672 KiB)
3736710778780434371
julia> @time fib2(200)
0.000025 seconds (204 allocations: 20.453 KiB)
-1123705814761610347
```
As you can see things are indeed fast. Note that for `n=200` we get a negative
values because of integer overflow.
As a more advanced topic (not covered in the book) it is worth to comment that
`fib2` is not type stable. If we wanted to make it type stable we need to
declare `fib_dict` dictionary as `const`. Here is the code and benchmarks
(you need to restart Julia to run this test):
```
julia> const fib_dict = Dict{Int, Int}()
Dict{Int64, Int64}()
julia> function fib2(n)
haskey(fib_dict, n) && return fib_dict[n]
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
fib_dict[n] = fib_n
return fib_n
end
fib2 (generic function with 1 method)
julia> @time fib2(100)
0.000014 seconds (6 allocations: 5.828 KiB)
3736710778780434371
julia> @time fib2(200)
0.000011 seconds (3 allocations: 17.312 KiB)
-1123705814761610347
```
As you can see the code does less allocations and is faster now.
### Exercise 7
Create a vector containing named tuples representing elements of a 4x4 grid.
So the first element of this vector should be `(x=1, y=1)` and last should be
`(x=4, y=4)`. Store the vector in variable `v`.
Solution
Since we are asked to create a vector we can write:
```
julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4]
16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 2)
(x = 1, y = 3)
(x = 1, y = 4)
(x = 2, y = 1)
(x = 2, y = 2)
(x = 2, y = 3)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 2)
(x = 3, y = 3)
(x = 3, y = 4)
(x = 4, y = 1)
(x = 4, y = 2)
(x = 4, y = 3)
(x = 4, y = 4)
```
Note (not covered in the book) that you could create a matrix by changing
the syntax a bit:
```
julia> [(x=x, y=y) for x in 1:4, y in 1:4]
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
```
Finally, we can use a bit shorter syntax (covered in chapter 14 of the book):
```
julia> [(; x, y) for x in 1:4, y in 1:4]
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
```
### Exercise 8
The `filter` function allows you to select some values of an input collection.
Check its documentation first. Next, use it to keep from the vector `v` from
exercise 7 only elements whose sum is even.
Solution
To get help on the `filter` function write `?filter`. Next run:
```
julia> filter(e -> iseven(e.x + e.y), v)
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
```
### Exercise 9
Check the documentation of the `filter!` function. Perform the same operation
as asked in exercise 8 but using `filter!`. What is the difference?
Solution
To get help on the `filter!` function write `?filter!`. Next run:
```
julia> filter!(e -> iseven(e.x + e.y), v)
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
julia> v
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
```
Notice that `filter` allocated a new vector, while `filter!` updated the `v`
vector in place.
### Exercise 10
Write a function that takes a number `n`. Next it generates two independent
random vectors of length `n` and returns their correlation coefficient.
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
and `10000`.
Create a plot with four histograms of distribution of computed Pearson
correlation coefficient. Check in the Plots.jl package which function can be
used to plot histograms.
Solution
You can use for example the following code:
```
using Statistics
using Plots
rand_cor(n) = cor(rand(n), rand(n))
plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false)
for n in [10, 100, 1000, 10000]]...)
```
Observe that as you increase `n` the dispersion of the correlation coefficient
decreases.