13 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 4
Problems
Exercise 1
Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix columnwise with consecutive numbers). Next calculate sum, mean and standard deviation of each row and each column of this matrix.
Solution
Write:
julia> using Statistics
julia> mat = [1 3 5
2 4 6]
2×3 Matrix{Int64}:
1 3 5
2 4 6
julia> sum(mat, dims=1)
1×3 Matrix{Int64}:
3 7 11
julia> sum(mat, dims=2)
2×1 Matrix{Int64}:
9
12
julia> mean(mat, dims=1)
1×3 Matrix{Float64}:
1.5 3.5 5.5
julia> mean(mat, dims=2)
2×1 Matrix{Float64}:
3.0
4.0
julia> std(mat, dims=1)
1×3 Matrix{Float64}:
0.707107 0.707107 0.707107
julia> std(mat, dims=2)
2×1 Matrix{Float64}:
2.0
2.0
Observe that the returned statistics are also stored in matrices. If
we compute them for columns (dims=1
) then the produced
matrix has one row. If we compute them for rows (dims=2
)
then the produced matrix has one column.
Exercise 2
For each column of the matrix created in exercise 1 compute its range (i.e. the difference between maximum and minimum element stored in it).
Solution
Here are some ways you can do it:
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
3-element Vector{Int64}:
1
1
1
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
3-element Vector{Int64}:
1
1
1
Observe that if we used eachcol
the produced result is a
vector (not a matrix like in exercise 1).
Exercise 3
This is data for car speed (mph) and distance taken to stop (ft) from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34
13 34
13 46
14 26
14 36
14 60
14 80
15 20
15 26
15 54
16 32
16 40
17 32
17 40
17 50
18 42
18 56
18 76
18 84
19 36
19 46
19 68
20 32
20 48
20 52
20 56
20 64
22 66
23 54
24 70
24 92
24 93
24 120
25 85
Load this data into Julia (this is part of the exercise) and fit a linear regression where speed is a feature and distance is target variable.
Solution
First create a matrix with source data by copy pasting it from the exercise like this:
data = [
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34
13 34
13 46
14 26
14 36
14 60
14 80
15 20
15 26
15 54
16 32
16 40
17 32
17 40
17 50
18 42
18 56
18 76
18 84
19 36
19 46
19 68
20 32
20 48
20 52
20 56
20 64
22 66
23 54
24 70
24 92
24 93
24 120
25 85
]
Now use the GLM.jl package to fit the model:
julia> using GLM
julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1]))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}}
distance ~ 1 + speed
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -17.5791 6.75844 -2.60 0.0123 -31.1678 -3.99034
speed 3.93241 0.415513 9.46 <1e-11 3.09696 4.76785
─────────────────────────────────────────────────────────────────────────
You can get the same estimates using the \
operator like
this:
julia> [ones(50) data[:, 1]] \ data[:, 2]
2-element Vector{Float64}:
-17.579094890510966
3.9324087591240877
Exercise 4
Plot the data loaded in exercise 4. Additionally plot the fitted regression (you need to check Plots.jl documentation to find a way to do this).
Solution
Run the following:
using Plots
scatter(data[:, 1], data[:, 2];
xlab="speed", ylab="distance", legend=false, smooth=true)
The smooth=true
keyword argument adds the linear
regression line to the plot.
Exercise 5
A simple code for calculation of Fibonacci numbers for positive arguments is as follows:
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
Using the BenchmarkTools.jl package measure runtime of this function
for n
ranging from 1
to 20
.
Solution
Use the following code:
julia> using BenchmarkTools
julia> for i in 1:40
print(i, " ")
@btime fib($i)
end
1 2.500 ns (0 allocations: 0 bytes)
2 2.700 ns (0 allocations: 0 bytes)
3 4.800 ns (0 allocations: 0 bytes)
4 7.500 ns (0 allocations: 0 bytes)
5 12.112 ns (0 allocations: 0 bytes)
6 19.980 ns (0 allocations: 0 bytes)
7 32.125 ns (0 allocations: 0 bytes)
8 52.696 ns (0 allocations: 0 bytes)
9 85.010 ns (0 allocations: 0 bytes)
10 140.311 ns (0 allocations: 0 bytes)
11 222.177 ns (0 allocations: 0 bytes)
12 359.903 ns (0 allocations: 0 bytes)
13 582.123 ns (0 allocations: 0 bytes)
14 1.000 μs (0 allocations: 0 bytes)
15 1.560 μs (0 allocations: 0 bytes)
16 2.522 μs (0 allocations: 0 bytes)
17 4.000 μs (0 allocations: 0 bytes)
18 6.600 μs (0 allocations: 0 bytes)
19 11.400 μs (0 allocations: 0 bytes)
20 18.100 μs (0 allocations: 0 bytes)
Notice that execution time for number n
is roughly sum
of execution times for numbers n-1
and
n-2
.
Exercise 6
Improve the speed of code from exercise 5 by using a dictionary where
you store a mapping of n
to fib(n)
. Measure
the performance of this function for the same range of values as in
exercise 5.
Solution
Use the following code:
julia> fib_dict = Dict{Int, Int}()
Dict{Int64, Int64}()
julia> function fib2(n)
haskey(fib_dict, n) && return fib_dict[n]
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
fib_dict[n] = fib_n
return fib_n
end
fib2 (generic function with 1 method)
julia> for i in 1:20
print(i, " ")
@btime fib2($i)
end
1 40.808 ns (0 allocations: 0 bytes)
2 40.101 ns (0 allocations: 0 bytes)
3 40.101 ns (0 allocations: 0 bytes)
4 40.707 ns (0 allocations: 0 bytes)
5 42.727 ns (0 allocations: 0 bytes)
6 40.909 ns (0 allocations: 0 bytes)
7 40.404 ns (0 allocations: 0 bytes)
8 40.707 ns (0 allocations: 0 bytes)
9 40.808 ns (0 allocations: 0 bytes)
10 39.798 ns (0 allocations: 0 bytes)
11 40.909 ns (0 allocations: 0 bytes)
12 40.404 ns (0 allocations: 0 bytes)
13 42.872 ns (0 allocations: 0 bytes)
14 42.626 ns (0 allocations: 0 bytes)
15 47.972 ns (1 allocation: 16 bytes)
16 46.505 ns (1 allocation: 16 bytes)
17 46.302 ns (1 allocation: 16 bytes)
18 45.390 ns (1 allocation: 16 bytes)
19 47.160 ns (1 allocation: 16 bytes)
20 46.201 ns (1 allocation: 16 bytes)
Note that benchmarking essentially gives us a time of dictionary
lookup. The reason is that @btime
executes the same
expression many times, so for the fastest execution time the value for
each n
is already stored in fib_dict
.
It would be more interesting to see the runtime of fib2
for some large value of n
executed once:
julia> @time fib2(100)
0.000018 seconds (107 allocations: 1.672 KiB)
3736710778780434371
julia> @time fib2(200)
0.000025 seconds (204 allocations: 20.453 KiB)
-1123705814761610347
As you can see things are indeed fast. Note that for
n=200
we get a negative values because of integer
overflow.
As a more advanced topic (not covered in the book) it is worth to
comment that fib2
is not type stable. If we wanted to make
it type stable we need to declare fib_dict
dictionary as
const
. Here is the code and benchmarks (you need to restart
Julia to run this test):
julia> const fib_dict = Dict{Int, Int}()
Dict{Int64, Int64}()
julia> function fib2(n)
haskey(fib_dict, n) && return fib_dict[n]
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
fib_dict[n] = fib_n
return fib_n
end
fib2 (generic function with 1 method)
julia> @time fib2(100)
0.000014 seconds (6 allocations: 5.828 KiB)
3736710778780434371
julia> @time fib2(200)
0.000011 seconds (3 allocations: 17.312 KiB)
-1123705814761610347
As you can see the code does less allocations and is faster now.
Exercise 7
Create a vector containing named tuples representing elements of a
4x4 grid. So the first element of this vector should be
(x=1, y=1)
and last should be (x=4, y=4)
.
Store the vector in variable v
.
Solution
Since we are asked to create a vector we can write:
julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4]
16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 2)
(x = 1, y = 3)
(x = 1, y = 4)
(x = 2, y = 1)
(x = 2, y = 2)
(x = 2, y = 3)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 2)
(x = 3, y = 3)
(x = 3, y = 4)
(x = 4, y = 1)
(x = 4, y = 2)
(x = 4, y = 3)
(x = 4, y = 4)
Note (not covered in the book) that you could create a matrix by changing the syntax a bit:
julia> [(x=x, y=y) for x in 1:4, y in 1:4]
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
Finally, we can use a bit shorter syntax (covered in chapter 14 of the book):
julia> [(; x, y) for x in 1:4, y in 1:4]
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
Exercise 8
The filter
function allows you to select some values of
an input collection. Check its documentation first. Next, use it to keep
from the vector v
from exercise 7 only elements whose sum
is even.
Solution
To get help on the filter
function write
?filter
. Next run:
julia> filter(e -> iseven(e.x + e.y), v)
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
Exercise 9
Check the documentation of the filter!
function. Perform
the same operation as asked in exercise 8 but using
filter!
. What is the difference?
Solution
To get help on the filter!
function write
?filter!
. Next run:
julia> filter!(e -> iseven(e.x + e.y), v)
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
julia> v
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
Notice that filter
allocated a new vector, while
filter!
updated the v
vector in place.
Exercise 10
Write a function that takes a number n
. Next it
generates two independent random vectors of length n
and
returns their correlation coefficient. Run this function
10000
times for n
equal to 10
,
100
, 1000
, and 10000
. Create a
plot with four histograms of distribution of computed Pearson
correlation coefficient. Check in the Plots.jl package which function
can be used to plot histograms.
Solution
You can use for example the following code:
using Statistics
using Plots
rand_cor(n) = cor(rand(n), rand(n))
plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false)
for n in [10, 100, 1000, 10000]]...)
Observe that as you increase n
the dispersion of the
correlation coefficient decreases.