add exercises

This commit is contained in:
Bogumił Kamiński
2022-10-14 12:27:04 +02:00
parent 67f17f270f
commit 3b8ffa5d40
17 changed files with 4341 additions and 7 deletions

View File

@@ -5,6 +5,11 @@ This repository contains source codes for the
book that is written by Bogumił Kamiński and is planned to be published in 2022
by [Manning Publications Co.](https://www.manning.com/)
Extras:
* in the `/exercises` folder for each book chapter you can find 10 additional
exercises with solutions (they are meant for self study and are not discussed
in the book)
## Setting up your environment
In order to prepare the Julia environment before working with the materials
@@ -71,9 +76,9 @@ The codes for each chapter are stored in files named *chXX.jl*, where *XX* is
chapter number. The exceptions are
* chapter 14, where additionally a separate *ch14_server.jl* is present along
with *ch14.jl* (the reason is that in this chapter we create a web service and
the *ch14_server.jl* contains the server-side code that should be run in a
separate Julia process);
with *ch14.jl* (the reason is that in this chapter we create a web service and
the *ch14_server.jl* contains the server-side code that should be run in a
separate Julia process);
* appendix A, where the file name used is *appA.txt* because it also
contains other instructions than only Julia code (in particular package
manager mode instructions).
@@ -113,8 +118,6 @@ There are the following videos that feature material related to this book:
* [Analysis of GitHub developer graph](https://www.twitch.tv/videos/1527593261)
(a shortened version of material covered in chapter 12)
## Data used in the book
For your convenience I additionally stored data files that we use in this book.
@@ -130,5 +133,3 @@ They are respectively:
<https://snap.stanford.edu/data/github-social.html> under GPL-3.0 License)
* owensboro.zip (for chapter 13, available at The Stanford Open Policing Project
under the Open Data Commons Attribution License)
<!-- markdownlint-disable-file MD033 -->

33
exercises/README.md Normal file
View File

@@ -0,0 +1,33 @@
# Julia for Data Analysis
This folder contains additional exercises to the book
["Julia for Data Analysis"](https://www.manning.com/books/julia-for-data-analysis?utm_source=bkamins&utm_medium=affiliate&utm_campaign=book_kaminski2_julia_3_17_22)
that is written by Bogumił Kamiński and is planned to be published in 2022
by [Manning Publications Co.](https://www.manning.com/).
The exercises were prepared by [Bogumił Kamiński](https://github.com/bkamins) and [Daniel Kaszyński](https://www.linkedin.com/in/daniel-kaszy%C5%84ski-3a7807113/).
For each book chapter you can find 10 additional exercises with solutions.
The exercises are meant for self study and are not discussed in the book.
The exercises contained in this folder have varying level of difficulty. The
problems for the first few chapters are meant to be relatively easy, however you
might find exercises for the last chapters more challenging (if you are learning
Julia I still encourage you to try these exercises and walk through the
solutions trying to understand them). In particular, in some exercises I, on
purpose, require using more functionalities of Julia ecosystem that is covered
in the book. This is meant to teach you how to use help and documentation,
as this is a very important skill to master.
The files containing exercises have a naming convention `execricesDD.md`, where
`DD` is book chapter number for which the exercises were prepared.
All the exercises should be possible to solve using project environment setting
that is used in the whole book (please check the global README.md file of this
repository for details).
---
*Preparation of these exercises have been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.*
![SGH & NAWA](logo.png)

BIN
exercises/example8.csv.gz Normal file

Binary file not shown.

267
exercises/exercises02.md Normal file
View File

@@ -0,0 +1,267 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 2
# Problems
### Exercise 1
Consider the following code:
```
x = [1, 2]
y = x
y[1] = 10
```
What is the value of `x[1]` and why?
### Exercise 2
How can you type `⚡ = 1`. Check if this operation succeeds and what is its result.
### Exercise 3
What will be the value of variable `x` after running of the following code and why?
```
x = 0.0
for i in 1:7_000_000
global x += 1/7
end
x /= 1_000_000
```
### Exercise 4
Express the type `Matrix{Bool}` using `Array` type.
### Exercise 5
Let `x` be a vector. Write code that prints an error if `x` is empty
(has zero elements)
### Exercise 6
Write a function called `exec` that takes two values `x` and `y` and a function
accepting two arguments, call it `op` and returns `op(x, y)`. Make `+` to be
the default value of `op`.
### Exercise 7
Write a function that calculates a sum of absolute values of values stored in
a collection passed to it.
### Exercise 8
Write a function that swaps first and last element in an array in place.
### Exercise 9
Write a loop in global scope that calculates the sum of cubes of numbers from
`1` to `10^6`. Next use the `sum` function to perform the same computation.
What is the difference in timing of these operations?
### Exercise 10
Explain the value of the result of summation obtained in exercise 9.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
`x[1]` will be `10` because `y = x` is not copying data but it binds
the same value both to variable `x` and `y`.
### Exercise 2
In help mode (activated by `?`) copy-paste `⚡` to get:
```
help?> ⚡
"⚡" can be typed by \:zap:<tab>
```
After the `⚡ = 1` operation a new variable `⚡` is defined and it is bound
to value `1`.
### Exercise 3
`x` will have value `0.9999999999242748`. This value is below `1.0` because
representation of `1/7` using `Float64` type is less than rational number 1/7,
and the error accumulates when we do addition multiple times.
*Extra*: You can check that indeed that `Float64` representation is a bit less
than rational 1/7 by increasing the precision of computations using the `big`
function:
```
julia> big(1/7) # convert Floa64 to high-precision float
0.142857142857142849212692681248881854116916656494140625
julia> 1/big(7) # construct high-precision float directly
0.1428571428571428571428571428571428571428571428571428571428571428571428571428568
```
As you can see there is a difference at 17th place after decimal dot where we
have `4` vs `5`.
### Exercise 4
It is `Array{Bool, 2}`. You immediately get this information in REPL:
```
julia> Matrix{Bool}
Matrix{Bool} (alias for Array{Bool, 2})
```
### Exercise 5
You can do it like this:
```
length(x) == 0 && println("x is empty")
```
*Extra*: typically in such case one would use the `isempty` function and throw
an exception instead of just printing information (here I assume that `x` was
passed as an argument to the function):
```
isempty(x) && throw(ArgumentError("x is not allowed to be empty"))
```
### Exercise 6
Here are two ways to define the `exec` function:
```
exec1(x, y, op=+) = op(x, y)
exec2(x, y; op=+) = op(x, y)
```
The first of them uses positional arguments, and the second a keyword argument.
Here is a difference in how they would be called:
```
julia> exec1(2, 3, *)
6
julia> exec2(2, 3; op=*)
6
```
### Exercise 7
Such a function can be written as:
```
sumabs(x) = sum(abs, x)
```
### Exercise 8
This can be written for example as:
```
function swap!(x)
f = x[1]
x[1] = x[end]
x[end] = f
return x
end
```
*Extra* A more advanced way to write this function would be:
```
function swap!(x)
if length(x) > 1
x[begin], x[end] = x[end], x[begin]
end
return x
end
```
Note the differences in the code:
* we use `begin` instead of `1` to get the first element. This is a safer
practice since some collections in Julia do not use 1-based indexing (in
practice you are not likely to see them, so this comment is most relevant
for package developers)
* if there are `0` or `1` element in the collection the function does not do
anything (depending on the context we might want to throw an error instead)
* in `x[begin], x[end] = x[end], x[begin]` we perform two assignments at the
same time to avoid having to use a temporaty variable `f` (this operation
is technically called tuple destructuring; we discuss it in later chapters of
the book)
### Exercise 9
We used `@time` macro in chapter 1.
Version in global scope:
```
julia> s = 0
0
julia> @time for i in 1:10^6
global s += i^3
end
0.076299 seconds (2.00 M allocations: 30.517 MiB, 10.47% gc time)
```
Version with a function using a `sum` function:
```
julia> sum3(n) = sum(x -> x^3, 1:n)
sum3 (generic function with 1 method)
julia> @time sum3(10^6)
0.000012 seconds
-8222430735553051648
```
Version with `sum` function in global scope:
```
julia> @time sum(x -> x^3, 1:10^6)
0.027436 seconds (48.61 k allocations: 2.558 MiB, 99.75% compilation time)
-8222430735553051648
julia> @time sum(x -> x^3, 1:10^6)
0.025744 seconds (48.61 k allocations: 2.557 MiB, 99.76% compilation time)
-8222430735553051648
```
As you can see using a loop in global scope is inefficient. It leads to
many allocations and slow execution.
Using a `sum3` function leads to fastest execution. You might ask why using
`sum(x -> x^3, 1:10^6)` in global scope is slower. The reason is that an
anonymous function `x -> x^3` is defined anew each time this operation is called
which forces compilation of the `sum` function (but it is still faster than
the loop in global scope).
For a reference check the function with a loop inside it:
```
julia> function sum3loop(n)
s = 0
for i in 1:n
s += i^3
end
return s
end
sum3loop (generic function with 1 method)
julia> @time sum3loop(10^6)
0.001378 seconds
-8222430735553051648
```
This is also much faster than a loop in global scope.
### Exercise 10
In exercise 9 we note that the result is `-8222430735553051648` which is a
negative value, although we are adding cubes of positive values. The
reason of the problem is that operations on integers overflow. If you
are working with numbers larger that can be stored in `Int` type, which is:
```
julia> typemax(Int)
9223372036854775807
```
use `big` numbers that we discussed in *Exercise 3*:
```
julia> @time sum(x -> big(x)^3, 1:10^6)
0.833234 seconds (11.05 M allocations: 236.113 MiB, 23.77% gc time, 2.63% compilation time)
250000500000250000000000
```
Now we get a correct result, at the cost of slower computation.
</details>

345
exercises/exercises03.md Normal file
View File

@@ -0,0 +1,345 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 3
# Problems
### Exercise 1
Check what methods does the `repeat` function have.
Are they all covered in help for this function?
### Exercise 2
Write a function `fun2` that takes any vector and returns the difference between
the largest and the smallest element in this vector.
### Exercise 3
Generate a vector of one million random numbers from `[0, 1]` interval.
Check what is a faster way to get a maximum and minimum element in it. One
option is by using the `maximum` and `minimum` functions and the other is by
using the `extrema` function.
### Exercise 4
Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
variable `x`. What effects can this operation have?
### Exercise 5
What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
### Exercise 6
Define two identical anonymous functions `x -> x + 1` in global scope? Do they
have the same type?
### Exercise 7
Define the `wrap` function taking one argument `i` and returning the anonymous
function `x -> x + i`. Is the type of such anonymous function the same across
calls to `wrap` function?
### Exercise 8
You want to write a function that accepts any `Integer` except `Bool` and returns
the passed value. If `Bool` is passed an error should be thrown.
### Exercise 9
The `@time` macro measures time taken by an expression run and prints it,
but returns the value of the expression.
The `@elapsed` macro works differently - it does not print anything, but returns
time taken to evaluate an expression. Test the `@elapsed` macro by to see how
long it takes to shuffle a vector of one million floats. Use the `shuffle` function
from `Random` module.
### Exercise 10
Using the `@btime` macro benchmark the time of calculating the sum of one million
random floats.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Write:
```
julia> methods(repeat)
# 6 methods for generic function "repeat":
[1] repeat(A::AbstractArray; inner, outer) in Base at abstractarraymath.jl:392
[2] repeat(A::AbstractArray, counts...) in Base at abstractarraymath.jl:355
[3] repeat(c::Char, r::Integer) in Base at strings/string.jl:336
[4] repeat(c::AbstractChar, r::Integer) in Base at strings/string.jl:335
[5] repeat(s::Union{SubString{String}, String}, r::Integer) in Base at strings/substring.jl:248
[6] repeat(s::AbstractString, r::Integer) in Base at strings/basic.jl:715
```
Now write `?repeat` and you will see that there are four entries in help.
The reason is that for `Char` and `AbstractChar` as well as for
`AbstractString` and `Union{SubString{String}, String}` there is one help entry.
Why do these cases have two methods defined?
The reason is performance. For example `repeat(c::AbstractChar, r::Integer)`
is a generic function that accept any character values
and `repeat(c::Char, r::Integer)` is its faster version
that accepts values that have `Char` type only (and it is invoked by Julia
if value of type `Char` is passed as an argument to `repeat`).
### Exercise 2
You can define is as follows:
```
fun2(x::AbstractVector) = maximum(x) - minimum(x)
```
or as follows:
```
function fun2(x::AbstractVector)
lo, hi = extrema(x)
return hi - lo
end
```
Note that these two functions will work with vectors of any elements that
are ordered and support subtraction (they do not have to be numbers).
### Exercise 3
Here is a way to compare the performance of both options:
```
julia> using BenchmarkTools
julia> x = rand(10^6);
julia> @btime minimum($x), maximum($x)
860.700 μs (0 allocations: 0 bytes)
(1.489173560242918e-6, 0.9999984347293639)
julia> @btime extrema($x)
2.185 ms (0 allocations: 0 bytes)
(1.489173560242918e-6, 0.9999984347293639)
```
As you can see in this situation, although `extrema` does the operation
in a single pass over `x` it is slower than computing `minimum` and `maximum`
in two passes.
### Exercise 4
If it is a fresh Julia session you define a new function in `Main` for `+` operator:
```
julia> +x=1
+ (generic function with 1 method)
julia> methods(+)
# 1 method for generic function "+":
[1] +(x) in Main at REPL[1]:1
julia> +(10)
1
```
This will also break any further uses of `+` in your programs:
```
julia> 1 + 2
ERROR: MethodError: no method matching +(::Int64, ::Int64)
You may have intended to import Base.:+
Closest candidates are:
+(::Any) at REPL[1]:1
```
If you earlier used addition in this Julia session then the operation will error.
Start a fresh Julia session:
```
julia> 1 + 2
3
julia> +x=1
ERROR: error in method definition: function Base.+ must be explicitly imported to be extended
```
### Exercise 5
You get an empty vector:
```
julia> subtypes(Union{Float64, Missing})
Type[]
```
The reason is that the `subtypes` function returns subtypes of explicitly
declared types that have names (type of such types is `DataType` in Julia).
*Extra* for this reason `subtypes` has a limited use. To check if one type
is a subtype of some other type use the `<:` operator.
### Exercise 6
No, each of them has a different type:
```
julia> f1 = x -> x + 1
#1 (generic function with 1 method)
julia> f2 = x -> x + 1
#3 (generic function with 1 method)
julia> typeof(f1)
var"#1#2"
julia> typeof(f2)
var"#3#4"
```
This is the reason why function call like `sum(x -> x^2, 1:10)` in global
scope triggers compilation each time:
```
julia> @time sum(x -> x^2, 1:10)
0.070714 seconds (167.41 k allocations: 8.815 MiB, 14.29% gc time, 93.91% compilation time)
385
julia> @time sum(x -> x^2, 1:10)
0.020971 seconds (47.82 k allocations: 2.529 MiB, 99.75% compilation time)
385
julia> @time sum(x -> x^2, 1:10)
0.021184 seconds (47.81 k allocations: 2.529 MiB, 99.77% compilation time)
385
```
### Exercise 7
Yes, the type is the same:
```
julia> wrap(i) = x -> x + i
wrap (generic function with 1 method)
julia> typeof(wrap(1))
var"#11#12"
julia> typeof(wrap(2))
var"#11#12"
```
Julia defines a new type for such an anonymous function only once
The consequence of this is that e.g. expressions inside a function like
`sum(x -> x ^ i, 1:10)` where `i` is an argument to a function do not trigger
compilation (as opposed to similar expressions in global scope, see exercise 6).
```
julia> sumi(i) = sum(x -> x^i, 1:10)
sumi (generic function with 1 method)
julia> @time sumi(1)
0.000004 seconds
55
julia> @time sumi(2)
0.000001 seconds
385
julia> @time sumi(3)
0.000003 seconds
3025
```
### Exercise 8
We check subtypes of `Integer`:
```
julia> subtypes(Integer)
3-element Vector{Any}:
Bool
Signed
Unsigned
```
The first way to write such a function is then:
```
fun1(i::Union{Signed, Unsigned}) = i
```
and now we have:
```
julia> fun1(1)
1
julia> fun1(true)
ERROR: MethodError: no method matching fun1(::Bool)
```
The second way is:
```
fun2(i::Integer) = i
fun2(::Bool) = throw(ArgumentError("Bool is not supported"))
```
and now you have:
```
julia> fun2(1)
1
julia> fun2(true)
ERROR: ArgumentError: Bool is not supported
```
### Exercise 9
Here is the code that performs the task:
```
julia> using Random # needed to get access to shuffle
julia> x = rand(10^6); # generate random floats
julia> @elapsed shuffle(x)
0.0518085
julia> @elapsed shuffle(x)
0.01257
julia> @elapsed shuffle(x)
0.012483
```
Note that the first time we run `shuffle` it takes longer due to compilation.
### Exercise 10
The code you can use is:
```
julia> using BenchmarkTools
julia> @btime sum($(rand(10^6)))
155.300 μs (0 allocations: 0 bytes)
500330.6375697419
```
Note that the following:
```
julia> @btime sum(rand(10^6))
1.644 ms (2 allocations: 7.63 MiB)
500266.9457722128
```
would be an incorrect timing as you would also measure the time of generating
of the vector.
Alternatively you can e.g. write:
```
julia> x = rand(10^6);
julia> @btime sum($x)
154.700 μs (0 allocations: 0 bytes)
500151.95875364926
```
</details>

536
exercises/exercises04.md Normal file
View File

@@ -0,0 +1,536 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 4
# Problems
### Exercise 1
Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
columnwise with consecutive numbers). Next calculate sum, mean and standard
deviation of each row and each column of this matrix.
### Exercise 2
For each column of the matrix created in exercise 1 compute its range
(i.e. the difference between maximum and minimum element stored in it).
### Exercise 3
This is data for car speed (mph) and distance taken to stop (ft)
from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
```
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34
13 34
13 46
14 26
14 36
14 60
14 80
15 20
15 26
15 54
16 32
16 40
17 32
17 40
17 50
18 42
18 56
18 76
18 84
19 36
19 46
19 68
20 32
20 48
20 52
20 56
20 64
22 66
23 54
24 70
24 92
24 93
24 120
25 85
```
Load this data into Julia (this is part of the exercise) and fit a linear
regression where speed is a feature and distance is target variable.
### Exercise 4
Plot the data loaded in exercise 4. Additionally plot the fitted regression
(you need to check Plots.jl documentation to find a way to do this).
### Exercise 5
A simple code for calculation of Fibonacci numbers for positive
arguments is as follows:
```
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
```
Using the BenchmarkTools.jl package measure runtime of this function for
`n` ranging from `1` to `20`.
### Exercise 6
Improve the speed of code from exercise 5 by using a dictionary where you
store a mapping of `n` to `fib(n)`. Measure the performance of this function
for the same range of values as in exercise 5.
### Exercise 7
Create a vector containing named tuples representing elements of a 4x4 grid.
So the first element of this vector should be `(x=1, y=1)` and last should be
`(x=4, y=4)`. Store the vector in variable `v`.
### Exercise 8
The `filter` function allows you to select some values of an input collection.
Check its documentation first. Next, use it to keep from the vector `v` from
exercise 7 only elements whose sum is even.
### Exercise 9
Check the documentation of the `filter!` function. Perform the same operation
as asked in exercise 8 but using `filter!`. What is the difference?
### Exercise 10
Write a function that takes a number `n`. Next it generates two independent
random vectors of length `n` and returns their correlation coefficient.
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
and `10000`.
Create a plot with four histograms of distribution of computed Pearson
correlation coefficient. Check in the Plots.jl package which function can be
used to plot histograms.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Write:
```
julia> using Statistics
julia> mat = [1 3 5
2 4 6]
2×3 Matrix{Int64}:
1 3 5
2 4 6
julia> sum(mat, dims=1)
1×3 Matrix{Int64}:
3 7 11
julia> sum(mat, dims=2)
2×1 Matrix{Int64}:
9
12
julia> mean(mat, dims=1)
1×3 Matrix{Float64}:
1.5 3.5 5.5
julia> mean(mat, dims=2)
2×1 Matrix{Float64}:
3.0
4.0
julia> std(mat, dims=1)
1×3 Matrix{Float64}:
0.707107 0.707107 0.707107
julia> std(mat, dims=2)
2×1 Matrix{Float64}:
2.0
2.0
```
Observe that the returned statistics are also stored in matrices.
If we compute them for columns (`dims=1`) then the produced matrix has one row.
If we compute them for rows (`dims=2`) then the produced matrix has one column.
### Exercise 2
Here are some ways you can do it:
```
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
3-element Vector{Int64}:
1
1
1
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
3-element Vector{Int64}:
1
1
1
```
Observe that if we used `eachcol` the produced result is a vector (not a matrix
like in exercise 1).
### Exercise 3
First create a matrix with source data by copy pasting it from the exercise
like this:
```
data = [
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34
13 34
13 46
14 26
14 36
14 60
14 80
15 20
15 26
15 54
16 32
16 40
17 32
17 40
17 50
18 42
18 56
18 76
18 84
19 36
19 46
19 68
20 32
20 48
20 52
20 56
20 64
22 66
23 54
24 70
24 92
24 93
24 120
25 85
]
```
Now use the GLM.jl package to fit the model:
```
julia> using GLM
julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1]))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}}
distance ~ 1 + speed
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -17.5791 6.75844 -2.60 0.0123 -31.1678 -3.99034
speed 3.93241 0.415513 9.46 <1e-11 3.09696 4.76785
─────────────────────────────────────────────────────────────────────────
```
You can get the same estimates using the `\` operator like this:
```
julia> [ones(50) data[:, 1]] \ data[:, 2]
2-element Vector{Float64}:
-17.579094890510966
3.9324087591240877
```
### Exercise 4
Run the following:
```
using Plots
scatter(data[:, 1], data[:, 2];
xlab="speed", ylab="distance", legend=false, smooth=true)
```
The `smooth=true` keyword argument adds the linear regression line to the plot.
### Exercise 5
Use the following code:
```
julia> using BenchmarkTools
julia> for i in 1:40
print(i, " ")
@btime fib($i)
end
1 2.500 ns (0 allocations: 0 bytes)
2 2.700 ns (0 allocations: 0 bytes)
3 4.800 ns (0 allocations: 0 bytes)
4 7.500 ns (0 allocations: 0 bytes)
5 12.112 ns (0 allocations: 0 bytes)
6 19.980 ns (0 allocations: 0 bytes)
7 32.125 ns (0 allocations: 0 bytes)
8 52.696 ns (0 allocations: 0 bytes)
9 85.010 ns (0 allocations: 0 bytes)
10 140.311 ns (0 allocations: 0 bytes)
11 222.177 ns (0 allocations: 0 bytes)
12 359.903 ns (0 allocations: 0 bytes)
13 582.123 ns (0 allocations: 0 bytes)
14 1.000 μs (0 allocations: 0 bytes)
15 1.560 μs (0 allocations: 0 bytes)
16 2.522 μs (0 allocations: 0 bytes)
17 4.000 μs (0 allocations: 0 bytes)
18 6.600 μs (0 allocations: 0 bytes)
19 11.400 μs (0 allocations: 0 bytes)
20 18.100 μs (0 allocations: 0 bytes)
```
Notice that execution time for number `n` is roughly sum of ececution times
for numbers `n-1` and `n-2`.
### Exercise 6
Use the following code:
```
julia> fib_dict = Dict{Int, Int}()
Dict{Int64, Int64}()
julia> function fib2(n)
haskey(fib_dict, n) && return fib_dict[n]
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
fib_dict[n] = fib_n
return fib_n
end
fib2 (generic function with 1 method)
julia> for i in 1:20
print(i, " ")
@btime fib2($i)
end
1 40.808 ns (0 allocations: 0 bytes)
2 40.101 ns (0 allocations: 0 bytes)
3 40.101 ns (0 allocations: 0 bytes)
4 40.707 ns (0 allocations: 0 bytes)
5 42.727 ns (0 allocations: 0 bytes)
6 40.909 ns (0 allocations: 0 bytes)
7 40.404 ns (0 allocations: 0 bytes)
8 40.707 ns (0 allocations: 0 bytes)
9 40.808 ns (0 allocations: 0 bytes)
10 39.798 ns (0 allocations: 0 bytes)
11 40.909 ns (0 allocations: 0 bytes)
12 40.404 ns (0 allocations: 0 bytes)
13 42.872 ns (0 allocations: 0 bytes)
14 42.626 ns (0 allocations: 0 bytes)
15 47.972 ns (1 allocation: 16 bytes)
16 46.505 ns (1 allocation: 16 bytes)
17 46.302 ns (1 allocation: 16 bytes)
18 45.390 ns (1 allocation: 16 bytes)
19 47.160 ns (1 allocation: 16 bytes)
20 46.201 ns (1 allocation: 16 bytes)
```
Note that benchmarking essentially gives us a time of dictionary lookup.
The reason is that `@btime` executes the same expression many times, so
for the fastest execution time the value for each `n` is already stored in
`fib_dict`.
It would be more interesting to see the runtime of `fib2` for some large value
of `n` executed once:
```
julia> @time fib2(100)
0.000018 seconds (107 allocations: 1.672 KiB)
3736710778780434371
julia> @time fib2(200)
0.000025 seconds (204 allocations: 20.453 KiB)
-1123705814761610347
```
As you can see things are indeed fast. Note that for `n=200` we get a negative
values because of integer overflow.
As a more advanced topic (not covered in the book) it is worth to comment that
`fib2` is not type stable. If we wanted to make it type stable we need to
declare `fib_dict` dictionary as `const`. Here is the code and benchmarks
(you need to restart Julia to run this test):
```
julia> const fib_dict = Dict{Int, Int}()
Dict{Int64, Int64}()
julia> function fib2(n)
haskey(fib_dict, n) && return fib_dict[n]
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
fib_dict[n] = fib_n
return fib_n
end
fib2 (generic function with 1 method)
julia> @time fib2(100)
0.000014 seconds (6 allocations: 5.828 KiB)
3736710778780434371
julia> @time fib2(200)
0.000011 seconds (3 allocations: 17.312 KiB)
-1123705814761610347
```
As you can see the code does less allocations and is faster now.
### Exercise 7
Since we are asked to create a vector we can write:
```
julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4]
16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 2)
(x = 1, y = 3)
(x = 1, y = 4)
(x = 2, y = 1)
(x = 2, y = 2)
(x = 2, y = 3)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 2)
(x = 3, y = 3)
(x = 3, y = 4)
(x = 4, y = 1)
(x = 4, y = 2)
(x = 4, y = 3)
(x = 4, y = 4)
```
Note (not covered in the book) that you could create a matrix by changing
the syntax a bit:
```
julia> [(x=x, y=y) for x in 1:4, y in 1:4]
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
```
Finally, we can use a bit shorter syntax (covered in chapter 14 of the book):
```
julia> [(; x, y) for x in 1:4, y in 1:4]
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
```
### Exercise 8
To get help on the `filter` function write `?filter`. Next run:
```
julia> filter(e -> iseven(e.x + e.y), v)
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
```
### Exercise 9
To get help on the `filter!` function write `?filter!`. Next run:
```
julia> filter!(e -> iseven(e.x + e.y), v)
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
julia> v
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 1)
(x = 1, y = 3)
(x = 2, y = 2)
(x = 2, y = 4)
(x = 3, y = 1)
(x = 3, y = 3)
(x = 4, y = 2)
(x = 4, y = 4)
```
Notice that `filter` allocated a new vector, while `filter!` updated the `v`
vector in place.
### Exercise 10
You can use for example the following code:
```
using Statistics
using Plots
rand_cor(n) = cor(rand(n), rand(n))
plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false)
for n in [10, 100, 1000, 10000]]...)
```
Observe that as you increase `n` the dispersion of the correlation coefficient
decreases.
</details>

287
exercises/exercises05.md Normal file
View File

@@ -0,0 +1,287 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 5
# Problems
### Exercise 1
Create a matrix containing truth table for `&&` and `||` operations.
### Exercise 2
The `issubset` function checks if one collection is a subset of other
collection.
Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
`k` varying from `1` to `3`. Store the result in a vector.
### Exercise 3
Write a function that accepts two vectors and returns `true` if they have equal
length and otherwise returns `false`.
### Exercise 4
Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
Calculate their Adjusted Mutual Information using scikit-learn.
### Exercise 5
Using Adjusted Mutual Information function from exercise 4 generate
a pair of random vectors of length 100 containing integer numbers from the
range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
Check in the documentation of the `rand` function how you can draw a sample
from a collection of values.
### Exercise 6
Adjust the code from exercise 5 but replace first 50 elements of each vector
with zero. Repeat the experiment.
### Exercise 7
Write a function that takes a vector of integer values and returns a dictionary
giving information how many times each integer was present in the passed vector.
Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
and `v3 = 3:5`.
### Exercise 8
Write code that creates a `Bool` diagonal matrix of size 5x5.
### Exercise 9
Write a code comparing performance of calculation of sum of logarithms of
elements of a vector `1:100` using broadcasting and the `sum` function vs only
the `sum` function taking a function as a first argument.
### Exercise 10
Create a dictionary in which for each number from `1` to `10` you will store
a vector of its positive divisors. You can check the reminder of division
of two values using the `rem` function.
Additionally (not covered in the book), you can drop elements
from a comprehension if you add an `if` clause after the `for` clause, for
example to keep only odd numbers from range `1:10` do:
```
julia> [i for i in 1:10 if isodd(i)]
5-element Vector{Int64}:
1
3
5
7
9
```
You can populate a dictionary by passing a vector of pairs to it (not covered in
the book), for example:
```
julia> Dict(["a" => 1, "b" => 2])
Dict{String, Int64} with 2 entries:
"b" => 2
"a" => 1
```
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
You can do it as follows:
```
julia> [true, false] .&& [true false]
2×2 BitMatrix:
1 0
0 0
julia> [true, false] .|| [true false]
2×2 BitMatrix:
1 1
1 0
```
Note that the first array is a vector, while the second array is a 1-row matrix.
### Exercise 2
You can do it like this using broadcasting:
```
julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
3-element BitVector:
0
1
1
```
Note that you need to use `Ref` to protect `4:6` from being broadcasted over.
### Exercise 3
This function can be written as follows:
```
function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y)
```
### Exercise 4
You can do this exercise as follows:
```
julia> using PyCall
julia> metrics = pyimport("sklearn.metrics");
julia> metrics.adjusted_mutual_info_score(x, y)
-0.11111111111111087
julia> metrics.adjusted_mutual_info_score(x, z)
0.7276079390930807
julia> metrics.adjusted_mutual_info_score(y, z)
-0.21267989848846763
```
### Exercise 5
You can create such a plot using the following commands:
```
using Plots
histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100))
for i in 1:1000], label="AMI")
```
You can check that AMI oscillates around 0.
### Exercise 6
This time it is convenient to write a helper function. Note that we use
broadcasting to update values in the vectors.
```
function exampleAMI()
x = rand(1:5, 100)
y = rand(1:5, 100)
x[1:50] .= 0
y[1:50] .= 0
return metrics.adjusted_mutual_info_score(x, y)
end
histogram([exampleAMI() for i in 1:1000], label="AMI")
```
Note that this time AMI is a bit below 0.5, which shows a better match between
vectors.
### Exercise 7
```
julia> function counter(v::AbstractVector{<:Integer})
d = Dict{eltype(v), Int}()
for x in v
if haskey(d, x)
d[x] += 1
else
d[x] = 1
end
end
return d
end
counter (generic function with 1 method)
julia> counter(v1)
Dict{Int64, Int64} with 3 entries:
2 => 2
3 => 3
1 => 1
julia> counter(v2)
Dict{Bool, Int64} with 2 entries:
0 => 1
1 => 1
julia> counter(v3)
Dict{Int64, Int64} with 3 entries:
5 => 1
4 => 1
3 => 1
```
Note that we used the `eltype` function to set a proper key type for
dictionary `d`.
### Exercise 8
This is a way to do it:
```
julia> 1:5 .== (1:5)'
5×5 BitMatrix:
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
```
Using the `LinearAlgebra` module you could also write:
```
julia> using LinearAlgebra
julia> I(5)
5×5 Diagonal{Bool, Vector{Bool}}:
1 ⋅ ⋅ ⋅ ⋅
⋅ 1 ⋅ ⋅ ⋅
⋅ ⋅ 1 ⋅ ⋅
⋅ ⋅ ⋅ 1 ⋅
⋅ ⋅ ⋅ ⋅ 1
```
### Exercise 9
Here is how you can do it:
```
julia> using BenchmarkTools
julia> @btime sum(log.(1:100))
1.620 μs (1 allocation: 896 bytes)
363.7393755555635
julia> @btime sum(log, 1:100)
1.570 μs (0 allocations: 0 bytes)
363.7393755555636
```
As you can see using the `sum` function with `log` as its first argument
is a bit faster as it is not allocating.
### Exercise 10
Here is how you can do it:
```
julia> Dict([i => [j for j in 1:i if rem(i, j) == 0] for i in 1:10])
Dict{Int64, Vector{Int64}} with 10 entries:
5 => [1, 5]
4 => [1, 2, 4]
6 => [1, 2, 3, 6]
7 => [1, 7]
2 => [1, 2]
10 => [1, 2, 5, 10]
9 => [1, 3, 9]
8 => [1, 2, 4, 8]
3 => [1, 3]
1 => [1]
```
</details>

271
exercises/exercises06.md Normal file
View File

@@ -0,0 +1,271 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 6
# Problems
### Exercise 1
Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
### Exercise 2
Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
as `iris.csv` to your local folder.
### Exercise 3
Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
in two lines so that it takes less horizontal space.
### Exercise 4
Load data stored in `iris.csv` file into a `data` vector where each element
should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
as floats).
### Exercise 5
The `data` structure is a vector of named tuples, change it to a named tuple
of vectors (with the same field names) and call it `data2`.
### Exercise 6
Calculate the frequency of each type of Iris type (`c` field in `data2`).
### Exercise 7
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
Compare sizes of the three objects.
### Exercise 8
You know that `refs` field of `PooledArray` stores an integer index of a given
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
in `data2`, but for each Iris type give a different point color (check the
`color` keyword argument meaning in the Plots.jl manual; you can use the
`plot_color` function).
### Exercise 9
Type the following string `"a²=b² ⟺ a=b a=-b"` in your terminal and bind it to
`str` variable (do not copy paste the string, but type it).
### Exercise 10
In the `str` string from exercise 9 find all matches of a pattern where `a`
is followed by `b` but there can be some characters between them.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"
```
### Exercise 2
Solution:
```
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
"iris.csv")
```
### Exercise 3
Solution:
```
"https://archive.ics.uci.edu/ml/\
machine-learning-databases/iris/iris.data"
```
### Exercise 4
Solution:
```
julia> function line_parser(line)
elements = split(line, ",")
@assert length(elements) == 5
return (sl=parse(Float64, elements[1]),
sw=parse(Float64, elements[2]),
pl=parse(Float64, elements[3]),
pw=parse(Float64, elements[4]),
c=elements[5])
end
line_parser (generic function with 1 method)
julia> data = line_parser.(readlines("iris.csv")[1:end-1])
150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
(sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
(sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
(sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
(sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
(sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
(sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")
```
Note that we used `1:end-1` selector to drop last element from the read lines
since it is empty. This is the reason why adding the
`@assert length(elements) == 5` check in the `line_parser` function is useful.
### Exercise 5
Later in the book you will learn more advanced ways to do it. Here let us
use a most basic approach:
```
data2 = (sl=[d.sl for d in data],
sw=[d.sw for d in data],
pl=[d.pl for d in data],
pw=[d.pw for d in data],
c=[d.c for d in data])
```
### Exercise 6
Solution:
```
julia> using FreqTables
julia> freqtable(data2.c)
3-element Named Vector{Int64}
Dim1 │
──────────────────┼───
"Iris-setosa" │ 50
"Iris-versicolor" │ 50
"Iris-virginica" │ 50
```
### Exercise 7
Solution:
```
julia> using InlineStrings
julia> c2 = inlinestrings(data2.c)
150-element Vector{String15}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> using PooledArrays
julia> c3 = PooledArray(data2.c)
150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> c4 = Symbol.(data2.c)
150-element Vector{Symbol}:
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
julia> Base.summarysize(data2.c)
12840
julia> Base.summarysize(c2)
2440
julia> Base.summarysize(c3)
1696
julia> Base.summarysize(c4)
1240
```
### Exercise 8
Solution:
```
using Plots
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
```
### Exercise 9
The hard part is typing `²`, `⟺` and ``. You can check how to do it using help:
```
help?> ²
"²" can be typed by \^2<tab>
help?> ⟺
"⟺" can be typed by \iff<tab>
help?>
"" can be typed by \vee<tab>
```
Save the string in the `str` variable as we will use it in the next exercise.
### Exercise 10
The exercise does not specify how the matching should be done. If we
want it to be eager (match as much as possible), we write:
```
julia> m = match(r"a.*b", str)
RegexMatch("a²=b² ⟺ a=b a=-b")
```
As you can see we have matched whole string.
If we want it to be lazy (match as little as possible) we write:
```
julia> m = match(r"a.*?b", str)
RegexMatch("a²=b")
```
This finds us the first such match.
If we want to find all lazy matches we can write (not covered in the book):
```
julia> collect(eachmatch(r"a.*?b", str))
3-element Vector{RegexMatch}:
RegexMatch("a²=b")
RegexMatch("a=b")
RegexMatch("a=-b")
```
</details>

314
exercises/exercises07.md Normal file
View File

@@ -0,0 +1,314 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 7
# Problems
### Exercise 1
Random.org provides a service that returns random numbers. One of the ways
how you can use it is by sending HTTP GET reguests. Here is an example request:
> https://www.random.org/integers/?num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new
If you want to understand all the parameters plese check their meaning
[here](https://www.random.org/clients/http/).
For us it is enough that this request generates 10 random integers in the range
from 1 to 6. Run this query in Julia and parse the result.
### Exercise 2
Write a function that tries to parse a string as an integer.
If it succeeds it should return the integer, otherwise it should return `0`
but print error message.
### Exercise 3
Create a matrix containing truth table for `&&` operation including `missing`.
If some operation errors store `"error"` in the table. As an extra feature (this
is harder so you can skip it) in each cell store both inputs and output to make
reading the table easier.
### Exercise 4
Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
missing values in it by the mean of the non-missing values.
### Exercise 5
Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
strings stored in it as `Float64`, while keeping `missing` values unchanged.
### Exercise 6
Print to the terminal all days in January 2023 that are Mondays.
### Exercise 7
Compute the dates that are one month later than January 15, 2020, February 15
2020, March 15, 2020, and April 15, 2020. How many days pass during this one
month. Print the results to the screen?
### Exercise 8
Parse the following string as JSON:
```
str = """
[{"x":1,"y":1},
{"x":2,"y":4},
{"x":3,"y":9},
{"x":4,"y":16},
{"x":5,"y":25}]
"""
```
into a `json` variable.
### Exercise 9
Extract from the `json` variable from exercise 8 two vectors `x` and `y`
that correspond to the fields stored in the JSON structure.
Plot `y` as a function of `x`.
### Exercise 10
Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
Use linear interpolation for filling missing values. For the extreme values
use nearest available observation (you will need to consult Impute.jl
documentation to find all required functions).
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution (example run):
```
julia> using HTTP
julia> response = HTTP.get("https://www.random.org/integers/?\
num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new");
julia> parse.(Int, split(String(response.body)))
10-element Vector{Int64}:
6
2
6
3
4
2
5
2
3
6
```
### Exercise 2
Example function:
```
function str2int(s::AbstractString)
try
return parse(Int, s)
catch e
println(e)
end
return 0
end
```
Let us check it:
```
julia> str2int("10")
10
julia> str2int(" -1 ")
-1
julia> str2int("12345678901234567890")
OverflowError("overflow parsing \"12345678901234567890\"")
0
julia> str2int("1.3")
ArgumentError("invalid base 10 digit '.' in \"1.3\"")
0
julia> str2int("a")
ArgumentError("invalid base 10 digit 'a' in \"a\"")
0
```
An alternative solution would use `tryparse` (not covered in the book):
```
function str2int(s::AbstractString)
v = tryparse(Int, s)
if isnothing(v)
println("error while parsing")
return 0
end
return v
end
```
But this time we do not see the cause of the error.
### Exercise 3
Solution:
```
julia> function apply_and(x, y)
try
return "$x && $y = $(x && y)"
catch e
return "$x && $y = error"
end
end
apply_and (generic function with 2 methods)
julia> apply_and.([true, false, missing], [true false missing])
3×3 Matrix{String}:
"true && true = true" "true && false = false" "true && missing = missing"
"false && true = false" "false && false = false" "false && missing = false"
"missing && true = error" "missing && false = error" "missing && missing = error"
```
### Exercise 4
Solution:
```
julia> using Statistics
julia> coalesce.(v, mean(skipmissing(v)))
6-element Vector{Float64}:
1.5
2.5
3.5
4.5
5.5
3.5
```
### Exercise 5
Solution:
```
julia> using Missings
julia> passmissing(parse).(Float64, s)
6-element Vector{Union{Missing, Float64}}:
1.5
2.5
missing
4.5
5.5
missing
```
### Exercise 6
Example solution:
```
julia> using Dates
julia> for day in Date.(2023, 01, 1:31)
dayofweek(day) == 1 && println(day)
end
2023-01-02
2023-01-09
2023-01-16
2023-01-23
2023-01-30
```
### Exercise 7
Example solution:
```
julia> for day in Date.(2023, 1:4, 15)
day_next = day + Month(1)
println("$day + 1 month = $day_next (difference: $(day_next - day))")
end
2023-01-15 + 1 month = 2023-02-15 (difference: 31 days)
2023-02-15 + 1 month = 2023-03-15 (difference: 28 days)
2023-03-15 + 1 month = 2023-04-15 (difference: 31 days)
2023-04-15 + 1 month = 2023-05-15 (difference: 30 days)
```
### Exercise 8
Solution:
```
julia> using JSON3
julia> json = JSON3.read(str)
5-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, Vector{UInt64}}:
{
"x": 1,
"y": 1
}
{
"x": 2,
"y": 4
}
{
"x": 3,
"y": 9
}
{
"x": 4,
"y": 16
}
{
"x": 5,
"y": 25
}
```
### Exercise 9
Solution:
```
using Plots
x = [el.x for el in json]
y = [el.y for el in json]
plot(x, y, xlabel="x", ylabel="y", legend=false)
```
### Exercise 10
Solution:
```
julia> using Impute
julia> Impute.nocb!(Impute.locf!(Impute.interp(m)))
8-element Vector{Union{Missing, Int64}}:
1
1
2
3
4
5
6
6
```
Note that we use the `locf!` and `nocb!` functions (with `!`) to perform
operation in place (a new vector was already allocated by `Impute.interp`).
</details>

294
exercises/exercises08.md Normal file
View File

@@ -0,0 +1,294 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 8
# Problems
### Exercise 1
Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame`
called `df`.
### Exercise 2
Get number of rows, columns, column names and summary statistics of the
`df` data frame from exercise 1.
### Exercise 3
Make a plot of `number` against `square` columns of `df` data frame.
### Exercise 4
Add a column to `df` data frame with name `name string` containing string
representation of numbers in column `number`, i.e.
`["one", "two", "three", "four"]`.
### Exercise 5
Check if `df` contains column `square2`.
### Exercise 6
Extract column `number` from `df` and empty it (recall `empty!` function
discussed in chapter 4).
### Exercise 7
In `Random` module the `randexp` function is defined that samples numbers
from exponential distribution with scale 1.
Draw two 100,000 element samples from this distribution store them
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
and sum of vector `x` and half of vector `y`.
### Exercise 8
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
Compute all standard descriptive statistics of columns of this data frame.
### Exercise 9
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
Compare the size of created files using the `filesize` function.
### Exercise 10
Write the `df` data frame into SQLite database. Next find information about
tables in this database. Run a query against a table representing the `df` data
frame to calculate the mean of column `x`. Does it match the result we got in
exercise 8?
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
CSV.jl supports reading gzip-compressed files so you can just do:
```
julia> using CSV
julia> using DataFrames
julia> df = CSV.read("example8.csv.gz", DataFrame)
4×2 DataFrame
Row │ number square
│ Int64 Int64
─────┼────────────────
1 │ 1 2
2 │ 2 4
3 │ 3 9
4 │ 4 16
```
You can also do it manually:
```
julia> using CodecZlib # you might need to install this package
julia> compressed = read("example8.csv.gz");
julia> plain = transcode(GzipDecompressor, compressed);
julia> df = CSV.read(plain, DataFrame)
4×2 DataFrame
Row │ number square
│ Int64 Int64
─────┼────────────────
1 │ 1 2
2 │ 2 4
3 │ 3 9
4 │ 4 16
```
### Exercise 2
Solution:
```
julia> nrow(df)
4
julia> ncol(df)
2
julia> names(df)
2-element Vector{String}:
"number"
"square"
julia> describe(df)
2×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Int64 Float64 Int64 Int64 DataType
─────┼──────────────────────────────────────────────────────────────
1 │ number 2.5 1 2.5 4 0 Int64
2 │ square 7.75 2 6.5 16 0 Int64
```
### Exercise 3
Solution:
```
using Plots
plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)
```
### Exercise 4
Solution:
```
julia> df."name string" = ["one", "two", "three", "four"]
4-element Vector{String}:
"one"
"two"
"three"
"four"
julia> df
4×3 DataFrame
Row │ number square name string
│ Int64 Int64 String
─────┼─────────────────────────────
1 │ 1 2 one
2 │ 2 4 two
3 │ 3 9 three
4 │ 4 16 four
```
Note that we needed to use a string as we have space in column name.
### Exercise 5
You can use either `hasproperty` or `columnindex`:
```
julia> hasproperty(df, :square2)
false
julia> columnindex(df, :square2)
0
```
Note that if you try to access this column you will get a hint what was the
mistake you most likely made:
```
julia> df.square2
ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square
```
### Exercise 6
Solution:
```
julia> empty!(df[:, :number])
Int64[]
```
Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)`
as it would corrupt the `df` data frame (these operations do non-copying
extraction of a column from a data frame as opposed to `df[:, :number]`
which makes a copy).
### Exercise 7
Solution:
```
using Random
using Plots
x = randexp(100_000);
y = randexp(100_000);
histogram(x + y / 2, label="mean")
histogram!(max.(x, y), label="maximum")
```
I have put both histograms on the same plot to show that they overlap.
### Exercise 8
Solution (you might get slightly different results because we did not set
the seed of random number generator when creating `x` and `y` vectors):
```
julia> df = DataFrame(x=x, y=y);
julia> df."x+y/2" = x + y / 2;
julia> df."max.(x,y)" = max.(x, y);
julia> describe(df, :all)
4×13 DataFrame
Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ x 0.997023 0.999119 3.01389e-6 0.285129 0.68856 1.38414 12.1556 0 0.250502 0.077737 Float64
2 │ y 1.00109 0.995904 2.78828e-6 0.289371 0.6957 1.38491 12.0445 0 0.689659 0.486246 Float64
3 │ x+y/2 1.49757 1.11676 0.00217486 0.688598 1.2235 2.0113 14.2046 0 0.595331 0.32086 Float64
4 │ max.(x,y) 1.49872 1.11295 0.00187844 0.691588 1.22466 2.01257 12.1556 0 0.689659 0.486246 Float64
```
We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary
statistics except `first` and `last` as expected.
### Exercise 9
```
julia> using Arrow
julia> CSV.write("df.csv", df)
"df.csv"
julia> Arrow.write("df.arrow", df)
"df.arrow"
julia> filesize("df.csv")
7587820
julia> filesize("df.arrow")
3200874
```
In this case Apache Arrow file is smaller.
### Exercise 10
```
julia> using SQLite
julia> db = SQLite.DB("df.db")
SQLite.DB("df.db")
julia> SQLite.load!(df, db, "df")
"df"
julia> SQLite.tables(db)
1-element Vector{SQLite.DBTable}:
SQLite.DBTable("df", Tables.Schema:
:x Union{Missing, Float64}
:y Union{Missing, Float64}
Symbol("x+y/2") Union{Missing, Float64}
Symbol("max.(x,y)") Union{Missing, Float64})
julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df");
julia> DataFrame(query)
1×1 DataFrame
Row │ AVG(x)
│ Float64
─────┼──────────
1 │ 0.997023
julia> close(db)
```
The computed mean of column `x` is the same as we got in exercise 8.
</details>

293
exercises/exercises09.md Normal file
View File

@@ -0,0 +1,293 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 9
# Problems
In this problemset we will use the `puzzles.csv` file that was
created in chapter 8. Please first load it into your Julia
session using the commands:
```
using CSV
using DataFrames
puzzles = CSV.read("puzzles.csv", DataFrame);
```
### Exercise 1
Create `matein2` data frame that will have only puzzles that have `"mateIn2"`
in the `Themes` column.
Use the `contains` function (check its documentation first).
### Exercise 2
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
in the `puzzles` data frame?
### Exercise 3
Create `small` data frame that holds first 10 rows of `matein2` data frame
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
### Exercise 4
Iterate rows of `small` data frame and print the ratio of
`RatingDeviation` and `NbPlays` for each row.
### Exercise 5
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
### Exercise 6
Write a function `collatz` that runs the following process. Start with a
positive number `n`. If it is even divide it by two. If it is odd multiply
it by 3 and add one. The function should return the number of steps needed to
reach 1.
Create a `d` dictionary that maps number of steps needed to a list of numbers from
the range `1:100` that required this number of steps.
### Exercise 7
Using the `d` dictionary make a scatter plot of number of steps required
vs average value of numbers that require this number of steps.
### Exercise 8
Repeat the process from exercises 6 and 7, but this time use a data frame
and try to write an appropriate expression using the `combine` and `groupby`
functions (as it was explained in the last part of chapter 9). This time
perform computations for numbers ranging from one to one million.
### Exercise 9
Set seed of random number generator to `1234`. Draw 100 random points
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
Add random noise to column `y` that has normal distribution with mean 0 and
standard deviation 0.25. Call this column `z`.
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
### Exercise 10
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
274135×9 DataFrame
Row │ PuzzleId FEN Moves Rating RatingDeviation Popularity NbPlays Themes GameUrl ⋯
│ String7 String String Int64 Int64 Int64 Int64 String String ⋯
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 000hf r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1… e8f7 e2e6 f7f8 e6f7 1560 76 88 441 mate mateIn2 middlegame short https://li ⋯
2 │ 001Wz 4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P… e8e5 d1d8 e5e8 d8e8 1128 81 87 54 backRankMate endgame mate mateIn… https://li
3 │ 001om 5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1… g4h4 c5f2 g2g3 f2g3 991 78 89 215 mate mateIn2 middlegame short https://li
4 │ 003Tx 2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P… e1e4 f3d2 b1a1 c8c1 1716 77 87 476 backRankMate endgame fork mate m… https://li
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
274132 │ zzxQS 2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn… c8f8 d6d1 e3e1 d1e1 1149 75 96 1722 mate mateIn2 middlegame short https://li ⋯
274133 │ zzxvB 5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3… f6g4 c7f7 f8f7 a7a8 1695 74 95 4857 endgame mate mateIn2 pin sacrifi… https://li
274134 │ zzzRN 4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3… g1h1 e3e1 f7f1 e1f1 830 108 67 31 endgame mate mateIn2 short https://li
274135 │ zzzco 5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P… f7f2 b2c2 c1b1 e2d1 1783 75 90 763 endgame mate mateIn2 queensideAt… https://li
1 column and 274127 rows omitted
```
### Exercise 2
Solution (two ways to do it):
```
julia> using Statistics
julia> nrow(matein2) / nrow(puzzles)
0.12852152542746353
julia> mean(contains.(puzzles.Themes, "mateIn2"))
0.12852152542746353
```
### Exercise 3
Solution:
```
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
10×3 DataFrame
Row │ Rating RatingDeviation NbPlays
│ Int64 Int64 Int64
─────┼──────────────────────────────────
1 │ 1560 76 441
2 │ 1128 81 54
3 │ 991 78 215
4 │ 1716 77 476
5 │ 711 81 111
6 │ 723 86 806
7 │ 754 92 248
8 │ 1177 76 827
9 │ 994 81 71
10 │ 979 144 14
```
### Exercise 4
Solution:
```
julia> for row in eachrow(small)
println(row.RatingDeviation / row.NbPlays)
end
0.17233560090702948
1.5
0.3627906976744186
0.16176470588235295
0.7297297297297297
0.10669975186104218
0.3709677419354839
0.09189842805320435
1.1408450704225352
10.285714285714286
```
### Exercise 5
Solution (several options):
```
julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
2-element Vector{String}:
"FEN"
"RatingDeviation"
julia> names(matein2, Cols(col -> endswith(uppercase(col), "N")))
2-element Vector{String}:
"FEN"
"RatingDeviation"
julia> names(matein2, r"[nN]$")
2-element Vector{String}:
"FEN"
"RatingDeviation"
```
### Exercise 6
Solution:
```
julia> function collatz(n)
i = 0
while n != 1
i += 1
n = iseven(n) ? div(n, 2) : 3 * n + 1
end
return i
end
collatz (generic function with 1 method)
julia> d = Dict{Int, Vector{Int}}()
Dict{Int64, Vector{Int64}}()
julia> for n in 1:100
i = collatz(n)
if haskey(d, i)
push!(d[i], n)
else
d[i] = [n]
end
end
julia> d
Dict{Int64, Vector{Int64}} with 45 entries:
5 => [5, 32]
35 => [78, 79]
110 => [82, 83]
30 => [86, 87, 89]
32 => [57, 59]
6 => [10, 64]
115 => [73]
112 => [54, 55]
4 => [16]
13 => [34, 35]
104 => [47]
12 => [17, 96]
23 => [25]
111 => [27]
92 => [91]
11 => [48, 52, 53]
118 => [97]
⋮ => ⋮
```
As we can see even for small `n` the number of steps required to reach `1`
can get quite large.
### Exercise 7
Solution:
```
using Plots
using Statistics
steps = collect(keys(d))
mean_number = mean.(values(d))
scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false)
```
Note that we needed to use `collect` on `keys` as `scatter` expects an array
not just an iterator.
### Exercise 8
Solution:
```
df = DataFrame(n=1:10^6);
df.collatz = collatz.(df.n);
agg = combine(groupby(df, :collatz), :n => mean);
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
```
### Exercise 9
Set seed of random number generator to `1234`. Draw 100 random points
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
Add random noise to column `y` that has normal distribution with mean 0 and
standard deviation 0.25. Call this column `z`.
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
Solution:
```
using Random
Random.seed!(1234)
df = DataFrame(x=rand(100))
df.y = 4 .* (df.x .- 0.5) .^ 2
df.z = df.y + randn(100) / 4
scatter(df.x, [df.y df.z], labels=["y" "z"])
```
### Exercise 10
Solution:
```
using Loess
model = loess(df.x, df.z);
x_predict = sort(df.x)
z_predict = predict(model, x_predict)
plot!(x_predict, z_predict; label="z predicted")
```
</details>

303
exercises/exercises10.md Normal file
View File

@@ -0,0 +1,303 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 10
# Problems
### Exercise 1
Generate a random matrix `mat` having size 5x4 and all elements drawn
independently and uniformly from the [0,1[ interval.
Create a data frame using data from this matrix using auto-generated
column names.
### Exercise 2
Now, using matrix `mat` create a data frame with randomly generated
column names. Use the `randstring` function from the `Random` module
to generate them. Store this data frame in `df` variable.
### Exercise 3
Create a new data frame, taking `df` as a source that will have the same
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
### Exercise 4
Create a dictionary holding `column_name => column_vector` pairs
using data stored in data frame `df`. Save this dictionary in variable `d`.
### Exercise 5
Create a data frame back from dictionary `d` from exercise 4. Compare it
with `df`.
### Exercise 6
For data frame `df` compute the dot product between all pairs of its columns.
Use the `dot` function from the `LinearAlgebra` module.
### Exercise 7
Given two data frames:
```
julia> df1 = DataFrame(a=1:2, b=11:12)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> df2 = DataFrame(a=1:2, c=101:102)
2×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 101
2 │ 2 102
```
vertically concatenate them so that only columns that are present in both
data frames are kept. Check the documentation of `vcat` to see how to
do it.
### Exercise 8
Now append to `df1` table `df2`, but add only the columns from `df2` that
are present in `df1`. Check the documentation of `append!` to see how to
do it.
### Exercise 9
Create a `circle` data frame, using the `push!` function that will store
1000 samples of the following process:
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
and is `false` otherwise.
Compute summary statistics of this data frame.
### Exercise 10
Create a scatterplot of `circle` data frame where its `x` and `y` axis
will be the plotted points and `inside` variable will determine the color
of the plotted point.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> using DataFrames
julia> mat = rand(5, 4)
5×4 Matrix{Float64}:
0.8386 0.83612 0.0353994 0.15547
0.590172 0.611815 0.0691152 0.915788
0.879395 0.07271 0.980079 0.655158
0.340435 0.756196 0.0697535 0.388578
0.714515 0.861872 0.971521 0.176768
julia> DataFrame(mat, :auto)
5×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
### Exercise 2
Solution:
```
julia> using Random
julia> df = DataFrame(mat, [randstring() for _ in 1:4])
5×4 DataFrame
Row │ 6mTK5evn K8Inf7ER 5Caz55k0 SRiGemsa
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
### Exercise 3
Solution:
```
julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
5×4 DataFrame
Row │ y1 y2 y3 y4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
You could also use the `raname` function:
```
julia> rename(df, string.("y", 1:4))
5×4 DataFrame
Row │ y1 y2 y3 y4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
### Exercise 4
Solution:
```
julia> d = Dict([n => df[:, n] for n in names(df)])
Dict{String, Vector{Float64}} with 4 entries:
"6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
"5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
"K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
"SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
```
or (using the `pairs` function; note that this time column names are `Symbol`):
```
julia> Dict(pairs(eachcol(df)))
Dict{Symbol, AbstractVector} with 4 entries:
Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
:SRiGemsa => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
:K8Inf7ER => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
```
### Exercise 5
Solution:
```
julia> DataFrame(d)
5×4 DataFrame
Row │ 5Caz55k0 6mTK5evn K8Inf7ER SRiGemsa
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.0353994 0.8386 0.83612 0.15547
2 │ 0.0691152 0.590172 0.611815 0.915788
3 │ 0.980079 0.879395 0.07271 0.655158
4 │ 0.0697535 0.340435 0.756196 0.388578
5 │ 0.971521 0.714515 0.861872 0.176768
```
Note that columns of a data frame are now sorted by their names.
This is done for `Dict` objects because such dictionaries do not have
a defined order of keys.
### Exercise 6
Solution:
```
julia> using LinearAlgebra
julia> using StatsBase
julia> pairwise(dot, eachcol(df))
4×4 Matrix{Float64}:
2.45132 1.99944 1.65026 1.50558
1.99944 2.39336 1.03322 1.18411
1.65026 1.03322 1.9153 0.909744
1.50558 1.18411 0.909744 1.47431
```
### Exercise 7
Solution:
```
julia> vcat(df1, df2, cols=:intersect)
4×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 1
4 │ 2
```
By default you will get an error:
```
julia> vcat(df1, df2)
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
```
### Exercise 8
Solution:
```
julia> append!(df1, df2, cols=:subset)
4×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 11
2 │ 2 12
3 │ 1 missing
4 │ 2 missing
```
### Exercise 9
Solution
```
circle=DataFrame()
for _ in 1:1000
x, y = 2rand()-1, 2rand()-1
inside = x^2 + y^2 < 1
push!(circle, (x=x, y=y, inside=inside))
end
describe(circle)
```
We note that mean of variable `inside` is approximately π.
### Exercise 10
Solution:
```
using Plots
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400))
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal)
```
In the solution two ways to plot ensuring the ratio between x and y axis is 1
are shown. Note the differences in the produced output between the two methods.
</details>

479
exercises/exercises11.md Normal file
View File

@@ -0,0 +1,479 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 11
# Problems
### Exercise 1
Generate a data frame `df` having one column `x` consisting of 100,000 values
sampled from uniform distribution on [0, 1[ interval.
Serialize it to disk, and next deserialize. Check if the deserialized
object is the same as the source data frame.
### Exercise 2
Add a column `n` to the `df` data frame that in each row will hold the
number of observations in column `x` that have distance less than `0.1` to
a value stored in a given row of `x`.
### Exercise 3
Investigate visually how does `n` depend on `x` in data frame `df`.
### Exercise 4
Someone has prepared the following test data for you:
```
teststr = """
"x","sinx"
0.139279,0.138829
0.456779,0.441059
0.344034,0.337287
0.140253,0.139794
0.848344,0.750186
0.977512,0.829109
0.032737,0.032731
0.702750,0.646318
0.422339,0.409895
0.393878,0.383772
"""
```
Load this data into `testdf` data frame.
### Exercise 5
Check the accuracy of computations of sinus of `x` in `testdf`.
Print all rows for which the absolute difference is greater than `5e-7`.
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
difference.
### Exercise 6
Group data in data frame `df` into buckets of 0.1 width and store the result in
`gdf` data frame (sort the groups). Use the `cut` function from
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
Check the number of values in each group.
### Exercise 7
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
Check what would be the group order if you asked not to sort them.
### Exercise 8
Compute average `n` for each group in `gdf`.
### Exercise 9
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
Use the `\` operator to fit it (recall it from chapter 4).
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
### Exercise 10
Repeat exercise 9 but using the GLM.jl package. This time
extract the p-value for the slope of estimated coefficient for `x` variable.
Use the `coeftable` function from GLM.jl to get this information.
Check the documentation of this function to learn how to do it (it will be
easiest for you to first convert its result to a `DataFrame`).
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> using DataFrames
julia> df = DataFrame(x=rand(100_000));
julia> using Serialization
julia> serialize("df.bin", df)
julia> deserialize("df.bin") == df
true
```
### Exercise 2
Solution:
A simple approach is:
```
df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
```
A more sophisticated approach (faster and allocating less memory) would be:
```
df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
```
An even faster solution that is type stable would use function barrier:
```
f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
df.n = f(df.x)
```
Finally you can work on sorted data to get a much better performance. Here is an
example (it is a bit more advanced):
```
function f2(x)
p = sortperm(x)
n = zeros(Int, length(x))
start = 1
stop = 1
idx = 0
while idx < length(x) # you could add @inbounds here but I typically avoid it
idx += 1
while x[p[idx]] - x[p[start]] >= 0.1
start += 1
end
while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
stop += 1
end
n[p[idx]] = stop - start
end
return n
end
df.n = f2(df.x)
```
In this solution the fact that we used function barrier is even more relevant
as we explicitly use loops inside.
### Exercise 3
Solution:
```
using Plots
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
```
As expected on the border of the domain number of neighbors drops.
### Exercise 4
Solution:
```
julia> using CSV
julia> using DataFrames
julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
10×2 DataFrame
Row │ x sinx
│ Float64 Float64
─────┼────────────────────
1 │ 0.139279 0.138829
2 │ 0.456779 0.441059
3 │ 0.344034 0.337287
4 │ 0.140253 0.139794
5 │ 0.848344 0.750186
6 │ 0.977512 0.829109
7 │ 0.032737 0.032731
8 │ 0.70275 0.646318
9 │ 0.422339 0.409895
10 │ 0.393878 0.383772
```
### Exercise 5
Since data frame is small we can use `eachrow`:
```
julia> for row in eachrow(testdf)
sinx = sin(row.x)
dev = abs(sinx - row.sinx)
dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
end
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
```
### Exercise 6
Solution:
```
julia> using CategoricalArrays
julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
julia> gdf = groupby(df, :xbins; sort=true);
julia> [nrow(group) for group in gdf]
10-element Vector{Int64}:
9872
9976
9968
9943
10063
10173
9977
10076
9908
10044
julia> combine(gdf, nrow) # alternative way to do it
10×2 DataFrame
Row │ xbins nrow
│ Cat… Int64
─────┼───────────────────
1 │ [0.0, 0.1) 9872
2 │ [0.1, 0.2) 9976
3 │ [0.2, 0.3) 9968
4 │ [0.3, 0.4) 9943
5 │ [0.4, 0.5) 10063
6 │ [0.5, 0.6) 10173
7 │ [0.6, 0.7) 9977
8 │ [0.7, 0.8) 10076
9 │ [0.8, 0.9) 9908
10 │ [0.9, 1.0) 10044
```
You might get a bit different numbers but all should be around 10,000.
### Exercise 7
Solution:
```
julia> NamedTuple.(keys(gdf))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
(xbins = "[0.0, 0.1)",)
(xbins = "[0.1, 0.2)",)
(xbins = "[0.2, 0.3)",)
(xbins = "[0.3, 0.4)",)
(xbins = "[0.4, 0.5)",)
(xbins = "[0.5, 0.6)",)
(xbins = "[0.6, 0.7)",)
(xbins = "[0.7, 0.8)",)
(xbins = "[0.8, 0.9)",)
(xbins = "[0.9, 1.0)",)
julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
(xbins = "[0.4, 0.5)",)
(xbins = "[0.9, 1.0)",)
(xbins = "[0.8, 0.9)",)
(xbins = "[0.0, 0.1)",)
(xbins = "[0.2, 0.3)",)
(xbins = "[0.5, 0.6)",)
(xbins = "[0.7, 0.8)",)
(xbins = "[0.3, 0.4)",)
(xbins = "[0.1, 0.2)",)
(xbins = "[0.6, 0.7)",)
```
If you pass `sort=false` instead of `sort=true` you get groups in their order
of appearance in `df`. If you skipped specifying `sort` keyword argument
the resulting group order could depend on the type of grouping column, so if
you want to depend on the order of groups always spass `sort` keyword argument
explicitly.
### Exercise 8
Solution:
```
julia> using Statistics
julia> [mean(group.n) for group in gdf]
10-element Vector{Float64}:
14845.847751215559
19835.367882919007
19919.195826645264
19993.023936437694
20105.506111497565
20222.35761329008
20151.794727874112
20022.69610956729
19909.331550262414
14944.511449621665
julia> combine(gdf, :n => mean) # alternative way to do it
10×2 DataFrame
Row │ xbins n_mean
│ Cat… Float64
─────┼─────────────────────
1 │ [0.0, 0.1) 14845.8
2 │ [0.1, 0.2) 19835.4
3 │ [0.2, 0.3) 19919.2
4 │ [0.3, 0.4) 19993.0
5 │ [0.4, 0.5) 20105.5
6 │ [0.5, 0.6) 20222.4
7 │ [0.6, 0.7) 20151.8
8 │ [0.7, 0.8) 20022.7
9 │ [0.8, 0.9) 19909.3
10 │ [0.9, 1.0) 14944.5
```
### Exercise 9
Solution:
```
julia> function fitmodel(x, n)
X = [ones(length(x)) x]
α₀, αₓ = X \ n
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
end
fitmodel (generic function with 1 method)
julia> [fitmodel(group.x, group.n) for group in gdf]
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
(α₀ = 9900.190310776916, αₓ = 99131.14394200995)
(α₀ = 19823.115188829383, αₓ = 81.66979172871368)
(α₀ = 19812.9822724435, αₓ = 424.00895772216785)
(α₀ = 19810.726510910834, αₓ = 520.6763238983195)
(α₀ = 19437.772385484135, αₓ = 1483.333906139938)
(α₀ = 20187.521449870146, αₓ = 63.30709585406235)
(α₀ = 20424.362332155855, αₓ = -419.42268710601405)
(α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
(α₀ = 20013.690535193662, αₓ = -122.80055110522495)
(α₀ = 109320.55276082881, αₓ = -99305.18846102979)
julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
10×3 DataFrame
Row │ xbins α₀ αₓ
│ Cat… Float64 Float64
─────┼────────────────────────────────────────
1 │ [0.0, 0.1) 9900.19 99131.1
2 │ [0.1, 0.2) 19823.1 81.6698
3 │ [0.2, 0.3) 19813.0 424.009
4 │ [0.3, 0.4) 19810.7 520.676
5 │ [0.4, 0.5) 19437.8 1483.33
6 │ [0.5, 0.6) 20187.5 63.3071
7 │ [0.6, 0.7) 20424.4 -419.423
8 │ [0.7, 0.8) 20789.7 -1022.98
9 │ [0.8, 0.9) 20013.7 -122.801
10 │ [0.9, 1.0) 1.09321e5 -99305.2
```
We note that indeed in the first and last group the regression has a significant
slope.
### Exercise 10
Solution:
```
julia> using GLM
julia> function fitlmmodel(group; info=false)
model = lm(@formula(n~x), group)
coefdf = DataFrame(coeftable(model))
info && @show coefdf # to see how the data frame looks like
α₀, αₓ = coefdf[:, "Coef."]
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
end
fitlmmodel (generic function with 1 method)
julia> [fitlmmodel(group; info = true) for group in gdf]
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95
2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1
2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6
2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5
2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7
2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6
2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4
2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4
2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1
2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375
coefdf = 2×7 DataFrame
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
│ String Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5
2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
(α₀ = 9900.190310776927, αₓ = 99131.1439420097)
(α₀ = 19823.115188829663, αₓ = 81.66979172690417)
(α₀ = 19812.98227244386, αₓ = 424.00895772074136)
(α₀ = 19810.726510911398, αₓ = 520.6763238966264)
(α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
(α₀ = 20187.521449871012, αₓ = 63.307095852511125)
(α₀ = 20424.36233216108, αₓ = -419.4226871140539)
(α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
(α₀ = 20013.69053519897, αₓ = -122.80055111148658)
(α₀ = 109320.55276074051, αₓ = -99305.18846093686)
julia> combine(gdf, fitlmmodel)
10×3 DataFrame
Row │ xbins α₀ αₓ
│ Cat… Float64 Float64
─────┼────────────────────────────────────────
1 │ [0.0, 0.1) 9900.19 99131.1
2 │ [0.1, 0.2) 19823.1 81.6698
3 │ [0.2, 0.3) 19813.0 424.009
4 │ [0.3, 0.4) 19810.7 520.676
5 │ [0.4, 0.5) 19437.8 1483.33
6 │ [0.5, 0.6) 20187.5 63.3071
7 │ [0.6, 0.7) 20424.4 -419.423
8 │ [0.7, 0.8) 20789.7 -1022.98
9 │ [0.8, 0.9) 20013.7 -122.801
10 │ [0.9, 1.0) 1.09321e5 -99305.2
```
We got the same results. The `combine(gdf, fitlmmodel)` style of using
the `combine` function is a bit more advanced and is not covered in the book.
It is used in the cases, like the one we have here, when you want to pass
a whole group to the function in `combine`. Check DataFrames.jl documentation
for more detailed explanations.
</details>

247
exercises/exercises12.md Normal file
View File

@@ -0,0 +1,247 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 12
# Problems
### Exercise 1
The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code of
Go language version 19.2. As you can check on its website its SHA-256
is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`.
Download this file and check if it indeed has this checksum.
You might need to read documentation of `string` and `join` functions.
### Exercise 2
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
that contains the ego-nets of Eastern European users collected from the music
streaming service Deezer in February 2020. Nodes are users and edges are mutual
follower relationships.
From the file extract deezer_edges.json and deezer_target.csv files and
save them to disk.
### Exercise 3
Load deezer_edges.json and deezer_target.csv files to Julia.
The JSON file should be loaded as JSON3.jl object `edges_json`.
The CSV file should be loaded into a data frame `target_df`.
### Exercise 4
Check that keys in the `edges_json` are in the same order as `id` column
in `target_df`.
### Exercise 5
From every value stored in `edges_json` create a graph representing
ego-net of the given node. Store these graphs in a vector that will make the
`egonet` column of in the `target_df` data frame.
### Exercise 6
Ego-net in our data set is a subgraph of a full Deezer graph where for some
node all its neighbors are included, but also it contains all edges between the
neighbors.
Therefore we expect that diameter of every ego-net is at most 2 (as every
two nodes are either connected directly or by a common friend).
Check if this is indeed the case. Use the `diameter` function.
### Exercise 7
For each ego-net find a central node that is connected to every other node
in this network. Use the `degree` and `findall` functions to achieve this.
Add `center` column with numbers of nodes that are connected to all other
nodes in the ego-net to `target_df` data frame.
Next add a column `center_len` that gives the number of such nodes.
Check how many times different numbers of center nodes are found.
### Exercise 8
Add the following ego-net features to the `target_df` data frame:
* `size`: number of nodes in ego-net
* `mean_degree`: average node degree in ego-net
Check mean values of these two columns by `target` column.
### Exercise 9
Continuing to work with `target_df` data frame create a logistic regression
explaining `target` by `size` and `mean_degree`.
### Exercise 10
Continuing to work with `target_df` create a scatterplot where `size` will be on
one axis and `mean_degree` rounded to nearest integer on the other axis.
Plot the mean of `target` for each point being a combination of `size` and
rounded `mean_degree`.
Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
for values in range from 5% to 95% quantile (to concentrate on typical values
of size).
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
using Downloads
using SHA
Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz")
shavec = open(sha256, "go.tar.gz")
shastr = join(string.(s; base=16, pad=2))
sha == shastr
```
The last line should produce `true`.
### Exercise 2
Solution:
```
Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
import ZipFile
archive = ZipFile.Reader("ego.zip")
idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files))
open("deezer_edges.json", "w") do io
write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files))
open("deezer_target.csv", "w") do io
write(io, read(archive.files[idx]))
end
close(archive)
```
### Exercise 3
Solution:
```
using CSV
using DataFrames
using JSON3
edges_json = JSON3.read(read("deezer_edges.json"))
target_df = CSV.read("deezer_target.csv", DataFrame)
```
### Exercise 4
Solution (short, but you need to have a good understanding of Julia types
and standar functions to properly write it):
```
Symbol.(target_df.id) == keys(edges_json)
```
### Exercise 5
Solution:
```
using Graphs
function edgelist2graph(edgelist)
nodes = sort!(unique(reduce(vcat, edgelist)))
@assert 0:length(nodes)-1 == nodes
g = SimpleGraph(length(nodes))
for (a, b) in edgelist
add_edge!(g, a+1, b+1)
end
return g
end
target_df.egonet = edgelist2graph(values(edges_json))
```
### Exercise 6
Solution:
```
julia> extrema(diameter.(target_df.egonet))
(2, 2)
```
Indeed we see that for each ego-net diameter is 2.
### Exercise 7
Solution:
```
target_df.center = map(target_df.egonet) do g
findall(==(nv(g) - 1), degree(g))
end
target_df.center_len = length.(target_df.center)
combine(groupby(target_df, :center_len, sort=true), nrow)
```
Note that we used `map` since in this case it gives a convenient way to express
the condition we want to check.
We notice that in some cases it is impossible to identify the center of the
ego-net uniquely.
### Exercise 8
Solution:
```
using Statistics
target_df.size = nv.(target_df.egonet)
target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size
combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)
```
It seems that for target equal to `0` size and average degree in the network are
a bit larger.
### Exercise 9
Solution:
```
using GLM
glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())
```
We see that only `size` is statistically significant.
### Exercise 10
Solution:
```
using Plots
target_df.round_degree = round.(Int, target_df.mean_degree)
agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean)
scatter(agg_df.size, agg_df.round_degree;
zcolor=agg_df.target_mean,
xlabel="size", ylabel="rounded degree",
label="mean target", xaxis=:log)
```
It is hard to visually see any strong relationship in the data.
```
import Loess
model = Loess.loess(target_df.size, target_df.target)
size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95)
target_predict = Loess.predict(model, size_predict)
plot(size_predict, target_predict;
xlabel="size", ylabel="predicted target", legend=false)
```
Between quantiles 5% and 95% we see a downward shaped relationship.
</details>

280
exercises/exercises13.md Normal file
View File

@@ -0,0 +1,280 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 13
# Problems
### Exercise 1
Download
https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip
archive and extract primary_data.csv and secondary_data.csv files from it.
Save the files to disk.
### Exercise 2
Load primary_data.csv into the `primary` data frame.
Load secondary_data.csv into the `secondary` data frame.
Describe the contents of both data frames.
### Exercise 3
Start with `primary` data. Note that columns starting from column 4 have
their data encoded using vector notation, but they have been read-in as strings.
Convert these columns to hold proper vectors. Note that some columns have
`missing` values. Most of the columns hold nominal data, but three columns,
i.e. `cap-diameter`, `stem-height`, and `stem-width` have numeric data. These
should be parsed as vectors storing numeric values. After parsing, put these
three columns just after `class` column in the `parsed_primary` data frame.
Check `renamecols` keyword argument of `select` to
avoid renaming of the produced columns.
### Exercise 4
In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be
confused and have a different class, using the information about them we have
(so all information except their family, name, and class).
Use the following rules:
* if for some pair of mushrooms the data in some column for either of them is
`missing` then skip matching on this column; for numeric columns if there
is only one value in a vector then treat it as `missing`;
* otherwise:
- for numeric columns check if there is an overlap in the interval
specified by the min and max values for the range passed;
- for nominal columns check if the intersection of nominal values is nonempty.
For each found pair print to the screen the row number, family, name, and class.
### Exercise 5
Still using `parsed_primary` find what is the average probability of class being
`p` by `family`. Additionally add number of observations in each group. Sort
these results by the probability. Try using DataFramesMeta.jl to do this
exercise (this requirement is optional).
Store the result in `agg_primary` data frame.
### Exercise 6
Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
it gives us a total number of rows that had this probability and a tuple
of mushroom family names.
Optionally: try to display the produced table so that the tuple containing the
list of families for each group is not cropped (this will require large
terminal).
### Exercise 7
From our preliminary analysis of `primary` data we see that `missing` value in
the primary data is non-informative, so in `secondary` data we should be
cautious when building a model if we allowed for missing data (in practice
if we were investigating some real mushroom we most likely would know its
characteristics).
Therefore as a first step drop in-place all columns in `secondary` data frame
that have missing values.
### Exercise 8
Create a logistic regression predicting `class` based on all remaining features
in the data frame. You might need to check the `Term` usage in StatsModels.jl
documentation.
You will notice that for `stem-color` and `habitat` columns you get strange
estimation results (large absolute values of estimated parameters and even
larger standard errors). Explain why this happens by analyzing frequency tables
of these variables against `class` column.
### Exercise 9
Add `class_p` column to `secondary` as a second column that will contain
predicted probability from the model created in exercise 8 of a given
observation having class `p`.
Print descriptive statistics of column `class_p` by `class`.
### Exercise 10
Plot FPR-TPR ROC curve for our model and compute associated AUC value.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
using Downloads
import ZipFile
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
archive = ZipFile.Reader("MushroomDataset.zip")
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
open("primary_data.csv", "w") do io
write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
open("secondary_data.csv", "w") do io
write(io, read(archive.files[idx]))
end
close(archive)
```
### Exercise 2
Solution:
```
using CSV
using DataFrames
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
describe(primary)
describe(secondary)
```
### Exercise 3
Solution:
```
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
parse_nominal(::Missing) = missing
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
parse_numeric(::Missing) = missing
idcols = ["family", "name", "class"]
numericcols = ["cap-diameter", "stem-height", "stem-width"]
parsed_primary = select(primary,
idcols,
numericcols .=> ByRow(parse_numeric),
Not([idcols; numericcols]) .=> ByRow(parse_nominal);
renamecols=false)
```
### Exercise 4
Solution:
```
function overlap_numeric(v1, v2)
# there are no missing values in numeric columns
if length(v1) == 1 || length(v2) == 1
return true
else
return max(v1[1], v2[1]) <= min(v1[2], v2[2])
end
end
function overlap_nominal(v1, v2)
if ismissing(v1) || ismissing(v2)
return true
else
return !isempty(intersect(v1, v2))
end
end
function rows_overlap(row1, row2)
# note that in parsed_primary numeric columns have indices 4 to 6
# and nominal columns have indices 7 to 23
return all(i -> overlap_numeric(row1[i], row2[i]), 4:6) &&
all(i -> overlap_nominal(row1[i], row2[i]), 7:23)
end
for i in 1:nrow(parsed_primary), j in i+1:nrow(parsed_primary)
row1 = parsed_primary[i, :]
row2 = parsed_primary[j, :]
if rows_overlap(row1, row2) && row1.class != row2.class
println((i, Tuple(row1[1:3]), j, Tuple(row2[1:3])))
end
end
```
Note that in this exercise using `eachrow` is not a problem
(although it is not type stable) because the data is small.
### Exercise 5
Solution:
```
using Statistics
using DataFramesMeta
agg_primary = @chain parsed_primary begin
groupby(:family)
@combine(:pr_p = mean(:class .== "p"), $nrow)
sort(:pr_p)
end
```
### Exercise 6
Solution:
```
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140)
```
### Exercise 7
Solution:
```
select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
```
Note that we select based on actual contents of the columns and not by their
element type (column could allow for missing values but not have them).
### Exercise 8
Solution:
```
using GLM
using FreqTables
secondary.class = secondary.class .== "p"
model = glm(Term(:class)~sum(Term.(Symbol.(names(secondary, Not(:class))))),
secondary, Binomial(), LogitLink())
freqtable(secondary, "stem-color", "class")
freqtable(secondary, "habitat", "class")
```
We can see that for cetrain levels of `stem-color` and `habitat` variables
there is a perfect separation of classes.
### Exercise 9
Solution:
```
insertcols!(secondary, 2, :class_p => predict(model))
combine(groupby(secondary, :class)) do sdf
return describe(sdf, :detailed, cols=:class_p)
end
```
We can see that the model has some discriminatory power, but there
is still a significant overlap between classes.
### Exercise 10
Solution:
```
using Plots
using ROCAnalysis
roc_data = roc(secondary; score=:class_p, target=:class)
plot(roc_data.pfa, 1 .- roc_data.pmiss;
title="AUC=$(round(100*(1 - auc(roc_data)), digits=2))%",
xlabel="FPR", ylabel="TPR", legend=false)
```
</details>

384
exercises/exercises14.md Normal file
View File

@@ -0,0 +1,384 @@
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 14
# Problems
### Exercise 1
Write a simulator that takes one parameter `n`. Next we assume we draw
`n` times from this set with replacement. The simulator should return the
average number of items from this set that were drawn at least once.
Call the function running this simulation `boot`.
### Exercise 2
Now write a function `simboot` that takes parameters `n` and `k` and runs
the simulation defined in the `boot` function `k` times. It should return
a named tuple storing `k`, `n`, mean of produced values, and ends of
approximated 95% confidence interval of the result.
Make this function single threaded. Check how long
this function runs for `n=1000` and `k=1_000_000`.
### Exercise 3
Now rewrite this simulator to be multi threaded. Use 4 cores for benchmarking.
Call the function `simbootT`. Check how long this function runs for `n=1000` and
`k=1_000_000`.
### Exercise 4
Now rewrite `boot` and `simbootT` to perform less allocations. Achieve this by
making sure that all allocated objects are passed to `boot` function (so that it
does not do any allocations internally). Call these new functions `boot!` and
`simbootT2`. You might need to use the `Threads.threadid` and `Threads.nthreads`
functions.
### Exercise 5
Use either of the solutions we have developed in the previous exercises to
create a web service taking `k` and `n` parameters and returning the values
produced by `boot` functions and time to run the simulation. You might want to
use the `@timed` macro in your solution.
Start the server.
### Exercise 6
Query the server started in the exercise 5 with
the following parameters:
* `k=1000` and `n=1000`
* `k=1.5` and `n=1000`
### Exercise 7
Collect the data generated by a web service into the `df` data frame for
`k = [10^i for i in 3:6]` and `n = [10^i for i in 1:3]`.
### Exercise 8
Replace the `value` column in the `df` data frame by its contents in-place.
### Exercise 9
Checks that execution time roughly scales proportionally to the product
of `k` times `n`.
### Exercise 10
Plot the expected fraction of seen elements in the set as a function of
`n` by `k` along with 95% confidence interval around these values.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution (there are many other approaches you could use):
```
using Statistics
function boot(n::Integer)
table = falses(n)
for _ in 1:n
table[rand(1:n)] = true
end
return mean(table)
end
```
### Exercise 2
Solution:
```
function simboot(k::Integer, n::Integer)
result = [boot(n) for _ in 1:k]
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
```
We run it twice to make sure everything is compiled:
```
julia> @time simboot(1000, 1_000_000)
7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time)
(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944)
julia> @time simboot(1000, 1_000_000)
7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time)
(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754)
```
We see that on my computer the run time is around 7 seconds.
### Exercise 3
Solution:
```
using ThreadsX
function simbootT(k::Integer, n::Integer)
result = ThreadsX.map(i -> boot(n), 1:k)
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
```
Here is the timing for four threads:
```
julia> @time simbootT(1000, 1_000_000)
2.390795 seconds (3.37 k allocations: 119.434 MiB)
(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484)
julia> @time simbootT(1000, 1_000_000)
2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time)
(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554)
```
Indeed we see a significant performance improvement.
### Exercise 4
Solution:
```
function boot!(n::Integer, pool)
table = pool[Threads.threadid()]
fill!(table, false)
for _ in 1:n
table[rand(1:n)] = true
end
return mean(table)
end
function simbootT2(k::Integer, n::Integer)
pool = [falses(n) for _ in 1:Threads.nthreads()]
result = ThreadsX.map(i -> boot!(n, pool), 1:k)
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
```
In the solution the `pool` vector keeps `table` vector
individually for each thread. Let us test the timing:
```
julia> @time simbootT2(1000, 1_000_000)
2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation)
(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206)
julia> @time simbootT2(1000, 1_000_000)
2.340694 seconds (391 allocations: 586.453 KiB)
(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058)
```
Indeed, we see that the number of allocations was decreased, which should lower
GC usage. However, the runtime of the simulation is similar since in this task
memory allocation does not account for a significant portion of the runtime.
### Exercise 5
Solution (I used the simplest single-threaded code here; this is a complete
code of the web service):
```
using Genie
using Statistics
function boot(n::Integer)
table = falses(n)
for _ in 1:n
table[rand(1:n)] = true
end
return mean(table)
end
function simboot(k::Integer, n::Integer)
result = [boot(n) for _ in 1:k]
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
Genie.config.run_as_server = true
Genie.Router.route("/", method=POST) do
message = Genie.Requests.jsonpayload()
return try
k = message["k"]
n = message["n"]
value, time = @timed simboot(k, n)
Genie.Renderer.Json.json((status="OK", time=time, value=value))
catch
Genie.Renderer.Json.json((status="ERROR", time="", value=""))
end
end
Genie.Server.up()
```
### Exercise 6
Solution:
```
julia> using HTTP
julia> using JSON3
julia> HTTP.post("http://127.0.0.1:8000",
["Content-Type" => "application/json"],
JSON3.write((k=1000, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked
{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}"""
julia> HTTP.post("http://127.0.0.1:8000",
["Content-Type" => "application/json"],
JSON3.write((k=1.5, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked
{"status":"ERROR","time":"","value":""}"""
```
As expected we got a positive answer the first time and an error on the second call.
### Exercise 7
Solution:
```
using DataFrames
df = DataFrame()
for k in [10^i for i in 3:6], n in [10^i for i in 1:3]
@show k, n
req = HTTP.post("http://127.0.0.1:8000",
["Content-Type" => "application/json"],
JSON3.write((; k, n)))
push!(df, NamedTuple(JSON3.read(req.body)))
end
```
Note that I convert `JSON3.Object` into a `NamedTuple` to easily `push!`
it into the `df` data frame.
Let us have a look at the produced data frame:
```
julia> df
12×3 DataFrame
Row │ status time value
│ String Float64 Object…
─────┼──────────────────────────────────────────────────────
1 │ OK 0.0006784 {\n "k": 1000,\n "n": …
2 │ OK 0.0038374 {\n "k": 1000,\n "n": …
3 │ OK 0.0150844 {\n "k": 1000,\n "n": …
4 │ OK 0.0014071 {\n "k": 10000,\n "n":…
5 │ OK 0.008443 {\n "k": 10000,\n "n":…
6 │ OK 0.0700319 {\n "k": 10000,\n "n":…
7 │ OK 0.0253826 {\n "k": 100000,\n "n"…
8 │ OK 0.0795937 {\n "k": 100000,\n "n"…
9 │ OK 0.708287 {\n "k": 100000,\n "n"…
10 │ OK 0.160286 {\n "k": 1000000,\n "n…
11 │ OK 0.803433 {\n "k": 1000000,\n "n…
12 │ OK 7.23958 {\n "k": 1000000,\n "n…
```
### Exercise 8
Solution:
```
julia> select!(df, :status, :time, :value => AsTable)
12×7 DataFrame
Row │ status time k n mv lo95 hi95
│ String Float64 Int64 Int64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────
1 │ OK 0.0006784 1000 10 0.6469 0.640745 0.653055
2 │ OK 0.0038374 1000 100 0.63508 0.633035 0.637125
3 │ OK 0.0150844 1000 1000 0.632178 0.631581 0.632775
4 │ OK 0.0014071 10000 10 0.65239 0.650425 0.654355
5 │ OK 0.008443 10000 100 0.634456 0.633845 0.635067
6 │ OK 0.0700319 10000 1000 0.63207 0.631878 0.632262
7 │ OK 0.0253826 100000 10 0.651411 0.650793 0.652029
8 │ OK 0.0795937 100000 100 0.634 0.633807 0.634193
9 │ OK 0.708287 100000 1000 0.63224 0.632179 0.632302
10 │ OK 0.160286 1000000 10 0.65129 0.651095 0.651486
11 │ OK 0.803433 1000000 100 0.633995 0.633934 0.634056
12 │ OK 7.23958 1000000 1000 0.63232 0.632301 0.63234
```
### Exercise 9
Solution:
```
julia> using DataFramesMeta
julia> @chain df begin
@rselect(:k, :n, :avg_time = :time / (:k * :n))
unstack(:k, :n, :avg_time)
end
4×4 DataFrame
Row │ k 10 100 1000
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────────────
1 │ 1000 6.784e-8 3.8374e-8 1.50844e-8
2 │ 10000 1.4071e-8 8.443e-9 7.00319e-9
3 │ 100000 2.53826e-8 7.95937e-9 7.08287e-9
4 │ 1000000 1.60286e-8 8.03433e-9 7.23958e-9
```
We see that indeed this is the case. For large `k` and `n` the average time per
single sample stabilizes (for small values the runtime is low so the timing is
more affected by external noise and the other operations that the functions do
affect the results more).
### Exercise 10
Solution:
```
using Plots
gdf = groupby(df, :k, sort=true)
plot([bar(string.(g.n), g.mv;
ylim=(0.62, 0.66), xlabel="n", ylabel="estimate",
legend=false, title=first(g.k),
yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...)
```
As expected error bandwidth gets smaller as `k` encreases.
Note that as `n` increases the estimated value tends to `1-exp(-1)`.
</details>

BIN
exercises/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB