add exercises

2022-10-14 12:27:04 +02:00
parent 67f17f270f
commit 3b8ffa5d40
17 changed files with 4341 additions and 7 deletions
--- a/README.md
+++ b/README.md
@@ -5,6 +5,11 @@ This repository contains source codes for the
 book that is written by Bogumił Kamiński and is planned to be published in 2022
 by [Manning Publications Co.](https://www.manning.com/)

+Extras:
+* in the `/exercises` folder for each book chapter you can find 10 additional
+  exercises with solutions (they are meant for self study and are not discussed
+  in the book)
+
 ## Setting up your environment

 In order to prepare the Julia environment before working with the materials
@@ -71,9 +76,9 @@ The codes for each chapter are stored in files named *chXX.jl*, where *XX* is
 chapter number. The exceptions are

 * chapter 14, where additionally a separate *ch14_server.jl* is present along
-with *ch14.jl* (the reason is that in this chapter we create a web service and
-the *ch14_server.jl* contains the server-side code that should be run in a
-separate Julia process);
+  with *ch14.jl* (the reason is that in this chapter we create a web service and
+  the *ch14_server.jl* contains the server-side code that should be run in a
+  separate Julia process);
 * appendix A, where the file name used is *appA.txt* because it also
  contains other instructions than only Julia code (in particular package
  manager mode instructions).
@@ -113,8 +118,6 @@ There are the following videos that feature material related to this book:
 * [Analysis of GitHub developer graph](https://www.twitch.tv/videos/1527593261)
  (a shortened version of material covered in chapter 12)

-
-
 ## Data used in the book

 For your convenience I additionally stored data files that we use in this book.
@@ -130,5 +133,3 @@ They are respectively:
  <https://snap.stanford.edu/data/github-social.html> under GPL-3.0 License)
 * owensboro.zip (for chapter 13, available at The Stanford Open Policing Project
  under the Open Data Commons Attribution License)
-
-<!-- markdownlint-disable-file MD033 -->
--- a/exercises/README.md
+++ b/exercises/README.md
@@ -0,0 +1,33 @@
+# Julia for Data Analysis
+
+This folder contains additional exercises to the book
+["Julia for Data Analysis"](https://www.manning.com/books/julia-for-data-analysis?utm_source=bkamins&utm_medium=affiliate&utm_campaign=book_kaminski2_julia_3_17_22)
+that is written by Bogumił Kamiński and is planned to be published in 2022
+by [Manning Publications Co.](https://www.manning.com/).
+
+The exercises were prepared by [Bogumił Kamiński](https://github.com/bkamins) and [Daniel Kaszyński](https://www.linkedin.com/in/daniel-kaszy%C5%84ski-3a7807113/).
+
+For each book chapter you can find 10 additional exercises with solutions.
+The exercises are meant for self study and are not discussed in the book.
+
+The exercises contained in this folder have varying level of difficulty. The
+problems for the first few chapters are meant to be relatively easy, however you
+might find exercises for the last chapters more challenging (if you are learning
+Julia I still encourage you to try these exercises and walk through the
+solutions trying to understand them). In particular, in some exercises I, on
+purpose, require using more functionalities of Julia ecosystem that is covered
+in the book. This is meant to teach you how to use help and documentation,
+as this is a very important skill to master.
+
+The files containing exercises have a naming convention `execricesDD.md`, where
+`DD` is book chapter number for which the exercises were prepared.
+
+All the exercises should be possible to solve using project environment setting
+that is used in the whole book (please check the global README.md file of this
+repository for details).
+
+---
+
+*Preparation of these exercises have been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.*
+
+![SGH & NAWA](logo.png)
--- a/exercises/example8.csv.gz
+++ b/exercises/example8.csv.gz
--- a/exercises/exercises02.md
+++ b/exercises/exercises02.md
@@ -0,0 +1,267 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 2
+
+# Problems
+
+### Exercise 1
+
+Consider the following code:
+```
+x = [1, 2]
+y = x
+y[1] = 10
+```
+What is the value of `x[1]` and why?
+
+### Exercise 2
+
+How can you type `⚡ = 1`. Check if this operation succeeds and what is its result.
+
+### Exercise 3
+
+What will be the value of variable `x` after running of the following code and why?
+```
+x = 0.0
+for i in 1:7_000_000
+    global x += 1/7
+end
+x /= 1_000_000
+```
+
+### Exercise 4
+
+Express the type `Matrix{Bool}` using `Array` type.
+
+### Exercise 5
+
+Let `x` be a vector. Write code that prints an error if `x` is empty
+(has zero elements)
+
+### Exercise 6
+
+Write a function called `exec` that takes two values `x` and `y` and a function
+accepting two arguments, call it `op` and returns `op(x, y)`. Make `+` to be
+the default value of `op`.
+
+### Exercise 7
+
+Write a function that calculates a sum of absolute values of values stored in
+a collection passed to it.
+
+### Exercise 8
+
+Write a function that swaps first and last element in an array in place.
+
+### Exercise 9
+
+Write a loop in global scope that calculates the sum of cubes of numbers from
+`1` to `10^6`. Next use the `sum` function to perform the same computation.
+What is the difference in timing of these operations?
+
+### Exercise 10
+
+Explain the value of the result of summation obtained in exercise 9.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+`x[1]` will be `10` because `y = x` is not copying data but it binds
+the same value both to variable `x` and `y`.
+
+### Exercise 2
+
+In help mode (activated by `?`) copy-paste `⚡` to get:
+```
+help?> ⚡
+"⚡" can be typed by \:zap:<tab>
+```
+After the `⚡ = 1` operation a new variable `⚡` is defined and it is bound
+to value `1`.
+
+### Exercise 3
+
+`x` will have value `0.9999999999242748`. This value is below `1.0` because
+representation of `1/7` using `Float64` type is less than rational number 1/7,
+and the error accumulates when we do addition multiple times.
+
+*Extra*: You can check that indeed that `Float64` representation is a bit less
+than rational 1/7 by increasing the precision of computations using the `big`
+function:
+```
+julia> big(1/7) # convert Floa64 to high-precision float
+0.142857142857142849212692681248881854116916656494140625
+
+julia> 1/big(7) # construct high-precision float directly
+0.1428571428571428571428571428571428571428571428571428571428571428571428571428568
+```
+As you can see there is a difference at 17th place after decimal dot where we
+have `4` vs `5`.
+
+### Exercise 4
+
+It is `Array{Bool, 2}`. You immediately get this information in REPL:
+```
+julia> Matrix{Bool}
+Matrix{Bool} (alias for Array{Bool, 2})
+```
+
+### Exercise 5
+
+You can do it like this:
+```
+length(x) == 0 && println("x is empty")
+```
+
+*Extra*: typically in such case one would use the `isempty` function and throw
+an exception instead of just printing information (here I assume that `x` was
+passed as an argument to the function):
+```
+isempty(x) && throw(ArgumentError("x is not allowed to be empty"))
+```
+
+### Exercise 6
+
+Here are two ways to define the `exec` function:
+```
+exec1(x, y, op=+) = op(x, y)
+exec2(x, y; op=+) = op(x, y)
+```
+The first of them uses positional arguments, and the second a keyword argument.
+Here is a difference in how they would be called:
+```
+julia> exec1(2, 3, *)
+6
+
+julia> exec2(2, 3; op=*)
+6
+```
+
+### Exercise 7
+
+Such a function can be written as:
+```
+sumabs(x) = sum(abs, x)
+```
+
+### Exercise 8
+
+This can be written for example as:
+```
+function swap!(x)
+    f = x[1]
+    x[1] = x[end]
+    x[end] = f
+    return x
+end
+```
+
+*Extra* A more advanced way to write this function would be:
+```
+function swap!(x)
+    if length(x) > 1
+        x[begin], x[end] = x[end], x[begin]
+    end
+    return x
+end
+```
+Note the differences in the code:
+* we use `begin` instead of `1` to get the first element. This is a safer
+  practice since some collections in Julia do not use 1-based indexing (in
+  practice you are not likely to see them, so this comment is most relevant
+  for package developers)
+* if there are `0` or `1` element in the collection the function does not do
+  anything (depending on the context we might want to throw an error instead)
+* in `x[begin], x[end] = x[end], x[begin]` we perform two assignments at the
+  same time to avoid having to use a temporaty variable `f` (this operation
+  is technically called tuple destructuring; we discuss it in later chapters of
+  the book)
+
+### Exercise 9
+
+We used `@time` macro in chapter 1.
+
+Version in global scope:
+```
+julia> s = 0
+0
+
+julia> @time for i in 1:10^6
+           global s += i^3
+       end
+  0.076299 seconds (2.00 M allocations: 30.517 MiB, 10.47% gc time)
+```
+
+Version with a function using a `sum` function:
+```
+julia> sum3(n) = sum(x -> x^3, 1:n)
+sum3 (generic function with 1 method)
+
+julia> @time sum3(10^6)
+  0.000012 seconds
+-8222430735553051648
+```
+
+Version with `sum` function in global scope:
+```
+julia> @time sum(x -> x^3, 1:10^6)
+  0.027436 seconds (48.61 k allocations: 2.558 MiB, 99.75% compilation time)
+-8222430735553051648
+
+julia> @time sum(x -> x^3, 1:10^6)
+  0.025744 seconds (48.61 k allocations: 2.557 MiB, 99.76% compilation time)
+-8222430735553051648
+```
+
+As you can see using a loop in global scope is inefficient. It leads to
+many allocations and slow execution.
+
+Using a `sum3` function leads to fastest execution. You might ask why using
+`sum(x -> x^3, 1:10^6)` in global scope is slower. The reason is that an
+anonymous function `x -> x^3` is defined anew each time this operation is called
+which forces compilation of the `sum` function (but it is still faster than
+the loop in global scope).
+
+For a reference check the function with a loop inside it:
+```
+julia> function sum3loop(n)
+           s = 0
+           for i in 1:n
+               s += i^3
+           end
+           return s
+       end
+sum3loop (generic function with 1 method)
+
+julia> @time sum3loop(10^6)
+  0.001378 seconds
+-8222430735553051648
+```
+This is also much faster than a loop in global scope.
+
+### Exercise 10
+
+In exercise 9 we note that the result is `-8222430735553051648` which is a
+negative value, although we are adding cubes of positive values. The
+reason of the problem is that operations on integers overflow. If you
+are working with numbers larger that can be stored in `Int` type, which is:
+```
+julia> typemax(Int)
+9223372036854775807
+```
+use `big` numbers that we discussed in *Exercise 3*:
+```
+julia> @time sum(x -> big(x)^3, 1:10^6)
+  0.833234 seconds (11.05 M allocations: 236.113 MiB, 23.77% gc time, 2.63% compilation time)
+250000500000250000000000
+```
+Now we get a correct result, at the cost of slower computation.
+
+</details>
--- a/exercises/exercises03.md
+++ b/exercises/exercises03.md
@@ -0,0 +1,345 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 3
+
+# Problems
+
+### Exercise 1
+
+Check what methods does the `repeat` function have.
+Are they all covered in help for this function?
+
+### Exercise 2
+
+Write a function `fun2` that takes any vector and returns the difference between
+the largest and the smallest element in this vector.
+
+### Exercise 3
+
+Generate a vector of one million random numbers from `[0, 1]` interval.
+Check what is a faster way to get a maximum and minimum element in it. One
+option is by using the `maximum` and `minimum` functions and the other is by
+using the `extrema` function.
+
+### Exercise 4
+
+Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
+variable `x`. What effects can this operation have?
+
+### Exercise 5
+
+What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
+
+### Exercise 6
+
+Define two identical anonymous functions `x -> x + 1` in global scope? Do they
+have the same type?
+
+### Exercise 7
+
+Define the `wrap` function taking one argument `i` and returning the anonymous
+function `x -> x + i`. Is the type of such anonymous function the same across
+calls to `wrap` function?
+
+### Exercise 8
+
+You want to write a function that accepts any `Integer` except `Bool` and returns
+the passed value. If `Bool` is passed an error should be thrown.
+
+### Exercise 9
+
+The `@time` macro measures time taken by an expression run and prints it,
+but returns the value of the expression.
+The `@elapsed` macro works differently - it does not print anything, but returns
+time taken to evaluate an expression. Test the `@elapsed` macro by to see how
+long it takes to shuffle a vector of one million floats. Use the `shuffle` function
+from `Random` module.
+
+### Exercise 10
+
+Using the `@btime` macro benchmark the time of calculating the sum of one million
+random floats.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Write:
+```
+julia> methods(repeat)
+# 6 methods for generic function "repeat":
+[1] repeat(A::AbstractArray; inner, outer) in Base at abstractarraymath.jl:392
+[2] repeat(A::AbstractArray, counts...) in Base at abstractarraymath.jl:355
+[3] repeat(c::Char, r::Integer) in Base at strings/string.jl:336
+[4] repeat(c::AbstractChar, r::Integer) in Base at strings/string.jl:335
+[5] repeat(s::Union{SubString{String}, String}, r::Integer) in Base at strings/substring.jl:248
+[6] repeat(s::AbstractString, r::Integer) in Base at strings/basic.jl:715
+```
+
+Now write `?repeat` and you will see that there are four entries in help.
+The reason is that for `Char` and `AbstractChar` as well as for
+`AbstractString` and `Union{SubString{String}, String}` there is one help entry.
+
+Why do these cases have two methods defined?
+The reason is performance. For example `repeat(c::AbstractChar, r::Integer)`
+is a generic function that accept any character values
+and `repeat(c::Char, r::Integer)` is its faster version
+that accepts values that have `Char` type only (and it is invoked by Julia
+if value of type `Char` is passed as an argument to `repeat`).
+
+### Exercise 2
+
+You can define is as follows:
+```
+fun2(x::AbstractVector) = maximum(x) - minimum(x)
+```
+or as follows:
+```
+function fun2(x::AbstractVector)
+    lo, hi = extrema(x)
+    return hi - lo
+end
+```
+Note that these two functions will work with vectors of any elements that
+are ordered and support subtraction (they do not have to be numbers).
+
+### Exercise 3
+
+Here is a way to compare the performance of both options:
+```
+julia> using BenchmarkTools
+
+julia> x = rand(10^6);
+
+julia> @btime minimum($x), maximum($x)
+  860.700 μs (0 allocations: 0 bytes)
+(1.489173560242918e-6, 0.9999984347293639)
+
+julia> @btime extrema($x)
+  2.185 ms (0 allocations: 0 bytes)
+(1.489173560242918e-6, 0.9999984347293639)
+```
+
+As you can see in this situation, although `extrema` does the operation
+in a single pass over `x` it is slower than computing `minimum` and `maximum`
+in two passes.
+
+### Exercise 4
+
+If it is a fresh Julia session you define a new function in `Main` for `+` operator:
+
+```
+julia> +x=1
+ (generic function with 1 method)
+
+julia> methods(+)
+# 1 method for generic function "+":
+[1] +(x) in Main at REPL[1]:1
+
+julia> +(10)
+1
+```
+
+This will also break any further uses of `+` in your programs:
+
+```
+julia> 1 + 2
+ERROR: MethodError: no method matching +(::Int64, ::Int64)
+You may have intended to import Base.:+
+Closest candidates are:
+  +(::Any) at REPL[1]:1
+```
+
+If you earlier used addition in this Julia session then the operation will error.
+Start a fresh Julia session:
+
+```
+julia> 1 + 2
+3
+
+julia> +x=1
+ERROR: error in method definition: function Base.+ must be explicitly imported to be extended
+```
+
+### Exercise 5
+
+You get an empty vector:
+```
+julia> subtypes(Union{Float64, Missing})
+Type[]
+```
+
+The reason is that the `subtypes` function returns subtypes of explicitly
+declared types that have names (type of such types is `DataType` in Julia).
+
+*Extra* for this reason `subtypes` has a limited use. To check if one type
+is a subtype of some other type use the `<:` operator.
+
+### Exercise 6
+
+No, each of them has a different type:
+```
+julia> f1 = x -> x + 1
+#1 (generic function with 1 method)
+
+julia> f2 = x -> x + 1
+#3 (generic function with 1 method)
+
+julia> typeof(f1)
+var"#1#2"
+
+julia> typeof(f2)
+var"#3#4"
+```
+
+This is the reason why function call like `sum(x -> x^2, 1:10)` in global
+scope triggers compilation each time:
+
+```
+julia> @time sum(x -> x^2, 1:10)
+  0.070714 seconds (167.41 k allocations: 8.815 MiB, 14.29% gc time, 93.91% compilation time)
+385
+
+julia> @time sum(x -> x^2, 1:10)
+  0.020971 seconds (47.82 k allocations: 2.529 MiB, 99.75% compilation time)
+385
+
+julia> @time sum(x -> x^2, 1:10)
+  0.021184 seconds (47.81 k allocations: 2.529 MiB, 99.77% compilation time)
+385
+```
+
+### Exercise 7
+
+Yes, the type is the same:
+
+```
+julia> wrap(i) = x -> x + i
+wrap (generic function with 1 method)
+
+julia> typeof(wrap(1))
+var"#11#12"
+
+julia> typeof(wrap(2))
+var"#11#12"
+```
+
+Julia defines a new type for such an anonymous function only once
+The consequence of this is that e.g. expressions inside a function like
+`sum(x -> x ^ i, 1:10)` where `i` is an argument to a function do not trigger
+compilation (as opposed to similar expressions in global scope, see exercise 6).
+
+```
+julia> sumi(i) = sum(x -> x^i, 1:10)
+sumi (generic function with 1 method)
+
+julia> @time sumi(1)
+  0.000004 seconds
+55
+
+julia> @time sumi(2)
+  0.000001 seconds
+385
+
+julia> @time sumi(3)
+  0.000003 seconds
+3025
+```
+
+### Exercise 8
+
+We check subtypes of `Integer`:
+
+```
+julia> subtypes(Integer)
+3-element Vector{Any}:
+ Bool
+ Signed
+ Unsigned
+```
+
+The first way to write such a function is then:
+```
+fun1(i::Union{Signed, Unsigned}) = i
+```
+and now we have:
+```
+julia> fun1(1)
+1
+
+julia> fun1(true)
+ERROR: MethodError: no method matching fun1(::Bool)
+```
+
+The second way is:
+```
+fun2(i::Integer) = i
+fun2(::Bool) = throw(ArgumentError("Bool is not supported"))
+```
+
+and now you have:
+```
+julia> fun2(1)
+1
+
+julia> fun2(true)
+ERROR: ArgumentError: Bool is not supported
+```
+
+### Exercise 9
+
+Here is the code that performs the task:
+```
+julia> using Random # needed to get access to shuffle
+
+julia> x = rand(10^6); # generate random floats
+
+julia> @elapsed shuffle(x)
+0.0518085
+
+julia> @elapsed shuffle(x)
+0.01257
+
+julia> @elapsed shuffle(x)
+0.012483
+```
+
+Note that the first time we run `shuffle` it takes longer due to compilation.
+
+### Exercise 10
+
+The code you can use is:
+
+```
+julia> using BenchmarkTools
+
+julia> @btime sum($(rand(10^6)))
+  155.300 μs (0 allocations: 0 bytes)
+500330.6375697419
+```
+
+Note that the following:
+```
+julia> @btime sum(rand(10^6))
+  1.644 ms (2 allocations: 7.63 MiB)
+500266.9457722128
+```
+would be an incorrect timing as you would also measure the time of generating
+of the vector.
+
+Alternatively you can e.g. write:
+```
+julia> x = rand(10^6);
+
+julia> @btime sum($x)
+  154.700 μs (0 allocations: 0 bytes)
+500151.95875364926
+```
+
+</details>
--- a/exercises/exercises04.md
+++ b/exercises/exercises04.md
@@ -0,0 +1,536 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 4
+
+# Problems
+
+### Exercise 1
+
+Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
+columnwise with consecutive numbers). Next calculate sum, mean and standard
+deviation of each row and each column of this matrix.
+
+### Exercise 2
+
+For each column of the matrix created in exercise 1 compute its range
+(i.e. the difference between maximum and minimum element stored in it).
+
+### Exercise 3
+
+This is data for car speed (mph) and distance taken to stop (ft)
+from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
+
+```
+speed  dist
+    4     2
+    4    10
+    7     4
+    7    22
+    8    16
+    9    10
+   10    18
+   10    26
+   10    34
+   11    17
+   11    28
+   12    14
+   12    20
+   12    24
+   12    28
+   13    26
+   13    34
+   13    34
+   13    46
+   14    26
+   14    36
+   14    60
+   14    80
+   15    20
+   15    26
+   15    54
+   16    32
+   16    40
+   17    32
+   17    40
+   17    50
+   18    42
+   18    56
+   18    76
+   18    84
+   19    36
+   19    46
+   19    68
+   20    32
+   20    48
+   20    52
+   20    56
+   20    64
+   22    66
+   23    54
+   24    70
+   24    92
+   24    93
+   24   120
+   25    85
+```
+
+Load this data into Julia (this is part of the exercise) and fit a linear
+regression where speed is a feature and distance is target variable.
+
+### Exercise 4
+
+Plot the data loaded in exercise 4. Additionally plot the fitted regression
+(you need to check Plots.jl documentation to find a way to do this).
+
+### Exercise 5
+
+A simple code for calculation of Fibonacci numbers for positive
+arguments is as follows:
+
+```
+fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
+```
+
+Using the BenchmarkTools.jl package measure runtime of this function for
+`n` ranging from `1` to `20`.
+
+### Exercise 6
+
+Improve the speed of code from exercise 5 by using a dictionary where you
+store a mapping of `n` to `fib(n)`. Measure the performance of this function
+for the same range of values as in exercise 5.
+
+### Exercise 7
+
+Create a vector containing named tuples representing elements of a 4x4 grid.
+So the first element of this vector should be `(x=1, y=1)` and last should be
+`(x=4, y=4)`. Store the vector in variable `v`.
+
+### Exercise 8
+
+The `filter` function allows you to select some values of an input collection.
+Check its documentation first. Next, use it to keep from the vector `v` from
+exercise 7 only elements whose sum is even.
+
+### Exercise 9
+
+Check the documentation of the `filter!` function. Perform the same operation
+as asked in exercise 8 but using `filter!`. What is the difference?
+
+### Exercise 10
+
+Write a function that takes a number `n`. Next it generates two independent
+random vectors of length `n` and returns their correlation coefficient.
+Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
+and `10000`.
+Create a plot with four histograms of distribution of computed Pearson
+correlation coefficient. Check in the Plots.jl package which function can be
+used to plot histograms.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Write:
+```
+julia> using Statistics
+
+julia> mat = [1 3 5
+              2 4 6]
+2×3 Matrix{Int64}:
+ 1  3  5
+ 2  4  6
+
+julia> sum(mat, dims=1)
+1×3 Matrix{Int64}:
+ 3  7  11
+
+julia> sum(mat, dims=2)
+2×1 Matrix{Int64}:
+  9
+ 12
+
+julia> mean(mat, dims=1)
+1×3 Matrix{Float64}:
+ 1.5  3.5  5.5
+
+julia> mean(mat, dims=2)
+2×1 Matrix{Float64}:
+ 3.0
+ 4.0
+
+julia> std(mat, dims=1)
+1×3 Matrix{Float64}:
+ 0.707107  0.707107  0.707107
+
+julia> std(mat, dims=2)
+2×1 Matrix{Float64}:
+ 2.0
+ 2.0
+```
+
+Observe that the returned statistics are also stored in matrices.
+If we compute them for columns (`dims=1`) then the produced matrix has one row.
+If we compute them for rows (`dims=2`) then the produced matrix has one column.
+
+### Exercise 2
+
+Here are some ways you can do it:
+```
+julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
+3-element Vector{Int64}:
+ 1
+ 1
+ 1
+
+julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
+3-element Vector{Int64}:
+ 1
+ 1
+ 1
+```
+
+Observe that if we used `eachcol` the produced result is a vector (not a matrix
+like in exercise 1).
+
+### Exercise 3
+
+First create a matrix with source data by copy pasting it from the exercise
+like this:
+```
+data = [
+    4     2
+    4    10
+    7     4
+    7    22
+    8    16
+    9    10
+   10    18
+   10    26
+   10    34
+   11    17
+   11    28
+   12    14
+   12    20
+   12    24
+   12    28
+   13    26
+   13    34
+   13    34
+   13    46
+   14    26
+   14    36
+   14    60
+   14    80
+   15    20
+   15    26
+   15    54
+   16    32
+   16    40
+   17    32
+   17    40
+   17    50
+   18    42
+   18    56
+   18    76
+   18    84
+   19    36
+   19    46
+   19    68
+   20    32
+   20    48
+   20    52
+   20    56
+   20    64
+   22    66
+   23    54
+   24    70
+   24    92
+   24    93
+   24   120
+   25    85
+]
+```
+
+Now use the GLM.jl package to fit the model:
+
+```
+julia> using GLM
+
+julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1]))
+StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}}
+
+distance ~ 1 + speed
+
+Coefficients:
+─────────────────────────────────────────────────────────────────────────
+                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
+─────────────────────────────────────────────────────────────────────────
+(Intercept)  -17.5791     6.75844   -2.60    0.0123  -31.1678    -3.99034
+speed          3.93241    0.415513   9.46    <1e-11    3.09696    4.76785
+─────────────────────────────────────────────────────────────────────────
+```
+
+You can get the same estimates using the `\` operator like this:
+```
+julia> [ones(50) data[:, 1]] \ data[:, 2]
+2-element Vector{Float64}:
+ -17.579094890510966
+   3.9324087591240877
+```
+
+### Exercise 4
+
+Run the following:
+```
+using Plots
+scatter(data[:, 1], data[:, 2];
+        xlab="speed", ylab="distance", legend=false, smooth=true)
+```
+
+The `smooth=true` keyword argument adds the linear regression line to the plot.
+
+### Exercise 5
+
+Use the following code:
+```
+julia> using BenchmarkTools
+
+julia> for i in 1:40
+           print(i, " ")
+           @btime fib($i)
+       end
+1   2.500 ns (0 allocations: 0 bytes)
+2   2.700 ns (0 allocations: 0 bytes)
+3   4.800 ns (0 allocations: 0 bytes)
+4   7.500 ns (0 allocations: 0 bytes)
+5   12.112 ns (0 allocations: 0 bytes)
+6   19.980 ns (0 allocations: 0 bytes)
+7   32.125 ns (0 allocations: 0 bytes)
+8   52.696 ns (0 allocations: 0 bytes)
+9   85.010 ns (0 allocations: 0 bytes)
+10   140.311 ns (0 allocations: 0 bytes)
+11   222.177 ns (0 allocations: 0 bytes)
+12   359.903 ns (0 allocations: 0 bytes)
+13   582.123 ns (0 allocations: 0 bytes)
+14   1.000 μs (0 allocations: 0 bytes)
+15   1.560 μs (0 allocations: 0 bytes)
+16   2.522 μs (0 allocations: 0 bytes)
+17   4.000 μs (0 allocations: 0 bytes)
+18   6.600 μs (0 allocations: 0 bytes)
+19   11.400 μs (0 allocations: 0 bytes)
+20   18.100 μs (0 allocations: 0 bytes)
+```
+
+Notice that execution time for number `n` is roughly sum of ececution times
+for numbers `n-1` and `n-2`.
+
+### Exercise 6
+
+Use the following code:
+
+```
+julia> fib_dict = Dict{Int, Int}()
+Dict{Int64, Int64}()
+
+julia> function fib2(n)
+           haskey(fib_dict, n) && return fib_dict[n]
+           fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
+           fib_dict[n] = fib_n
+           return fib_n
+       end
+fib2 (generic function with 1 method)
+
+julia> for i in 1:20
+           print(i, " ")
+           @btime fib2($i)
+       end
+1   40.808 ns (0 allocations: 0 bytes)
+2   40.101 ns (0 allocations: 0 bytes)
+3   40.101 ns (0 allocations: 0 bytes)
+4   40.707 ns (0 allocations: 0 bytes)
+5   42.727 ns (0 allocations: 0 bytes)
+6   40.909 ns (0 allocations: 0 bytes)
+7   40.404 ns (0 allocations: 0 bytes)
+8   40.707 ns (0 allocations: 0 bytes)
+9   40.808 ns (0 allocations: 0 bytes)
+10   39.798 ns (0 allocations: 0 bytes)
+11   40.909 ns (0 allocations: 0 bytes)
+12   40.404 ns (0 allocations: 0 bytes)
+13   42.872 ns (0 allocations: 0 bytes)
+14   42.626 ns (0 allocations: 0 bytes)
+15   47.972 ns (1 allocation: 16 bytes)
+16   46.505 ns (1 allocation: 16 bytes)
+17   46.302 ns (1 allocation: 16 bytes)
+18   45.390 ns (1 allocation: 16 bytes)
+19   47.160 ns (1 allocation: 16 bytes)
+20   46.201 ns (1 allocation: 16 bytes)
+```
+
+Note that benchmarking essentially gives us a time of dictionary lookup.
+The reason is that `@btime` executes the same expression many times, so
+for the fastest execution time the value for each `n` is already stored in
+`fib_dict`.
+
+It would be more interesting to see the runtime of `fib2` for some large value
+of `n` executed once:
+
+```
+julia> @time fib2(100)
+  0.000018 seconds (107 allocations: 1.672 KiB)
+3736710778780434371
+
+julia> @time fib2(200)
+  0.000025 seconds (204 allocations: 20.453 KiB)
+-1123705814761610347
+```
+
+As you can see things are indeed fast. Note that for `n=200` we get a negative
+values because of integer overflow.
+
+As a more advanced topic (not covered in the book) it is worth to comment that
+`fib2` is not type stable. If we wanted to make it type stable we need to
+declare `fib_dict` dictionary as `const`. Here is the code and benchmarks
+(you need to restart Julia to run this test):
+
+```
+julia> const fib_dict = Dict{Int, Int}()
+Dict{Int64, Int64}()
+
+julia> function fib2(n)
+           haskey(fib_dict, n) && return fib_dict[n]
+           fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
+           fib_dict[n] = fib_n
+           return fib_n
+       end
+fib2 (generic function with 1 method)
+
+julia> @time fib2(100)
+  0.000014 seconds (6 allocations: 5.828 KiB)
+3736710778780434371
+
+julia> @time fib2(200)
+  0.000011 seconds (3 allocations: 17.312 KiB)
+-1123705814761610347
+```
+
+As you can see the code does less allocations and is faster now.
+
+### Exercise 7
+
+Since we are asked to create a vector we can write:
+
+```
+julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4]
+16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 2)
+ (x = 1, y = 3)
+ (x = 1, y = 4)
+ (x = 2, y = 1)
+ (x = 2, y = 2)
+ (x = 2, y = 3)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 2)
+ (x = 3, y = 3)
+ (x = 3, y = 4)
+ (x = 4, y = 1)
+ (x = 4, y = 2)
+ (x = 4, y = 3)
+ (x = 4, y = 4)
+```
+
+Note (not covered in the book) that you could create a matrix by changing
+the syntax a bit:
+
+```
+julia> [(x=x, y=y) for x in 1:4, y in 1:4]
+4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)  (x = 1, y = 2)  (x = 1, y = 3)  (x = 1, y = 4)
+ (x = 2, y = 1)  (x = 2, y = 2)  (x = 2, y = 3)  (x = 2, y = 4)
+ (x = 3, y = 1)  (x = 3, y = 2)  (x = 3, y = 3)  (x = 3, y = 4)
+ (x = 4, y = 1)  (x = 4, y = 2)  (x = 4, y = 3)  (x = 4, y = 4)
+```
+
+Finally, we can use a bit shorter syntax (covered in chapter 14 of the book):
+
+```
+julia> [(; x, y) for x in 1:4, y in 1:4]
+4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)  (x = 1, y = 2)  (x = 1, y = 3)  (x = 1, y = 4)
+ (x = 2, y = 1)  (x = 2, y = 2)  (x = 2, y = 3)  (x = 2, y = 4)
+ (x = 3, y = 1)  (x = 3, y = 2)  (x = 3, y = 3)  (x = 3, y = 4)
+ (x = 4, y = 1)  (x = 4, y = 2)  (x = 4, y = 3)  (x = 4, y = 4)
+```
+
+### Exercise 8
+
+To get help on the `filter` function write `?filter`. Next run:
+
+```
+julia> filter(e -> iseven(e.x + e.y), v)
+8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 3)
+ (x = 2, y = 2)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 3)
+ (x = 4, y = 2)
+ (x = 4, y = 4)
+```
+
+### Exercise 9
+
+To get help on the `filter!` function write `?filter!`. Next run:
+
+```
+julia> filter!(e -> iseven(e.x + e.y), v)
+8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 3)
+ (x = 2, y = 2)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 3)
+ (x = 4, y = 2)
+ (x = 4, y = 4)
+
+julia> v
+8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
+ (x = 1, y = 1)
+ (x = 1, y = 3)
+ (x = 2, y = 2)
+ (x = 2, y = 4)
+ (x = 3, y = 1)
+ (x = 3, y = 3)
+ (x = 4, y = 2)
+ (x = 4, y = 4)
+```
+
+Notice that `filter` allocated a new vector, while `filter!` updated the `v`
+vector in place.
+
+### Exercise 10
+
+You can use for example the following code:
+
+```
+using Statistics
+using Plots
+rand_cor(n) = cor(rand(n), rand(n))
+plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false)
+      for n in [10, 100, 1000, 10000]]...)
+```
+
+Observe that as you increase `n` the dispersion of the correlation coefficient
+decreases.
+
+</details>
--- a/exercises/exercises05.md
+++ b/exercises/exercises05.md
@@ -0,0 +1,287 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 5
+
+# Problems
+
+### Exercise 1
+
+Create a matrix containing truth table for `&&` and `||` operations.
+
+### Exercise 2
+
+The `issubset` function checks if one collection is a subset of other
+collection.
+
+Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
+`k` varying from `1` to `3`. Store the result in a vector.
+
+### Exercise 3
+
+Write a function that accepts two vectors and returns `true` if they have equal
+length and otherwise returns `false`.
+
+### Exercise 4
+
+Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
+`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
+Calculate their Adjusted Mutual Information using scikit-learn.
+
+### Exercise 5
+
+Using Adjusted Mutual Information function from exercise 4 generate
+a pair of random vectors of length 100 containing integer numbers from the
+range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
+Check in the documentation of the `rand` function how you can draw a sample
+from a collection of values.
+
+### Exercise 6
+
+Adjust the code from exercise 5 but replace first 50 elements of each vector
+with zero. Repeat the experiment.
+
+### Exercise 7
+
+Write a function that takes a vector of integer values and returns a dictionary
+giving information how many times each integer was present in the passed vector.
+
+Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
+and `v3 = 3:5`.
+
+### Exercise 8
+
+Write code that creates a `Bool` diagonal matrix of size 5x5.
+
+### Exercise 9
+
+Write a code comparing performance of calculation of sum of logarithms of
+elements of a vector `1:100` using broadcasting and the `sum` function vs only
+the `sum` function taking a function as a first argument.
+
+### Exercise 10
+
+Create a dictionary in which for each number from `1` to `10` you will store
+a vector of its positive divisors. You can check the reminder of division
+of two values using the `rem` function.
+
+Additionally (not covered in the book), you can drop elements
+from a comprehension if you add an `if` clause after the `for` clause, for
+example to keep only odd numbers from range `1:10` do:
+
+```
+julia> [i for i in 1:10 if isodd(i)]
+5-element Vector{Int64}:
+ 1
+ 3
+ 5
+ 7
+ 9
+```
+
+You can populate a dictionary by passing a vector of pairs to it (not covered in
+the book), for example:
+
+```
+julia> Dict(["a" => 1, "b" => 2])
+Dict{String, Int64} with 2 entries:
+  "b" => 2
+  "a" => 1
+```
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+You can do it as follows:
+```
+julia> [true, false] .&& [true false]
+2×2 BitMatrix:
+ 1  0
+ 0  0
+
+julia> [true, false] .|| [true false]
+2×2 BitMatrix:
+ 1  1
+ 1  0
+```
+
+Note that the first array is a vector, while the second array is a 1-row matrix.
+
+### Exercise 2
+
+You can do it like this using broadcasting:
+```
+julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
+3-element BitVector:
+ 0
+ 1
+ 1
+```
+Note that you need to use `Ref` to protect `4:6` from being broadcasted over.
+
+### Exercise 3
+
+This function can be written as follows:
+
+```
+function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y)
+```
+
+### Exercise 4
+
+You can do this exercise as follows:
+```
+julia> using PyCall
+
+julia> metrics = pyimport("sklearn.metrics");
+
+julia> metrics.adjusted_mutual_info_score(x, y)
+-0.11111111111111087
+
+julia> metrics.adjusted_mutual_info_score(x, z)
+0.7276079390930807
+
+julia> metrics.adjusted_mutual_info_score(y, z)
+-0.21267989848846763
+```
+
+### Exercise 5
+
+You can create such a plot using the following commands:
+
+```
+using Plots
+histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100))
+           for i in 1:1000], label="AMI")
+```
+
+You can check that AMI oscillates around 0.
+
+### Exercise 6
+
+This time it is convenient to write a helper function. Note that we use
+broadcasting to update values in the vectors.
+
+```
+function exampleAMI()
+    x = rand(1:5, 100)
+    y = rand(1:5, 100)
+    x[1:50] .= 0
+    y[1:50] .= 0
+    return metrics.adjusted_mutual_info_score(x, y)
+end
+histogram([exampleAMI() for i in 1:1000], label="AMI")
+```
+
+Note that this time AMI is a bit below 0.5, which shows a better match between
+vectors.
+
+### Exercise 7
+
+```
+julia> function counter(v::AbstractVector{<:Integer})
+           d = Dict{eltype(v), Int}()
+           for x in v
+               if haskey(d, x)
+                   d[x] += 1
+               else
+                   d[x] = 1
+               end
+           end
+           return d
+       end
+counter (generic function with 1 method)
+
+julia> counter(v1)
+Dict{Int64, Int64} with 3 entries:
+  2 => 2
+  3 => 3
+  1 => 1
+
+julia> counter(v2)
+Dict{Bool, Int64} with 2 entries:
+  0 => 1
+  1 => 1
+
+julia> counter(v3)
+Dict{Int64, Int64} with 3 entries:
+  5 => 1
+  4 => 1
+  3 => 1
+```
+
+Note that we used the `eltype` function to set a proper key type for
+dictionary `d`.
+
+### Exercise 8
+
+This is a way to do it:
+```
+julia> 1:5 .== (1:5)'
+5×5 BitMatrix:
+ 1  0  0  0  0
+ 0  1  0  0  0
+ 0  0  1  0  0
+ 0  0  0  1  0
+ 0  0  0  0  1
+```
+
+Using the `LinearAlgebra` module you could also write:
+
+```
+julia> using LinearAlgebra
+
+julia> I(5)
+5×5 Diagonal{Bool, Vector{Bool}}:
+ 1  ⋅  ⋅  ⋅  ⋅
+ ⋅  1  ⋅  ⋅  ⋅
+ ⋅  ⋅  1  ⋅  ⋅
+ ⋅  ⋅  ⋅  1  ⋅
+ ⋅  ⋅  ⋅  ⋅  1
+```
+
+### Exercise 9
+
+Here is how you can do it:
+
+```
+julia> using BenchmarkTools
+
+julia> @btime sum(log.(1:100))
+  1.620 μs (1 allocation: 896 bytes)
+363.7393755555635
+
+julia> @btime sum(log, 1:100)
+  1.570 μs (0 allocations: 0 bytes)
+363.7393755555636
+```
+
+As you can see using the `sum` function with `log` as its first argument
+is a bit faster as it is not allocating.
+
+### Exercise 10
+
+Here is how you can do it:
+
+```
+julia> Dict([i => [j for j in 1:i if rem(i, j) == 0] for i in 1:10])
+Dict{Int64, Vector{Int64}} with 10 entries:
+  5  => [1, 5]
+  4  => [1, 2, 4]
+  6  => [1, 2, 3, 6]
+  7  => [1, 7]
+  2  => [1, 2]
+  10 => [1, 2, 5, 10]
+  9  => [1, 3, 9]
+  8  => [1, 2, 4, 8]
+  3  => [1, 3]
+  1  => [1]
+```
+
+</details>
--- a/exercises/exercises06.md
+++ b/exercises/exercises06.md
@@ -0,0 +1,271 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 6
+
+# Problems
+
+### Exercise 1
+
+Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
+(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
+
+### Exercise 2
+
+Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
+as `iris.csv` to your local folder.
+
+### Exercise 3
+
+Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
+in two lines so that it takes less horizontal space.
+
+### Exercise 4
+
+Load data stored in `iris.csv` file into a `data` vector where each element
+should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
+the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
+as floats).
+
+### Exercise 5
+
+The `data` structure is a vector of named tuples, change it to a named tuple
+of vectors (with the same field names) and call it `data2`.
+
+### Exercise 6
+
+Calculate the frequency of each type of Iris type (`c` field in `data2`).
+
+### Exercise 7
+
+Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
+vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
+Compare sizes of the three objects.
+
+### Exercise 8
+
+You know that `refs` field of `PooledArray` stores an integer index of a given
+value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
+in `data2`, but for each Iris type give a different point color (check the
+`color` keyword argument meaning in the Plots.jl manual; you can use the
+`plot_color` function).
+
+### Exercise 9
+
+Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to
+`str` variable (do not copy paste the string, but type it).
+
+### Exercise 10
+
+In the `str` string from exercise 9 find all matches of a pattern where `a`
+is followed by `b` but there can be some characters between them.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution:
+```
+julia> "I have apples worth $(1+2)\$"
+"I have apples worth 3\$"
+```
+
+### Exercise 2
+
+Solution:
+```
+import Downloads
+Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
+                   "iris.csv")
+```
+
+### Exercise 3
+
+Solution:
+```
+"https://archive.ics.uci.edu/ml/\
+ machine-learning-databases/iris/iris.data"
+```
+
+### Exercise 4
+
+Solution:
+```
+julia> function line_parser(line)
+           elements = split(line, ",")
+           @assert length(elements) == 5
+           return (sl=parse(Float64, elements[1]),
+                   sw=parse(Float64, elements[2]),
+                   pl=parse(Float64, elements[3]),
+                   pw=parse(Float64, elements[4]),
+                   c=elements[5])
+       end
+line_parser (generic function with 1 method)
+
+julia> data = line_parser.(readlines("iris.csv")[1:end-1])
+150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
+ (sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
+ (sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
+ (sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
+ (sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
+ (sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
+ ⋮
+ (sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
+ (sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
+ (sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
+ (sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")
+```
+
+Note that we used `1:end-1` selector to drop last element from the read lines
+since it is empty. This is the reason why adding the
+`@assert length(elements) == 5` check in the `line_parser` function is useful.
+
+### Exercise 5
+
+Later in the book you will learn more advanced ways to do it. Here let us
+use a most basic approach:
+
+```
+data2 = (sl=[d.sl for d in data],
+         sw=[d.sw for d in data],
+         pl=[d.pl for d in data],
+         pw=[d.pw for d in data],
+         c=[d.c for d in data])
+```
+
+### Exercise 6
+
+Solution:
+```
+julia> using FreqTables
+
+julia> freqtable(data2.c)
+3-element Named Vector{Int64}
+Dim1              │
+──────────────────┼───
+"Iris-setosa"     │ 50
+"Iris-versicolor" │ 50
+"Iris-virginica"  │ 50
+```
+
+### Exercise 7
+
+Solution:
+```
+julia> using InlineStrings
+
+julia> c2 = inlinestrings(data2.c)
+150-element Vector{String15}:
+ "Iris-setosa"
+ "Iris-setosa"
+ "Iris-setosa"
+ "Iris-setosa"
+ "Iris-setosa"
+ ⋮
+ "Iris-virginica"
+ "Iris-virginica"
+ "Iris-virginica"
+ "Iris-virginica"
+
+julia> using PooledArrays
+
+julia> c3 = PooledArray(data2.c)
+150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
+ "Iris-setosa"
+ "Iris-setosa"
+ "Iris-setosa"
+ "Iris-setosa"
+ "Iris-setosa"
+ ⋮
+ "Iris-virginica"
+ "Iris-virginica"
+ "Iris-virginica"
+ "Iris-virginica"
+
+julia> c4 = Symbol.(data2.c)
+150-element Vector{Symbol}:
+ Symbol("Iris-setosa")
+ Symbol("Iris-setosa")
+ Symbol("Iris-setosa")
+ Symbol("Iris-setosa")
+ Symbol("Iris-setosa")
+ ⋮
+ Symbol("Iris-virginica")
+ Symbol("Iris-virginica")
+ Symbol("Iris-virginica")
+ Symbol("Iris-virginica")
+
+julia> Base.summarysize(data2.c)
+12840
+
+julia> Base.summarysize(c2)
+2440
+
+julia> Base.summarysize(c3)
+1696
+
+julia> Base.summarysize(c4)
+1240
+```
+
+### Exercise 8
+
+Solution:
+```
+using Plots
+scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
+```
+
+### Exercise 9
+
+The hard part is typing `²`, `⟺` and `∨`. You can check how to do it using help:
+```
+help?> ²
+"²" can be typed by \^2<tab>
+
+help?> ⟺
+"⟺" can be typed by \iff<tab>
+
+help?> ∨
+"∨" can be typed by \vee<tab>
+```
+
+Save the string in the `str` variable as we will use it in the next exercise.
+
+### Exercise 10
+
+The exercise does not specify how the matching should be done. If we
+want it to be eager (match as much as possible), we write:
+
+```
+julia> m = match(r"a.*b", str)
+RegexMatch("a²=b² ⟺ a=b ∨ a=-b")
+```
+
+As you can see we have matched whole string.
+
+If we want it to be lazy (match as little as possible) we write:
+
+```
+julia> m = match(r"a.*?b", str)
+RegexMatch("a²=b")
+```
+
+This finds us the first such match.
+
+If we want to find all lazy matches we can write (not covered in the book):
+
+```
+julia> collect(eachmatch(r"a.*?b", str))
+3-element Vector{RegexMatch}:
+ RegexMatch("a²=b")
+ RegexMatch("a=b")
+ RegexMatch("a=-b")
+```
+
+</details>
--- a/exercises/exercises07.md
+++ b/exercises/exercises07.md
@@ -0,0 +1,314 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 7
+
+# Problems
+
+### Exercise 1
+
+Random.org provides a service that returns random numbers. One of the ways
+how you can use it is by sending HTTP GET reguests. Here is an example request:
+
+> https://www.random.org/integers/?num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new
+
+If you want to understand all the parameters plese check their meaning
+[here](https://www.random.org/clients/http/).
+
+For us it is enough that this request generates 10 random integers in the range
+from 1 to 6. Run this query in Julia and parse the result.
+
+### Exercise 2
+
+Write a function that tries to parse a string as an integer.
+If it succeeds it should return the integer, otherwise it should return `0`
+but print error message.
+
+### Exercise 3
+
+Create a matrix containing truth table for `&&` operation including `missing`.
+If some operation errors store `"error"` in the table. As an extra feature (this
+is harder so you can skip it) in each cell store both inputs and output to make
+reading the table easier.
+
+### Exercise 4
+
+Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
+missing values in it by the mean of the non-missing values.
+
+### Exercise 5
+
+Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
+strings stored in it as `Float64`, while keeping `missing` values unchanged.
+
+### Exercise 6
+
+Print to the terminal all days in January 2023 that are Mondays.
+
+### Exercise 7
+
+Compute the dates that are one month later than January 15, 2020, February 15
+2020, March 15, 2020, and April 15, 2020. How many days pass during this one
+month. Print the results to the screen?
+
+### Exercise 8
+
+Parse the following string as JSON:
+```
+str = """
+[{"x":1,"y":1},
+ {"x":2,"y":4},
+ {"x":3,"y":9},
+ {"x":4,"y":16},
+ {"x":5,"y":25}]
+"""
+```
+into a `json` variable.
+
+### Exercise 9
+
+Extract from the `json` variable from exercise 8 two vectors `x` and `y`
+that correspond to the fields stored in the JSON structure.
+Plot `y` as a function of `x`.
+
+### Exercise 10
+
+Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
+Use linear interpolation for filling missing values. For the extreme values
+use nearest available observation (you will need to consult Impute.jl
+documentation to find all required functions).
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution (example run):
+
+```
+julia> using HTTP
+
+julia> response = HTTP.get("https://www.random.org/integers/?\
+                            num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new");
+
+julia> parse.(Int, split(String(response.body)))
+10-element Vector{Int64}:
+ 6
+ 2
+ 6
+ 3
+ 4
+ 2
+ 5
+ 2
+ 3
+ 6
+```
+
+### Exercise 2
+
+Example function:
+
+```
+function str2int(s::AbstractString)
+    try
+        return parse(Int, s)
+    catch e
+        println(e)
+    end
+    return 0
+end
+```
+
+Let us check it:
+
+```
+julia> str2int("10")
+10
+
+julia> str2int("  -1  ")
+-1
+
+julia> str2int("12345678901234567890")
+OverflowError("overflow parsing \"12345678901234567890\"")
+0
+
+julia> str2int("1.3")
+ArgumentError("invalid base 10 digit '.' in \"1.3\"")
+0
+
+julia> str2int("a")
+ArgumentError("invalid base 10 digit 'a' in \"a\"")
+0
+```
+
+An alternative solution would use `tryparse` (not covered in the book):
+
+```
+function str2int(s::AbstractString)
+    v = tryparse(Int, s)
+    if isnothing(v)
+        println("error while parsing")
+        return 0
+    end
+    return v
+end
+```
+But this time we do not see the cause of the error.
+
+### Exercise 3
+
+Solution:
+
+```
+julia> function apply_and(x, y)
+           try
+               return "$x && $y = $(x && y)"
+           catch e
+               return "$x && $y = error"
+           end
+       end
+apply_and (generic function with 2 methods)
+
+julia> apply_and.([true, false, missing], [true false missing])
+3×3 Matrix{String}:
+ "true && true = true"      "true && false = false"     "true && missing = missing"
+ "false && true = false"    "false && false = false"    "false && missing = false"
+ "missing && true = error"  "missing && false = error"  "missing && missing = error"
+```
+
+### Exercise 4
+
+Solution:
+
+```
+julia> using Statistics
+
+julia> coalesce.(v, mean(skipmissing(v)))
+6-element Vector{Float64}:
+ 1.5
+ 2.5
+ 3.5
+ 4.5
+ 5.5
+ 3.5
+```
+
+### Exercise 5
+
+Solution:
+
+```
+julia> using Missings
+
+julia> passmissing(parse).(Float64, s)
+6-element Vector{Union{Missing, Float64}}:
+ 1.5
+ 2.5
+  missing
+ 4.5
+ 5.5
+  missing
+```
+
+### Exercise 6
+
+Example solution:
+
+```
+julia> using Dates
+
+julia> for day in Date.(2023, 01, 1:31)
+           dayofweek(day) == 1 && println(day)
+       end
+2023-01-02
+2023-01-09
+2023-01-16
+2023-01-23
+2023-01-30
+```
+
+### Exercise 7
+
+Example solution:
+
+```
+julia> for day in Date.(2023, 1:4, 15)
+           day_next = day + Month(1)
+           println("$day + 1 month = $day_next (difference: $(day_next - day))")
+       end
+2023-01-15 + 1 month = 2023-02-15 (difference: 31 days)
+2023-02-15 + 1 month = 2023-03-15 (difference: 28 days)
+2023-03-15 + 1 month = 2023-04-15 (difference: 31 days)
+2023-04-15 + 1 month = 2023-05-15 (difference: 30 days)
+```
+
+### Exercise 8
+
+Solution:
+
+```
+julia> using JSON3
+
+julia> json = JSON3.read(str)
+5-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, Vector{UInt64}}:
+ {
+   "x": 1,
+   "y": 1
+}
+ {
+   "x": 2,
+   "y": 4
+}
+ {
+   "x": 3,
+   "y": 9
+}
+ {
+   "x": 4,
+   "y": 16
+}
+ {
+   "x": 5,
+   "y": 25
+}
+```
+
+### Exercise 9
+
+Solution:
+
+```
+using Plots
+x = [el.x for el in json]
+y = [el.y for el in json]
+plot(x, y, xlabel="x", ylabel="y", legend=false)
+```
+
+### Exercise 10
+
+Solution:
+
+```
+julia> using Impute
+
+julia> Impute.nocb!(Impute.locf!(Impute.interp(m)))
+8-element Vector{Union{Missing, Int64}}:
+ 1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 6
+```
+
+Note that we use the `locf!` and `nocb!` functions (with `!`) to perform
+operation in place (a new vector was already allocated by `Impute.interp`).
+
+</details>
--- a/exercises/exercises08.md
+++ b/exercises/exercises08.md
@@ -0,0 +1,294 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 8
+
+# Problems
+
+### Exercise 1
+
+Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame`
+called `df`.
+
+### Exercise 2
+
+Get number of rows, columns, column names and summary statistics of the
+`df` data frame from exercise 1.
+
+### Exercise 3
+
+Make a plot of `number` against `square` columns of `df` data frame.
+
+### Exercise 4
+
+Add a column to `df` data frame with name `name string` containing string
+representation of numbers in column `number`, i.e.
+`["one", "two", "three", "four"]`.
+
+### Exercise 5
+
+Check if `df` contains column `square2`.
+
+### Exercise 6
+
+Extract column `number` from `df` and empty it (recall `empty!` function
+discussed in chapter 4).
+
+### Exercise 7
+
+In `Random` module the `randexp` function is defined that samples numbers
+from exponential distribution with scale 1.
+Draw two 100,000 element samples from this distribution store them
+in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
+and sum of vector `x` and half of vector `y`.
+
+### Exercise 8
+
+Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
+and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
+Compute all standard descriptive statistics of columns of this data frame.
+
+### Exercise 9
+
+Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
+Compare the size of created files using the `filesize` function.
+
+### Exercise 10
+
+Write the `df` data frame into SQLite database. Next find information about
+tables in this database. Run a query against a table representing the `df` data
+frame to calculate the mean of column `x`. Does it match the result we got in
+exercise 8?
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+CSV.jl supports reading gzip-compressed files so you can just do:
+
+```
+julia> using CSV
+
+julia> using DataFrames
+
+julia> df = CSV.read("example8.csv.gz", DataFrame)
+4×2 DataFrame
+ Row │ number  square
+     │ Int64   Int64
+─────┼────────────────
+   1 │      1       2
+   2 │      2       4
+   3 │      3       9
+   4 │      4      16
+```
+
+You can also do it manually:
+```
+julia> using CodecZlib # you might need to install this package
+
+julia> compressed = read("example8.csv.gz");
+
+julia> plain = transcode(GzipDecompressor, compressed);
+
+julia> df = CSV.read(plain, DataFrame)
+4×2 DataFrame
+ Row │ number  square
+     │ Int64   Int64
+─────┼────────────────
+   1 │      1       2
+   2 │      2       4
+   3 │      3       9
+   4 │      4      16
+```
+
+### Exercise 2
+
+Solution:
+
+```
+julia> nrow(df)
+4
+
+julia> ncol(df)
+2
+
+julia> names(df)
+2-element Vector{String}:
+ "number"
+ "square"
+
+julia> describe(df)
+2×7 DataFrame
+ Row │ variable  mean     min    median   max    nmissing  eltype
+     │ Symbol    Float64  Int64  Float64  Int64  Int64     DataType
+─────┼──────────────────────────────────────────────────────────────
+   1 │ number       2.5       1      2.5      4         0  Int64
+   2 │ square       7.75      2      6.5     16         0  Int64
+```
+
+### Exercise 3
+
+Solution:
+```
+using Plots
+plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)
+```
+
+### Exercise 4
+
+Solution:
+
+```
+julia> df."name string" = ["one", "two", "three", "four"]
+4-element Vector{String}:
+ "one"
+ "two"
+ "three"
+ "four"
+
+julia> df
+4×3 DataFrame
+ Row │ number  square  name string
+     │ Int64   Int64   String
+─────┼─────────────────────────────
+   1 │      1       2  one
+   2 │      2       4  two
+   3 │      3       9  three
+   4 │      4      16  four
+```
+
+Note that we needed to use a string as we have space in column name.
+
+### Exercise 5
+
+You can use either `hasproperty` or `columnindex`:
+
+```
+julia> hasproperty(df, :square2)
+false
+
+julia> columnindex(df, :square2)
+0
+```
+
+Note that if you try to access this column you will get a hint what was the
+mistake you most likely made:
+
+```
+julia> df.square2
+ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square
+```
+
+### Exercise 6
+
+Solution:
+
+```
+julia> empty!(df[:, :number])
+Int64[]
+```
+
+Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)`
+as it would corrupt the `df` data frame (these operations do non-copying
+extraction of a column from a data frame as opposed to `df[:, :number]`
+which makes a copy).
+
+### Exercise 7
+
+Solution:
+```
+using Random
+using Plots
+x = randexp(100_000);
+y = randexp(100_000);
+histogram(x + y / 2, label="mean")
+histogram!(max.(x, y), label="maximum")
+```
+
+I have put both histograms on the same plot to show that they overlap.
+
+### Exercise 8
+
+Solution (you might get slightly different results because we did not set
+the seed of random number generator when creating `x` and `y` vectors):
+
+```
+julia> df = DataFrame(x=x, y=y);
+
+julia> df."x+y/2" = x + y / 2;
+
+julia> df."max.(x,y)" = max.(x, y);
+
+julia> describe(df, :all)
+4×13 DataFrame
+ Row │ variable   mean      std       min         q25       median   q75      max      nunique  nmissing  first     last      eltype
+     │ Symbol     Float64   Float64   Float64     Float64   Float64  Float64  Float64  Nothing  Int64     Float64   Float64   DataType
+─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+   1 │ x          0.997023  0.999119  3.01389e-6  0.285129  0.68856  1.38414  12.1556                  0  0.250502  0.077737  Float64
+   2 │ y          1.00109   0.995904  2.78828e-6  0.289371  0.6957   1.38491  12.0445                  0  0.689659  0.486246  Float64
+   3 │ x+y/2      1.49757   1.11676   0.00217486  0.688598  1.2235   2.0113   14.2046                  0  0.595331  0.32086   Float64
+   4 │ max.(x,y)  1.49872   1.11295   0.00187844  0.691588  1.22466  2.01257  12.1556                  0  0.689659  0.486246  Float64
+```
+
+We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary
+statistics except `first` and `last` as expected.
+
+### Exercise 9
+
+```
+julia> using Arrow
+
+julia> CSV.write("df.csv", df)
+"df.csv"
+
+julia> Arrow.write("df.arrow", df)
+"df.arrow"
+
+julia> filesize("df.csv")
+7587820
+
+julia> filesize("df.arrow")
+3200874
+```
+
+In this case Apache Arrow file is smaller.
+
+### Exercise 10
+
+```
+julia> using SQLite
+
+julia> db = SQLite.DB("df.db")
+SQLite.DB("df.db")
+
+julia> SQLite.load!(df, db, "df")
+"df"
+
+julia> SQLite.tables(db)
+1-element Vector{SQLite.DBTable}:
+ SQLite.DBTable("df", Tables.Schema:
+ :x                   Union{Missing, Float64}
+ :y                   Union{Missing, Float64}
+ Symbol("x+y/2")      Union{Missing, Float64}
+ Symbol("max.(x,y)")  Union{Missing, Float64})
+
+julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df");
+
+julia> DataFrame(query)
+1×1 DataFrame
+ Row │ AVG(x)
+     │ Float64
+─────┼──────────
+   1 │ 0.997023
+
+julia> close(db)
+```
+
+The computed mean of column `x` is the same as we got in exercise 8.
+
+</details>
--- a/exercises/exercises09.md
+++ b/exercises/exercises09.md
@@ -0,0 +1,293 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 9
+
+# Problems
+
+In this problemset we will use the `puzzles.csv` file that was
+created in chapter 8. Please first load it into your Julia
+session using the commands:
+
+```
+using CSV
+using DataFrames
+puzzles = CSV.read("puzzles.csv", DataFrame);
+```
+
+### Exercise 1
+
+Create `matein2` data frame that will have only puzzles that have `"mateIn2"`
+in the `Themes` column.
+Use the `contains` function (check its documentation first).
+
+### Exercise 2
+
+What is the fraction of puzzles that are mate in 2 in relation to all puzzles
+in the `puzzles` data frame?
+
+### Exercise 3
+
+Create `small` data frame that holds first 10 rows of `matein2` data frame
+and columns `Rating`, `RatingDeviation`, and `NbPlays`.
+
+### Exercise 4
+
+Iterate rows of `small` data frame and print the ratio of
+`RatingDeviation` and `NbPlays` for each row.
+
+### Exercise 5
+
+Get names of columns from the `matein2` data frame that end with `n` (ignore case).
+
+### Exercise 6
+
+Write a function `collatz` that runs the following process. Start with a
+positive number `n`. If it is even divide it by two. If it is odd multiply
+it by 3 and add one. The function should return the number of steps needed to
+reach 1.
+
+Create a `d` dictionary that maps number of steps needed to a list of numbers from
+the range `1:100` that required this number of steps.
+
+### Exercise 7
+
+Using the `d` dictionary make a scatter plot of number of steps required
+vs average value of numbers that require this number of steps.
+
+### Exercise 8
+
+Repeat the process from exercises 6 and 7, but this time use a data frame
+and try to write an appropriate expression using the `combine` and `groupby`
+functions (as it was explained in the last part of chapter 9). This time
+perform computations for numbers ranging from one to one million.
+
+### Exercise 9
+
+Set seed of random number generator to `1234`. Draw 100 random points
+from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
+Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
+Add random noise to column `y` that has normal distribution with mean 0 and
+standard deviation 0.25. Call this column `z`.
+Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
+
+### Exercise 10
+
+Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution:
+
+```
+julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
+274135×9 DataFrame
+    Row │ PuzzleId  FEN                                Moves                Rating  RatingDeviation  Popularity  NbPlays  Themes                             GameUrl    ⋯
+        │ String7   String                             String               Int64   Int64            Int64       Int64    String                             String     ⋯
+────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+      1 │ 000hf     r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1…  e8f7 e2e6 f7f8 e6f7    1560               76          88      441  mate mateIn2 middlegame short      https://li ⋯
+      2 │ 001Wz     4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P…  e8e5 d1d8 e5e8 d8e8    1128               81          87       54  backRankMate endgame mate mateIn…  https://li
+      3 │ 001om     5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1…  g4h4 c5f2 g2g3 f2g3     991               78          89      215  mate mateIn2 middlegame short      https://li
+      4 │ 003Tx     2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P…  e1e4 f3d2 b1a1 c8c1    1716               77          87      476  backRankMate endgame fork mate m…  https://li
+   ⋮    │    ⋮                      ⋮                           ⋮             ⋮            ⋮             ⋮          ⋮                     ⋮                             ⋱
+ 274132 │ zzxQS     2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn…  c8f8 d6d1 e3e1 d1e1    1149               75          96     1722  mate mateIn2 middlegame short      https://li ⋯
+ 274133 │ zzxvB     5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3…  f6g4 c7f7 f8f7 a7a8    1695               74          95     4857  endgame mate mateIn2 pin sacrifi…  https://li
+ 274134 │ zzzRN     4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3…  g1h1 e3e1 f7f1 e1f1     830              108          67       31  endgame mate mateIn2 short         https://li
+ 274135 │ zzzco     5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P…  f7f2 b2c2 c1b1 e2d1    1783               75          90      763  endgame mate mateIn2 queensideAt…  https://li
+                                                                                                                                         1 column and 274127 rows omitted
+```
+
+### Exercise 2
+
+Solution (two ways to do it):
+
+```
+julia> using Statistics
+
+julia> nrow(matein2) / nrow(puzzles)
+0.12852152542746353
+
+julia> mean(contains.(puzzles.Themes, "mateIn2"))
+0.12852152542746353
+```
+
+### Exercise 3
+
+Solution:
+
+```
+julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
+10×3 DataFrame
+ Row │ Rating  RatingDeviation  NbPlays
+     │ Int64   Int64            Int64
+─────┼──────────────────────────────────
+   1 │   1560               76      441
+   2 │   1128               81       54
+   3 │    991               78      215
+   4 │   1716               77      476
+   5 │    711               81      111
+   6 │    723               86      806
+   7 │    754               92      248
+   8 │   1177               76      827
+   9 │    994               81       71
+  10 │    979              144       14
+```
+
+### Exercise 4
+
+Solution:
+
+```
+julia> for row in eachrow(small)
+           println(row.RatingDeviation / row.NbPlays)
+       end
+0.17233560090702948
+1.5
+0.3627906976744186
+0.16176470588235295
+0.7297297297297297
+0.10669975186104218
+0.3709677419354839
+0.09189842805320435
+1.1408450704225352
+10.285714285714286
+```
+
+### Exercise 5
+
+Solution (several options):
+```
+julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
+2-element Vector{String}:
+ "FEN"
+ "RatingDeviation"
+
+julia> names(matein2, Cols(col -> endswith(uppercase(col), "N")))
+2-element Vector{String}:
+ "FEN"
+ "RatingDeviation"
+
+julia> names(matein2, r"[nN]$")
+2-element Vector{String}:
+ "FEN"
+ "RatingDeviation"
+```
+
+### Exercise 6
+
+Solution:
+
+```
+julia> function collatz(n)
+           i = 0
+           while n != 1
+               i += 1
+               n = iseven(n) ? div(n, 2) : 3 * n + 1
+           end
+           return i
+       end
+collatz (generic function with 1 method)
+
+julia> d = Dict{Int, Vector{Int}}()
+Dict{Int64, Vector{Int64}}()
+
+julia> for n in 1:100
+           i = collatz(n)
+           if haskey(d, i)
+               push!(d[i], n)
+           else
+               d[i] = [n]
+           end
+       end
+
+julia> d
+Dict{Int64, Vector{Int64}} with 45 entries:
+  5   => [5, 32]
+  35  => [78, 79]
+  110 => [82, 83]
+  30  => [86, 87, 89]
+  32  => [57, 59]
+  6   => [10, 64]
+  115 => [73]
+  112 => [54, 55]
+  4   => [16]
+  13  => [34, 35]
+  104 => [47]
+  12  => [17, 96]
+  23  => [25]
+  111 => [27]
+  92  => [91]
+  11  => [48, 52, 53]
+  118 => [97]
+  ⋮   => ⋮
+```
+
+As we can see even for small `n` the number of steps required to reach `1`
+can get quite large.
+
+### Exercise 7
+
+Solution:
+
+```
+using Plots
+using Statistics
+steps = collect(keys(d))
+mean_number = mean.(values(d))
+scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false)
+```
+
+Note that we needed to use `collect` on `keys` as `scatter` expects an array
+not just an iterator.
+
+### Exercise 8
+
+Solution:
+
+```
+df = DataFrame(n=1:10^6);
+df.collatz = collatz.(df.n);
+agg = combine(groupby(df, :collatz), :n => mean);
+scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
+```
+
+### Exercise 9
+
+Set seed of random number generator to `1234`. Draw 100 random points
+from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
+Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
+Add random noise to column `y` that has normal distribution with mean 0 and
+standard deviation 0.25. Call this column `z`.
+Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
+
+Solution:
+
+```
+using Random
+Random.seed!(1234)
+df = DataFrame(x=rand(100))
+df.y = 4 .* (df.x .- 0.5) .^ 2
+df.z = df.y + randn(100) / 4
+scatter(df.x, [df.y df.z], labels=["y" "z"])
+```
+
+### Exercise 10
+
+Solution:
+
+```
+using Loess
+model = loess(df.x, df.z);
+x_predict = sort(df.x)
+z_predict = predict(model, x_predict)
+plot!(x_predict, z_predict; label="z predicted")
+```
+
+</details>
--- a/exercises/exercises10.md
+++ b/exercises/exercises10.md
@@ -0,0 +1,303 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 10
+
+# Problems
+
+### Exercise 1
+
+Generate a random matrix `mat` having size 5x4 and all elements drawn
+independently and uniformly from the [0,1[ interval.
+Create a data frame using data from this matrix using auto-generated
+column names.
+
+### Exercise 2
+
+Now, using matrix `mat` create a data frame with randomly generated
+column names. Use the `randstring` function from the `Random` module
+to generate them. Store this data frame in `df` variable.
+
+### Exercise 3
+
+Create a new data frame, taking `df` as a source that will have the same
+columns but its column names will be `y1`, `y2`, `y3`, `y4`.
+
+### Exercise 4
+
+Create a dictionary holding `column_name => column_vector` pairs
+using data stored in data frame `df`. Save this dictionary in variable `d`.
+
+### Exercise 5
+
+Create a data frame back from dictionary `d` from exercise 4. Compare it
+with `df`.
+
+### Exercise 6
+
+For data frame `df` compute the dot product between all pairs of its columns.
+Use the `dot` function from the `LinearAlgebra` module.
+
+### Exercise 7
+
+Given two data frames:
+
+```
+julia> df1 = DataFrame(a=1:2, b=11:12)
+2×2 DataFrame
+ Row │ a      b
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1     11
+   2 │     2     12
+
+julia> df2 = DataFrame(a=1:2, c=101:102)
+2×2 DataFrame
+ Row │ a      c
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1    101
+   2 │     2    102
+```
+
+vertically concatenate them so that only columns that are present in both
+data frames are kept. Check the documentation of `vcat` to see how to
+do it.
+
+### Exercise 8
+
+Now append to `df1` table `df2`, but add only the columns from `df2` that
+are present in `df1`. Check the documentation of `append!` to see how to
+do it.
+
+### Exercise 9
+
+Create a `circle` data frame, using the `push!` function that will store
+1000 samples of the following process:
+* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
+* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
+  and is `false` otherwise.
+
+Compute summary statistics of this data frame.
+
+### Exercise 10
+
+Create a scatterplot of `circle` data frame where its `x` and `y` axis
+will be the plotted points and `inside` variable will determine the color
+of the plotted point.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution:
+
+```
+julia> using DataFrames
+
+julia> mat = rand(5, 4)
+5×4 Matrix{Float64}:
+ 0.8386    0.83612   0.0353994  0.15547
+ 0.590172  0.611815  0.0691152  0.915788
+ 0.879395  0.07271   0.980079   0.655158
+ 0.340435  0.756196  0.0697535  0.388578
+ 0.714515  0.861872  0.971521   0.176768
+
+julia> DataFrame(mat, :auto)
+5×4 DataFrame
+ Row │ x1        x2        x3         x4
+     │ Float64   Float64   Float64    Float64
+─────┼─────────────────────────────────────────
+   1 │ 0.8386    0.83612   0.0353994  0.15547
+   2 │ 0.590172  0.611815  0.0691152  0.915788
+   3 │ 0.879395  0.07271   0.980079   0.655158
+   4 │ 0.340435  0.756196  0.0697535  0.388578
+   5 │ 0.714515  0.861872  0.971521   0.176768
+```
+
+### Exercise 2
+
+Solution:
+
+```
+julia> using Random
+
+julia> df = DataFrame(mat, [randstring() for _ in 1:4])
+5×4 DataFrame
+ Row │ 6mTK5evn  K8Inf7ER  5Caz55k0   SRiGemsa
+     │ Float64   Float64   Float64    Float64
+─────┼─────────────────────────────────────────
+   1 │ 0.8386    0.83612   0.0353994  0.15547
+   2 │ 0.590172  0.611815  0.0691152  0.915788
+   3 │ 0.879395  0.07271   0.980079   0.655158
+   4 │ 0.340435  0.756196  0.0697535  0.388578
+   5 │ 0.714515  0.861872  0.971521   0.176768
+```
+
+
+### Exercise 3
+
+Solution:
+```
+julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
+5×4 DataFrame
+ Row │ y1        y2        y3         y4
+     │ Float64   Float64   Float64    Float64
+─────┼─────────────────────────────────────────
+   1 │ 0.8386    0.83612   0.0353994  0.15547
+   2 │ 0.590172  0.611815  0.0691152  0.915788
+   3 │ 0.879395  0.07271   0.980079   0.655158
+   4 │ 0.340435  0.756196  0.0697535  0.388578
+   5 │ 0.714515  0.861872  0.971521   0.176768
+```
+
+You could also use the `raname` function:
+```
+julia> rename(df, string.("y", 1:4))
+5×4 DataFrame
+ Row │ y1        y2        y3         y4
+     │ Float64   Float64   Float64    Float64
+─────┼─────────────────────────────────────────
+   1 │ 0.8386    0.83612   0.0353994  0.15547
+   2 │ 0.590172  0.611815  0.0691152  0.915788
+   3 │ 0.879395  0.07271   0.980079   0.655158
+   4 │ 0.340435  0.756196  0.0697535  0.388578
+   5 │ 0.714515  0.861872  0.971521   0.176768
+```
+
+### Exercise 4
+
+Solution:
+
+```
+julia> d = Dict([n => df[:, n] for n in names(df)])
+Dict{String, Vector{Float64}} with 4 entries:
+  "6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
+  "5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
+  "K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
+  "SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
+```
+
+or (using the `pairs` function; note that this time column names are `Symbol`):
+
+```
+julia> Dict(pairs(eachcol(df)))
+Dict{Symbol, AbstractVector} with 4 entries:
+  Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
+  :SRiGemsa          => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
+  :K8Inf7ER          => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
+  Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
+```
+
+### Exercise 5
+
+Solution:
+
+```
+julia> DataFrame(d)
+5×4 DataFrame
+ Row │ 5Caz55k0   6mTK5evn  K8Inf7ER  SRiGemsa
+     │ Float64    Float64   Float64   Float64
+─────┼─────────────────────────────────────────
+   1 │ 0.0353994  0.8386    0.83612   0.15547
+   2 │ 0.0691152  0.590172  0.611815  0.915788
+   3 │ 0.980079   0.879395  0.07271   0.655158
+   4 │ 0.0697535  0.340435  0.756196  0.388578
+   5 │ 0.971521   0.714515  0.861872  0.176768
+```
+
+Note that columns of a data frame are now sorted by their names.
+This is done for `Dict` objects because such dictionaries do not have
+a defined order of keys.
+
+### Exercise 6
+
+Solution:
+
+```
+julia> using LinearAlgebra
+
+julia> using StatsBase
+
+julia> pairwise(dot, eachcol(df))
+4×4 Matrix{Float64}:
+ 2.45132  1.99944  1.65026   1.50558
+ 1.99944  2.39336  1.03322   1.18411
+ 1.65026  1.03322  1.9153    0.909744
+ 1.50558  1.18411  0.909744  1.47431
+```
+
+### Exercise 7
+
+Solution:
+
+```
+julia> vcat(df1, df2, cols=:intersect)
+4×1 DataFrame
+ Row │ a
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │     2
+   3 │     1
+   4 │     2
+```
+
+By default you will get an error:
+
+```
+julia> vcat(df1, df2)
+ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
+```
+
+### Exercise 8
+
+Solution:
+
+```
+julia> append!(df1, df2, cols=:subset)
+4×2 DataFrame
+ Row │ a      b
+     │ Int64  Int64?
+─────┼────────────────
+   1 │     1       11
+   2 │     2       12
+   3 │     1  missing
+   4 │     2  missing
+```
+
+### Exercise 9
+
+Solution
+
+```
+circle=DataFrame()
+for _ in 1:1000
+    x, y = 2rand()-1, 2rand()-1
+    inside = x^2 + y^2 < 1
+    push!(circle, (x=x, y=y, inside=inside))
+end
+describe(circle)
+```
+
+We note that mean of variable `inside` is approximately π.
+
+### Exercise 10
+
+Solution:
+
+```
+using Plots
+scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400))
+scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal)
+```
+
+In the solution two ways to plot ensuring the ratio between x and y axis is 1
+are shown. Note the differences in the produced output between the two methods.
+
+</details>
--- a/exercises/exercises11.md
+++ b/exercises/exercises11.md
@@ -0,0 +1,479 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 11
+
+# Problems
+
+### Exercise 1
+
+Generate a data frame `df` having one column `x` consisting of 100,000 values
+sampled from uniform distribution on [0, 1[ interval.
+Serialize it to disk, and next deserialize. Check if the deserialized
+object is the same as the source data frame.
+
+### Exercise 2
+
+Add a column `n` to the `df` data frame that in each row will hold the
+number of observations in column `x` that have distance less than `0.1` to
+a value stored in a given row of `x`.
+
+### Exercise 3
+
+Investigate visually how does `n` depend on `x` in data frame `df`.
+
+### Exercise 4
+
+Someone has prepared the following test data for you:
+```
+teststr = """
+"x","sinx"
+0.139279,0.138829
+0.456779,0.441059
+0.344034,0.337287
+0.140253,0.139794
+0.848344,0.750186
+0.977512,0.829109
+0.032737,0.032731
+0.702750,0.646318
+0.422339,0.409895
+0.393878,0.383772
+"""
+```
+
+Load this data into `testdf` data frame.
+
+### Exercise 5
+
+Check the accuracy of computations of sinus of `x` in `testdf`.
+Print all rows for which the absolute difference is greater than `5e-7`.
+In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
+difference.
+
+### Exercise 6
+
+Group data in data frame `df` into buckets of 0.1 width and store the result in
+`gdf` data frame (sort the groups). Use the `cut` function from
+CategoricalArrays.jl to do it (check its documentation to learn how to do it).
+Check the number of values in each group.
+
+### Exercise 7
+
+Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
+Check what would be the group order if you asked not to sort them.
+
+### Exercise 8
+
+Compute average `n` for each group in `gdf`.
+
+### Exercise 9
+
+Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
+Use the `\` operator to fit it (recall it from chapter 4).
+For each group produce the result as named tuple having fields `α₀` and `αₓ`.
+
+### Exercise 10
+
+Repeat exercise 9 but using the GLM.jl package. This time
+extract the p-value for the slope of estimated coefficient for `x` variable.
+Use the `coeftable` function from GLM.jl to get this information.
+Check the documentation of this function to learn how to do it (it will be
+easiest for you to first convert its result to a `DataFrame`).
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution:
+
+```
+julia> using DataFrames
+
+julia> df = DataFrame(x=rand(100_000));
+
+julia> using Serialization
+
+julia> serialize("df.bin", df)
+
+julia> deserialize("df.bin") == df
+true
+```
+
+### Exercise 2
+
+Solution:
+
+A simple approach is:
+```
+df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
+```
+
+A more sophisticated approach (faster and allocating less memory) would be:
+```
+df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
+```
+
+An even faster solution that is type stable would use function barrier:
+```
+f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
+df.n = f(df.x)
+```
+
+Finally you can work on sorted data to get a much better performance. Here is an
+example (it is a bit more advanced):
+```
+function f2(x)
+    p = sortperm(x)
+    n = zeros(Int, length(x))
+    start = 1
+    stop = 1
+    idx = 0
+    while idx < length(x) # you could add @inbounds here but I typically avoid it
+        idx += 1
+        while x[p[idx]] - x[p[start]] >= 0.1
+            start += 1
+        end
+        while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
+            stop += 1
+        end
+        n[p[idx]] = stop - start
+    end
+    return n
+end
+df.n = f2(df.x)
+```
+
+In this solution the fact that we used function barrier is even more relevant
+as we explicitly use loops inside.
+
+### Exercise 3
+
+Solution:
+
+```
+using Plots
+scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
+```
+
+As expected on the border of the domain number of neighbors drops.
+
+### Exercise 4
+
+Solution:
+
+```
+julia> using CSV
+
+julia> using DataFrames
+
+julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
+10×2 DataFrame
+ Row │ x         sinx
+     │ Float64   Float64
+─────┼────────────────────
+   1 │ 0.139279  0.138829
+   2 │ 0.456779  0.441059
+   3 │ 0.344034  0.337287
+   4 │ 0.140253  0.139794
+   5 │ 0.848344  0.750186
+   6 │ 0.977512  0.829109
+   7 │ 0.032737  0.032731
+   8 │ 0.70275   0.646318
+   9 │ 0.422339  0.409895
+  10 │ 0.393878  0.383772
+```
+
+### Exercise 5
+
+Since data frame is small we can use `eachrow`:
+
+```
+julia> for row in eachrow(testdf)
+           sinx = sin(row.x)
+           dev = abs(sinx - row.sinx)
+           dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
+       end
+(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
+(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
+```
+
+### Exercise 6
+
+Solution:
+```
+julia> using CategoricalArrays
+
+julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
+
+julia> gdf = groupby(df, :xbins; sort=true);
+
+julia> [nrow(group) for group in gdf]
+10-element Vector{Int64}:
+  9872
+  9976
+  9968
+  9943
+ 10063
+ 10173
+  9977
+ 10076
+  9908
+ 10044
+
+julia> combine(gdf, nrow) # alternative way to do it
+10×2 DataFrame
+ Row │ xbins       nrow
+     │ Cat…        Int64
+─────┼───────────────────
+   1 │ [0.0, 0.1)   9872
+   2 │ [0.1, 0.2)   9976
+   3 │ [0.2, 0.3)   9968
+   4 │ [0.3, 0.4)   9943
+   5 │ [0.4, 0.5)  10063
+   6 │ [0.5, 0.6)  10173
+   7 │ [0.6, 0.7)   9977
+   8 │ [0.7, 0.8)  10076
+   9 │ [0.8, 0.9)   9908
+  10 │ [0.9, 1.0)  10044
+```
+
+You might get a bit different numbers but all should be around 10,000.
+
+### Exercise 7
+
+Solution:
+
+```
+julia> NamedTuple.(keys(gdf))
+10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
+ (xbins = "[0.0, 0.1)",)
+ (xbins = "[0.1, 0.2)",)
+ (xbins = "[0.2, 0.3)",)
+ (xbins = "[0.3, 0.4)",)
+ (xbins = "[0.4, 0.5)",)
+ (xbins = "[0.5, 0.6)",)
+ (xbins = "[0.6, 0.7)",)
+ (xbins = "[0.7, 0.8)",)
+ (xbins = "[0.8, 0.9)",)
+ (xbins = "[0.9, 1.0)",)
+
+julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
+10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
+ (xbins = "[0.4, 0.5)",)
+ (xbins = "[0.9, 1.0)",)
+ (xbins = "[0.8, 0.9)",)
+ (xbins = "[0.0, 0.1)",)
+ (xbins = "[0.2, 0.3)",)
+ (xbins = "[0.5, 0.6)",)
+ (xbins = "[0.7, 0.8)",)
+ (xbins = "[0.3, 0.4)",)
+ (xbins = "[0.1, 0.2)",)
+ (xbins = "[0.6, 0.7)",)
+```
+
+If you pass `sort=false` instead of `sort=true` you get groups in their order
+of appearance in `df`. If you skipped specifying `sort` keyword argument
+the resulting group order could depend on the type of grouping column, so if
+you want to depend on the order of groups always spass `sort` keyword argument
+explicitly.
+
+### Exercise 8
+
+Solution:
+
+```
+julia> using Statistics
+
+julia> [mean(group.n) for group in gdf]
+10-element Vector{Float64}:
+ 14845.847751215559
+ 19835.367882919007
+ 19919.195826645264
+ 19993.023936437694
+ 20105.506111497565
+ 20222.35761329008
+ 20151.794727874112
+ 20022.69610956729
+ 19909.331550262414
+ 14944.511449621665
+
+julia> combine(gdf, :n => mean) # alternative way to do it
+10×2 DataFrame
+ Row │ xbins       n_mean
+     │ Cat…        Float64
+─────┼─────────────────────
+   1 │ [0.0, 0.1)  14845.8
+   2 │ [0.1, 0.2)  19835.4
+   3 │ [0.2, 0.3)  19919.2
+   4 │ [0.3, 0.4)  19993.0
+   5 │ [0.4, 0.5)  20105.5
+   6 │ [0.5, 0.6)  20222.4
+   7 │ [0.6, 0.7)  20151.8
+   8 │ [0.7, 0.8)  20022.7
+   9 │ [0.8, 0.9)  19909.3
+  10 │ [0.9, 1.0)  14944.5
+```
+
+### Exercise 9
+
+Solution:
+
+```
+julia> function fitmodel(x, n)
+           X = [ones(length(x)) x]
+           α₀, αₓ = X \ n
+           return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
+       end
+fitmodel (generic function with 1 method)
+
+julia> [fitmodel(group.x, group.n) for group in gdf]
+10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
+ (α₀ = 9900.190310776916, αₓ = 99131.14394200995)
+ (α₀ = 19823.115188829383, αₓ = 81.66979172871368)
+ (α₀ = 19812.9822724435, αₓ = 424.00895772216785)
+ (α₀ = 19810.726510910834, αₓ = 520.6763238983195)
+ (α₀ = 19437.772385484135, αₓ = 1483.333906139938)
+ (α₀ = 20187.521449870146, αₓ = 63.30709585406235)
+ (α₀ = 20424.362332155855, αₓ = -419.42268710601405)
+ (α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
+ (α₀ = 20013.690535193662, αₓ = -122.80055110522495)
+ (α₀ = 109320.55276082881, αₓ = -99305.18846102979)
+
+julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
+10×3 DataFrame
+ Row │ xbins       α₀             αₓ
+     │ Cat…        Float64        Float64
+─────┼────────────────────────────────────────
+   1 │ [0.0, 0.1)   9900.19        99131.1
+   2 │ [0.1, 0.2)  19823.1            81.6698
+   3 │ [0.2, 0.3)  19813.0           424.009
+   4 │ [0.3, 0.4)  19810.7           520.676
+   5 │ [0.4, 0.5)  19437.8          1483.33
+   6 │ [0.5, 0.6)  20187.5            63.3071
+   7 │ [0.6, 0.7)  20424.4          -419.423
+   8 │ [0.7, 0.8)  20789.7         -1022.98
+   9 │ [0.8, 0.9)  20013.7          -122.801
+  10 │ [0.9, 1.0)      1.09321e5  -99305.2
+```
+
+We note that indeed in the first and last group the regression has a significant
+slope.
+
+### Exercise 10
+
+Solution:
+
+```
+julia> using GLM
+
+julia> function fitlmmodel(group; info=false)
+           model = lm(@formula(n~x), group)
+           coefdf = DataFrame(coeftable(model))
+           info && @show coefdf # to see how the data frame looks like
+           α₀, αₓ = coefdf[:, "Coef."]
+           return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
+       end
+fitlmmodel (generic function with 1 method)
+
+julia> [fitlmmodel(group; info = true) for group in gdf]
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.     Std. Error  t        Pr(>|t|)  Lower 95%  Upper 95%
+     │ String       Float64   Float64     Float64  Float64   Float64    Float64
+─────┼────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)   9900.19    0.388607  25476.1       0.0    9899.43    9900.95
+   2 │ x            99131.1     6.75846   14667.7       0.0   99117.9    99144.4
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.       Std. Error  t           Pr(>|t|)    Lower 95%  Upper 95%
+     │ String       Float64     Float64     Float64     Float64     Float64    Float64
+─────┼───────────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  19823.1        2.52926  7837.5      0.0         19818.2    19828.1
+   2 │ x               81.6698    16.5512      4.93436  8.17139e-7     49.226    114.114
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.      Std. Error  t          Pr(>|t|)      Lower 95%  Upper 95%
+     │ String       Float64    Float64     Float64    Float64       Float64    Float64
+─────┼───────────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  19813.0        2.8427  6969.79    0.0            19807.4   19818.6
+   2 │ x              424.009     11.2737    37.6106  1.32368e-289     401.91    446.108
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.      Std. Error  t          Pr(>|t|)  Lower 95%  Upper 95%
+     │ String       Float64    Float64     Float64    Float64   Float64    Float64
+─────┼───────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  19810.7       3.98478  4971.59         0.0  19802.9    19818.5
+   2 │ x              520.676    11.3429     45.9033       0.0    498.442    542.911
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.     Std. Error  t         Pr(>|t|)  Lower 95%  Upper 95%
+     │ String       Float64   Float64     Float64   Float64   Float64    Float64
+─────┼─────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  19437.8      6.07925  3197.4         0.0   19425.9    19449.7
+   2 │ x             1483.33    13.4768    110.065       0.0    1456.92    1509.75
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.       Std. Error  t           Pr(>|t|)     Lower 95%   Upper 95%
+     │ String       Float64     Float64     Float64     Float64      Float64     Float64
+─────┼─────────────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  20187.5        9.72795  2075.21     0.0          20168.5     20206.6
+   2 │ x               63.3071    17.6538      3.58603  0.000337323     28.7022     97.912
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.      Std. Error  t          Pr(>|t|)     Lower 95%  Upper 95%
+     │ String       Float64    Float64     Float64    Float64      Float64    Float64
+─────┼──────────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  20424.4       10.2201  1998.45    0.0           20404.3   20444.4
+   2 │ x             -419.423     15.7112   -26.6958  1.0356e-151    -450.22   -388.626
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.     Std. Error  t          Pr(>|t|)  Lower 95%  Upper 95%
+     │ String       Float64   Float64     Float64    Float64   Float64    Float64
+─────┼──────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  20789.7      9.56063  2174.51         0.0   20771.0   20808.4
+   2 │ x            -1022.98    12.7417    -80.2856       0.0   -1047.95   -998.001
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.      Std. Error  t         Pr(>|t|)     Lower 95%  Upper 95%
+     │ String       Float64    Float64     Float64   Float64      Float64    Float64
+─────┼─────────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)  20013.7       8.86033  2258.8    0.0          19996.3    20031.1
+   2 │ x             -122.801    10.4201    -11.785  7.60822e-32   -143.226   -102.375
+coefdf = 2×7 DataFrame
+ Row │ Name         Coef.           Std. Error  t         Pr(>|t|)  Lower 95%       Upper 95%
+     │ String       Float64         Float64     Float64   Float64   Float64         Float64
+─────┼─────────────────────────────────────────────────────────────────────────────────────────────
+   1 │ (Intercept)       1.09321e5     5.78343   18902.4       0.0       1.09309e5       1.09332e5
+   2 │ x            -99305.2           6.08269  -16325.9       0.0  -99317.1        -99293.3
+10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
+ (α₀ = 9900.190310776927, αₓ = 99131.1439420097)
+ (α₀ = 19823.115188829663, αₓ = 81.66979172690417)
+ (α₀ = 19812.98227244386, αₓ = 424.00895772074136)
+ (α₀ = 19810.726510911398, αₓ = 520.6763238966264)
+ (α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
+ (α₀ = 20187.521449871012, αₓ = 63.307095852511125)
+ (α₀ = 20424.36233216108, αₓ = -419.4226871140539)
+ (α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
+ (α₀ = 20013.69053519897, αₓ = -122.80055111148658)
+ (α₀ = 109320.55276074051, αₓ = -99305.18846093686)
+
+julia> combine(gdf, fitlmmodel)
+10×3 DataFrame
+ Row │ xbins       α₀             αₓ
+     │ Cat…        Float64        Float64
+─────┼────────────────────────────────────────
+   1 │ [0.0, 0.1)   9900.19        99131.1
+   2 │ [0.1, 0.2)  19823.1            81.6698
+   3 │ [0.2, 0.3)  19813.0           424.009
+   4 │ [0.3, 0.4)  19810.7           520.676
+   5 │ [0.4, 0.5)  19437.8          1483.33
+   6 │ [0.5, 0.6)  20187.5            63.3071
+   7 │ [0.6, 0.7)  20424.4          -419.423
+   8 │ [0.7, 0.8)  20789.7         -1022.98
+   9 │ [0.8, 0.9)  20013.7          -122.801
+  10 │ [0.9, 1.0)      1.09321e5  -99305.2
+```
+
+We got the same results. The `combine(gdf, fitlmmodel)` style of using
+the `combine` function is a bit more advanced and is not covered in the book.
+It is used in the cases, like the one we have here, when you want to pass
+a whole group to the function in `combine`. Check DataFrames.jl documentation
+for more detailed explanations.
+
+</details>
--- a/exercises/exercises12.md
+++ b/exercises/exercises12.md
@@ -0,0 +1,247 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 12
+
+# Problems
+
+### Exercise 1
+
+The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code of
+Go language version 19.2. As you can check on its website its SHA-256
+is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`.
+Download this file and check if it indeed has this checksum.
+You might need to read documentation of `string` and `join` functions.
+
+### Exercise 2
+
+Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
+that contains the ego-nets of Eastern European users collected from the music
+streaming service Deezer in February 2020. Nodes are users and edges are mutual
+follower relationships.
+
+From the file extract deezer_edges.json and deezer_target.csv files and
+save them to disk.
+
+### Exercise 3
+
+Load deezer_edges.json and deezer_target.csv files to Julia.
+The JSON file should be loaded as JSON3.jl object `edges_json`.
+The CSV file should be loaded into a data frame `target_df`.
+
+### Exercise 4
+
+Check that keys in the `edges_json` are in the same order as `id` column
+in `target_df`.
+
+### Exercise 5
+
+From every value stored in `edges_json` create a graph representing
+ego-net of the given node. Store these graphs in a vector that will make the
+`egonet` column of in the `target_df` data frame.
+
+### Exercise 6
+
+Ego-net in our data set is a subgraph of a full Deezer graph where for some
+node all its neighbors are included, but also it contains all edges between the
+neighbors.
+Therefore we expect that diameter of every ego-net is at most 2 (as every
+two nodes are either connected directly or by a common friend).
+Check if this is indeed the case. Use the `diameter` function.
+
+### Exercise 7
+
+For each ego-net find a central node that is connected to every other node
+in this network. Use the `degree` and `findall` functions to achieve this.
+Add `center` column with numbers of nodes that are connected to all other
+nodes in the ego-net to `target_df` data frame.
+
+Next add a column `center_len` that gives the number of such nodes.
+
+Check how many times different numbers of center nodes are found.
+
+### Exercise 8
+
+Add the following ego-net features to the `target_df` data frame:
+* `size`: number of nodes in ego-net
+* `mean_degree`: average node degree in ego-net
+
+Check mean values of these two columns by `target` column.
+
+### Exercise 9
+
+Continuing to work with `target_df` data frame create a logistic regression
+explaining `target` by `size` and `mean_degree`.
+
+### Exercise 10
+
+Continuing to work with `target_df` create a scatterplot where `size` will be on
+one axis and `mean_degree` rounded to nearest integer on the other axis.
+Plot the mean of `target` for each point being a combination of `size` and
+rounded `mean_degree`.
+
+Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
+for values in range from 5% to 95% quantile (to concentrate on typical values
+of size).
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution:
+
+```
+using Downloads
+using SHA
+Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz")
+shavec = open(sha256, "go.tar.gz")
+shastr = join(string.(s; base=16, pad=2))
+sha == shastr
+```
+
+The last line should produce `true`.
+
+### Exercise 2
+
+Solution:
+
+```
+Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
+import ZipFile
+archive = ZipFile.Reader("ego.zip")
+idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files))
+open("deezer_edges.json", "w") do io
+    write(io, read(archive.files[idx]))
+end
+idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files))
+open("deezer_target.csv", "w") do io
+    write(io, read(archive.files[idx]))
+end
+close(archive)
+```
+
+### Exercise 3
+
+Solution:
+
+```
+using CSV
+using DataFrames
+using JSON3
+edges_json = JSON3.read(read("deezer_edges.json"))
+target_df = CSV.read("deezer_target.csv", DataFrame)
+```
+
+### Exercise 4
+
+Solution (short, but you need to have a good understanding of Julia types
+and standar functions to properly write it):
+```
+Symbol.(target_df.id) == keys(edges_json)
+```
+
+### Exercise 5
+
+Solution:
+
+```
+using Graphs
+function edgelist2graph(edgelist)
+    nodes = sort!(unique(reduce(vcat, edgelist)))
+    @assert 0:length(nodes)-1 == nodes
+    g = SimpleGraph(length(nodes))
+    for (a, b) in edgelist
+        add_edge!(g, a+1, b+1)
+    end
+    return g
+end
+target_df.egonet = edgelist2graph(values(edges_json))
+```
+
+### Exercise 6
+
+Solution:
+
+```
+julia> extrema(diameter.(target_df.egonet))
+(2, 2)
+```
+
+Indeed we see that for each ego-net diameter is 2.
+
+### Exercise 7
+
+Solution:
+
+```
+target_df.center = map(target_df.egonet) do g
+    findall(==(nv(g) - 1), degree(g))
+end
+target_df.center_len = length.(target_df.center)
+combine(groupby(target_df, :center_len, sort=true), nrow)
+```
+
+Note that we used `map` since in this case it gives a convenient way to express
+the condition we want to check.
+
+We notice that in some cases it is impossible to identify the center of the
+ego-net uniquely.
+
+### Exercise 8
+
+Solution:
+
+```
+using Statistics
+target_df.size = nv.(target_df.egonet)
+target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size
+combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)
+```
+
+It seems that for target equal to `0` size and average degree in the network are
+a bit larger.
+
+### Exercise 9
+
+Solution:
+
+```
+using GLM
+glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())
+```
+
+We see that only `size` is statistically significant.
+
+### Exercise 10
+
+Solution:
+
+```
+using Plots
+target_df.round_degree = round.(Int, target_df.mean_degree)
+agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean)
+scatter(agg_df.size, agg_df.round_degree;
+        zcolor=agg_df.target_mean,
+        xlabel="size", ylabel="rounded degree",
+        label="mean target", xaxis=:log)
+```
+
+It is hard to visually see any strong relationship in the data.
+
+```
+import Loess
+model = Loess.loess(target_df.size, target_df.target)
+size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95)
+target_predict = Loess.predict(model, size_predict)
+plot(size_predict, target_predict;
+     xlabel="size", ylabel="predicted target", legend=false)
+```
+
+Between quantiles 5% and 95% we see a downward shaped relationship.
+
+</details>
--- a/exercises/exercises13.md
+++ b/exercises/exercises13.md
@@ -0,0 +1,280 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 13
+
+# Problems
+
+### Exercise 1
+
+Download
+https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip
+archive and extract primary_data.csv and secondary_data.csv files from it.
+Save the files to disk.
+
+### Exercise 2
+
+Load primary_data.csv into the `primary` data frame.
+Load secondary_data.csv into the `secondary` data frame.
+Describe the contents of both data frames.
+
+### Exercise 3
+
+Start with `primary` data. Note that columns starting from column 4 have
+their data encoded using vector notation, but they have been read-in as strings.
+Convert these columns to hold proper vectors. Note that some columns have
+`missing` values. Most of the columns hold nominal data, but three columns,
+i.e. `cap-diameter`, `stem-height`, and `stem-width` have numeric data. These
+should be parsed as vectors storing numeric values. After parsing, put these
+three columns just after `class` column in the `parsed_primary` data frame.
+
+Check `renamecols` keyword argument of `select` to
+avoid renaming of the produced columns.
+
+### Exercise 4
+
+In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be
+confused and have a different class, using the information about them we have
+(so all information except their family, name, and class).
+
+Use the following rules:
+* if for some pair of mushrooms the data in some column for either of them is
+  `missing` then skip matching on this column; for numeric columns if there
+  is only one value in a vector then treat it as `missing`;
+* otherwise:
+  - for numeric columns check if there is an overlap in the interval
+    specified by the min and max values for the range passed;
+  - for nominal columns check if the intersection of nominal values is nonempty.
+
+For each found pair print to the screen the row number, family, name, and class.
+
+### Exercise 5
+
+Still using `parsed_primary` find what is the average probability of class being
+`p` by `family`. Additionally add number of observations in each group. Sort
+these results by the probability. Try using DataFramesMeta.jl to do this
+exercise (this requirement is optional).
+
+Store the result in `agg_primary` data frame.
+
+### Exercise 6
+
+Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
+it gives us a total number of rows that had this probability and a tuple
+of mushroom family names.
+
+Optionally: try to display the produced table so that the tuple containing the
+list of families for each group is not cropped (this will require large
+terminal).
+
+### Exercise 7
+
+From our preliminary analysis of `primary` data we see that `missing` value in
+the primary data is non-informative, so in `secondary` data we should be
+cautious when building a model if we allowed for missing data (in practice
+if we were investigating some real mushroom we most likely would know its
+characteristics).
+
+Therefore as a first step drop in-place all columns in `secondary` data frame
+that have missing values.
+
+### Exercise 8
+
+Create a logistic regression predicting `class` based on all remaining features
+in the data frame. You might need to check the `Term` usage in StatsModels.jl
+documentation.
+
+You will notice that for `stem-color` and `habitat` columns you get strange
+estimation results (large absolute values of estimated parameters and even
+larger standard errors). Explain why this happens by analyzing frequency tables
+of these variables against `class` column.
+
+### Exercise 9
+
+Add `class_p` column to `secondary` as a second column that will contain
+predicted probability from the model created in exercise 8 of a given
+observation having class `p`.
+
+Print descriptive statistics of column `class_p` by `class`.
+
+### Exercise 10
+
+Plot FPR-TPR ROC curve for our model and compute associated AUC value.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution:
+
+```
+using Downloads
+import ZipFile
+Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
+archive = ZipFile.Reader("MushroomDataset.zip")
+idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
+open("primary_data.csv", "w") do io
+    write(io, read(archive.files[idx]))
+end
+idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
+open("secondary_data.csv", "w") do io
+    write(io, read(archive.files[idx]))
+end
+close(archive)
+```
+
+### Exercise 2
+
+Solution:
+
+```
+using CSV
+using DataFrames
+primary = CSV.read("primary_data.csv", DataFrame; delim=';')
+secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
+describe(primary)
+describe(secondary)
+```
+
+### Exercise 3
+
+Solution:
+
+```
+parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
+parse_nominal(::Missing) = missing
+parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
+parse_numeric(::Missing) = missing
+idcols = ["family", "name", "class"]
+numericcols = ["cap-diameter", "stem-height", "stem-width"]
+parsed_primary = select(primary,
+                        idcols,
+                        numericcols .=> ByRow(parse_numeric),
+                        Not([idcols; numericcols]) .=> ByRow(parse_nominal);
+                        renamecols=false)
+```
+
+### Exercise 4
+
+Solution:
+
+```
+function overlap_numeric(v1, v2)
+    # there are no missing values in numeric columns
+    if length(v1) == 1 || length(v2) == 1
+        return true
+    else
+        return max(v1[1], v2[1]) <= min(v1[2], v2[2])
+    end
+end
+
+function overlap_nominal(v1, v2)
+    if ismissing(v1) || ismissing(v2)
+        return true
+    else
+        return !isempty(intersect(v1, v2))
+    end
+end
+
+function rows_overlap(row1, row2)
+    # note that in parsed_primary numeric columns have indices 4 to 6
+    # and nominal columns have indices 7 to 23
+    return all(i -> overlap_numeric(row1[i], row2[i]), 4:6) &&
+           all(i -> overlap_nominal(row1[i], row2[i]), 7:23)
+end
+
+for i in 1:nrow(parsed_primary), j in i+1:nrow(parsed_primary)
+    row1 = parsed_primary[i, :]
+    row2 = parsed_primary[j, :]
+    if rows_overlap(row1, row2) && row1.class != row2.class
+        println((i, Tuple(row1[1:3]), j, Tuple(row2[1:3])))
+    end
+end
+```
+
+Note that in this exercise using `eachrow` is not a problem
+(although it is not type stable) because the data is small.
+
+### Exercise 5
+
+Solution:
+
+```
+using Statistics
+using DataFramesMeta
+agg_primary = @chain parsed_primary begin
+    groupby(:family)
+    @combine(:pr_p = mean(:class .== "p"), $nrow)
+    sort(:pr_p)
+end
+```
+
+### Exercise 6
+
+Solution:
+
+```
+show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140)
+```
+
+### Exercise 7
+
+Solution:
+
+```
+select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
+```
+
+Note that we select based on actual contents of the columns and not by their
+element type (column could allow for missing values but not have them).
+
+### Exercise 8
+
+Solution:
+
+```
+using GLM
+using FreqTables
+secondary.class = secondary.class .== "p"
+model = glm(Term(:class)~sum(Term.(Symbol.(names(secondary, Not(:class))))),
+           secondary, Binomial(), LogitLink())
+freqtable(secondary, "stem-color", "class")
+freqtable(secondary, "habitat", "class")
+```
+
+We can see that for cetrain levels of `stem-color` and `habitat` variables
+there is a perfect separation of classes.
+
+### Exercise 9
+
+Solution:
+
+```
+insertcols!(secondary, 2, :class_p => predict(model))
+combine(groupby(secondary, :class)) do sdf
+    return describe(sdf, :detailed, cols=:class_p)
+end
+```
+
+We can see that the model has some discriminatory power, but there
+is still a significant overlap between classes.
+
+### Exercise 10
+
+Solution:
+
+```
+using Plots
+using ROCAnalysis
+roc_data = roc(secondary; score=:class_p, target=:class)
+plot(roc_data.pfa, 1 .- roc_data.pmiss;
+     title="AUC=$(round(100*(1 - auc(roc_data)), digits=2))%",
+     xlabel="FPR", ylabel="TPR", legend=false)
+```
+
+</details>
--- a/exercises/exercises14.md
+++ b/exercises/exercises14.md
@@ -0,0 +1,384 @@
+# Julia for Data Analysis
+
+## Bogumił Kamiński, Daniel Kaszyński
+
+# Chapter 14
+
+# Problems
+
+### Exercise 1
+
+Write a simulator that takes one parameter `n`. Next we assume we draw
+`n` times from this set with replacement. The simulator should return the
+average number of items from this set that were drawn at least once.
+
+Call the function running this simulation `boot`.
+
+### Exercise 2
+
+Now write a function `simboot` that takes parameters `n` and `k` and runs
+the simulation defined in the `boot` function `k` times. It should return
+a named tuple storing `k`, `n`, mean of produced values, and ends of
+approximated 95% confidence interval of the result.
+
+Make this function single threaded. Check how long
+this function runs for `n=1000` and `k=1_000_000`.
+
+### Exercise 3
+
+Now rewrite this simulator to be multi threaded. Use 4 cores for benchmarking.
+Call the function `simbootT`. Check how long this function runs for `n=1000` and
+`k=1_000_000`.
+
+### Exercise 4
+
+Now rewrite `boot` and `simbootT` to perform less allocations. Achieve this by
+making sure that all allocated objects are passed to `boot` function (so that it
+does not do any allocations internally). Call these new functions `boot!` and
+`simbootT2`. You might need to use the `Threads.threadid` and `Threads.nthreads`
+functions.
+
+### Exercise 5
+
+Use either of the solutions we have developed in the previous exercises to
+create a web service taking `k` and `n` parameters and returning the values
+produced by `boot` functions and time to run the simulation. You might want to
+use the `@timed` macro in your solution.
+
+Start the server.
+
+### Exercise 6
+
+Query the server started in the exercise 5 with
+the following parameters:
+* `k=1000` and `n=1000`
+* `k=1.5` and `n=1000`
+
+### Exercise 7
+
+Collect the data generated by a web service into the `df` data frame for
+`k = [10^i for i in 3:6]` and `n = [10^i for i in 1:3]`.
+
+### Exercise 8
+
+Replace the `value` column in the `df` data frame by its contents in-place.
+
+### Exercise 9
+
+Checks that execution time roughly scales proportionally to the product
+of `k` times `n`.
+
+### Exercise 10
+
+Plot the expected fraction of seen elements in the set as a function of
+`n` by `k` along with 95% confidence interval around these values.
+
+# Solutions
+
+<details>
+
+<summary>Show!</summary>
+
+### Exercise 1
+
+Solution (there are many other approaches you could use):
+
+```
+using Statistics
+
+function boot(n::Integer)
+    table = falses(n)
+    for _ in 1:n
+        table[rand(1:n)] = true
+    end
+    return mean(table)
+end
+```
+
+### Exercise 2
+
+Solution:
+
+```
+function simboot(k::Integer, n::Integer)
+    result = [boot(n) for _ in 1:k]
+    mv = mean(result)
+    sdv = std(result)
+    lo95 = mv - 1.96 * sdv / sqrt(k)
+    hi95 = mv + 1.96 * sdv / sqrt(k)
+    return (; k, n, mv, lo95, hi95)
+end
+```
+
+We run it twice to make sure everything is compiled:
+```
+julia> @time simboot(1000, 1_000_000)
+  7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time)
+(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944)
+
+julia> @time simboot(1000, 1_000_000)
+  7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time)
+(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754)
+```
+
+We see that on my computer the run time is around 7 seconds.
+
+### Exercise 3
+
+Solution:
+
+```
+using ThreadsX
+
+function simbootT(k::Integer, n::Integer)
+    result = ThreadsX.map(i -> boot(n), 1:k)
+    mv = mean(result)
+    sdv = std(result)
+    lo95 = mv - 1.96 * sdv / sqrt(k)
+    hi95 = mv + 1.96 * sdv / sqrt(k)
+    return (; k, n, mv, lo95, hi95)
+end
+```
+
+Here is the timing for four threads:
+```
+julia> @time simbootT(1000, 1_000_000)
+  2.390795 seconds (3.37 k allocations: 119.434 MiB)
+(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484)
+
+julia> @time simbootT(1000, 1_000_000)
+  2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time)
+(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554)
+```
+
+Indeed we see a significant performance improvement.
+
+### Exercise 4
+
+Solution:
+
+```
+function boot!(n::Integer, pool)
+    table = pool[Threads.threadid()]
+    fill!(table, false)
+    for _ in 1:n
+        table[rand(1:n)] = true
+    end
+    return mean(table)
+end
+
+function simbootT2(k::Integer, n::Integer)
+    pool = [falses(n) for _ in 1:Threads.nthreads()]
+    result = ThreadsX.map(i -> boot!(n, pool), 1:k)
+    mv = mean(result)
+    sdv = std(result)
+    lo95 = mv - 1.96 * sdv / sqrt(k)
+    hi95 = mv + 1.96 * sdv / sqrt(k)
+    return (; k, n, mv, lo95, hi95)
+end
+```
+
+In the solution the `pool` vector keeps `table` vector
+individually for each thread. Let us test the timing:
+
+```
+julia> @time simbootT2(1000, 1_000_000)
+  2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation)
+(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206)
+
+julia> @time simbootT2(1000, 1_000_000)
+  2.340694 seconds (391 allocations: 586.453 KiB)
+(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058)
+```
+
+Indeed, we see that the number of allocations was decreased, which should lower
+GC usage. However, the runtime of the simulation is similar since in this task
+memory allocation does not account for a significant portion of the runtime.
+
+### Exercise 5
+
+Solution (I used the simplest single-threaded code here; this is a complete
+code of the web service):
+
+```
+using Genie
+using Statistics
+
+function boot(n::Integer)
+    table = falses(n)
+    for _ in 1:n
+        table[rand(1:n)] = true
+    end
+    return mean(table)
+end
+
+function simboot(k::Integer, n::Integer)
+    result = [boot(n) for _ in 1:k]
+    mv = mean(result)
+    sdv = std(result)
+    lo95 = mv - 1.96 * sdv / sqrt(k)
+    hi95 = mv + 1.96 * sdv / sqrt(k)
+    return (; k, n, mv, lo95, hi95)
+end
+
+Genie.config.run_as_server = true
+
+Genie.Router.route("/", method=POST) do
+  message = Genie.Requests.jsonpayload()
+  return try
+      k = message["k"]
+      n = message["n"]
+      value, time = @timed simboot(k, n)
+      Genie.Renderer.Json.json((status="OK", time=time, value=value))
+  catch
+      Genie.Renderer.Json.json((status="ERROR", time="", value=""))
+  end
+end
+
+Genie.Server.up()
+```
+
+### Exercise 6
+
+Solution:
+
+```
+julia> using HTTP
+
+julia> using JSON3
+
+julia> HTTP.post("http://127.0.0.1:8000",
+                 ["Content-Type" => "application/json"],
+                 JSON3.write((k=1000, n=1000)))
+HTTP.Messages.Response:
+"""
+HTTP/1.1 200 OK
+Content-Type: application/json; charset=utf-8
+Server: Genie/Julia/1.8.2
+Transfer-Encoding: chunked
+
+{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}"""
+
+julia> HTTP.post("http://127.0.0.1:8000",
+                 ["Content-Type" => "application/json"],
+                 JSON3.write((k=1.5, n=1000)))
+HTTP.Messages.Response:
+"""
+HTTP/1.1 200 OK
+Content-Type: application/json; charset=utf-8
+Server: Genie/Julia/1.8.2
+Transfer-Encoding: chunked
+
+{"status":"ERROR","time":"","value":""}"""
+```
+
+As expected we got a positive answer the first time and an error on the second call.
+
+### Exercise 7
+
+Solution:
+
+```
+using DataFrames
+
+df = DataFrame()
+for k in [10^i for i in 3:6], n in [10^i for i in 1:3]
+    @show k, n
+    req = HTTP.post("http://127.0.0.1:8000",
+                    ["Content-Type" => "application/json"],
+                    JSON3.write((; k, n)))
+    push!(df, NamedTuple(JSON3.read(req.body)))
+end
+```
+
+Note that I convert `JSON3.Object` into a `NamedTuple` to easily `push!`
+it into the `df` data frame.
+
+Let us have a look at the produced data frame:
+
+```
+julia> df
+12×3 DataFrame
+ Row │ status  time       value
+     │ String  Float64    Object…
+─────┼──────────────────────────────────────────────────────
+   1 │ OK      0.0006784  {\n      "k": 1000,\n      "n": …
+   2 │ OK      0.0038374  {\n      "k": 1000,\n      "n": …
+   3 │ OK      0.0150844  {\n      "k": 1000,\n      "n": …
+   4 │ OK      0.0014071  {\n      "k": 10000,\n      "n":…
+   5 │ OK      0.008443   {\n      "k": 10000,\n      "n":…
+   6 │ OK      0.0700319  {\n      "k": 10000,\n      "n":…
+   7 │ OK      0.0253826  {\n      "k": 100000,\n      "n"…
+   8 │ OK      0.0795937  {\n      "k": 100000,\n      "n"…
+   9 │ OK      0.708287   {\n      "k": 100000,\n      "n"…
+  10 │ OK      0.160286   {\n      "k": 1000000,\n      "n…
+  11 │ OK      0.803433   {\n      "k": 1000000,\n      "n…
+  12 │ OK      7.23958    {\n      "k": 1000000,\n      "n…
+```
+
+### Exercise 8
+
+Solution:
+
+```
+julia> select!(df, :status, :time, :value => AsTable)
+12×7 DataFrame
+ Row │ status  time       k        n      mv        lo95      hi95
+     │ String  Float64    Int64    Int64  Float64   Float64   Float64
+─────┼─────────────────────────────────────────────────────────────────
+   1 │ OK      0.0006784     1000     10  0.6469    0.640745  0.653055
+   2 │ OK      0.0038374     1000    100  0.63508   0.633035  0.637125
+   3 │ OK      0.0150844     1000   1000  0.632178  0.631581  0.632775
+   4 │ OK      0.0014071    10000     10  0.65239   0.650425  0.654355
+   5 │ OK      0.008443     10000    100  0.634456  0.633845  0.635067
+   6 │ OK      0.0700319    10000   1000  0.63207   0.631878  0.632262
+   7 │ OK      0.0253826   100000     10  0.651411  0.650793  0.652029
+   8 │ OK      0.0795937   100000    100  0.634     0.633807  0.634193
+   9 │ OK      0.708287    100000   1000  0.63224   0.632179  0.632302
+  10 │ OK      0.160286   1000000     10  0.65129   0.651095  0.651486
+  11 │ OK      0.803433   1000000    100  0.633995  0.633934  0.634056
+  12 │ OK      7.23958    1000000   1000  0.63232   0.632301  0.63234
+```
+
+### Exercise 9
+
+Solution:
+
+```
+julia> using DataFramesMeta
+
+julia> @chain df begin
+           @rselect(:k, :n, :avg_time = :time / (:k * :n))
+           unstack(:k, :n, :avg_time)
+       end
+4×4 DataFrame
+ Row │ k        10          100         1000
+     │ Int64    Float64?    Float64?    Float64?
+─────┼─────────────────────────────────────────────
+   1 │    1000  6.784e-8    3.8374e-8   1.50844e-8
+   2 │   10000  1.4071e-8   8.443e-9    7.00319e-9
+   3 │  100000  2.53826e-8  7.95937e-9  7.08287e-9
+   4 │ 1000000  1.60286e-8  8.03433e-9  7.23958e-9
+```
+
+We see that indeed this is the case. For large `k` and `n` the average time per
+single sample stabilizes (for small values the runtime is low so the timing is
+more affected by external noise and the other operations that the functions do
+affect the results more).
+
+### Exercise 10
+
+Solution:
+```
+using Plots
+gdf = groupby(df, :k, sort=true)
+plot([bar(string.(g.n), g.mv;
+          ylim=(0.62, 0.66), xlabel="n", ylabel="estimate",
+          legend=false, title=first(g.k),
+          yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...)
+```
+
+As expected error bandwidth gets smaller as `k` encreases.
+Note that as `n` increases the estimated value tends to `1-exp(-1)`.
+
+</details>
--- a/exercises/logo.png
+++ b/exercises/logo.png