update layout of all exercises

This commit is contained in:
Bogumił Kamiński 2022-10-14 13:43:12 +02:00
parent 38398729ce
commit 31d8428f6a
11 changed files with 1042 additions and 925 deletions

View File

@ -11,64 +11,8 @@
Check what methods does the `repeat` function have.
Are they all covered in help for this function?
### Exercise 2
Write a function `fun2` that takes any vector and returns the difference between
the largest and the smallest element in this vector.
### Exercise 3
Generate a vector of one million random numbers from `[0, 1]` interval.
Check what is a faster way to get a maximum and minimum element in it. One
option is by using the `maximum` and `minimum` functions and the other is by
using the `extrema` function.
### Exercise 4
Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
variable `x`. What effects can this operation have?
### Exercise 5
What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
### Exercise 6
Define two identical anonymous functions `x -> x + 1` in global scope? Do they
have the same type?
### Exercise 7
Define the `wrap` function taking one argument `i` and returning the anonymous
function `x -> x + i`. Is the type of such anonymous function the same across
calls to `wrap` function?
### Exercise 8
You want to write a function that accepts any `Integer` except `Bool` and returns
the passed value. If `Bool` is passed an error should be thrown.
### Exercise 9
The `@time` macro measures time taken by an expression run and prints it,
but returns the value of the expression.
The `@elapsed` macro works differently - it does not print anything, but returns
time taken to evaluate an expression. Test the `@elapsed` macro by to see how
long it takes to shuffle a vector of one million floats. Use the `shuffle` function
from `Random` module.
### Exercise 10
Using the `@btime` macro benchmark the time of calculating the sum of one million
random floats.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
<summary>Solution</summary>
Write:
```
@ -93,8 +37,16 @@ and `repeat(c::Char, r::Integer)` is its faster version
that accepts values that have `Char` type only (and it is invoked by Julia
if value of type `Char` is passed as an argument to `repeat`).
</details>
### Exercise 2
Write a function `fun2` that takes any vector and returns the difference between
the largest and the smallest element in this vector.
<details>
<summary>Solution</summary>
You can define is as follows:
```
fun2(x::AbstractVector) = maximum(x) - minimum(x)
@ -109,8 +61,18 @@ end
Note that these two functions will work with vectors of any elements that
are ordered and support subtraction (they do not have to be numbers).
</details>
### Exercise 3
Generate a vector of one million random numbers from `[0, 1]` interval.
Check what is a faster way to get a maximum and minimum element in it. One
option is by using the `maximum` and `minimum` functions and the other is by
using the `extrema` function.
<details>
<summary>Solution</summary>
Here is a way to compare the performance of both options:
```
julia> using BenchmarkTools
@ -130,8 +92,16 @@ As you can see in this situation, although `extrema` does the operation
in a single pass over `x` it is slower than computing `minimum` and `maximum`
in two passes.
</details>
### Exercise 4
Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
variable `x`. What effects can this operation have?
<details>
<summary>Solution</summary>
If it is a fresh Julia session you define a new function in `Main` for `+` operator:
```
@ -167,8 +137,15 @@ julia> +x=1
ERROR: error in method definition: function Base.+ must be explicitly imported to be extended
```
</details>
### Exercise 5
What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
<details>
<summary>Solution</summary>
You get an empty vector:
```
julia> subtypes(Union{Float64, Missing})
@ -181,8 +158,16 @@ declared types that have names (type of such types is `DataType` in Julia).
*Extra* for this reason `subtypes` has a limited use. To check if one type
is a subtype of some other type use the `<:` operator.
</details>
### Exercise 6
Define two identical anonymous functions `x -> x + 1` in global scope? Do they
have the same type?
<details>
<summary>Solution</summary>
No, each of them has a different type:
```
julia> f1 = x -> x + 1
@ -215,8 +200,17 @@ julia> @time sum(x -> x^2, 1:10)
385
```
</details>
### Exercise 7
Define the `wrap` function taking one argument `i` and returning the anonymous
function `x -> x + i`. Is the type of such anonymous function the same across
calls to `wrap` function?
<details>
<summary>Solution</summary>
Yes, the type is the same:
```
@ -252,8 +246,16 @@ julia> @time sumi(3)
3025
```
</details>
### Exercise 8
You want to write a function that accepts any `Integer` except `Bool` and returns
the passed value. If `Bool` is passed an error should be thrown.
<details>
<summary>Solution</summary>
We check subtypes of `Integer`:
```
@ -292,8 +294,20 @@ julia> fun2(true)
ERROR: ArgumentError: Bool is not supported
```
</details>
### Exercise 9
The `@time` macro measures time taken by an expression run and prints it,
but returns the value of the expression.
The `@elapsed` macro works differently - it does not print anything, but returns
time taken to evaluate an expression. Test the `@elapsed` macro by to see how
long it takes to shuffle a vector of one million floats. Use the `shuffle` function
from `Random` module.
<details>
<summary>Solution</summary>
Here is the code that performs the task:
```
julia> using Random # needed to get access to shuffle
@ -312,8 +326,16 @@ julia> @elapsed shuffle(x)
Note that the first time we run `shuffle` it takes longer due to compilation.
</details>
### Exercise 10
Using the `@btime` macro benchmark the time of calculating the sum of one million
random floats.
<details>
<summary>Solution</summary>
The code you can use is:
```

View File

@ -12,11 +12,81 @@ Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
columnwise with consecutive numbers). Next calculate sum, mean and standard
deviation of each row and each column of this matrix.
<details>
<summary>Solution</summary>
Write:
```
julia> using Statistics
julia> mat = [1 3 5
2 4 6]
2×3 Matrix{Int64}:
1 3 5
2 4 6
julia> sum(mat, dims=1)
1×3 Matrix{Int64}:
3 7 11
julia> sum(mat, dims=2)
2×1 Matrix{Int64}:
9
12
julia> mean(mat, dims=1)
1×3 Matrix{Float64}:
1.5 3.5 5.5
julia> mean(mat, dims=2)
2×1 Matrix{Float64}:
3.0
4.0
julia> std(mat, dims=1)
1×3 Matrix{Float64}:
0.707107 0.707107 0.707107
julia> std(mat, dims=2)
2×1 Matrix{Float64}:
2.0
2.0
```
Observe that the returned statistics are also stored in matrices.
If we compute them for columns (`dims=1`) then the produced matrix has one row.
If we compute them for rows (`dims=2`) then the produced matrix has one column.
</details>
### Exercise 2
For each column of the matrix created in exercise 1 compute its range
(i.e. the difference between maximum and minimum element stored in it).
<details>
<summary>Solution</summary>
Here are some ways you can do it:
```
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
3-element Vector{Int64}:
1
1
1
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
3-element Vector{Int64}:
1
1
1
```
Observe that if we used `eachcol` the produced result is a vector (not a matrix
like in exercise 1).
</details>
### Exercise 3
This is data for car speed (mph) and distance taken to stop (ft)
@ -79,127 +149,8 @@ speed dist
Load this data into Julia (this is part of the exercise) and fit a linear
regression where speed is a feature and distance is target variable.
### Exercise 4
Plot the data loaded in exercise 4. Additionally plot the fitted regression
(you need to check Plots.jl documentation to find a way to do this).
### Exercise 5
A simple code for calculation of Fibonacci numbers for positive
arguments is as follows:
```
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
```
Using the BenchmarkTools.jl package measure runtime of this function for
`n` ranging from `1` to `20`.
### Exercise 6
Improve the speed of code from exercise 5 by using a dictionary where you
store a mapping of `n` to `fib(n)`. Measure the performance of this function
for the same range of values as in exercise 5.
### Exercise 7
Create a vector containing named tuples representing elements of a 4x4 grid.
So the first element of this vector should be `(x=1, y=1)` and last should be
`(x=4, y=4)`. Store the vector in variable `v`.
### Exercise 8
The `filter` function allows you to select some values of an input collection.
Check its documentation first. Next, use it to keep from the vector `v` from
exercise 7 only elements whose sum is even.
### Exercise 9
Check the documentation of the `filter!` function. Perform the same operation
as asked in exercise 8 but using `filter!`. What is the difference?
### Exercise 10
Write a function that takes a number `n`. Next it generates two independent
random vectors of length `n` and returns their correlation coefficient.
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
and `10000`.
Create a plot with four histograms of distribution of computed Pearson
correlation coefficient. Check in the Plots.jl package which function can be
used to plot histograms.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Write:
```
julia> using Statistics
julia> mat = [1 3 5
2 4 6]
2×3 Matrix{Int64}:
1 3 5
2 4 6
julia> sum(mat, dims=1)
1×3 Matrix{Int64}:
3 7 11
julia> sum(mat, dims=2)
2×1 Matrix{Int64}:
9
12
julia> mean(mat, dims=1)
1×3 Matrix{Float64}:
1.5 3.5 5.5
julia> mean(mat, dims=2)
2×1 Matrix{Float64}:
3.0
4.0
julia> std(mat, dims=1)
1×3 Matrix{Float64}:
0.707107 0.707107 0.707107
julia> std(mat, dims=2)
2×1 Matrix{Float64}:
2.0
2.0
```
Observe that the returned statistics are also stored in matrices.
If we compute them for columns (`dims=1`) then the produced matrix has one row.
If we compute them for rows (`dims=2`) then the produced matrix has one column.
### Exercise 2
Here are some ways you can do it:
```
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
3-element Vector{Int64}:
1
1
1
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
3-element Vector{Int64}:
1
1
1
```
Observe that if we used `eachcol` the produced result is a vector (not a matrix
like in exercise 1).
### Exercise 3
<summary>Solution</summary>
First create a matrix with source data by copy pasting it from the exercise
like this:
@ -285,8 +236,16 @@ julia> [ones(50) data[:, 1]] \ data[:, 2]
3.9324087591240877
```
</details>
### Exercise 4
Plot the data loaded in exercise 4. Additionally plot the fitted regression
(you need to check Plots.jl documentation to find a way to do this).
<details>
<summary>Solution</summary>
Run the following:
```
using Plots
@ -296,8 +255,23 @@ scatter(data[:, 1], data[:, 2];
The `smooth=true` keyword argument adds the linear regression line to the plot.
</details>
### Exercise 5
A simple code for calculation of Fibonacci numbers for positive
arguments is as follows:
```
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
```
Using the BenchmarkTools.jl package measure runtime of this function for
`n` ranging from `1` to `20`.
<details>
<summary>Solution</summary>
Use the following code:
```
julia> using BenchmarkTools
@ -331,8 +305,17 @@ julia> for i in 1:40
Notice that execution time for number `n` is roughly sum of ececution times
for numbers `n-1` and `n-2`.
</details>
### Exercise 6
Improve the speed of code from exercise 5 by using a dictionary where you
store a mapping of `n` to `fib(n)`. Measure the performance of this function
for the same range of values as in exercise 5.
<details>
<summary>Solution</summary>
Use the following code:
```
@ -422,8 +405,17 @@ julia> @time fib2(200)
As you can see the code does less allocations and is faster now.
</details>
### Exercise 7
Create a vector containing named tuples representing elements of a 4x4 grid.
So the first element of this vector should be `(x=1, y=1)` and last should be
`(x=4, y=4)`. Store the vector in variable `v`.
<details>
<summary>Solution</summary>
Since we are asked to create a vector we can write:
```
@ -470,8 +462,17 @@ julia> [(; x, y) for x in 1:4, y in 1:4]
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
```
</details>
### Exercise 8
The `filter` function allows you to select some values of an input collection.
Check its documentation first. Next, use it to keep from the vector `v` from
exercise 7 only elements whose sum is even.
<details>
<summary>Solution</summary>
To get help on the `filter` function write `?filter`. Next run:
```
@ -487,8 +488,16 @@ julia> filter(e -> iseven(e.x + e.y), v)
(x = 4, y = 4)
```
</details>
### Exercise 9
Check the documentation of the `filter!` function. Perform the same operation
as asked in exercise 8 but using `filter!`. What is the difference?
<details>
<summary>Solution</summary>
To get help on the `filter!` function write `?filter!`. Next run:
```
@ -518,8 +527,21 @@ julia> v
Notice that `filter` allocated a new vector, while `filter!` updated the `v`
vector in place.
</details>
### Exercise 10
Write a function that takes a number `n`. Next it generates two independent
random vectors of length `n` and returns their correlation coefficient.
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
and `10000`.
Create a plot with four histograms of distribution of computed Pearson
correlation coefficient. Check in the Plots.jl package which function can be
used to plot histograms.
<details>
<summary>Solution</summary>
You can use for example the following code:
```

View File

@ -10,93 +10,8 @@
Create a matrix containing truth table for `&&` and `||` operations.
### Exercise 2
The `issubset` function checks if one collection is a subset of other
collection.
Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
`k` varying from `1` to `3`. Store the result in a vector.
### Exercise 3
Write a function that accepts two vectors and returns `true` if they have equal
length and otherwise returns `false`.
### Exercise 4
Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
Calculate their Adjusted Mutual Information using scikit-learn.
### Exercise 5
Using Adjusted Mutual Information function from exercise 4 generate
a pair of random vectors of length 100 containing integer numbers from the
range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
Check in the documentation of the `rand` function how you can draw a sample
from a collection of values.
### Exercise 6
Adjust the code from exercise 5 but replace first 50 elements of each vector
with zero. Repeat the experiment.
### Exercise 7
Write a function that takes a vector of integer values and returns a dictionary
giving information how many times each integer was present in the passed vector.
Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
and `v3 = 3:5`.
### Exercise 8
Write code that creates a `Bool` diagonal matrix of size 5x5.
### Exercise 9
Write a code comparing performance of calculation of sum of logarithms of
elements of a vector `1:100` using broadcasting and the `sum` function vs only
the `sum` function taking a function as a first argument.
### Exercise 10
Create a dictionary in which for each number from `1` to `10` you will store
a vector of its positive divisors. You can check the reminder of division
of two values using the `rem` function.
Additionally (not covered in the book), you can drop elements
from a comprehension if you add an `if` clause after the `for` clause, for
example to keep only odd numbers from range `1:10` do:
```
julia> [i for i in 1:10 if isodd(i)]
5-element Vector{Int64}:
1
3
5
7
9
```
You can populate a dictionary by passing a vector of pairs to it (not covered in
the book), for example:
```
julia> Dict(["a" => 1, "b" => 2])
Dict{String, Int64} with 2 entries:
"b" => 2
"a" => 1
```
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
<summary>Solution</summary>
You can do it as follows:
```
@ -113,8 +28,19 @@ julia> [true, false] .|| [true false]
Note that the first array is a vector, while the second array is a 1-row matrix.
</details>
### Exercise 2
The `issubset` function checks if one collection is a subset of other
collection.
Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
`k` varying from `1` to `3`. Store the result in a vector.
<details>
<summary>Solution</summary>
You can do it like this using broadcasting:
```
julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
@ -125,16 +51,33 @@ julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
```
Note that you need to use `Ref` to protect `4:6` from being broadcasted over.
</details>
### Exercise 3
Write a function that accepts two vectors and returns `true` if they have equal
length and otherwise returns `false`.
<details>
<summary>Solution</summary>
This function can be written as follows:
```
function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y)
```
</details>
### Exercise 4
Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
Calculate their Adjusted Mutual Information using scikit-learn.
<details>
<summary>Solution</summary>
You can do this exercise as follows:
```
julia> using PyCall
@ -151,8 +94,19 @@ julia> metrics.adjusted_mutual_info_score(y, z)
-0.21267989848846763
```
</details>
### Exercise 5
Using Adjusted Mutual Information function from exercise 4 generate
a pair of random vectors of length 100 containing integer numbers from the
range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
Check in the documentation of the `rand` function how you can draw a sample
from a collection of values.
<details>
<summary>Solution</summary>
You can create such a plot using the following commands:
```
@ -163,8 +117,16 @@ histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100))
You can check that AMI oscillates around 0.
</details>
### Exercise 6
Adjust the code from exercise 5 but replace first 50 elements of each vector
with zero. Repeat the experiment.
<details>
<summary>Solution</summary>
This time it is convenient to write a helper function. Note that we use
broadcasting to update values in the vectors.
@ -182,8 +144,19 @@ histogram([exampleAMI() for i in 1:1000], label="AMI")
Note that this time AMI is a bit below 0.5, which shows a better match between
vectors.
</details>
### Exercise 7
Write a function that takes a vector of integer values and returns a dictionary
giving information how many times each integer was present in the passed vector.
Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
and `v3 = 3:5`.
<details>
<summary>Solution</summary>
```
julia> function counter(v::AbstractVector{<:Integer})
d = Dict{eltype(v), Int}()
@ -219,8 +192,15 @@ Dict{Int64, Int64} with 3 entries:
Note that we used the `eltype` function to set a proper key type for
dictionary `d`.
</details>
### Exercise 8
Write code that creates a `Bool` diagonal matrix of size 5x5.
<details>
<summary>Solution</summary>
This is a way to do it:
```
julia> 1:5 .== (1:5)'
@ -246,8 +226,17 @@ julia> I(5)
⋅ ⋅ ⋅ ⋅ 1
```
</details>
### Exercise 9
Write a code comparing performance of calculation of sum of logarithms of
elements of a vector `1:100` using broadcasting and the `sum` function vs only
the `sum` function taking a function as a first argument.
<details>
<summary>Solution</summary>
Here is how you can do it:
```
@ -265,8 +254,41 @@ julia> @btime sum(log, 1:100)
As you can see using the `sum` function with `log` as its first argument
is a bit faster as it is not allocating.
</details>
### Exercise 10
Create a dictionary in which for each number from `1` to `10` you will store
a vector of its positive divisors. You can check the reminder of division
of two values using the `rem` function.
Additionally (not covered in the book), you can drop elements
from a comprehension if you add an `if` clause after the `for` clause, for
example to keep only odd numbers from range `1:10` do:
```
julia> [i for i in 1:10 if isodd(i)]
5-element Vector{Int64}:
1
3
5
7
9
```
You can populate a dictionary by passing a vector of pairs to it (not covered in
the book), for example:
```
julia> Dict(["a" => 1, "b" => 2])
Dict{String, Int64} with 2 entries:
"b" => 2
"a" => 1
```
<details>
<summary>Solution</summary>
Here is how you can do it:
```

View File

@ -11,16 +11,47 @@
Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
<details>
<summary>Solution</summary>
```
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"
```
</details>
### Exercise 2
Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
as `iris.csv` to your local folder.
<details>
<summary>Solution</summary>
```
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
"iris.csv")
```
</details>
### Exercise 3
Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
in two lines so that it takes less horizontal space.
<details>
<summary>Solution</summary>
```
"https://archive.ics.uci.edu/ml/\
machine-learning-databases/iris/iris.data"
```
</details>
### Exercise 4
Load data stored in `iris.csv` file into a `data` vector where each element
@ -28,73 +59,9 @@ should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
as floats).
### Exercise 5
The `data` structure is a vector of named tuples, change it to a named tuple
of vectors (with the same field names) and call it `data2`.
### Exercise 6
Calculate the frequency of each type of Iris type (`c` field in `data2`).
### Exercise 7
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
Compare sizes of the three objects.
### Exercise 8
You know that `refs` field of `PooledArray` stores an integer index of a given
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
in `data2`, but for each Iris type give a different point color (check the
`color` keyword argument meaning in the Plots.jl manual; you can use the
`plot_color` function).
### Exercise 9
Type the following string `"a²=b² ⟺ a=b a=-b"` in your terminal and bind it to
`str` variable (do not copy paste the string, but type it).
### Exercise 10
In the `str` string from exercise 9 find all matches of a pattern where `a`
is followed by `b` but there can be some characters between them.
# Solutions
<details>
<summary>Solution</summary>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"
```
### Exercise 2
Solution:
```
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
"iris.csv")
```
### Exercise 3
Solution:
```
"https://archive.ics.uci.edu/ml/\
machine-learning-databases/iris/iris.data"
```
### Exercise 4
Solution:
```
julia> function line_parser(line)
elements = split(line, ",")
@ -125,8 +92,16 @@ Note that we used `1:end-1` selector to drop last element from the read lines
since it is empty. This is the reason why adding the
`@assert length(elements) == 5` check in the `line_parser` function is useful.
</details>
### Exercise 5
The `data` structure is a vector of named tuples, change it to a named tuple
of vectors (with the same field names) and call it `data2`.
<details>
<summary>Solution</summary>
Later in the book you will learn more advanced ways to do it. Here let us
use a most basic approach:
@ -138,9 +113,15 @@ data2 = (sl=[d.sl for d in data],
c=[d.c for d in data])
```
</details>
### Exercise 6
Solution:
Calculate the frequency of each type of Iris type (`c` field in `data2`).
<details>
<summary>Solution</summary>
```
julia> using FreqTables
@ -153,9 +134,17 @@ Dim1 │
"Iris-virginica" │ 50
```
</details>
### Exercise 7
Solution:
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
Compare sizes of the three objects.
<details>
<summary>Solution</summary>
```
julia> using InlineStrings
@ -213,16 +202,34 @@ julia> Base.summarysize(c4)
1240
```
</details>
### Exercise 8
Solution:
You know that `refs` field of `PooledArray` stores an integer index of a given
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
in `data2`, but for each Iris type give a different point color (check the
`color` keyword argument meaning in the Plots.jl manual; you can use the
`plot_color` function).
<details>
<summary>Solution</summary>
```
using Plots
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
```
</details>
### Exercise 9
Type the following string `"a²=b² ⟺ a=b a=-b"` in your terminal and bind it to
`str` variable (do not copy paste the string, but type it).
<details>
<summary>Solution</summary>
The hard part is typing `²`, `⟺` and ``. You can check how to do it using help:
```
help?> ²
@ -237,8 +244,16 @@ help?>
Save the string in the `str` variable as we will use it in the next exercise.
</details>
### Exercise 10
In the `str` string from exercise 9 find all matches of a pattern where `a`
is followed by `b` but there can be some characters between them.
<details>
<summary>Show!</summary>
The exercise does not specify how the matching should be done. If we
want it to be eager (match as much as possible), we write:

View File

@ -19,75 +19,10 @@ If you want to understand all the parameters plese check their meaning
For us it is enough that this request generates 10 random integers in the range
from 1 to 6. Run this query in Julia and parse the result.
### Exercise 2
Write a function that tries to parse a string as an integer.
If it succeeds it should return the integer, otherwise it should return `0`
but print error message.
### Exercise 3
Create a matrix containing truth table for `&&` operation including `missing`.
If some operation errors store `"error"` in the table. As an extra feature (this
is harder so you can skip it) in each cell store both inputs and output to make
reading the table easier.
### Exercise 4
Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
missing values in it by the mean of the non-missing values.
### Exercise 5
Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
strings stored in it as `Float64`, while keeping `missing` values unchanged.
### Exercise 6
Print to the terminal all days in January 2023 that are Mondays.
### Exercise 7
Compute the dates that are one month later than January 15, 2020, February 15
2020, March 15, 2020, and April 15, 2020. How many days pass during this one
month. Print the results to the screen?
### Exercise 8
Parse the following string as JSON:
```
str = """
[{"x":1,"y":1},
{"x":2,"y":4},
{"x":3,"y":9},
{"x":4,"y":16},
{"x":5,"y":25}]
"""
```
into a `json` variable.
### Exercise 9
Extract from the `json` variable from exercise 8 two vectors `x` and `y`
that correspond to the fields stored in the JSON structure.
Plot `y` as a function of `x`.
### Exercise 10
Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
Use linear interpolation for filling missing values. For the extreme values
use nearest available observation (you will need to consult Impute.jl
documentation to find all required functions).
# Solutions
<details>
<summary>Solution</summary>
<summary>Show!</summary>
### Exercise 1
Solution (example run):
Example run:
```
julia> using HTTP
@ -109,8 +44,17 @@ julia> parse.(Int, split(String(response.body)))
6
```
</details>
### Exercise 2
Write a function that tries to parse a string as an integer.
If it succeeds it should return the integer, otherwise it should return `0`
but print error message.
<details>
<summary>Solution</summary>
Example function:
```
@ -160,9 +104,17 @@ end
```
But this time we do not see the cause of the error.
</details>
### Exercise 3
Solution:
Create a matrix containing truth table for `&&` operation including `missing`.
If some operation errors store `"error"` in the table. As an extra feature (this
is harder so you can skip it) in each cell store both inputs and output to make
reading the table easier.
<details>
<summary>Solution</summary>
```
julia> function apply_and(x, y)
@ -181,9 +133,15 @@ julia> apply_and.([true, false, missing], [true false missing])
"missing && true = error" "missing && false = error" "missing && missing = error"
```
</details>
### Exercise 4
Solution:
Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
missing values in it by the mean of the non-missing values.
<details>
<summary>Solution</summary>
```
julia> using Statistics
@ -198,9 +156,15 @@ julia> coalesce.(v, mean(skipmissing(v)))
3.5
```
</details>
### Exercise 5
Solution:
Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
strings stored in it as `Float64`, while keeping `missing` values unchanged.
<details>
<summary>Solution</summary>
```
julia> using Missings
@ -215,9 +179,16 @@ julia> passmissing(parse).(Float64, s)
missing
```
</details>
### Exercise 6
Example solution:
Print to the terminal all days in January 2023 that are Mondays.
<details>
<summary>Solution</summary>
Example:
```
julia> using Dates
@ -232,9 +203,18 @@ julia> for day in Date.(2023, 01, 1:31)
2023-01-30
```
</details>
### Exercise 7
Example solution:
Compute the dates that are one month later than January 15, 2020, February 15
2020, March 15, 2020, and April 15, 2020. How many days pass during this one
month. Print the results to the screen?
<details>
<summary>Solution</summary>
Example:
```
julia> for day in Date.(2023, 1:4, 15)
@ -247,9 +227,24 @@ julia> for day in Date.(2023, 1:4, 15)
2023-04-15 + 1 month = 2023-05-15 (difference: 30 days)
```
</details>
### Exercise 8
Solution:
Parse the following string as JSON:
```
str = """
[{"x":1,"y":1},
{"x":2,"y":4},
{"x":3,"y":9},
{"x":4,"y":16},
{"x":5,"y":25}]
"""
```
into a `json` variable.
<details>
<summary>Solution</summary>
```
julia> using JSON3
@ -278,9 +273,16 @@ julia> json = JSON3.read(str)
}
```
</details>
### Exercise 9
Solution:
Extract from the `json` variable from exercise 8 two vectors `x` and `y`
that correspond to the fields stored in the JSON structure.
Plot `y` as a function of `x`.
<details>
<summary>Solution</summary>
```
using Plots
@ -289,9 +291,17 @@ y = [el.y for el in json]
plot(x, y, xlabel="x", ylabel="y", legend=false)
```
</details>
### Exercise 10
Solution:
Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
Use linear interpolation for filling missing values. For the extreme values
use nearest available observation (you will need to consult Impute.jl
documentation to find all required functions).
<details>
<summary>Solution</summary>
```
julia> using Impute

View File

@ -11,63 +11,8 @@
Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame`
called `df`.
### Exercise 2
Get number of rows, columns, column names and summary statistics of the
`df` data frame from exercise 1.
### Exercise 3
Make a plot of `number` against `square` columns of `df` data frame.
### Exercise 4
Add a column to `df` data frame with name `name string` containing string
representation of numbers in column `number`, i.e.
`["one", "two", "three", "four"]`.
### Exercise 5
Check if `df` contains column `square2`.
### Exercise 6
Extract column `number` from `df` and empty it (recall `empty!` function
discussed in chapter 4).
### Exercise 7
In `Random` module the `randexp` function is defined that samples numbers
from exponential distribution with scale 1.
Draw two 100,000 element samples from this distribution store them
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
and sum of vector `x` and half of vector `y`.
### Exercise 8
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
Compute all standard descriptive statistics of columns of this data frame.
### Exercise 9
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
Compare the size of created files using the `filesize` function.
### Exercise 10
Write the `df` data frame into SQLite database. Next find information about
tables in this database. Run a query against a table representing the `df` data
frame to calculate the mean of column `x`. Does it match the result we got in
exercise 8?
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
<summary>Solution</summary>
CSV.jl supports reading gzip-compressed files so you can just do:
@ -106,9 +51,15 @@ julia> df = CSV.read(plain, DataFrame)
4 │ 4 16
```
</details>
### Exercise 2
Solution:
Get number of rows, columns, column names and summary statistics of the
`df` data frame from exercise 1.
<details>
<summary>Solution</summary>
```
julia> nrow(df)
@ -131,17 +82,30 @@ julia> describe(df)
2 │ square 7.75 2 6.5 16 0 Int64
```
</details>
### Exercise 3
Solution:
Make a plot of `number` against `square` columns of `df` data frame.
<details>
<summary>Solution</summary>
```
using Plots
plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)
```
</details>
### Exercise 4
Solution:
Add a column to `df` data frame with name `name string` containing string
representation of numbers in column `number`, i.e.
`["one", "two", "three", "four"]`.
<details>
<summary>Solution</summary>
```
julia> df."name string" = ["one", "two", "three", "four"]
@ -164,8 +128,15 @@ julia> df
Note that we needed to use a string as we have space in column name.
</details>
### Exercise 5
Check if `df` contains column `square2`.
<details>
<summary>Solution</summary>
You can use either `hasproperty` or `columnindex`:
```
@ -184,9 +155,15 @@ julia> df.square2
ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square
```
</details>
### Exercise 6
Solution:
Extract column `number` from `df` and empty it (recall `empty!` function
discussed in chapter 4).
<details>
<summary>Solution</summary>
```
julia> empty!(df[:, :number])
@ -198,9 +175,19 @@ as it would corrupt the `df` data frame (these operations do non-copying
extraction of a column from a data frame as opposed to `df[:, :number]`
which makes a copy).
</details>
### Exercise 7
Solution:
In `Random` module the `randexp` function is defined that samples numbers
from exponential distribution with scale 1.
Draw two 100,000 element samples from this distribution store them
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
and sum of vector `x` and half of vector `y`.
<details>
<summary>Solution</summary>
```
using Random
using Plots
@ -212,10 +199,19 @@ histogram!(max.(x, y), label="maximum")
I have put both histograms on the same plot to show that they overlap.
</details>
### Exercise 8
Solution (you might get slightly different results because we did not set
the seed of random number generator when creating `x` and `y` vectors):
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
Compute all standard descriptive statistics of columns of this data frame.
<details>
<summary>Solution</summary>
You might get slightly different results because we did not set
the seed of random number generator when creating `x` and `y` vectors:
```
julia> df = DataFrame(x=x, y=y);
@ -238,8 +234,16 @@ julia> describe(df, :all)
We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary
statistics except `first` and `last` as expected.
</details>
### Exercise 9
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
Compare the size of created files using the `filesize` function.
<details>
<summary>Solution</summary>
```
julia> using Arrow
@ -258,8 +262,18 @@ julia> filesize("df.arrow")
In this case Apache Arrow file is smaller.
</details>
### Exercise 10
Write the `df` data frame into SQLite database. Next find information about
tables in this database. Run a query against a table representing the `df` data
frame to calculate the mean of column `x`. Does it match the result we got in
exercise 8?
<details>
<summary>Solution</summary>
```
julia> using SQLite

View File

@ -22,69 +22,8 @@ Create `matein2` data frame that will have only puzzles that have `"mateIn2"`
in the `Themes` column.
Use the `contains` function (check its documentation first).
### Exercise 2
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
in the `puzzles` data frame?
### Exercise 3
Create `small` data frame that holds first 10 rows of `matein2` data frame
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
### Exercise 4
Iterate rows of `small` data frame and print the ratio of
`RatingDeviation` and `NbPlays` for each row.
### Exercise 5
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
### Exercise 6
Write a function `collatz` that runs the following process. Start with a
positive number `n`. If it is even divide it by two. If it is odd multiply
it by 3 and add one. The function should return the number of steps needed to
reach 1.
Create a `d` dictionary that maps number of steps needed to a list of numbers from
the range `1:100` that required this number of steps.
### Exercise 7
Using the `d` dictionary make a scatter plot of number of steps required
vs average value of numbers that require this number of steps.
### Exercise 8
Repeat the process from exercises 6 and 7, but this time use a data frame
and try to write an appropriate expression using the `combine` and `groupby`
functions (as it was explained in the last part of chapter 9). This time
perform computations for numbers ranging from one to one million.
### Exercise 9
Set seed of random number generator to `1234`. Draw 100 random points
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
Add random noise to column `y` that has normal distribution with mean 0 and
standard deviation 0.25. Call this column `z`.
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
### Exercise 10
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
<summary>Solution</summary>
```
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
@ -104,9 +43,17 @@ julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
1 column and 274127 rows omitted
```
</details>
### Exercise 2
Solution (two ways to do it):
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
in the `puzzles` data frame?
<details>
<summary>Solution</summary>
Two ways to do it:
```
julia> using Statistics
@ -118,9 +65,15 @@ julia> mean(contains.(puzzles.Themes, "mateIn2"))
0.12852152542746353
```
</details>
### Exercise 3
Solution:
Create `small` data frame that holds first 10 rows of `matein2` data frame
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
<details>
<summary>Solution</summary>
```
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
@ -140,9 +93,15 @@ julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
10 │ 979 144 14
```
</details>
### Exercise 4
Solution:
Iterate rows of `small` data frame and print the ratio of
`RatingDeviation` and `NbPlays` for each row.
<details>
<summary>Solution</summary>
```
julia> for row in eachrow(small)
@ -160,9 +119,16 @@ julia> for row in eachrow(small)
10.285714285714286
```
</details>
### Exercise 5
Solution (several options):
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
<details>
<summary>Solution</summary>
Several options:
```
julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
2-element Vector{String}:
@ -180,9 +146,20 @@ julia> names(matein2, r"[nN]$")
"RatingDeviation"
```
</details>
### Exercise 6
Solution:
Write a function `collatz` that runs the following process. Start with a
positive number `n`. If it is even divide it by two. If it is odd multiply
it by 3 and add one. The function should return the number of steps needed to
reach 1.
Create a `d` dictionary that maps number of steps needed to a list of numbers from
the range `1:100` that required this number of steps.
<details>
<summary>Solution</summary>
```
julia> function collatz(n)
@ -232,9 +209,15 @@ Dict{Int64, Vector{Int64}} with 45 entries:
As we can see even for small `n` the number of steps required to reach `1`
can get quite large.
</details>
### Exercise 7
Solution:
Using the `d` dictionary make a scatter plot of number of steps required
vs average value of numbers that require this number of steps.
<details>
<summary>Solution</summary>
```
using Plots
@ -247,9 +230,17 @@ scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=fal
Note that we needed to use `collect` on `keys` as `scatter` expects an array
not just an iterator.
</details>
### Exercise 8
Solution:
Repeat the process from exercises 6 and 7, but this time use a data frame
and try to write an appropriate expression using the `combine` and `groupby`
functions (as it was explained in the last part of chapter 9). This time
perform computations for numbers ranging from one to one million.
<details>
<summary>Solution</summary>
```
df = DataFrame(n=1:10^6);
@ -258,6 +249,8 @@ agg = combine(groupby(df, :collatz), :n => mean);
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
```
</details>
### Exercise 9
Set seed of random number generator to `1234`. Draw 100 random points
@ -267,7 +260,8 @@ Add random noise to column `y` that has normal distribution with mean 0 and
standard deviation 0.25. Call this column `z`.
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
Solution:
<details>
<summary>Solution</summary>
```
using Random
@ -278,9 +272,14 @@ df.z = df.y + randn(100) / 4
scatter(df.x, [df.y df.z], labels=["y" "z"])
```
</details>
### Exercise 10
Solution:
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
<details>
<summary>Solution</summary>
```
using Loess

View File

@ -13,89 +13,8 @@ independently and uniformly from the [0,1[ interval.
Create a data frame using data from this matrix using auto-generated
column names.
### Exercise 2
Now, using matrix `mat` create a data frame with randomly generated
column names. Use the `randstring` function from the `Random` module
to generate them. Store this data frame in `df` variable.
### Exercise 3
Create a new data frame, taking `df` as a source that will have the same
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
### Exercise 4
Create a dictionary holding `column_name => column_vector` pairs
using data stored in data frame `df`. Save this dictionary in variable `d`.
### Exercise 5
Create a data frame back from dictionary `d` from exercise 4. Compare it
with `df`.
### Exercise 6
For data frame `df` compute the dot product between all pairs of its columns.
Use the `dot` function from the `LinearAlgebra` module.
### Exercise 7
Given two data frames:
```
julia> df1 = DataFrame(a=1:2, b=11:12)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> df2 = DataFrame(a=1:2, c=101:102)
2×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 101
2 │ 2 102
```
vertically concatenate them so that only columns that are present in both
data frames are kept. Check the documentation of `vcat` to see how to
do it.
### Exercise 8
Now append to `df1` table `df2`, but add only the columns from `df2` that
are present in `df1`. Check the documentation of `append!` to see how to
do it.
### Exercise 9
Create a `circle` data frame, using the `push!` function that will store
1000 samples of the following process:
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
and is `false` otherwise.
Compute summary statistics of this data frame.
### Exercise 10
Create a scatterplot of `circle` data frame where its `x` and `y` axis
will be the plotted points and `inside` variable will determine the color
of the plotted point.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
<summary>Solution</summary>
```
julia> using DataFrames
@ -120,9 +39,16 @@ julia> DataFrame(mat, :auto)
5 │ 0.714515 0.861872 0.971521 0.176768
```
</details>
### Exercise 2
Solution:
Now, using matrix `mat` create a data frame with randomly generated
column names. Use the `randstring` function from the `Random` module
to generate them. Store this data frame in `df` variable.
<details>
<summary>Solution</summary>
```
julia> using Random
@ -139,10 +65,16 @@ julia> df = DataFrame(mat, [randstring() for _ in 1:4])
5 │ 0.714515 0.861872 0.971521 0.176768
```
</details>
### Exercise 3
Solution:
Create a new data frame, taking `df` as a source that will have the same
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
<details>
<summary>Solution</summary>
```
julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
5×4 DataFrame
@ -170,9 +102,15 @@ julia> rename(df, string.("y", 1:4))
5 │ 0.714515 0.861872 0.971521 0.176768
```
</details>
### Exercise 4
Solution:
Create a dictionary holding `column_name => column_vector` pairs
using data stored in data frame `df`. Save this dictionary in variable `d`.
<details>
<summary>Solution</summary>
```
julia> d = Dict([n => df[:, n] for n in names(df)])
@ -194,9 +132,15 @@ Dict{Symbol, AbstractVector} with 4 entries:
Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
```
</details>
### Exercise 5
Solution:
Create a data frame back from dictionary `d` from exercise 4. Compare it
with `df`.
<details>
<summary>Solution</summary>
```
julia> DataFrame(d)
@ -215,9 +159,15 @@ Note that columns of a data frame are now sorted by their names.
This is done for `Dict` objects because such dictionaries do not have
a defined order of keys.
</details>
### Exercise 6
Solution:
For data frame `df` compute the dot product between all pairs of its columns.
Use the `dot` function from the `LinearAlgebra` module.
<details>
<summary>Solution</summary>
```
julia> using LinearAlgebra
@ -232,9 +182,36 @@ julia> pairwise(dot, eachcol(df))
1.50558 1.18411 0.909744 1.47431
```
</details>
### Exercise 7
Solution:
Given two data frames:
```
julia> df1 = DataFrame(a=1:2, b=11:12)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> df2 = DataFrame(a=1:2, c=101:102)
2×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 101
2 │ 2 102
```
vertically concatenate them so that only columns that are present in both
data frames are kept. Check the documentation of `vcat` to see how to
do it.
<details>
<summary>Solution</summary>
```
julia> vcat(df1, df2, cols=:intersect)
@ -255,9 +232,16 @@ julia> vcat(df1, df2)
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
```
</details>
### Exercise 8
Solution:
Now append to `df1` table `df2`, but add only the columns from `df2` that
are present in `df1`. Check the documentation of `append!` to see how to
do it.
<details>
<summary>Solution</summary>
```
julia> append!(df1, df2, cols=:subset)
@ -271,9 +255,20 @@ julia> append!(df1, df2, cols=:subset)
4 │ 2 missing
```
</details>
### Exercise 9
Solution
Create a `circle` data frame, using the `push!` function that will store
1000 samples of the following process:
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
and is `false` otherwise.
Compute summary statistics of this data frame.
<details>
<summary>Solution</summary>
```
circle=DataFrame()
@ -287,9 +282,16 @@ describe(circle)
We note that mean of variable `inside` is approximately π.
</details>
### Exercise 10
Solution:
Create a scatterplot of `circle` data frame where its `x` and `y` axis
will be the plotted points and `inside` variable will determine the color
of the plotted point.
<details>
<summary>Solution</summary>
```
using Plots

View File

@ -13,83 +13,8 @@ sampled from uniform distribution on [0, 1[ interval.
Serialize it to disk, and next deserialize. Check if the deserialized
object is the same as the source data frame.
### Exercise 2
Add a column `n` to the `df` data frame that in each row will hold the
number of observations in column `x` that have distance less than `0.1` to
a value stored in a given row of `x`.
### Exercise 3
Investigate visually how does `n` depend on `x` in data frame `df`.
### Exercise 4
Someone has prepared the following test data for you:
```
teststr = """
"x","sinx"
0.139279,0.138829
0.456779,0.441059
0.344034,0.337287
0.140253,0.139794
0.848344,0.750186
0.977512,0.829109
0.032737,0.032731
0.702750,0.646318
0.422339,0.409895
0.393878,0.383772
"""
```
Load this data into `testdf` data frame.
### Exercise 5
Check the accuracy of computations of sinus of `x` in `testdf`.
Print all rows for which the absolute difference is greater than `5e-7`.
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
difference.
### Exercise 6
Group data in data frame `df` into buckets of 0.1 width and store the result in
`gdf` data frame (sort the groups). Use the `cut` function from
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
Check the number of values in each group.
### Exercise 7
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
Check what would be the group order if you asked not to sort them.
### Exercise 8
Compute average `n` for each group in `gdf`.
### Exercise 9
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
Use the `\` operator to fit it (recall it from chapter 4).
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
### Exercise 10
Repeat exercise 9 but using the GLM.jl package. This time
extract the p-value for the slope of estimated coefficient for `x` variable.
Use the `coeftable` function from GLM.jl to get this information.
Check the documentation of this function to learn how to do it (it will be
easiest for you to first convert its result to a `DataFrame`).
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
<summary>Solution</summary>
```
julia> using DataFrames
@ -104,9 +29,16 @@ julia> deserialize("df.bin") == df
true
```
</details>
### Exercise 2
Solution:
Add a column `n` to the `df` data frame that in each row will hold the
number of observations in column `x` that have distance less than `0.1` to
a value stored in a given row of `x`.
<details>
<summary>Solution</summary>
A simple approach is:
```
@ -151,9 +83,14 @@ df.n = f2(df.x)
In this solution the fact that we used function barrier is even more relevant
as we explicitly use loops inside.
</details>
### Exercise 3
Solution:
Investigate visually how does `n` depend on `x` in data frame `df`.
<details>
<summary>Solution</summary>
```
using Plots
@ -162,9 +99,31 @@ scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
As expected on the border of the domain number of neighbors drops.
</details>
### Exercise 4
Solution:
Someone has prepared the following test data for you:
```
teststr = """
"x","sinx"
0.139279,0.138829
0.456779,0.441059
0.344034,0.337287
0.140253,0.139794
0.848344,0.750186
0.977512,0.829109
0.032737,0.032731
0.702750,0.646318
0.422339,0.409895
0.393878,0.383772
"""
```
Load this data into `testdf` data frame.
<details>
<summary>Solution</summary>
```
julia> using CSV
@ -188,8 +147,18 @@ julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
10 │ 0.393878 0.383772
```
</details>
### Exercise 5
Check the accuracy of computations of sinus of `x` in `testdf`.
Print all rows for which the absolute difference is greater than `5e-7`.
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
difference.
<details>
<summary>Solution</summary>
Since data frame is small we can use `eachrow`:
```
@ -202,9 +171,18 @@ julia> for row in eachrow(testdf)
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
```
</details>
### Exercise 6
Solution:
Group data in data frame `df` into buckets of 0.1 width and store the result in
`gdf` data frame (sort the groups). Use the `cut` function from
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
Check the number of values in each group.
<details>
<summary>Solution</summary>
```
julia> using CategoricalArrays
@ -244,9 +222,15 @@ julia> combine(gdf, nrow) # alternative way to do it
You might get a bit different numbers but all should be around 10,000.
</details>
### Exercise 7
Solution:
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
Check what would be the group order if you asked not to sort them.
<details>
<summary>Solution</summary>
```
julia> NamedTuple.(keys(gdf))
@ -282,9 +266,14 @@ the resulting group order could depend on the type of grouping column, so if
you want to depend on the order of groups always spass `sort` keyword argument
explicitly.
</details>
### Exercise 8
Solution:
Compute average `n` for each group in `gdf`.
<details>
<summary>Solution</summary>
```
julia> using Statistics
@ -319,9 +308,16 @@ julia> combine(gdf, :n => mean) # alternative way to do it
10 │ [0.9, 1.0) 14944.5
```
</details>
### Exercise 9
Solution:
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
Use the `\` operator to fit it (recall it from chapter 4).
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
<details>
<summary>Solution</summary>
```
julia> function fitmodel(x, n)
@ -364,9 +360,18 @@ julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that y
We note that indeed in the first and last group the regression has a significant
slope.
</details>
### Exercise 10
Solution:
Repeat exercise 9 but using the GLM.jl package. This time
extract the p-value for the slope of estimated coefficient for `x` variable.
Use the `coeftable` function from GLM.jl to get this information.
Check the documentation of this function to learn how to do it (it will be
easiest for you to first convert its result to a `DataFrame`).
<details>
<summary>Solution</summary>
```
julia> using GLM

View File

@ -14,86 +14,8 @@ is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`.
Download this file and check if it indeed has this checksum.
You might need to read documentation of `string` and `join` functions.
### Exercise 2
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
that contains the ego-nets of Eastern European users collected from the music
streaming service Deezer in February 2020. Nodes are users and edges are mutual
follower relationships.
From the file extract deezer_edges.json and deezer_target.csv files and
save them to disk.
### Exercise 3
Load deezer_edges.json and deezer_target.csv files to Julia.
The JSON file should be loaded as JSON3.jl object `edges_json`.
The CSV file should be loaded into a data frame `target_df`.
### Exercise 4
Check that keys in the `edges_json` are in the same order as `id` column
in `target_df`.
### Exercise 5
From every value stored in `edges_json` create a graph representing
ego-net of the given node. Store these graphs in a vector that will make the
`egonet` column of in the `target_df` data frame.
### Exercise 6
Ego-net in our data set is a subgraph of a full Deezer graph where for some
node all its neighbors are included, but also it contains all edges between the
neighbors.
Therefore we expect that diameter of every ego-net is at most 2 (as every
two nodes are either connected directly or by a common friend).
Check if this is indeed the case. Use the `diameter` function.
### Exercise 7
For each ego-net find a central node that is connected to every other node
in this network. Use the `degree` and `findall` functions to achieve this.
Add `center` column with numbers of nodes that are connected to all other
nodes in the ego-net to `target_df` data frame.
Next add a column `center_len` that gives the number of such nodes.
Check how many times different numbers of center nodes are found.
### Exercise 8
Add the following ego-net features to the `target_df` data frame:
* `size`: number of nodes in ego-net
* `mean_degree`: average node degree in ego-net
Check mean values of these two columns by `target` column.
### Exercise 9
Continuing to work with `target_df` data frame create a logistic regression
explaining `target` by `size` and `mean_degree`.
### Exercise 10
Continuing to work with `target_df` create a scatterplot where `size` will be on
one axis and `mean_degree` rounded to nearest integer on the other axis.
Plot the mean of `target` for each point being a combination of `size` and
rounded `mean_degree`.
Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
for values in range from 5% to 95% quantile (to concentrate on typical values
of size).
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
<summary>Solution</summary>
```
using Downloads
@ -106,9 +28,20 @@ sha == shastr
The last line should produce `true`.
</details>
### Exercise 2
Solution:
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
that contains the ego-nets of Eastern European users collected from the music
streaming service Deezer in February 2020. Nodes are users and edges are mutual
follower relationships.
From the file extract deezer_edges.json and deezer_target.csv files and
save them to disk.
<details>
<summary>Solution</summary>
```
Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
@ -125,9 +58,16 @@ end
close(archive)
```
</details>
### Exercise 3
Solution:
Load deezer_edges.json and deezer_target.csv files to Julia.
The JSON file should be loaded as JSON3.jl object `edges_json`.
The CSV file should be loaded into a data frame `target_df`.
<details>
<summary>Solution</summary>
```
using CSV
@ -137,17 +77,32 @@ edges_json = JSON3.read(read("deezer_edges.json"))
target_df = CSV.read("deezer_target.csv", DataFrame)
```
</details>
### Exercise 4
Solution (short, but you need to have a good understanding of Julia types
and standar functions to properly write it):
Check that keys in the `edges_json` are in the same order as `id` column
in `target_df`.
<details>
<summary>Solution</summary>
This is short, but you need to have a good understanding of Julia types
and standar functions to properly write it:
```
Symbol.(target_df.id) == keys(edges_json)
```
</details>
### Exercise 5
Solution:
From every value stored in `edges_json` create a graph representing
ego-net of the given node. Store these graphs in a vector that will make the
`egonet` column of in the `target_df` data frame.
<details>
<summary>Solution</summary>
```
using Graphs
@ -163,9 +118,19 @@ end
target_df.egonet = edgelist2graph(values(edges_json))
```
</details>
### Exercise 6
Solution:
Ego-net in our data set is a subgraph of a full Deezer graph where for some
node all its neighbors are included, but also it contains all edges between the
neighbors.
Therefore we expect that diameter of every ego-net is at most 2 (as every
two nodes are either connected directly or by a common friend).
Check if this is indeed the case. Use the `diameter` function.
<details>
<summary>Solution</summary>
```
julia> extrema(diameter.(target_df.egonet))
@ -174,9 +139,21 @@ julia> extrema(diameter.(target_df.egonet))
Indeed we see that for each ego-net diameter is 2.
</details>
### Exercise 7
Solution:
For each ego-net find a central node that is connected to every other node
in this network. Use the `degree` and `findall` functions to achieve this.
Add `center` column with numbers of nodes that are connected to all other
nodes in the ego-net to `target_df` data frame.
Next add a column `center_len` that gives the number of such nodes.
Check how many times different numbers of center nodes are found.
<details>
<summary>Solution</summary>
```
target_df.center = map(target_df.egonet) do g
@ -192,9 +169,18 @@ the condition we want to check.
We notice that in some cases it is impossible to identify the center of the
ego-net uniquely.
</details>
### Exercise 8
Solution:
Add the following ego-net features to the `target_df` data frame:
* `size`: number of nodes in ego-net
* `mean_degree`: average node degree in ego-net
Check mean values of these two columns by `target` column.
<details>
<summary>Solution</summary>
```
using Statistics
@ -206,9 +192,15 @@ combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)
It seems that for target equal to `0` size and average degree in the network are
a bit larger.
</details>
### Exercise 9
Solution:
Continuing to work with `target_df` data frame create a logistic regression
explaining `target` by `size` and `mean_degree`.
<details>
<summary>Solution</summary>
```
using GLM
@ -217,9 +209,21 @@ glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())
We see that only `size` is statistically significant.
</details>
### Exercise 10
Solution:
Continuing to work with `target_df` create a scatterplot where `size` will be on
one axis and `mean_degree` rounded to nearest integer on the other axis.
Plot the mean of `target` for each point being a combination of `size` and
rounded `mean_degree`.
Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
for values in range from 5% to 95% quantile (to concentrate on typical values
of size).
<details>
<summary>Solution</summary>
```
using Plots
@ -242,6 +246,6 @@ plot(size_predict, target_predict;
xlabel="size", ylabel="predicted target", legend=false)
```
Between quantiles 5% and 95% we see a downward shaped relationship.
Between quantiles 5% and 95% of `size` we see a downward shaped relationship.
</details>

View File

@ -13,12 +13,47 @@ https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.
archive and extract primary_data.csv and secondary_data.csv files from it.
Save the files to disk.
<details>
<summary>Solution</summary>
```
using Downloads
import ZipFile
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
archive = ZipFile.Reader("MushroomDataset.zip")
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
open("primary_data.csv", "w") do io
write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
open("secondary_data.csv", "w") do io
write(io, read(archive.files[idx]))
end
close(archive)
```
</details>
### Exercise 2
Load primary_data.csv into the `primary` data frame.
Load secondary_data.csv into the `secondary` data frame.
Describe the contents of both data frames.
<details>
<summary>Solution</summary>
```
using CSV
using DataFrames
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
describe(primary)
describe(secondary)
```
</details>
### Exercise 3
Start with `primary` data. Note that columns starting from column 4 have
@ -32,6 +67,25 @@ three columns just after `class` column in the `parsed_primary` data frame.
Check `renamecols` keyword argument of `select` to
avoid renaming of the produced columns.
<details>
<summary>Solution</summary>
```
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
parse_nominal(::Missing) = missing
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
parse_numeric(::Missing) = missing
idcols = ["family", "name", "class"]
numericcols = ["cap-diameter", "stem-height", "stem-width"]
parsed_primary = select(primary,
idcols,
numericcols .=> ByRow(parse_numeric),
Not([idcols; numericcols]) .=> ByRow(parse_nominal);
renamecols=false)
```
</details>
### Exercise 4
In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be
@ -49,119 +103,8 @@ Use the following rules:
For each found pair print to the screen the row number, family, name, and class.
### Exercise 5
Still using `parsed_primary` find what is the average probability of class being
`p` by `family`. Additionally add number of observations in each group. Sort
these results by the probability. Try using DataFramesMeta.jl to do this
exercise (this requirement is optional).
Store the result in `agg_primary` data frame.
### Exercise 6
Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
it gives us a total number of rows that had this probability and a tuple
of mushroom family names.
Optionally: try to display the produced table so that the tuple containing the
list of families for each group is not cropped (this will require large
terminal).
### Exercise 7
From our preliminary analysis of `primary` data we see that `missing` value in
the primary data is non-informative, so in `secondary` data we should be
cautious when building a model if we allowed for missing data (in practice
if we were investigating some real mushroom we most likely would know its
characteristics).
Therefore as a first step drop in-place all columns in `secondary` data frame
that have missing values.
### Exercise 8
Create a logistic regression predicting `class` based on all remaining features
in the data frame. You might need to check the `Term` usage in StatsModels.jl
documentation.
You will notice that for `stem-color` and `habitat` columns you get strange
estimation results (large absolute values of estimated parameters and even
larger standard errors). Explain why this happens by analyzing frequency tables
of these variables against `class` column.
### Exercise 9
Add `class_p` column to `secondary` as a second column that will contain
predicted probability from the model created in exercise 8 of a given
observation having class `p`.
Print descriptive statistics of column `class_p` by `class`.
### Exercise 10
Plot FPR-TPR ROC curve for our model and compute associated AUC value.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
using Downloads
import ZipFile
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
archive = ZipFile.Reader("MushroomDataset.zip")
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
open("primary_data.csv", "w") do io
write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
open("secondary_data.csv", "w") do io
write(io, read(archive.files[idx]))
end
close(archive)
```
### Exercise 2
Solution:
```
using CSV
using DataFrames
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
describe(primary)
describe(secondary)
```
### Exercise 3
Solution:
```
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
parse_nominal(::Missing) = missing
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
parse_numeric(::Missing) = missing
idcols = ["family", "name", "class"]
numericcols = ["cap-diameter", "stem-height", "stem-width"]
parsed_primary = select(primary,
idcols,
numericcols .=> ByRow(parse_numeric),
Not([idcols; numericcols]) .=> ByRow(parse_nominal);
renamecols=false)
```
### Exercise 4
Solution:
<summary>Solution</summary>
```
function overlap_numeric(v1, v2)
@ -200,9 +143,19 @@ end
Note that in this exercise using `eachrow` is not a problem
(although it is not type stable) because the data is small.
</details>
### Exercise 5
Solution:
Still using `parsed_primary` find what is the average probability of class being
`p` by `family`. Additionally add number of observations in each group. Sort
these results by the probability. Try using DataFramesMeta.jl to do this
exercise (this requirement is optional).
Store the result in `agg_primary` data frame.
<details>
<summary>Solution</summary>
```
using Statistics
@ -214,17 +167,40 @@ agg_primary = @chain parsed_primary begin
end
```
</details>
### Exercise 6
Solution:
Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
it gives us a total number of rows that had this probability and a tuple
of mushroom family names.
Optionally: try to display the produced table so that the tuple containing the
list of families for each group is not cropped (this will require large
terminal).
<details>
<summary>Solution</summary>
```
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140)
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families); truncate=140)
```
</details>
### Exercise 7
Solution:
From our preliminary analysis of `primary` data we see that `missing` value in
the primary data is non-informative, so in `secondary` data we should be
cautious when building a model if we allowed for missing data (in practice
if we were investigating some real mushroom we most likely would know its
characteristics).
Therefore as a first step drop in-place all columns in `secondary` data frame
that have missing values.
<details>
<summary>Solution</summary>
```
select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
@ -233,9 +209,21 @@ select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
Note that we select based on actual contents of the columns and not by their
element type (column could allow for missing values but not have them).
</details>
### Exercise 8
Solution:
Create a logistic regression predicting `class` based on all remaining features
in the data frame. You might need to check the `Term` usage in StatsModels.jl
documentation.
You will notice that for `stem-color` and `habitat` columns you get strange
estimation results (large absolute values of estimated parameters and even
larger standard errors). Explain why this happens by analyzing frequency tables
of these variables against `class` column.
<details>
<summary>Solution</summary>
```
using GLM
@ -247,12 +235,21 @@ freqtable(secondary, "stem-color", "class")
freqtable(secondary, "habitat", "class")
```
We can see that for cetrain levels of `stem-color` and `habitat` variables
We can see that for certain levels of `stem-color` and `habitat` variables
there is a perfect separation of classes.
</details>
### Exercise 9
Solution:
Add `class_p` column to `secondary` as a second column that will contain
predicted probability from the model created in exercise 8 of a given
observation having class `p`.
Print descriptive statistics of column `class_p` by `class`.
<details>
<summary>Solution</summary>
```
insertcols!(secondary, 2, :class_p => predict(model))
@ -264,9 +261,14 @@ end
We can see that the model has some discriminatory power, but there
is still a significant overlap between classes.
</details>
### Exercise 10
Solution:
Plot FPR-TPR ROC curve for our model and compute associated AUC value.
<details>
<summary>Solution</summary>
```
using Plots