update layout of all exercises
This commit is contained in:
parent
38398729ce
commit
31d8428f6a
@ -11,64 +11,8 @@
|
||||
Check what methods does the `repeat` function have.
|
||||
Are they all covered in help for this function?
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Write a function `fun2` that takes any vector and returns the difference between
|
||||
the largest and the smallest element in this vector.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Generate a vector of one million random numbers from `[0, 1]` interval.
|
||||
Check what is a faster way to get a maximum and minimum element in it. One
|
||||
option is by using the `maximum` and `minimum` functions and the other is by
|
||||
using the `extrema` function.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
|
||||
variable `x`. What effects can this operation have?
|
||||
|
||||
### Exercise 5
|
||||
|
||||
What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Define two identical anonymous functions `x -> x + 1` in global scope? Do they
|
||||
have the same type?
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Define the `wrap` function taking one argument `i` and returning the anonymous
|
||||
function `x -> x + i`. Is the type of such anonymous function the same across
|
||||
calls to `wrap` function?
|
||||
|
||||
### Exercise 8
|
||||
|
||||
You want to write a function that accepts any `Integer` except `Bool` and returns
|
||||
the passed value. If `Bool` is passed an error should be thrown.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
The `@time` macro measures time taken by an expression run and prints it,
|
||||
but returns the value of the expression.
|
||||
The `@elapsed` macro works differently - it does not print anything, but returns
|
||||
time taken to evaluate an expression. Test the `@elapsed` macro by to see how
|
||||
long it takes to shuffle a vector of one million floats. Use the `shuffle` function
|
||||
from `Random` module.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Using the `@btime` macro benchmark the time of calculating the sum of one million
|
||||
random floats.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
<summary>Solution</summary>
|
||||
|
||||
Write:
|
||||
```
|
||||
@ -93,8 +37,16 @@ and `repeat(c::Char, r::Integer)` is its faster version
|
||||
that accepts values that have `Char` type only (and it is invoked by Julia
|
||||
if value of type `Char` is passed as an argument to `repeat`).
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Write a function `fun2` that takes any vector and returns the difference between
|
||||
the largest and the smallest element in this vector.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can define is as follows:
|
||||
```
|
||||
fun2(x::AbstractVector) = maximum(x) - minimum(x)
|
||||
@ -109,8 +61,18 @@ end
|
||||
Note that these two functions will work with vectors of any elements that
|
||||
are ordered and support subtraction (they do not have to be numbers).
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Generate a vector of one million random numbers from `[0, 1]` interval.
|
||||
Check what is a faster way to get a maximum and minimum element in it. One
|
||||
option is by using the `maximum` and `minimum` functions and the other is by
|
||||
using the `extrema` function.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Here is a way to compare the performance of both options:
|
||||
```
|
||||
julia> using BenchmarkTools
|
||||
@ -130,8 +92,16 @@ As you can see in this situation, although `extrema` does the operation
|
||||
in a single pass over `x` it is slower than computing `minimum` and `maximum`
|
||||
in two passes.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
|
||||
variable `x`. What effects can this operation have?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
If it is a fresh Julia session you define a new function in `Main` for `+` operator:
|
||||
|
||||
```
|
||||
@ -167,8 +137,15 @@ julia> +x=1
|
||||
ERROR: error in method definition: function Base.+ must be explicitly imported to be extended
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You get an empty vector:
|
||||
```
|
||||
julia> subtypes(Union{Float64, Missing})
|
||||
@ -181,8 +158,16 @@ declared types that have names (type of such types is `DataType` in Julia).
|
||||
*Extra* for this reason `subtypes` has a limited use. To check if one type
|
||||
is a subtype of some other type use the `<:` operator.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Define two identical anonymous functions `x -> x + 1` in global scope? Do they
|
||||
have the same type?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
No, each of them has a different type:
|
||||
```
|
||||
julia> f1 = x -> x + 1
|
||||
@ -215,8 +200,17 @@ julia> @time sum(x -> x^2, 1:10)
|
||||
385
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Define the `wrap` function taking one argument `i` and returning the anonymous
|
||||
function `x -> x + i`. Is the type of such anonymous function the same across
|
||||
calls to `wrap` function?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Yes, the type is the same:
|
||||
|
||||
```
|
||||
@ -252,8 +246,16 @@ julia> @time sumi(3)
|
||||
3025
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
You want to write a function that accepts any `Integer` except `Bool` and returns
|
||||
the passed value. If `Bool` is passed an error should be thrown.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
We check subtypes of `Integer`:
|
||||
|
||||
```
|
||||
@ -292,8 +294,20 @@ julia> fun2(true)
|
||||
ERROR: ArgumentError: Bool is not supported
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
The `@time` macro measures time taken by an expression run and prints it,
|
||||
but returns the value of the expression.
|
||||
The `@elapsed` macro works differently - it does not print anything, but returns
|
||||
time taken to evaluate an expression. Test the `@elapsed` macro by to see how
|
||||
long it takes to shuffle a vector of one million floats. Use the `shuffle` function
|
||||
from `Random` module.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Here is the code that performs the task:
|
||||
```
|
||||
julia> using Random # needed to get access to shuffle
|
||||
@ -312,8 +326,16 @@ julia> @elapsed shuffle(x)
|
||||
|
||||
Note that the first time we run `shuffle` it takes longer due to compilation.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Using the `@btime` macro benchmark the time of calculating the sum of one million
|
||||
random floats.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
The code you can use is:
|
||||
|
||||
```
|
||||
|
@ -12,11 +12,81 @@ Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
|
||||
columnwise with consecutive numbers). Next calculate sum, mean and standard
|
||||
deviation of each row and each column of this matrix.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Write:
|
||||
```
|
||||
julia> using Statistics
|
||||
|
||||
julia> mat = [1 3 5
|
||||
2 4 6]
|
||||
2×3 Matrix{Int64}:
|
||||
1 3 5
|
||||
2 4 6
|
||||
|
||||
julia> sum(mat, dims=1)
|
||||
1×3 Matrix{Int64}:
|
||||
3 7 11
|
||||
|
||||
julia> sum(mat, dims=2)
|
||||
2×1 Matrix{Int64}:
|
||||
9
|
||||
12
|
||||
|
||||
julia> mean(mat, dims=1)
|
||||
1×3 Matrix{Float64}:
|
||||
1.5 3.5 5.5
|
||||
|
||||
julia> mean(mat, dims=2)
|
||||
2×1 Matrix{Float64}:
|
||||
3.0
|
||||
4.0
|
||||
|
||||
julia> std(mat, dims=1)
|
||||
1×3 Matrix{Float64}:
|
||||
0.707107 0.707107 0.707107
|
||||
|
||||
julia> std(mat, dims=2)
|
||||
2×1 Matrix{Float64}:
|
||||
2.0
|
||||
2.0
|
||||
```
|
||||
|
||||
Observe that the returned statistics are also stored in matrices.
|
||||
If we compute them for columns (`dims=1`) then the produced matrix has one row.
|
||||
If we compute them for rows (`dims=2`) then the produced matrix has one column.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
For each column of the matrix created in exercise 1 compute its range
|
||||
(i.e. the difference between maximum and minimum element stored in it).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Here are some ways you can do it:
|
||||
```
|
||||
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
|
||||
3-element Vector{Int64}:
|
||||
1
|
||||
1
|
||||
1
|
||||
|
||||
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
|
||||
3-element Vector{Int64}:
|
||||
1
|
||||
1
|
||||
1
|
||||
```
|
||||
|
||||
Observe that if we used `eachcol` the produced result is a vector (not a matrix
|
||||
like in exercise 1).
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
This is data for car speed (mph) and distance taken to stop (ft)
|
||||
@ -79,127 +149,8 @@ speed dist
|
||||
Load this data into Julia (this is part of the exercise) and fit a linear
|
||||
regression where speed is a feature and distance is target variable.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Plot the data loaded in exercise 4. Additionally plot the fitted regression
|
||||
(you need to check Plots.jl documentation to find a way to do this).
|
||||
|
||||
### Exercise 5
|
||||
|
||||
A simple code for calculation of Fibonacci numbers for positive
|
||||
arguments is as follows:
|
||||
|
||||
```
|
||||
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
|
||||
```
|
||||
|
||||
Using the BenchmarkTools.jl package measure runtime of this function for
|
||||
`n` ranging from `1` to `20`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Improve the speed of code from exercise 5 by using a dictionary where you
|
||||
store a mapping of `n` to `fib(n)`. Measure the performance of this function
|
||||
for the same range of values as in exercise 5.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Create a vector containing named tuples representing elements of a 4x4 grid.
|
||||
So the first element of this vector should be `(x=1, y=1)` and last should be
|
||||
`(x=4, y=4)`. Store the vector in variable `v`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
The `filter` function allows you to select some values of an input collection.
|
||||
Check its documentation first. Next, use it to keep from the vector `v` from
|
||||
exercise 7 only elements whose sum is even.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Check the documentation of the `filter!` function. Perform the same operation
|
||||
as asked in exercise 8 but using `filter!`. What is the difference?
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Write a function that takes a number `n`. Next it generates two independent
|
||||
random vectors of length `n` and returns their correlation coefficient.
|
||||
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
|
||||
and `10000`.
|
||||
Create a plot with four histograms of distribution of computed Pearson
|
||||
correlation coefficient. Check in the Plots.jl package which function can be
|
||||
used to plot histograms.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Write:
|
||||
```
|
||||
julia> using Statistics
|
||||
|
||||
julia> mat = [1 3 5
|
||||
2 4 6]
|
||||
2×3 Matrix{Int64}:
|
||||
1 3 5
|
||||
2 4 6
|
||||
|
||||
julia> sum(mat, dims=1)
|
||||
1×3 Matrix{Int64}:
|
||||
3 7 11
|
||||
|
||||
julia> sum(mat, dims=2)
|
||||
2×1 Matrix{Int64}:
|
||||
9
|
||||
12
|
||||
|
||||
julia> mean(mat, dims=1)
|
||||
1×3 Matrix{Float64}:
|
||||
1.5 3.5 5.5
|
||||
|
||||
julia> mean(mat, dims=2)
|
||||
2×1 Matrix{Float64}:
|
||||
3.0
|
||||
4.0
|
||||
|
||||
julia> std(mat, dims=1)
|
||||
1×3 Matrix{Float64}:
|
||||
0.707107 0.707107 0.707107
|
||||
|
||||
julia> std(mat, dims=2)
|
||||
2×1 Matrix{Float64}:
|
||||
2.0
|
||||
2.0
|
||||
```
|
||||
|
||||
Observe that the returned statistics are also stored in matrices.
|
||||
If we compute them for columns (`dims=1`) then the produced matrix has one row.
|
||||
If we compute them for rows (`dims=2`) then the produced matrix has one column.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Here are some ways you can do it:
|
||||
```
|
||||
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
|
||||
3-element Vector{Int64}:
|
||||
1
|
||||
1
|
||||
1
|
||||
|
||||
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
|
||||
3-element Vector{Int64}:
|
||||
1
|
||||
1
|
||||
1
|
||||
```
|
||||
|
||||
Observe that if we used `eachcol` the produced result is a vector (not a matrix
|
||||
like in exercise 1).
|
||||
|
||||
### Exercise 3
|
||||
<summary>Solution</summary>
|
||||
|
||||
First create a matrix with source data by copy pasting it from the exercise
|
||||
like this:
|
||||
@ -285,8 +236,16 @@ julia> [ones(50) data[:, 1]] \ data[:, 2]
|
||||
3.9324087591240877
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Plot the data loaded in exercise 4. Additionally plot the fitted regression
|
||||
(you need to check Plots.jl documentation to find a way to do this).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Run the following:
|
||||
```
|
||||
using Plots
|
||||
@ -296,8 +255,23 @@ scatter(data[:, 1], data[:, 2];
|
||||
|
||||
The `smooth=true` keyword argument adds the linear regression line to the plot.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
A simple code for calculation of Fibonacci numbers for positive
|
||||
arguments is as follows:
|
||||
|
||||
```
|
||||
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
|
||||
```
|
||||
|
||||
Using the BenchmarkTools.jl package measure runtime of this function for
|
||||
`n` ranging from `1` to `20`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Use the following code:
|
||||
```
|
||||
julia> using BenchmarkTools
|
||||
@ -331,8 +305,17 @@ julia> for i in 1:40
|
||||
Notice that execution time for number `n` is roughly sum of ececution times
|
||||
for numbers `n-1` and `n-2`.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Improve the speed of code from exercise 5 by using a dictionary where you
|
||||
store a mapping of `n` to `fib(n)`. Measure the performance of this function
|
||||
for the same range of values as in exercise 5.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Use the following code:
|
||||
|
||||
```
|
||||
@ -422,8 +405,17 @@ julia> @time fib2(200)
|
||||
|
||||
As you can see the code does less allocations and is faster now.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Create a vector containing named tuples representing elements of a 4x4 grid.
|
||||
So the first element of this vector should be `(x=1, y=1)` and last should be
|
||||
`(x=4, y=4)`. Store the vector in variable `v`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Since we are asked to create a vector we can write:
|
||||
|
||||
```
|
||||
@ -470,8 +462,17 @@ julia> [(; x, y) for x in 1:4, y in 1:4]
|
||||
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
The `filter` function allows you to select some values of an input collection.
|
||||
Check its documentation first. Next, use it to keep from the vector `v` from
|
||||
exercise 7 only elements whose sum is even.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
To get help on the `filter` function write `?filter`. Next run:
|
||||
|
||||
```
|
||||
@ -487,8 +488,16 @@ julia> filter(e -> iseven(e.x + e.y), v)
|
||||
(x = 4, y = 4)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Check the documentation of the `filter!` function. Perform the same operation
|
||||
as asked in exercise 8 but using `filter!`. What is the difference?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
To get help on the `filter!` function write `?filter!`. Next run:
|
||||
|
||||
```
|
||||
@ -518,8 +527,21 @@ julia> v
|
||||
Notice that `filter` allocated a new vector, while `filter!` updated the `v`
|
||||
vector in place.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Write a function that takes a number `n`. Next it generates two independent
|
||||
random vectors of length `n` and returns their correlation coefficient.
|
||||
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
|
||||
and `10000`.
|
||||
Create a plot with four histograms of distribution of computed Pearson
|
||||
correlation coefficient. Check in the Plots.jl package which function can be
|
||||
used to plot histograms.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can use for example the following code:
|
||||
|
||||
```
|
||||
|
@ -10,93 +10,8 @@
|
||||
|
||||
Create a matrix containing truth table for `&&` and `||` operations.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
The `issubset` function checks if one collection is a subset of other
|
||||
collection.
|
||||
|
||||
Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
|
||||
`k` varying from `1` to `3`. Store the result in a vector.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Write a function that accepts two vectors and returns `true` if they have equal
|
||||
length and otherwise returns `false`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
|
||||
`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
|
||||
Calculate their Adjusted Mutual Information using scikit-learn.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Using Adjusted Mutual Information function from exercise 4 generate
|
||||
a pair of random vectors of length 100 containing integer numbers from the
|
||||
range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
|
||||
Check in the documentation of the `rand` function how you can draw a sample
|
||||
from a collection of values.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Adjust the code from exercise 5 but replace first 50 elements of each vector
|
||||
with zero. Repeat the experiment.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Write a function that takes a vector of integer values and returns a dictionary
|
||||
giving information how many times each integer was present in the passed vector.
|
||||
|
||||
Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
|
||||
and `v3 = 3:5`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Write code that creates a `Bool` diagonal matrix of size 5x5.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Write a code comparing performance of calculation of sum of logarithms of
|
||||
elements of a vector `1:100` using broadcasting and the `sum` function vs only
|
||||
the `sum` function taking a function as a first argument.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Create a dictionary in which for each number from `1` to `10` you will store
|
||||
a vector of its positive divisors. You can check the reminder of division
|
||||
of two values using the `rem` function.
|
||||
|
||||
Additionally (not covered in the book), you can drop elements
|
||||
from a comprehension if you add an `if` clause after the `for` clause, for
|
||||
example to keep only odd numbers from range `1:10` do:
|
||||
|
||||
```
|
||||
julia> [i for i in 1:10 if isodd(i)]
|
||||
5-element Vector{Int64}:
|
||||
1
|
||||
3
|
||||
5
|
||||
7
|
||||
9
|
||||
```
|
||||
|
||||
You can populate a dictionary by passing a vector of pairs to it (not covered in
|
||||
the book), for example:
|
||||
|
||||
```
|
||||
julia> Dict(["a" => 1, "b" => 2])
|
||||
Dict{String, Int64} with 2 entries:
|
||||
"b" => 2
|
||||
"a" => 1
|
||||
```
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can do it as follows:
|
||||
```
|
||||
@ -113,8 +28,19 @@ julia> [true, false] .|| [true false]
|
||||
|
||||
Note that the first array is a vector, while the second array is a 1-row matrix.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
The `issubset` function checks if one collection is a subset of other
|
||||
collection.
|
||||
|
||||
Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
|
||||
`k` varying from `1` to `3`. Store the result in a vector.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can do it like this using broadcasting:
|
||||
```
|
||||
julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
|
||||
@ -125,16 +51,33 @@ julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
|
||||
```
|
||||
Note that you need to use `Ref` to protect `4:6` from being broadcasted over.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Write a function that accepts two vectors and returns `true` if they have equal
|
||||
length and otherwise returns `false`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
This function can be written as follows:
|
||||
|
||||
```
|
||||
function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
|
||||
`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
|
||||
Calculate their Adjusted Mutual Information using scikit-learn.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can do this exercise as follows:
|
||||
```
|
||||
julia> using PyCall
|
||||
@ -151,8 +94,19 @@ julia> metrics.adjusted_mutual_info_score(y, z)
|
||||
-0.21267989848846763
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Using Adjusted Mutual Information function from exercise 4 generate
|
||||
a pair of random vectors of length 100 containing integer numbers from the
|
||||
range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
|
||||
Check in the documentation of the `rand` function how you can draw a sample
|
||||
from a collection of values.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can create such a plot using the following commands:
|
||||
|
||||
```
|
||||
@ -163,8 +117,16 @@ histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100))
|
||||
|
||||
You can check that AMI oscillates around 0.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Adjust the code from exercise 5 but replace first 50 elements of each vector
|
||||
with zero. Repeat the experiment.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
This time it is convenient to write a helper function. Note that we use
|
||||
broadcasting to update values in the vectors.
|
||||
|
||||
@ -182,8 +144,19 @@ histogram([exampleAMI() for i in 1:1000], label="AMI")
|
||||
Note that this time AMI is a bit below 0.5, which shows a better match between
|
||||
vectors.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Write a function that takes a vector of integer values and returns a dictionary
|
||||
giving information how many times each integer was present in the passed vector.
|
||||
|
||||
Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
|
||||
and `v3 = 3:5`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> function counter(v::AbstractVector{<:Integer})
|
||||
d = Dict{eltype(v), Int}()
|
||||
@ -219,8 +192,15 @@ Dict{Int64, Int64} with 3 entries:
|
||||
Note that we used the `eltype` function to set a proper key type for
|
||||
dictionary `d`.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Write code that creates a `Bool` diagonal matrix of size 5x5.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
This is a way to do it:
|
||||
```
|
||||
julia> 1:5 .== (1:5)'
|
||||
@ -246,8 +226,17 @@ julia> I(5)
|
||||
⋅ ⋅ ⋅ ⋅ 1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Write a code comparing performance of calculation of sum of logarithms of
|
||||
elements of a vector `1:100` using broadcasting and the `sum` function vs only
|
||||
the `sum` function taking a function as a first argument.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Here is how you can do it:
|
||||
|
||||
```
|
||||
@ -265,8 +254,41 @@ julia> @btime sum(log, 1:100)
|
||||
As you can see using the `sum` function with `log` as its first argument
|
||||
is a bit faster as it is not allocating.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Create a dictionary in which for each number from `1` to `10` you will store
|
||||
a vector of its positive divisors. You can check the reminder of division
|
||||
of two values using the `rem` function.
|
||||
|
||||
Additionally (not covered in the book), you can drop elements
|
||||
from a comprehension if you add an `if` clause after the `for` clause, for
|
||||
example to keep only odd numbers from range `1:10` do:
|
||||
|
||||
```
|
||||
julia> [i for i in 1:10 if isodd(i)]
|
||||
5-element Vector{Int64}:
|
||||
1
|
||||
3
|
||||
5
|
||||
7
|
||||
9
|
||||
```
|
||||
|
||||
You can populate a dictionary by passing a vector of pairs to it (not covered in
|
||||
the book), for example:
|
||||
|
||||
```
|
||||
julia> Dict(["a" => 1, "b" => 2])
|
||||
Dict{String, Int64} with 2 entries:
|
||||
"b" => 2
|
||||
"a" => 1
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Here is how you can do it:
|
||||
|
||||
```
|
||||
|
@ -11,16 +11,47 @@
|
||||
Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
|
||||
(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> "I have apples worth $(1+2)\$"
|
||||
"I have apples worth 3\$"
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
|
||||
as `iris.csv` to your local folder.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
import Downloads
|
||||
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
|
||||
"iris.csv")
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
|
||||
in two lines so that it takes less horizontal space.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
"https://archive.ics.uci.edu/ml/\
|
||||
machine-learning-databases/iris/iris.data"
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Load data stored in `iris.csv` file into a `data` vector where each element
|
||||
@ -28,73 +59,9 @@ should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
|
||||
the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
|
||||
as floats).
|
||||
|
||||
### Exercise 5
|
||||
|
||||
The `data` structure is a vector of named tuples, change it to a named tuple
|
||||
of vectors (with the same field names) and call it `data2`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Calculate the frequency of each type of Iris type (`c` field in `data2`).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
|
||||
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
|
||||
Compare sizes of the three objects.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
You know that `refs` field of `PooledArray` stores an integer index of a given
|
||||
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
|
||||
in `data2`, but for each Iris type give a different point color (check the
|
||||
`color` keyword argument meaning in the Plots.jl manual; you can use the
|
||||
`plot_color` function).
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to
|
||||
`str` variable (do not copy paste the string, but type it).
|
||||
|
||||
### Exercise 10
|
||||
|
||||
In the `str` string from exercise 9 find all matches of a pattern where `a`
|
||||
is followed by `b` but there can be some characters between them.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> "I have apples worth $(1+2)\$"
|
||||
"I have apples worth 3\$"
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
```
|
||||
import Downloads
|
||||
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
|
||||
"iris.csv")
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
```
|
||||
"https://archive.ics.uci.edu/ml/\
|
||||
machine-learning-databases/iris/iris.data"
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> function line_parser(line)
|
||||
elements = split(line, ",")
|
||||
@ -125,8 +92,16 @@ Note that we used `1:end-1` selector to drop last element from the read lines
|
||||
since it is empty. This is the reason why adding the
|
||||
`@assert length(elements) == 5` check in the `line_parser` function is useful.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
The `data` structure is a vector of named tuples, change it to a named tuple
|
||||
of vectors (with the same field names) and call it `data2`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Later in the book you will learn more advanced ways to do it. Here let us
|
||||
use a most basic approach:
|
||||
|
||||
@ -138,9 +113,15 @@ data2 = (sl=[d.sl for d in data],
|
||||
c=[d.c for d in data])
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
Calculate the frequency of each type of Iris type (`c` field in `data2`).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using FreqTables
|
||||
|
||||
@ -153,9 +134,17 @@ Dim1 │
|
||||
"Iris-virginica" │ 50
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
|
||||
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
|
||||
Compare sizes of the three objects.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using InlineStrings
|
||||
|
||||
@ -213,16 +202,34 @@ julia> Base.summarysize(c4)
|
||||
1240
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
You know that `refs` field of `PooledArray` stores an integer index of a given
|
||||
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
|
||||
in `data2`, but for each Iris type give a different point color (check the
|
||||
`color` keyword argument meaning in the Plots.jl manual; you can use the
|
||||
`plot_color` function).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to
|
||||
`str` variable (do not copy paste the string, but type it).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
The hard part is typing `²`, `⟺` and `∨`. You can check how to do it using help:
|
||||
```
|
||||
help?> ²
|
||||
@ -237,8 +244,16 @@ help?> ∨
|
||||
|
||||
Save the string in the `str` variable as we will use it in the next exercise.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
In the `str` string from exercise 9 find all matches of a pattern where `a`
|
||||
is followed by `b` but there can be some characters between them.
|
||||
|
||||
<details>
|
||||
<summary>Show!</summary>
|
||||
|
||||
The exercise does not specify how the matching should be done. If we
|
||||
want it to be eager (match as much as possible), we write:
|
||||
|
||||
|
@ -19,75 +19,10 @@ If you want to understand all the parameters plese check their meaning
|
||||
For us it is enough that this request generates 10 random integers in the range
|
||||
from 1 to 6. Run this query in Julia and parse the result.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Write a function that tries to parse a string as an integer.
|
||||
If it succeeds it should return the integer, otherwise it should return `0`
|
||||
but print error message.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Create a matrix containing truth table for `&&` operation including `missing`.
|
||||
If some operation errors store `"error"` in the table. As an extra feature (this
|
||||
is harder so you can skip it) in each cell store both inputs and output to make
|
||||
reading the table easier.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
|
||||
missing values in it by the mean of the non-missing values.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
|
||||
strings stored in it as `Float64`, while keeping `missing` values unchanged.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Print to the terminal all days in January 2023 that are Mondays.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Compute the dates that are one month later than January 15, 2020, February 15
|
||||
2020, March 15, 2020, and April 15, 2020. How many days pass during this one
|
||||
month. Print the results to the screen?
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Parse the following string as JSON:
|
||||
```
|
||||
str = """
|
||||
[{"x":1,"y":1},
|
||||
{"x":2,"y":4},
|
||||
{"x":3,"y":9},
|
||||
{"x":4,"y":16},
|
||||
{"x":5,"y":25}]
|
||||
"""
|
||||
```
|
||||
into a `json` variable.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Extract from the `json` variable from exercise 8 two vectors `x` and `y`
|
||||
that correspond to the fields stored in the JSON structure.
|
||||
Plot `y` as a function of `x`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
|
||||
Use linear interpolation for filling missing values. For the extreme values
|
||||
use nearest available observation (you will need to consult Impute.jl
|
||||
documentation to find all required functions).
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution (example run):
|
||||
Example run:
|
||||
|
||||
```
|
||||
julia> using HTTP
|
||||
@ -109,8 +44,17 @@ julia> parse.(Int, split(String(response.body)))
|
||||
6
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Write a function that tries to parse a string as an integer.
|
||||
If it succeeds it should return the integer, otherwise it should return `0`
|
||||
but print error message.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Example function:
|
||||
|
||||
```
|
||||
@ -160,9 +104,17 @@ end
|
||||
```
|
||||
But this time we do not see the cause of the error.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
Create a matrix containing truth table for `&&` operation including `missing`.
|
||||
If some operation errors store `"error"` in the table. As an extra feature (this
|
||||
is harder so you can skip it) in each cell store both inputs and output to make
|
||||
reading the table easier.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> function apply_and(x, y)
|
||||
@ -181,9 +133,15 @@ julia> apply_and.([true, false, missing], [true false missing])
|
||||
"missing && true = error" "missing && false = error" "missing && missing = error"
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
|
||||
missing values in it by the mean of the non-missing values.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using Statistics
|
||||
@ -198,9 +156,15 @@ julia> coalesce.(v, mean(skipmissing(v)))
|
||||
3.5
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
|
||||
strings stored in it as `Float64`, while keeping `missing` values unchanged.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using Missings
|
||||
@ -215,9 +179,16 @@ julia> passmissing(parse).(Float64, s)
|
||||
missing
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Example solution:
|
||||
Print to the terminal all days in January 2023 that are Mondays.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
julia> using Dates
|
||||
@ -232,9 +203,18 @@ julia> for day in Date.(2023, 01, 1:31)
|
||||
2023-01-30
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Example solution:
|
||||
Compute the dates that are one month later than January 15, 2020, February 15
|
||||
2020, March 15, 2020, and April 15, 2020. How many days pass during this one
|
||||
month. Print the results to the screen?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
julia> for day in Date.(2023, 1:4, 15)
|
||||
@ -247,9 +227,24 @@ julia> for day in Date.(2023, 1:4, 15)
|
||||
2023-04-15 + 1 month = 2023-05-15 (difference: 30 days)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
Parse the following string as JSON:
|
||||
```
|
||||
str = """
|
||||
[{"x":1,"y":1},
|
||||
{"x":2,"y":4},
|
||||
{"x":3,"y":9},
|
||||
{"x":4,"y":16},
|
||||
{"x":5,"y":25}]
|
||||
"""
|
||||
```
|
||||
into a `json` variable.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using JSON3
|
||||
@ -278,9 +273,16 @@ julia> json = JSON3.read(str)
|
||||
}
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
Extract from the `json` variable from exercise 8 two vectors `x` and `y`
|
||||
that correspond to the fields stored in the JSON structure.
|
||||
Plot `y` as a function of `x`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
@ -289,9 +291,17 @@ y = [el.y for el in json]
|
||||
plot(x, y, xlabel="x", ylabel="y", legend=false)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
|
||||
Use linear interpolation for filling missing values. For the extreme values
|
||||
use nearest available observation (you will need to consult Impute.jl
|
||||
documentation to find all required functions).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using Impute
|
||||
|
@ -11,63 +11,8 @@
|
||||
Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame`
|
||||
called `df`.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Get number of rows, columns, column names and summary statistics of the
|
||||
`df` data frame from exercise 1.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Make a plot of `number` against `square` columns of `df` data frame.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Add a column to `df` data frame with name `name string` containing string
|
||||
representation of numbers in column `number`, i.e.
|
||||
`["one", "two", "three", "four"]`.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Check if `df` contains column `square2`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Extract column `number` from `df` and empty it (recall `empty!` function
|
||||
discussed in chapter 4).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
In `Random` module the `randexp` function is defined that samples numbers
|
||||
from exponential distribution with scale 1.
|
||||
Draw two 100,000 element samples from this distribution store them
|
||||
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
|
||||
and sum of vector `x` and half of vector `y`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
|
||||
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
|
||||
Compute all standard descriptive statistics of columns of this data frame.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
|
||||
Compare the size of created files using the `filesize` function.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Write the `df` data frame into SQLite database. Next find information about
|
||||
tables in this database. Run a query against a table representing the `df` data
|
||||
frame to calculate the mean of column `x`. Does it match the result we got in
|
||||
exercise 8?
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
<summary>Solution</summary>
|
||||
|
||||
CSV.jl supports reading gzip-compressed files so you can just do:
|
||||
|
||||
@ -106,9 +51,15 @@ julia> df = CSV.read(plain, DataFrame)
|
||||
4 │ 4 16
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
Get number of rows, columns, column names and summary statistics of the
|
||||
`df` data frame from exercise 1.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> nrow(df)
|
||||
@ -131,17 +82,30 @@ julia> describe(df)
|
||||
2 │ square 7.75 2 6.5 16 0 Int64
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
Make a plot of `number` against `square` columns of `df` data frame.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
Add a column to `df` data frame with name `name string` containing string
|
||||
representation of numbers in column `number`, i.e.
|
||||
`["one", "two", "three", "four"]`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> df."name string" = ["one", "two", "three", "four"]
|
||||
@ -164,8 +128,15 @@ julia> df
|
||||
|
||||
Note that we needed to use a string as we have space in column name.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Check if `df` contains column `square2`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You can use either `hasproperty` or `columnindex`:
|
||||
|
||||
```
|
||||
@ -184,9 +155,15 @@ julia> df.square2
|
||||
ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
Extract column `number` from `df` and empty it (recall `empty!` function
|
||||
discussed in chapter 4).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> empty!(df[:, :number])
|
||||
@ -198,9 +175,19 @@ as it would corrupt the `df` data frame (these operations do non-copying
|
||||
extraction of a column from a data frame as opposed to `df[:, :number]`
|
||||
which makes a copy).
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
In `Random` module the `randexp` function is defined that samples numbers
|
||||
from exponential distribution with scale 1.
|
||||
Draw two 100,000 element samples from this distribution store them
|
||||
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
|
||||
and sum of vector `x` and half of vector `y`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Random
|
||||
using Plots
|
||||
@ -212,10 +199,19 @@ histogram!(max.(x, y), label="maximum")
|
||||
|
||||
I have put both histograms on the same plot to show that they overlap.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution (you might get slightly different results because we did not set
|
||||
the seed of random number generator when creating `x` and `y` vectors):
|
||||
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
|
||||
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
|
||||
Compute all standard descriptive statistics of columns of this data frame.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
You might get slightly different results because we did not set
|
||||
the seed of random number generator when creating `x` and `y` vectors:
|
||||
|
||||
```
|
||||
julia> df = DataFrame(x=x, y=y);
|
||||
@ -238,8 +234,16 @@ julia> describe(df, :all)
|
||||
We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary
|
||||
statistics except `first` and `last` as expected.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
|
||||
Compare the size of created files using the `filesize` function.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using Arrow
|
||||
|
||||
@ -258,8 +262,18 @@ julia> filesize("df.arrow")
|
||||
|
||||
In this case Apache Arrow file is smaller.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Write the `df` data frame into SQLite database. Next find information about
|
||||
tables in this database. Run a query against a table representing the `df` data
|
||||
frame to calculate the mean of column `x`. Does it match the result we got in
|
||||
exercise 8?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using SQLite
|
||||
|
||||
|
@ -22,69 +22,8 @@ Create `matein2` data frame that will have only puzzles that have `"mateIn2"`
|
||||
in the `Themes` column.
|
||||
Use the `contains` function (check its documentation first).
|
||||
|
||||
### Exercise 2
|
||||
|
||||
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
|
||||
in the `puzzles` data frame?
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Create `small` data frame that holds first 10 rows of `matein2` data frame
|
||||
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Iterate rows of `small` data frame and print the ratio of
|
||||
`RatingDeviation` and `NbPlays` for each row.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Write a function `collatz` that runs the following process. Start with a
|
||||
positive number `n`. If it is even divide it by two. If it is odd multiply
|
||||
it by 3 and add one. The function should return the number of steps needed to
|
||||
reach 1.
|
||||
|
||||
Create a `d` dictionary that maps number of steps needed to a list of numbers from
|
||||
the range `1:100` that required this number of steps.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Using the `d` dictionary make a scatter plot of number of steps required
|
||||
vs average value of numbers that require this number of steps.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Repeat the process from exercises 6 and 7, but this time use a data frame
|
||||
and try to write an appropriate expression using the `combine` and `groupby`
|
||||
functions (as it was explained in the last part of chapter 9). This time
|
||||
perform computations for numbers ranging from one to one million.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Set seed of random number generator to `1234`. Draw 100 random points
|
||||
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
|
||||
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
|
||||
Add random noise to column `y` that has normal distribution with mean 0 and
|
||||
standard deviation 0.25. Call this column `z`.
|
||||
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
|
||||
@ -104,9 +43,17 @@ julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
|
||||
1 column and 274127 rows omitted
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution (two ways to do it):
|
||||
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
|
||||
in the `puzzles` data frame?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Two ways to do it:
|
||||
|
||||
```
|
||||
julia> using Statistics
|
||||
@ -118,9 +65,15 @@ julia> mean(contains.(puzzles.Themes, "mateIn2"))
|
||||
0.12852152542746353
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
Create `small` data frame that holds first 10 rows of `matein2` data frame
|
||||
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
|
||||
@ -140,9 +93,15 @@ julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
|
||||
10 │ 979 144 14
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
Iterate rows of `small` data frame and print the ratio of
|
||||
`RatingDeviation` and `NbPlays` for each row.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> for row in eachrow(small)
|
||||
@ -160,9 +119,16 @@ julia> for row in eachrow(small)
|
||||
10.285714285714286
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution (several options):
|
||||
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Several options:
|
||||
```
|
||||
julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
|
||||
2-element Vector{String}:
|
||||
@ -180,9 +146,20 @@ julia> names(matein2, r"[nN]$")
|
||||
"RatingDeviation"
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
Write a function `collatz` that runs the following process. Start with a
|
||||
positive number `n`. If it is even divide it by two. If it is odd multiply
|
||||
it by 3 and add one. The function should return the number of steps needed to
|
||||
reach 1.
|
||||
|
||||
Create a `d` dictionary that maps number of steps needed to a list of numbers from
|
||||
the range `1:100` that required this number of steps.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> function collatz(n)
|
||||
@ -232,9 +209,15 @@ Dict{Int64, Vector{Int64}} with 45 entries:
|
||||
As we can see even for small `n` the number of steps required to reach `1`
|
||||
can get quite large.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
Using the `d` dictionary make a scatter plot of number of steps required
|
||||
vs average value of numbers that require this number of steps.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
@ -247,9 +230,17 @@ scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=fal
|
||||
Note that we needed to use `collect` on `keys` as `scatter` expects an array
|
||||
not just an iterator.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
Repeat the process from exercises 6 and 7, but this time use a data frame
|
||||
and try to write an appropriate expression using the `combine` and `groupby`
|
||||
functions (as it was explained in the last part of chapter 9). This time
|
||||
perform computations for numbers ranging from one to one million.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
df = DataFrame(n=1:10^6);
|
||||
@ -258,6 +249,8 @@ agg = combine(groupby(df, :collatz), :n => mean);
|
||||
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Set seed of random number generator to `1234`. Draw 100 random points
|
||||
@ -267,7 +260,8 @@ Add random noise to column `y` that has normal distribution with mean 0 and
|
||||
standard deviation 0.25. Call this column `z`.
|
||||
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
|
||||
|
||||
Solution:
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Random
|
||||
@ -278,9 +272,14 @@ df.z = df.y + randn(100) / 4
|
||||
scatter(df.x, [df.y df.z], labels=["y" "z"])
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Loess
|
||||
|
@ -13,89 +13,8 @@ independently and uniformly from the [0,1[ interval.
|
||||
Create a data frame using data from this matrix using auto-generated
|
||||
column names.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Now, using matrix `mat` create a data frame with randomly generated
|
||||
column names. Use the `randstring` function from the `Random` module
|
||||
to generate them. Store this data frame in `df` variable.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Create a new data frame, taking `df` as a source that will have the same
|
||||
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Create a dictionary holding `column_name => column_vector` pairs
|
||||
using data stored in data frame `df`. Save this dictionary in variable `d`.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Create a data frame back from dictionary `d` from exercise 4. Compare it
|
||||
with `df`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
For data frame `df` compute the dot product between all pairs of its columns.
|
||||
Use the `dot` function from the `LinearAlgebra` module.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Given two data frames:
|
||||
|
||||
```
|
||||
julia> df1 = DataFrame(a=1:2, b=11:12)
|
||||
2×2 DataFrame
|
||||
Row │ a b
|
||||
│ Int64 Int64
|
||||
─────┼──────────────
|
||||
1 │ 1 11
|
||||
2 │ 2 12
|
||||
|
||||
julia> df2 = DataFrame(a=1:2, c=101:102)
|
||||
2×2 DataFrame
|
||||
Row │ a c
|
||||
│ Int64 Int64
|
||||
─────┼──────────────
|
||||
1 │ 1 101
|
||||
2 │ 2 102
|
||||
```
|
||||
|
||||
vertically concatenate them so that only columns that are present in both
|
||||
data frames are kept. Check the documentation of `vcat` to see how to
|
||||
do it.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Now append to `df1` table `df2`, but add only the columns from `df2` that
|
||||
are present in `df1`. Check the documentation of `append!` to see how to
|
||||
do it.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Create a `circle` data frame, using the `push!` function that will store
|
||||
1000 samples of the following process:
|
||||
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
|
||||
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
|
||||
and is `false` otherwise.
|
||||
|
||||
Compute summary statistics of this data frame.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Create a scatterplot of `circle` data frame where its `x` and `y` axis
|
||||
will be the plotted points and `inside` variable will determine the color
|
||||
of the plotted point.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using DataFrames
|
||||
@ -120,9 +39,16 @@ julia> DataFrame(mat, :auto)
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
Now, using matrix `mat` create a data frame with randomly generated
|
||||
column names. Use the `randstring` function from the `Random` module
|
||||
to generate them. Store this data frame in `df` variable.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using Random
|
||||
@ -139,10 +65,16 @@ julia> df = DataFrame(mat, [randstring() for _ in 1:4])
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
Create a new data frame, taking `df` as a source that will have the same
|
||||
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
|
||||
5×4 DataFrame
|
||||
@ -170,9 +102,15 @@ julia> rename(df, string.("y", 1:4))
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
Create a dictionary holding `column_name => column_vector` pairs
|
||||
using data stored in data frame `df`. Save this dictionary in variable `d`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> d = Dict([n => df[:, n] for n in names(df)])
|
||||
@ -194,9 +132,15 @@ Dict{Symbol, AbstractVector} with 4 entries:
|
||||
Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
Create a data frame back from dictionary `d` from exercise 4. Compare it
|
||||
with `df`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> DataFrame(d)
|
||||
@ -215,9 +159,15 @@ Note that columns of a data frame are now sorted by their names.
|
||||
This is done for `Dict` objects because such dictionaries do not have
|
||||
a defined order of keys.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
For data frame `df` compute the dot product between all pairs of its columns.
|
||||
Use the `dot` function from the `LinearAlgebra` module.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using LinearAlgebra
|
||||
@ -232,9 +182,36 @@ julia> pairwise(dot, eachcol(df))
|
||||
1.50558 1.18411 0.909744 1.47431
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
Given two data frames:
|
||||
|
||||
```
|
||||
julia> df1 = DataFrame(a=1:2, b=11:12)
|
||||
2×2 DataFrame
|
||||
Row │ a b
|
||||
│ Int64 Int64
|
||||
─────┼──────────────
|
||||
1 │ 1 11
|
||||
2 │ 2 12
|
||||
|
||||
julia> df2 = DataFrame(a=1:2, c=101:102)
|
||||
2×2 DataFrame
|
||||
Row │ a c
|
||||
│ Int64 Int64
|
||||
─────┼──────────────
|
||||
1 │ 1 101
|
||||
2 │ 2 102
|
||||
```
|
||||
|
||||
vertically concatenate them so that only columns that are present in both
|
||||
data frames are kept. Check the documentation of `vcat` to see how to
|
||||
do it.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> vcat(df1, df2, cols=:intersect)
|
||||
@ -255,9 +232,16 @@ julia> vcat(df1, df2)
|
||||
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
Now append to `df1` table `df2`, but add only the columns from `df2` that
|
||||
are present in `df1`. Check the documentation of `append!` to see how to
|
||||
do it.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> append!(df1, df2, cols=:subset)
|
||||
@ -271,9 +255,20 @@ julia> append!(df1, df2, cols=:subset)
|
||||
4 │ 2 missing
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution
|
||||
Create a `circle` data frame, using the `push!` function that will store
|
||||
1000 samples of the following process:
|
||||
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
|
||||
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
|
||||
and is `false` otherwise.
|
||||
|
||||
Compute summary statistics of this data frame.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
circle=DataFrame()
|
||||
@ -287,9 +282,16 @@ describe(circle)
|
||||
|
||||
We note that mean of variable `inside` is approximately π.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
Create a scatterplot of `circle` data frame where its `x` and `y` axis
|
||||
will be the plotted points and `inside` variable will determine the color
|
||||
of the plotted point.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
|
@ -13,83 +13,8 @@ sampled from uniform distribution on [0, 1[ interval.
|
||||
Serialize it to disk, and next deserialize. Check if the deserialized
|
||||
object is the same as the source data frame.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Add a column `n` to the `df` data frame that in each row will hold the
|
||||
number of observations in column `x` that have distance less than `0.1` to
|
||||
a value stored in a given row of `x`.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Investigate visually how does `n` depend on `x` in data frame `df`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Someone has prepared the following test data for you:
|
||||
```
|
||||
teststr = """
|
||||
"x","sinx"
|
||||
0.139279,0.138829
|
||||
0.456779,0.441059
|
||||
0.344034,0.337287
|
||||
0.140253,0.139794
|
||||
0.848344,0.750186
|
||||
0.977512,0.829109
|
||||
0.032737,0.032731
|
||||
0.702750,0.646318
|
||||
0.422339,0.409895
|
||||
0.393878,0.383772
|
||||
"""
|
||||
```
|
||||
|
||||
Load this data into `testdf` data frame.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Check the accuracy of computations of sinus of `x` in `testdf`.
|
||||
Print all rows for which the absolute difference is greater than `5e-7`.
|
||||
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
|
||||
difference.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Group data in data frame `df` into buckets of 0.1 width and store the result in
|
||||
`gdf` data frame (sort the groups). Use the `cut` function from
|
||||
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
|
||||
Check the number of values in each group.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
|
||||
Check what would be the group order if you asked not to sort them.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Compute average `n` for each group in `gdf`.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
|
||||
Use the `\` operator to fit it (recall it from chapter 4).
|
||||
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Repeat exercise 9 but using the GLM.jl package. This time
|
||||
extract the p-value for the slope of estimated coefficient for `x` variable.
|
||||
Use the `coeftable` function from GLM.jl to get this information.
|
||||
Check the documentation of this function to learn how to do it (it will be
|
||||
easiest for you to first convert its result to a `DataFrame`).
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using DataFrames
|
||||
@ -104,9 +29,16 @@ julia> deserialize("df.bin") == df
|
||||
true
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
Add a column `n` to the `df` data frame that in each row will hold the
|
||||
number of observations in column `x` that have distance less than `0.1` to
|
||||
a value stored in a given row of `x`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
A simple approach is:
|
||||
```
|
||||
@ -151,9 +83,14 @@ df.n = f2(df.x)
|
||||
In this solution the fact that we used function barrier is even more relevant
|
||||
as we explicitly use loops inside.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
Investigate visually how does `n` depend on `x` in data frame `df`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
@ -162,9 +99,31 @@ scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
|
||||
|
||||
As expected on the border of the domain number of neighbors drops.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
Someone has prepared the following test data for you:
|
||||
```
|
||||
teststr = """
|
||||
"x","sinx"
|
||||
0.139279,0.138829
|
||||
0.456779,0.441059
|
||||
0.344034,0.337287
|
||||
0.140253,0.139794
|
||||
0.848344,0.750186
|
||||
0.977512,0.829109
|
||||
0.032737,0.032731
|
||||
0.702750,0.646318
|
||||
0.422339,0.409895
|
||||
0.393878,0.383772
|
||||
"""
|
||||
```
|
||||
|
||||
Load this data into `testdf` data frame.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using CSV
|
||||
@ -188,8 +147,18 @@ julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
|
||||
10 │ 0.393878 0.383772
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Check the accuracy of computations of sinus of `x` in `testdf`.
|
||||
Print all rows for which the absolute difference is greater than `5e-7`.
|
||||
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
|
||||
difference.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Since data frame is small we can use `eachrow`:
|
||||
|
||||
```
|
||||
@ -202,9 +171,18 @@ julia> for row in eachrow(testdf)
|
||||
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
Group data in data frame `df` into buckets of 0.1 width and store the result in
|
||||
`gdf` data frame (sort the groups). Use the `cut` function from
|
||||
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
|
||||
Check the number of values in each group.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using CategoricalArrays
|
||||
|
||||
@ -244,9 +222,15 @@ julia> combine(gdf, nrow) # alternative way to do it
|
||||
|
||||
You might get a bit different numbers but all should be around 10,000.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
|
||||
Check what would be the group order if you asked not to sort them.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> NamedTuple.(keys(gdf))
|
||||
@ -282,9 +266,14 @@ the resulting group order could depend on the type of grouping column, so if
|
||||
you want to depend on the order of groups always spass `sort` keyword argument
|
||||
explicitly.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
Compute average `n` for each group in `gdf`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using Statistics
|
||||
@ -319,9 +308,16 @@ julia> combine(gdf, :n => mean) # alternative way to do it
|
||||
10 │ [0.9, 1.0) 14944.5
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
|
||||
Use the `\` operator to fit it (recall it from chapter 4).
|
||||
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> function fitmodel(x, n)
|
||||
@ -364,9 +360,18 @@ julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that y
|
||||
We note that indeed in the first and last group the regression has a significant
|
||||
slope.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
Repeat exercise 9 but using the GLM.jl package. This time
|
||||
extract the p-value for the slope of estimated coefficient for `x` variable.
|
||||
Use the `coeftable` function from GLM.jl to get this information.
|
||||
Check the documentation of this function to learn how to do it (it will be
|
||||
easiest for you to first convert its result to a `DataFrame`).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> using GLM
|
||||
|
@ -14,86 +14,8 @@ is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`.
|
||||
Download this file and check if it indeed has this checksum.
|
||||
You might need to read documentation of `string` and `join` functions.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
|
||||
that contains the ego-nets of Eastern European users collected from the music
|
||||
streaming service Deezer in February 2020. Nodes are users and edges are mutual
|
||||
follower relationships.
|
||||
|
||||
From the file extract deezer_edges.json and deezer_target.csv files and
|
||||
save them to disk.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Load deezer_edges.json and deezer_target.csv files to Julia.
|
||||
The JSON file should be loaded as JSON3.jl object `edges_json`.
|
||||
The CSV file should be loaded into a data frame `target_df`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Check that keys in the `edges_json` are in the same order as `id` column
|
||||
in `target_df`.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
From every value stored in `edges_json` create a graph representing
|
||||
ego-net of the given node. Store these graphs in a vector that will make the
|
||||
`egonet` column of in the `target_df` data frame.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Ego-net in our data set is a subgraph of a full Deezer graph where for some
|
||||
node all its neighbors are included, but also it contains all edges between the
|
||||
neighbors.
|
||||
Therefore we expect that diameter of every ego-net is at most 2 (as every
|
||||
two nodes are either connected directly or by a common friend).
|
||||
Check if this is indeed the case. Use the `diameter` function.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
For each ego-net find a central node that is connected to every other node
|
||||
in this network. Use the `degree` and `findall` functions to achieve this.
|
||||
Add `center` column with numbers of nodes that are connected to all other
|
||||
nodes in the ego-net to `target_df` data frame.
|
||||
|
||||
Next add a column `center_len` that gives the number of such nodes.
|
||||
|
||||
Check how many times different numbers of center nodes are found.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Add the following ego-net features to the `target_df` data frame:
|
||||
* `size`: number of nodes in ego-net
|
||||
* `mean_degree`: average node degree in ego-net
|
||||
|
||||
Check mean values of these two columns by `target` column.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Continuing to work with `target_df` data frame create a logistic regression
|
||||
explaining `target` by `size` and `mean_degree`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Continuing to work with `target_df` create a scatterplot where `size` will be on
|
||||
one axis and `mean_degree` rounded to nearest integer on the other axis.
|
||||
Plot the mean of `target` for each point being a combination of `size` and
|
||||
rounded `mean_degree`.
|
||||
|
||||
Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
|
||||
for values in range from 5% to 95% quantile (to concentrate on typical values
|
||||
of size).
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Downloads
|
||||
@ -106,9 +28,20 @@ sha == shastr
|
||||
|
||||
The last line should produce `true`.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
|
||||
that contains the ego-nets of Eastern European users collected from the music
|
||||
streaming service Deezer in February 2020. Nodes are users and edges are mutual
|
||||
follower relationships.
|
||||
|
||||
From the file extract deezer_edges.json and deezer_target.csv files and
|
||||
save them to disk.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
|
||||
@ -125,9 +58,16 @@ end
|
||||
close(archive)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
Load deezer_edges.json and deezer_target.csv files to Julia.
|
||||
The JSON file should be loaded as JSON3.jl object `edges_json`.
|
||||
The CSV file should be loaded into a data frame `target_df`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using CSV
|
||||
@ -137,17 +77,32 @@ edges_json = JSON3.read(read("deezer_edges.json"))
|
||||
target_df = CSV.read("deezer_target.csv", DataFrame)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution (short, but you need to have a good understanding of Julia types
|
||||
and standar functions to properly write it):
|
||||
Check that keys in the `edges_json` are in the same order as `id` column
|
||||
in `target_df`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
This is short, but you need to have a good understanding of Julia types
|
||||
and standar functions to properly write it:
|
||||
```
|
||||
Symbol.(target_df.id) == keys(edges_json)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
From every value stored in `edges_json` create a graph representing
|
||||
ego-net of the given node. Store these graphs in a vector that will make the
|
||||
`egonet` column of in the `target_df` data frame.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Graphs
|
||||
@ -163,9 +118,19 @@ end
|
||||
target_df.egonet = edgelist2graph(values(edges_json))
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
Ego-net in our data set is a subgraph of a full Deezer graph where for some
|
||||
node all its neighbors are included, but also it contains all edges between the
|
||||
neighbors.
|
||||
Therefore we expect that diameter of every ego-net is at most 2 (as every
|
||||
two nodes are either connected directly or by a common friend).
|
||||
Check if this is indeed the case. Use the `diameter` function.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
julia> extrema(diameter.(target_df.egonet))
|
||||
@ -174,9 +139,21 @@ julia> extrema(diameter.(target_df.egonet))
|
||||
|
||||
Indeed we see that for each ego-net diameter is 2.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
For each ego-net find a central node that is connected to every other node
|
||||
in this network. Use the `degree` and `findall` functions to achieve this.
|
||||
Add `center` column with numbers of nodes that are connected to all other
|
||||
nodes in the ego-net to `target_df` data frame.
|
||||
|
||||
Next add a column `center_len` that gives the number of such nodes.
|
||||
|
||||
Check how many times different numbers of center nodes are found.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
target_df.center = map(target_df.egonet) do g
|
||||
@ -192,9 +169,18 @@ the condition we want to check.
|
||||
We notice that in some cases it is impossible to identify the center of the
|
||||
ego-net uniquely.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
Add the following ego-net features to the `target_df` data frame:
|
||||
* `size`: number of nodes in ego-net
|
||||
* `mean_degree`: average node degree in ego-net
|
||||
|
||||
Check mean values of these two columns by `target` column.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Statistics
|
||||
@ -206,9 +192,15 @@ combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)
|
||||
It seems that for target equal to `0` size and average degree in the network are
|
||||
a bit larger.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
Continuing to work with `target_df` data frame create a logistic regression
|
||||
explaining `target` by `size` and `mean_degree`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using GLM
|
||||
@ -217,9 +209,21 @@ glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())
|
||||
|
||||
We see that only `size` is statistically significant.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
Continuing to work with `target_df` create a scatterplot where `size` will be on
|
||||
one axis and `mean_degree` rounded to nearest integer on the other axis.
|
||||
Plot the mean of `target` for each point being a combination of `size` and
|
||||
rounded `mean_degree`.
|
||||
|
||||
Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
|
||||
for values in range from 5% to 95% quantile (to concentrate on typical values
|
||||
of size).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
@ -242,6 +246,6 @@ plot(size_predict, target_predict;
|
||||
xlabel="size", ylabel="predicted target", legend=false)
|
||||
```
|
||||
|
||||
Between quantiles 5% and 95% we see a downward shaped relationship.
|
||||
Between quantiles 5% and 95% of `size` we see a downward shaped relationship.
|
||||
|
||||
</details>
|
||||
|
@ -13,12 +13,47 @@ https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.
|
||||
archive and extract primary_data.csv and secondary_data.csv files from it.
|
||||
Save the files to disk.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Downloads
|
||||
import ZipFile
|
||||
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
|
||||
archive = ZipFile.Reader("MushroomDataset.zip")
|
||||
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
|
||||
open("primary_data.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
|
||||
open("secondary_data.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
close(archive)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Load primary_data.csv into the `primary` data frame.
|
||||
Load secondary_data.csv into the `secondary` data frame.
|
||||
Describe the contents of both data frames.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using CSV
|
||||
using DataFrames
|
||||
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
|
||||
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
|
||||
describe(primary)
|
||||
describe(secondary)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Start with `primary` data. Note that columns starting from column 4 have
|
||||
@ -32,6 +67,25 @@ three columns just after `class` column in the `parsed_primary` data frame.
|
||||
Check `renamecols` keyword argument of `select` to
|
||||
avoid renaming of the produced columns.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
|
||||
parse_nominal(::Missing) = missing
|
||||
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
|
||||
parse_numeric(::Missing) = missing
|
||||
idcols = ["family", "name", "class"]
|
||||
numericcols = ["cap-diameter", "stem-height", "stem-width"]
|
||||
parsed_primary = select(primary,
|
||||
idcols,
|
||||
numericcols .=> ByRow(parse_numeric),
|
||||
Not([idcols; numericcols]) .=> ByRow(parse_nominal);
|
||||
renamecols=false)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 4
|
||||
|
||||
In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be
|
||||
@ -49,119 +103,8 @@ Use the following rules:
|
||||
|
||||
For each found pair print to the screen the row number, family, name, and class.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Still using `parsed_primary` find what is the average probability of class being
|
||||
`p` by `family`. Additionally add number of observations in each group. Sort
|
||||
these results by the probability. Try using DataFramesMeta.jl to do this
|
||||
exercise (this requirement is optional).
|
||||
|
||||
Store the result in `agg_primary` data frame.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
|
||||
it gives us a total number of rows that had this probability and a tuple
|
||||
of mushroom family names.
|
||||
|
||||
Optionally: try to display the produced table so that the tuple containing the
|
||||
list of families for each group is not cropped (this will require large
|
||||
terminal).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
From our preliminary analysis of `primary` data we see that `missing` value in
|
||||
the primary data is non-informative, so in `secondary` data we should be
|
||||
cautious when building a model if we allowed for missing data (in practice
|
||||
if we were investigating some real mushroom we most likely would know its
|
||||
characteristics).
|
||||
|
||||
Therefore as a first step drop in-place all columns in `secondary` data frame
|
||||
that have missing values.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Create a logistic regression predicting `class` based on all remaining features
|
||||
in the data frame. You might need to check the `Term` usage in StatsModels.jl
|
||||
documentation.
|
||||
|
||||
You will notice that for `stem-color` and `habitat` columns you get strange
|
||||
estimation results (large absolute values of estimated parameters and even
|
||||
larger standard errors). Explain why this happens by analyzing frequency tables
|
||||
of these variables against `class` column.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Add `class_p` column to `secondary` as a second column that will contain
|
||||
predicted probability from the model created in exercise 8 of a given
|
||||
observation having class `p`.
|
||||
|
||||
Print descriptive statistics of column `class_p` by `class`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Plot FPR-TPR ROC curve for our model and compute associated AUC value.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Downloads
|
||||
import ZipFile
|
||||
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
|
||||
archive = ZipFile.Reader("MushroomDataset.zip")
|
||||
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
|
||||
open("primary_data.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
|
||||
open("secondary_data.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
close(archive)
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using CSV
|
||||
using DataFrames
|
||||
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
|
||||
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
|
||||
describe(primary)
|
||||
describe(secondary)
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
|
||||
parse_nominal(::Missing) = missing
|
||||
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
|
||||
parse_numeric(::Missing) = missing
|
||||
idcols = ["family", "name", "class"]
|
||||
numericcols = ["cap-diameter", "stem-height", "stem-width"]
|
||||
parsed_primary = select(primary,
|
||||
idcols,
|
||||
numericcols .=> ByRow(parse_numeric),
|
||||
Not([idcols; numericcols]) .=> ByRow(parse_nominal);
|
||||
renamecols=false)
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
function overlap_numeric(v1, v2)
|
||||
@ -200,9 +143,19 @@ end
|
||||
Note that in this exercise using `eachrow` is not a problem
|
||||
(although it is not type stable) because the data is small.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
Still using `parsed_primary` find what is the average probability of class being
|
||||
`p` by `family`. Additionally add number of observations in each group. Sort
|
||||
these results by the probability. Try using DataFramesMeta.jl to do this
|
||||
exercise (this requirement is optional).
|
||||
|
||||
Store the result in `agg_primary` data frame.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Statistics
|
||||
@ -214,17 +167,40 @@ agg_primary = @chain parsed_primary begin
|
||||
end
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
|
||||
it gives us a total number of rows that had this probability and a tuple
|
||||
of mushroom family names.
|
||||
|
||||
Optionally: try to display the produced table so that the tuple containing the
|
||||
list of families for each group is not cropped (this will require large
|
||||
terminal).
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140)
|
||||
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families); truncate=140)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
From our preliminary analysis of `primary` data we see that `missing` value in
|
||||
the primary data is non-informative, so in `secondary` data we should be
|
||||
cautious when building a model if we allowed for missing data (in practice
|
||||
if we were investigating some real mushroom we most likely would know its
|
||||
characteristics).
|
||||
|
||||
Therefore as a first step drop in-place all columns in `secondary` data frame
|
||||
that have missing values.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
|
||||
@ -233,9 +209,21 @@ select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
|
||||
Note that we select based on actual contents of the columns and not by their
|
||||
element type (column could allow for missing values but not have them).
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
Create a logistic regression predicting `class` based on all remaining features
|
||||
in the data frame. You might need to check the `Term` usage in StatsModels.jl
|
||||
documentation.
|
||||
|
||||
You will notice that for `stem-color` and `habitat` columns you get strange
|
||||
estimation results (large absolute values of estimated parameters and even
|
||||
larger standard errors). Explain why this happens by analyzing frequency tables
|
||||
of these variables against `class` column.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using GLM
|
||||
@ -247,12 +235,21 @@ freqtable(secondary, "stem-color", "class")
|
||||
freqtable(secondary, "habitat", "class")
|
||||
```
|
||||
|
||||
We can see that for cetrain levels of `stem-color` and `habitat` variables
|
||||
We can see that for certain levels of `stem-color` and `habitat` variables
|
||||
there is a perfect separation of classes.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
Add `class_p` column to `secondary` as a second column that will contain
|
||||
predicted probability from the model created in exercise 8 of a given
|
||||
observation having class `p`.
|
||||
|
||||
Print descriptive statistics of column `class_p` by `class`.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
insertcols!(secondary, 2, :class_p => predict(model))
|
||||
@ -264,9 +261,14 @@ end
|
||||
We can see that the model has some discriminatory power, but there
|
||||
is still a significant overlap between classes.
|
||||
|
||||
</details>
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
Plot FPR-TPR ROC curve for our model and compute associated AUC value.
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
```
|
||||
using Plots
|
||||
|
Loading…
Reference in New Issue
Block a user