From 31d8428f6ad35a85d522a10dfa95fd099bdef7e6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Fri, 14 Oct 2022 13:43:12 +0200 Subject: [PATCH] update layout of all exercises --- exercises/exercises03.md | 136 +++++++++++--------- exercises/exercises04.md | 262 +++++++++++++++++++++------------------ exercises/exercises05.md | 194 ++++++++++++++++------------- exercises/exercises06.md | 151 ++++++++++++---------- exercises/exercises07.md | 160 +++++++++++++----------- exercises/exercises08.md | 140 +++++++++++---------- exercises/exercises09.md | 141 +++++++++++---------- exercises/exercises10.md | 184 +++++++++++++-------------- exercises/exercises11.md | 173 +++++++++++++------------- exercises/exercises12.md | 184 +++++++++++++-------------- exercises/exercises13.md | 242 ++++++++++++++++++------------------ 11 files changed, 1042 insertions(+), 925 deletions(-) diff --git a/exercises/exercises03.md b/exercises/exercises03.md index ce039f3..e719284 100644 --- a/exercises/exercises03.md +++ b/exercises/exercises03.md @@ -11,64 +11,8 @@ Check what methods does the `repeat` function have. Are they all covered in help for this function? -### Exercise 2 - -Write a function `fun2` that takes any vector and returns the difference between -the largest and the smallest element in this vector. - -### Exercise 3 - -Generate a vector of one million random numbers from `[0, 1]` interval. -Check what is a faster way to get a maximum and minimum element in it. One -option is by using the `maximum` and `minimum` functions and the other is by -using the `extrema` function. - -### Exercise 4 - -Assume you have accidentally typed `+x = 1` when wanting to assign `1` to -variable `x`. What effects can this operation have? - -### Exercise 5 - -What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why? - -### Exercise 6 - -Define two identical anonymous functions `x -> x + 1` in global scope? Do they -have the same type? - -### Exercise 7 - -Define the `wrap` function taking one argument `i` and returning the anonymous -function `x -> x + i`. Is the type of such anonymous function the same across -calls to `wrap` function? - -### Exercise 8 - -You want to write a function that accepts any `Integer` except `Bool` and returns -the passed value. If `Bool` is passed an error should be thrown. - -### Exercise 9 - -The `@time` macro measures time taken by an expression run and prints it, -but returns the value of the expression. -The `@elapsed` macro works differently - it does not print anything, but returns -time taken to evaluate an expression. Test the `@elapsed` macro by to see how -long it takes to shuffle a vector of one million floats. Use the `shuffle` function -from `Random` module. - -### Exercise 10 - -Using the `@btime` macro benchmark the time of calculating the sum of one million -random floats. - -# Solutions -
- -Show! - -### Exercise 1 +Solution Write: ``` @@ -93,8 +37,16 @@ and `repeat(c::Char, r::Integer)` is its faster version that accepts values that have `Char` type only (and it is invoked by Julia if value of type `Char` is passed as an argument to `repeat`). +
+ ### Exercise 2 +Write a function `fun2` that takes any vector and returns the difference between +the largest and the smallest element in this vector. + +
+Solution + You can define is as follows: ``` fun2(x::AbstractVector) = maximum(x) - minimum(x) @@ -109,8 +61,18 @@ end Note that these two functions will work with vectors of any elements that are ordered and support subtraction (they do not have to be numbers). +
+ ### Exercise 3 +Generate a vector of one million random numbers from `[0, 1]` interval. +Check what is a faster way to get a maximum and minimum element in it. One +option is by using the `maximum` and `minimum` functions and the other is by +using the `extrema` function. + +
+Solution + Here is a way to compare the performance of both options: ``` julia> using BenchmarkTools @@ -130,8 +92,16 @@ As you can see in this situation, although `extrema` does the operation in a single pass over `x` it is slower than computing `minimum` and `maximum` in two passes. +
+ ### Exercise 4 +Assume you have accidentally typed `+x = 1` when wanting to assign `1` to +variable `x`. What effects can this operation have? + +
+Solution + If it is a fresh Julia session you define a new function in `Main` for `+` operator: ``` @@ -167,8 +137,15 @@ julia> +x=1 ERROR: error in method definition: function Base.+ must be explicitly imported to be extended ``` +
+ ### Exercise 5 +What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why? + +
+Solution + You get an empty vector: ``` julia> subtypes(Union{Float64, Missing}) @@ -181,8 +158,16 @@ declared types that have names (type of such types is `DataType` in Julia). *Extra* for this reason `subtypes` has a limited use. To check if one type is a subtype of some other type use the `<:` operator. +
+ ### Exercise 6 +Define two identical anonymous functions `x -> x + 1` in global scope? Do they +have the same type? + +
+Solution + No, each of them has a different type: ``` julia> f1 = x -> x + 1 @@ -215,8 +200,17 @@ julia> @time sum(x -> x^2, 1:10) 385 ``` +
+ ### Exercise 7 +Define the `wrap` function taking one argument `i` and returning the anonymous +function `x -> x + i`. Is the type of such anonymous function the same across +calls to `wrap` function? + +
+Solution + Yes, the type is the same: ``` @@ -252,8 +246,16 @@ julia> @time sumi(3) 3025 ``` +
+ ### Exercise 8 +You want to write a function that accepts any `Integer` except `Bool` and returns +the passed value. If `Bool` is passed an error should be thrown. + +
+Solution + We check subtypes of `Integer`: ``` @@ -292,8 +294,20 @@ julia> fun2(true) ERROR: ArgumentError: Bool is not supported ``` +
+ ### Exercise 9 +The `@time` macro measures time taken by an expression run and prints it, +but returns the value of the expression. +The `@elapsed` macro works differently - it does not print anything, but returns +time taken to evaluate an expression. Test the `@elapsed` macro by to see how +long it takes to shuffle a vector of one million floats. Use the `shuffle` function +from `Random` module. + +
+Solution + Here is the code that performs the task: ``` julia> using Random # needed to get access to shuffle @@ -312,8 +326,16 @@ julia> @elapsed shuffle(x) Note that the first time we run `shuffle` it takes longer due to compilation. +
+ ### Exercise 10 +Using the `@btime` macro benchmark the time of calculating the sum of one million +random floats. + +
+Solution + The code you can use is: ``` diff --git a/exercises/exercises04.md b/exercises/exercises04.md index 26ebcef..e2e324b 100644 --- a/exercises/exercises04.md +++ b/exercises/exercises04.md @@ -12,11 +12,81 @@ Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix columnwise with consecutive numbers). Next calculate sum, mean and standard deviation of each row and each column of this matrix. +
+Solution + +Write: +``` +julia> using Statistics + +julia> mat = [1 3 5 + 2 4 6] +2×3 Matrix{Int64}: + 1 3 5 + 2 4 6 + +julia> sum(mat, dims=1) +1×3 Matrix{Int64}: + 3 7 11 + +julia> sum(mat, dims=2) +2×1 Matrix{Int64}: + 9 + 12 + +julia> mean(mat, dims=1) +1×3 Matrix{Float64}: + 1.5 3.5 5.5 + +julia> mean(mat, dims=2) +2×1 Matrix{Float64}: + 3.0 + 4.0 + +julia> std(mat, dims=1) +1×3 Matrix{Float64}: + 0.707107 0.707107 0.707107 + +julia> std(mat, dims=2) +2×1 Matrix{Float64}: + 2.0 + 2.0 +``` + +Observe that the returned statistics are also stored in matrices. +If we compute them for columns (`dims=1`) then the produced matrix has one row. +If we compute them for rows (`dims=2`) then the produced matrix has one column. + +
+ ### Exercise 2 For each column of the matrix created in exercise 1 compute its range (i.e. the difference between maximum and minimum element stored in it). +
+Solution + +Here are some ways you can do it: +``` +julia> [maximum(x) - minimum(x) for x in eachcol(mat)] +3-element Vector{Int64}: + 1 + 1 + 1 + +julia> map(x -> maximum(x) - minimum(x), eachcol(mat)) +3-element Vector{Int64}: + 1 + 1 + 1 +``` + +Observe that if we used `eachcol` the produced result is a vector (not a matrix +like in exercise 1). + +
+ ### Exercise 3 This is data for car speed (mph) and distance taken to stop (ft) @@ -79,127 +149,8 @@ speed dist Load this data into Julia (this is part of the exercise) and fit a linear regression where speed is a feature and distance is target variable. -### Exercise 4 - -Plot the data loaded in exercise 4. Additionally plot the fitted regression -(you need to check Plots.jl documentation to find a way to do this). - -### Exercise 5 - -A simple code for calculation of Fibonacci numbers for positive -arguments is as follows: - -``` -fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2) -``` - -Using the BenchmarkTools.jl package measure runtime of this function for -`n` ranging from `1` to `20`. - -### Exercise 6 - -Improve the speed of code from exercise 5 by using a dictionary where you -store a mapping of `n` to `fib(n)`. Measure the performance of this function -for the same range of values as in exercise 5. - -### Exercise 7 - -Create a vector containing named tuples representing elements of a 4x4 grid. -So the first element of this vector should be `(x=1, y=1)` and last should be -`(x=4, y=4)`. Store the vector in variable `v`. - -### Exercise 8 - -The `filter` function allows you to select some values of an input collection. -Check its documentation first. Next, use it to keep from the vector `v` from -exercise 7 only elements whose sum is even. - -### Exercise 9 - -Check the documentation of the `filter!` function. Perform the same operation -as asked in exercise 8 but using `filter!`. What is the difference? - -### Exercise 10 - -Write a function that takes a number `n`. Next it generates two independent -random vectors of length `n` and returns their correlation coefficient. -Run this function `10000` times for `n` equal to `10`, `100`, `1000`, -and `10000`. -Create a plot with four histograms of distribution of computed Pearson -correlation coefficient. Check in the Plots.jl package which function can be -used to plot histograms. - -# Solutions -
- -Show! - -### Exercise 1 - -Write: -``` -julia> using Statistics - -julia> mat = [1 3 5 - 2 4 6] -2×3 Matrix{Int64}: - 1 3 5 - 2 4 6 - -julia> sum(mat, dims=1) -1×3 Matrix{Int64}: - 3 7 11 - -julia> sum(mat, dims=2) -2×1 Matrix{Int64}: - 9 - 12 - -julia> mean(mat, dims=1) -1×3 Matrix{Float64}: - 1.5 3.5 5.5 - -julia> mean(mat, dims=2) -2×1 Matrix{Float64}: - 3.0 - 4.0 - -julia> std(mat, dims=1) -1×3 Matrix{Float64}: - 0.707107 0.707107 0.707107 - -julia> std(mat, dims=2) -2×1 Matrix{Float64}: - 2.0 - 2.0 -``` - -Observe that the returned statistics are also stored in matrices. -If we compute them for columns (`dims=1`) then the produced matrix has one row. -If we compute them for rows (`dims=2`) then the produced matrix has one column. - -### Exercise 2 - -Here are some ways you can do it: -``` -julia> [maximum(x) - minimum(x) for x in eachcol(mat)] -3-element Vector{Int64}: - 1 - 1 - 1 - -julia> map(x -> maximum(x) - minimum(x), eachcol(mat)) -3-element Vector{Int64}: - 1 - 1 - 1 -``` - -Observe that if we used `eachcol` the produced result is a vector (not a matrix -like in exercise 1). - -### Exercise 3 +Solution First create a matrix with source data by copy pasting it from the exercise like this: @@ -285,8 +236,16 @@ julia> [ones(50) data[:, 1]] \ data[:, 2] 3.9324087591240877 ``` +
+ ### Exercise 4 +Plot the data loaded in exercise 4. Additionally plot the fitted regression +(you need to check Plots.jl documentation to find a way to do this). + +
+Solution + Run the following: ``` using Plots @@ -296,8 +255,23 @@ scatter(data[:, 1], data[:, 2]; The `smooth=true` keyword argument adds the linear regression line to the plot. +
+ ### Exercise 5 +A simple code for calculation of Fibonacci numbers for positive +arguments is as follows: + +``` +fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2) +``` + +Using the BenchmarkTools.jl package measure runtime of this function for +`n` ranging from `1` to `20`. + +
+Solution + Use the following code: ``` julia> using BenchmarkTools @@ -331,8 +305,17 @@ julia> for i in 1:40 Notice that execution time for number `n` is roughly sum of ececution times for numbers `n-1` and `n-2`. +
+ ### Exercise 6 +Improve the speed of code from exercise 5 by using a dictionary where you +store a mapping of `n` to `fib(n)`. Measure the performance of this function +for the same range of values as in exercise 5. + +
+Solution + Use the following code: ``` @@ -422,8 +405,17 @@ julia> @time fib2(200) As you can see the code does less allocations and is faster now. +
+ ### Exercise 7 +Create a vector containing named tuples representing elements of a 4x4 grid. +So the first element of this vector should be `(x=1, y=1)` and last should be +`(x=4, y=4)`. Store the vector in variable `v`. + +
+Solution + Since we are asked to create a vector we can write: ``` @@ -470,8 +462,17 @@ julia> [(; x, y) for x in 1:4, y in 1:4] (x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4) ``` +
+ ### Exercise 8 +The `filter` function allows you to select some values of an input collection. +Check its documentation first. Next, use it to keep from the vector `v` from +exercise 7 only elements whose sum is even. + +
+Solution + To get help on the `filter` function write `?filter`. Next run: ``` @@ -487,8 +488,16 @@ julia> filter(e -> iseven(e.x + e.y), v) (x = 4, y = 4) ``` +
+ ### Exercise 9 +Check the documentation of the `filter!` function. Perform the same operation +as asked in exercise 8 but using `filter!`. What is the difference? + +
+Solution + To get help on the `filter!` function write `?filter!`. Next run: ``` @@ -518,8 +527,21 @@ julia> v Notice that `filter` allocated a new vector, while `filter!` updated the `v` vector in place. +
+ ### Exercise 10 +Write a function that takes a number `n`. Next it generates two independent +random vectors of length `n` and returns their correlation coefficient. +Run this function `10000` times for `n` equal to `10`, `100`, `1000`, +and `10000`. +Create a plot with four histograms of distribution of computed Pearson +correlation coefficient. Check in the Plots.jl package which function can be +used to plot histograms. + +
+Solution + You can use for example the following code: ``` diff --git a/exercises/exercises05.md b/exercises/exercises05.md index f91a442..24b46cb 100644 --- a/exercises/exercises05.md +++ b/exercises/exercises05.md @@ -10,93 +10,8 @@ Create a matrix containing truth table for `&&` and `||` operations. -### Exercise 2 - -The `issubset` function checks if one collection is a subset of other -collection. - -Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for -`k` varying from `1` to `3`. Store the result in a vector. - -### Exercise 3 - -Write a function that accepts two vectors and returns `true` if they have equal -length and otherwise returns `false`. - -### Exercise 4 - -Consider the vectors `x = [1, 2, 1, 2, 1, 2]`, -`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`. -Calculate their Adjusted Mutual Information using scikit-learn. - -### Exercise 5 - -Using Adjusted Mutual Information function from exercise 4 generate -a pair of random vectors of length 100 containing integer numbers from the -range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI. -Check in the documentation of the `rand` function how you can draw a sample -from a collection of values. - -### Exercise 6 - -Adjust the code from exercise 5 but replace first 50 elements of each vector -with zero. Repeat the experiment. - -### Exercise 7 - -Write a function that takes a vector of integer values and returns a dictionary -giving information how many times each integer was present in the passed vector. - -Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`, -and `v3 = 3:5`. - -### Exercise 8 - -Write code that creates a `Bool` diagonal matrix of size 5x5. - -### Exercise 9 - -Write a code comparing performance of calculation of sum of logarithms of -elements of a vector `1:100` using broadcasting and the `sum` function vs only -the `sum` function taking a function as a first argument. - -### Exercise 10 - -Create a dictionary in which for each number from `1` to `10` you will store -a vector of its positive divisors. You can check the reminder of division -of two values using the `rem` function. - -Additionally (not covered in the book), you can drop elements -from a comprehension if you add an `if` clause after the `for` clause, for -example to keep only odd numbers from range `1:10` do: - -``` -julia> [i for i in 1:10 if isodd(i)] -5-element Vector{Int64}: - 1 - 3 - 5 - 7 - 9 -``` - -You can populate a dictionary by passing a vector of pairs to it (not covered in -the book), for example: - -``` -julia> Dict(["a" => 1, "b" => 2]) -Dict{String, Int64} with 2 entries: - "b" => 2 - "a" => 1 -``` - -# Solutions -
- -Show! - -### Exercise 1 +Solution You can do it as follows: ``` @@ -113,8 +28,19 @@ julia> [true, false] .|| [true false] Note that the first array is a vector, while the second array is a 1-row matrix. +
+ ### Exercise 2 +The `issubset` function checks if one collection is a subset of other +collection. + +Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for +`k` varying from `1` to `3`. Store the result in a vector. + +
+Solution + You can do it like this using broadcasting: ``` julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3]) @@ -125,16 +51,33 @@ julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3]) ``` Note that you need to use `Ref` to protect `4:6` from being broadcasted over. +
+ ### Exercise 3 +Write a function that accepts two vectors and returns `true` if they have equal +length and otherwise returns `false`. + +
+Solution + This function can be written as follows: ``` function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y) ``` +
+ ### Exercise 4 +Consider the vectors `x = [1, 2, 1, 2, 1, 2]`, +`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`. +Calculate their Adjusted Mutual Information using scikit-learn. + +
+Solution + You can do this exercise as follows: ``` julia> using PyCall @@ -151,8 +94,19 @@ julia> metrics.adjusted_mutual_info_score(y, z) -0.21267989848846763 ``` +
+ ### Exercise 5 +Using Adjusted Mutual Information function from exercise 4 generate +a pair of random vectors of length 100 containing integer numbers from the +range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI. +Check in the documentation of the `rand` function how you can draw a sample +from a collection of values. + +
+Solution + You can create such a plot using the following commands: ``` @@ -163,8 +117,16 @@ histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100)) You can check that AMI oscillates around 0. +
+ ### Exercise 6 +Adjust the code from exercise 5 but replace first 50 elements of each vector +with zero. Repeat the experiment. + +
+Solution + This time it is convenient to write a helper function. Note that we use broadcasting to update values in the vectors. @@ -182,8 +144,19 @@ histogram([exampleAMI() for i in 1:1000], label="AMI") Note that this time AMI is a bit below 0.5, which shows a better match between vectors. +
+ ### Exercise 7 +Write a function that takes a vector of integer values and returns a dictionary +giving information how many times each integer was present in the passed vector. + +Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`, +and `v3 = 3:5`. + +
+Solution + ``` julia> function counter(v::AbstractVector{<:Integer}) d = Dict{eltype(v), Int}() @@ -219,8 +192,15 @@ Dict{Int64, Int64} with 3 entries: Note that we used the `eltype` function to set a proper key type for dictionary `d`. +
+ ### Exercise 8 +Write code that creates a `Bool` diagonal matrix of size 5x5. + +
+Solution + This is a way to do it: ``` julia> 1:5 .== (1:5)' @@ -246,8 +226,17 @@ julia> I(5) ⋅ ⋅ ⋅ ⋅ 1 ``` +
+ ### Exercise 9 +Write a code comparing performance of calculation of sum of logarithms of +elements of a vector `1:100` using broadcasting and the `sum` function vs only +the `sum` function taking a function as a first argument. + +
+Solution + Here is how you can do it: ``` @@ -265,8 +254,41 @@ julia> @btime sum(log, 1:100) As you can see using the `sum` function with `log` as its first argument is a bit faster as it is not allocating. +
+ ### Exercise 10 +Create a dictionary in which for each number from `1` to `10` you will store +a vector of its positive divisors. You can check the reminder of division +of two values using the `rem` function. + +Additionally (not covered in the book), you can drop elements +from a comprehension if you add an `if` clause after the `for` clause, for +example to keep only odd numbers from range `1:10` do: + +``` +julia> [i for i in 1:10 if isodd(i)] +5-element Vector{Int64}: + 1 + 3 + 5 + 7 + 9 +``` + +You can populate a dictionary by passing a vector of pairs to it (not covered in +the book), for example: + +``` +julia> Dict(["a" => 1, "b" => 2]) +Dict{String, Int64} with 2 entries: + "b" => 2 + "a" => 1 +``` + +
+Solution + Here is how you can do it: ``` diff --git a/exercises/exercises06.md b/exercises/exercises06.md index 76d55eb..656a5cc 100644 --- a/exercises/exercises06.md +++ b/exercises/exercises06.md @@ -11,16 +11,47 @@ Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"` (replace `3` by a proper interpolation expression) and replace `USD` by `$`. +
+Solution + +``` +julia> "I have apples worth $(1+2)\$" +"I have apples worth 3\$" +``` + +
+ ### Exercise 2 Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data` as `iris.csv` to your local folder. +
+Solution + +``` +import Downloads +Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", + "iris.csv") +``` + +
+ ### Exercise 3 Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"` in two lines so that it takes less horizontal space. +
+Solution + +``` +"https://archive.ics.uci.edu/ml/\ + machine-learning-databases/iris/iris.data" +``` + +
+ ### Exercise 4 Load data stored in `iris.csv` file into a `data` vector where each element @@ -28,73 +59,9 @@ should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed as floats). -### Exercise 5 - -The `data` structure is a vector of named tuples, change it to a named tuple -of vectors (with the same field names) and call it `data2`. - -### Exercise 6 - -Calculate the frequency of each type of Iris type (`c` field in `data2`). - -### Exercise 7 - -Create a vector `c2` that is derived from `c` in `data2` but holds inline strings, -vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s. -Compare sizes of the three objects. - -### Exercise 8 - -You know that `refs` field of `PooledArray` stores an integer index of a given -value in it. Using this information make a scatter plot of `pl` vs `pw` vectors -in `data2`, but for each Iris type give a different point color (check the -`color` keyword argument meaning in the Plots.jl manual; you can use the -`plot_color` function). - -### Exercise 9 - -Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to -`str` variable (do not copy paste the string, but type it). - -### Exercise 10 - -In the `str` string from exercise 9 find all matches of a pattern where `a` -is followed by `b` but there can be some characters between them. - -# Solutions -
+Solution -Show! - -### Exercise 1 - -Solution: -``` -julia> "I have apples worth $(1+2)\$" -"I have apples worth 3\$" -``` - -### Exercise 2 - -Solution: -``` -import Downloads -Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", - "iris.csv") -``` - -### Exercise 3 - -Solution: -``` -"https://archive.ics.uci.edu/ml/\ - machine-learning-databases/iris/iris.data" -``` - -### Exercise 4 - -Solution: ``` julia> function line_parser(line) elements = split(line, ",") @@ -125,8 +92,16 @@ Note that we used `1:end-1` selector to drop last element from the read lines since it is empty. This is the reason why adding the `@assert length(elements) == 5` check in the `line_parser` function is useful. +
+ ### Exercise 5 +The `data` structure is a vector of named tuples, change it to a named tuple +of vectors (with the same field names) and call it `data2`. + +
+Solution + Later in the book you will learn more advanced ways to do it. Here let us use a most basic approach: @@ -138,9 +113,15 @@ data2 = (sl=[d.sl for d in data], c=[d.c for d in data]) ``` +
+ ### Exercise 6 -Solution: +Calculate the frequency of each type of Iris type (`c` field in `data2`). + +
+Solution + ``` julia> using FreqTables @@ -153,9 +134,17 @@ Dim1 │ "Iris-virginica" │ 50 ``` +
+ ### Exercise 7 -Solution: +Create a vector `c2` that is derived from `c` in `data2` but holds inline strings, +vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s. +Compare sizes of the three objects. + +
+Solution + ``` julia> using InlineStrings @@ -213,16 +202,34 @@ julia> Base.summarysize(c4) 1240 ``` +
+ ### Exercise 8 -Solution: +You know that `refs` field of `PooledArray` stores an integer index of a given +value in it. Using this information make a scatter plot of `pl` vs `pw` vectors +in `data2`, but for each Iris type give a different point color (check the +`color` keyword argument meaning in the Plots.jl manual; you can use the +`plot_color` function). + +
+Solution + ``` using Plots scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false) ``` +
+ ### Exercise 9 +Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to +`str` variable (do not copy paste the string, but type it). + +
+Solution + The hard part is typing `²`, `⟺` and `∨`. You can check how to do it using help: ``` help?> ² @@ -237,8 +244,16 @@ help?> ∨ Save the string in the `str` variable as we will use it in the next exercise. +
+ ### Exercise 10 +In the `str` string from exercise 9 find all matches of a pattern where `a` +is followed by `b` but there can be some characters between them. + +
+Show! + The exercise does not specify how the matching should be done. If we want it to be eager (match as much as possible), we write: diff --git a/exercises/exercises07.md b/exercises/exercises07.md index 8dd7cb8..c031aad 100644 --- a/exercises/exercises07.md +++ b/exercises/exercises07.md @@ -19,75 +19,10 @@ If you want to understand all the parameters plese check their meaning For us it is enough that this request generates 10 random integers in the range from 1 to 6. Run this query in Julia and parse the result. -### Exercise 2 - -Write a function that tries to parse a string as an integer. -If it succeeds it should return the integer, otherwise it should return `0` -but print error message. - -### Exercise 3 - -Create a matrix containing truth table for `&&` operation including `missing`. -If some operation errors store `"error"` in the table. As an extra feature (this -is harder so you can skip it) in each cell store both inputs and output to make -reading the table easier. - -### Exercise 4 - -Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all -missing values in it by the mean of the non-missing values. - -### Exercise 5 - -Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse -strings stored in it as `Float64`, while keeping `missing` values unchanged. - -### Exercise 6 - -Print to the terminal all days in January 2023 that are Mondays. - -### Exercise 7 - -Compute the dates that are one month later than January 15, 2020, February 15 -2020, March 15, 2020, and April 15, 2020. How many days pass during this one -month. Print the results to the screen? - -### Exercise 8 - -Parse the following string as JSON: -``` -str = """ -[{"x":1,"y":1}, - {"x":2,"y":4}, - {"x":3,"y":9}, - {"x":4,"y":16}, - {"x":5,"y":25}] -""" -``` -into a `json` variable. - -### Exercise 9 - -Extract from the `json` variable from exercise 8 two vectors `x` and `y` -that correspond to the fields stored in the JSON structure. -Plot `y` as a function of `x`. - -### Exercise 10 - -Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`. -Use linear interpolation for filling missing values. For the extreme values -use nearest available observation (you will need to consult Impute.jl -documentation to find all required functions). - -# Solutions -
+Solution -Show! - -### Exercise 1 - -Solution (example run): +Example run: ``` julia> using HTTP @@ -109,8 +44,17 @@ julia> parse.(Int, split(String(response.body))) 6 ``` +
+ ### Exercise 2 +Write a function that tries to parse a string as an integer. +If it succeeds it should return the integer, otherwise it should return `0` +but print error message. + +
+Solution + Example function: ``` @@ -160,9 +104,17 @@ end ``` But this time we do not see the cause of the error. +
+ ### Exercise 3 -Solution: +Create a matrix containing truth table for `&&` operation including `missing`. +If some operation errors store `"error"` in the table. As an extra feature (this +is harder so you can skip it) in each cell store both inputs and output to make +reading the table easier. + +
+Solution ``` julia> function apply_and(x, y) @@ -181,9 +133,15 @@ julia> apply_and.([true, false, missing], [true false missing]) "missing && true = error" "missing && false = error" "missing && missing = error" ``` +
+ ### Exercise 4 -Solution: +Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all +missing values in it by the mean of the non-missing values. + +
+Solution ``` julia> using Statistics @@ -198,9 +156,15 @@ julia> coalesce.(v, mean(skipmissing(v))) 3.5 ``` +
+ ### Exercise 5 -Solution: +Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse +strings stored in it as `Float64`, while keeping `missing` values unchanged. + +
+Solution ``` julia> using Missings @@ -215,9 +179,16 @@ julia> passmissing(parse).(Float64, s) missing ``` +
+ ### Exercise 6 -Example solution: +Print to the terminal all days in January 2023 that are Mondays. + +
+Solution + +Example: ``` julia> using Dates @@ -232,9 +203,18 @@ julia> for day in Date.(2023, 01, 1:31) 2023-01-30 ``` +
+ ### Exercise 7 -Example solution: +Compute the dates that are one month later than January 15, 2020, February 15 +2020, March 15, 2020, and April 15, 2020. How many days pass during this one +month. Print the results to the screen? + +
+Solution + +Example: ``` julia> for day in Date.(2023, 1:4, 15) @@ -247,9 +227,24 @@ julia> for day in Date.(2023, 1:4, 15) 2023-04-15 + 1 month = 2023-05-15 (difference: 30 days) ``` +
+ ### Exercise 8 -Solution: +Parse the following string as JSON: +``` +str = """ +[{"x":1,"y":1}, + {"x":2,"y":4}, + {"x":3,"y":9}, + {"x":4,"y":16}, + {"x":5,"y":25}] +""" +``` +into a `json` variable. + +
+Solution ``` julia> using JSON3 @@ -278,9 +273,16 @@ julia> json = JSON3.read(str) } ``` +
+ ### Exercise 9 -Solution: +Extract from the `json` variable from exercise 8 two vectors `x` and `y` +that correspond to the fields stored in the JSON structure. +Plot `y` as a function of `x`. + +
+Solution ``` using Plots @@ -289,9 +291,17 @@ y = [el.y for el in json] plot(x, y, xlabel="x", ylabel="y", legend=false) ``` +
+ ### Exercise 10 -Solution: +Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`. +Use linear interpolation for filling missing values. For the extreme values +use nearest available observation (you will need to consult Impute.jl +documentation to find all required functions). + +
+Solution ``` julia> using Impute diff --git a/exercises/exercises08.md b/exercises/exercises08.md index 72da36d..bfd49ec 100644 --- a/exercises/exercises08.md +++ b/exercises/exercises08.md @@ -11,63 +11,8 @@ Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame` called `df`. -### Exercise 2 - -Get number of rows, columns, column names and summary statistics of the -`df` data frame from exercise 1. - -### Exercise 3 - -Make a plot of `number` against `square` columns of `df` data frame. - -### Exercise 4 - -Add a column to `df` data frame with name `name string` containing string -representation of numbers in column `number`, i.e. -`["one", "two", "three", "four"]`. - -### Exercise 5 - -Check if `df` contains column `square2`. - -### Exercise 6 - -Extract column `number` from `df` and empty it (recall `empty!` function -discussed in chapter 4). - -### Exercise 7 - -In `Random` module the `randexp` function is defined that samples numbers -from exponential distribution with scale 1. -Draw two 100,000 element samples from this distribution store them -in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values -and sum of vector `x` and half of vector `y`. - -### Exercise 8 - -Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them, -and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`. -Compute all standard descriptive statistics of columns of this data frame. - -### Exercise 9 - -Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file. -Compare the size of created files using the `filesize` function. - -### Exercise 10 - -Write the `df` data frame into SQLite database. Next find information about -tables in this database. Run a query against a table representing the `df` data -frame to calculate the mean of column `x`. Does it match the result we got in -exercise 8? - -# Solutions -
- -Show! - -### Exercise 1 +Solution CSV.jl supports reading gzip-compressed files so you can just do: @@ -106,9 +51,15 @@ julia> df = CSV.read(plain, DataFrame) 4 │ 4 16 ``` +
+ ### Exercise 2 -Solution: +Get number of rows, columns, column names and summary statistics of the +`df` data frame from exercise 1. + +
+Solution ``` julia> nrow(df) @@ -131,17 +82,30 @@ julia> describe(df) 2 │ square 7.75 2 6.5 16 0 Int64 ``` +
+ ### Exercise 3 -Solution: +Make a plot of `number` against `square` columns of `df` data frame. + +
+Solution + ``` using Plots plot(df.number, df.square, xlabel="number", ylabel="square", legend=false) ``` +
+ ### Exercise 4 -Solution: +Add a column to `df` data frame with name `name string` containing string +representation of numbers in column `number`, i.e. +`["one", "two", "three", "four"]`. + +
+Solution ``` julia> df."name string" = ["one", "two", "three", "four"] @@ -164,8 +128,15 @@ julia> df Note that we needed to use a string as we have space in column name. +
+ ### Exercise 5 +Check if `df` contains column `square2`. + +
+Solution + You can use either `hasproperty` or `columnindex`: ``` @@ -184,9 +155,15 @@ julia> df.square2 ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square ``` +
+ ### Exercise 6 -Solution: +Extract column `number` from `df` and empty it (recall `empty!` function +discussed in chapter 4). + +
+Solution ``` julia> empty!(df[:, :number]) @@ -198,9 +175,19 @@ as it would corrupt the `df` data frame (these operations do non-copying extraction of a column from a data frame as opposed to `df[:, :number]` which makes a copy). +
+ ### Exercise 7 -Solution: +In `Random` module the `randexp` function is defined that samples numbers +from exponential distribution with scale 1. +Draw two 100,000 element samples from this distribution store them +in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values +and sum of vector `x` and half of vector `y`. + +
+Solution + ``` using Random using Plots @@ -212,10 +199,19 @@ histogram!(max.(x, y), label="maximum") I have put both histograms on the same plot to show that they overlap. +
+ ### Exercise 8 -Solution (you might get slightly different results because we did not set -the seed of random number generator when creating `x` and `y` vectors): +Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them, +and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`. +Compute all standard descriptive statistics of columns of this data frame. + +
+Solution + +You might get slightly different results because we did not set +the seed of random number generator when creating `x` and `y` vectors: ``` julia> df = DataFrame(x=x, y=y); @@ -238,8 +234,16 @@ julia> describe(df, :all) We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary statistics except `first` and `last` as expected. +
+ ### Exercise 9 +Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file. +Compare the size of created files using the `filesize` function. + +
+Solution + ``` julia> using Arrow @@ -258,8 +262,18 @@ julia> filesize("df.arrow") In this case Apache Arrow file is smaller. +
+ ### Exercise 10 +Write the `df` data frame into SQLite database. Next find information about +tables in this database. Run a query against a table representing the `df` data +frame to calculate the mean of column `x`. Does it match the result we got in +exercise 8? + +
+Solution + ``` julia> using SQLite diff --git a/exercises/exercises09.md b/exercises/exercises09.md index acb6dd1..ec87bfc 100644 --- a/exercises/exercises09.md +++ b/exercises/exercises09.md @@ -22,69 +22,8 @@ Create `matein2` data frame that will have only puzzles that have `"mateIn2"` in the `Themes` column. Use the `contains` function (check its documentation first). -### Exercise 2 - -What is the fraction of puzzles that are mate in 2 in relation to all puzzles -in the `puzzles` data frame? - -### Exercise 3 - -Create `small` data frame that holds first 10 rows of `matein2` data frame -and columns `Rating`, `RatingDeviation`, and `NbPlays`. - -### Exercise 4 - -Iterate rows of `small` data frame and print the ratio of -`RatingDeviation` and `NbPlays` for each row. - -### Exercise 5 - -Get names of columns from the `matein2` data frame that end with `n` (ignore case). - -### Exercise 6 - -Write a function `collatz` that runs the following process. Start with a -positive number `n`. If it is even divide it by two. If it is odd multiply -it by 3 and add one. The function should return the number of steps needed to -reach 1. - -Create a `d` dictionary that maps number of steps needed to a list of numbers from -the range `1:100` that required this number of steps. - -### Exercise 7 - -Using the `d` dictionary make a scatter plot of number of steps required -vs average value of numbers that require this number of steps. - -### Exercise 8 - -Repeat the process from exercises 6 and 7, but this time use a data frame -and try to write an appropriate expression using the `combine` and `groupby` -functions (as it was explained in the last part of chapter 9). This time -perform computations for numbers ranging from one to one million. - -### Exercise 9 - -Set seed of random number generator to `1234`. Draw 100 random points -from the interval `[0, 1]`. Store this vector in a data frame as `x` column. -Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`. -Add random noise to column `y` that has normal distribution with mean 0 and -standard deviation 0.25. Call this column `z`. -Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis. - -### Exercise 10 - -Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10. - -# Solutions -
- -Show! - -### Exercise 1 - -Solution: +Solution ``` julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :] @@ -104,9 +43,17 @@ julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :] 1 column and 274127 rows omitted ``` +
+ ### Exercise 2 -Solution (two ways to do it): +What is the fraction of puzzles that are mate in 2 in relation to all puzzles +in the `puzzles` data frame? + +
+Solution + +Two ways to do it: ``` julia> using Statistics @@ -118,9 +65,15 @@ julia> mean(contains.(puzzles.Themes, "mateIn2")) 0.12852152542746353 ``` +
+ ### Exercise 3 -Solution: +Create `small` data frame that holds first 10 rows of `matein2` data frame +and columns `Rating`, `RatingDeviation`, and `NbPlays`. + +
+Solution ``` julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]] @@ -140,9 +93,15 @@ julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]] 10 │ 979 144 14 ``` +
+ ### Exercise 4 -Solution: +Iterate rows of `small` data frame and print the ratio of +`RatingDeviation` and `NbPlays` for each row. + +
+Solution ``` julia> for row in eachrow(small) @@ -160,9 +119,16 @@ julia> for row in eachrow(small) 10.285714285714286 ``` +
+ ### Exercise 5 -Solution (several options): +Get names of columns from the `matein2` data frame that end with `n` (ignore case). + +
+Solution + +Several options: ``` julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N')) 2-element Vector{String}: @@ -180,9 +146,20 @@ julia> names(matein2, r"[nN]$") "RatingDeviation" ``` +
+ ### Exercise 6 -Solution: +Write a function `collatz` that runs the following process. Start with a +positive number `n`. If it is even divide it by two. If it is odd multiply +it by 3 and add one. The function should return the number of steps needed to +reach 1. + +Create a `d` dictionary that maps number of steps needed to a list of numbers from +the range `1:100` that required this number of steps. + +
+Solution ``` julia> function collatz(n) @@ -232,9 +209,15 @@ Dict{Int64, Vector{Int64}} with 45 entries: As we can see even for small `n` the number of steps required to reach `1` can get quite large. +
+ ### Exercise 7 -Solution: +Using the `d` dictionary make a scatter plot of number of steps required +vs average value of numbers that require this number of steps. + +
+Solution ``` using Plots @@ -247,9 +230,17 @@ scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=fal Note that we needed to use `collect` on `keys` as `scatter` expects an array not just an iterator. +
+ ### Exercise 8 -Solution: +Repeat the process from exercises 6 and 7, but this time use a data frame +and try to write an appropriate expression using the `combine` and `groupby` +functions (as it was explained in the last part of chapter 9). This time +perform computations for numbers ranging from one to one million. + +
+Solution ``` df = DataFrame(n=1:10^6); @@ -258,6 +249,8 @@ agg = combine(groupby(df, :collatz), :n => mean); scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false) ``` +
+ ### Exercise 9 Set seed of random number generator to `1234`. Draw 100 random points @@ -267,7 +260,8 @@ Add random noise to column `y` that has normal distribution with mean 0 and standard deviation 0.25. Call this column `z`. Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis. -Solution: +
+Solution ``` using Random @@ -278,9 +272,14 @@ df.z = df.y + randn(100) / 4 scatter(df.x, [df.y df.z], labels=["y" "z"]) ``` +
+ ### Exercise 10 -Solution: +Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10. + +
+Solution ``` using Loess diff --git a/exercises/exercises10.md b/exercises/exercises10.md index 7c5c604..6449b43 100644 --- a/exercises/exercises10.md +++ b/exercises/exercises10.md @@ -13,89 +13,8 @@ independently and uniformly from the [0,1[ interval. Create a data frame using data from this matrix using auto-generated column names. -### Exercise 2 - -Now, using matrix `mat` create a data frame with randomly generated -column names. Use the `randstring` function from the `Random` module -to generate them. Store this data frame in `df` variable. - -### Exercise 3 - -Create a new data frame, taking `df` as a source that will have the same -columns but its column names will be `y1`, `y2`, `y3`, `y4`. - -### Exercise 4 - -Create a dictionary holding `column_name => column_vector` pairs -using data stored in data frame `df`. Save this dictionary in variable `d`. - -### Exercise 5 - -Create a data frame back from dictionary `d` from exercise 4. Compare it -with `df`. - -### Exercise 6 - -For data frame `df` compute the dot product between all pairs of its columns. -Use the `dot` function from the `LinearAlgebra` module. - -### Exercise 7 - -Given two data frames: - -``` -julia> df1 = DataFrame(a=1:2, b=11:12) -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 11 - 2 │ 2 12 - -julia> df2 = DataFrame(a=1:2, c=101:102) -2×2 DataFrame - Row │ a c - │ Int64 Int64 -─────┼────────────── - 1 │ 1 101 - 2 │ 2 102 -``` - -vertically concatenate them so that only columns that are present in both -data frames are kept. Check the documentation of `vcat` to see how to -do it. - -### Exercise 8 - -Now append to `df1` table `df2`, but add only the columns from `df2` that -are present in `df1`. Check the documentation of `append!` to see how to -do it. - -### Exercise 9 - -Create a `circle` data frame, using the `push!` function that will store -1000 samples of the following process: -* draw `x` and `y` uniformly and independently from the [-1,1[ interval; -* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1` - and is `false` otherwise. - -Compute summary statistics of this data frame. - -### Exercise 10 - -Create a scatterplot of `circle` data frame where its `x` and `y` axis -will be the plotted points and `inside` variable will determine the color -of the plotted point. - -# Solutions -
- -Show! - -### Exercise 1 - -Solution: +Solution ``` julia> using DataFrames @@ -120,9 +39,16 @@ julia> DataFrame(mat, :auto) 5 │ 0.714515 0.861872 0.971521 0.176768 ``` +
+ ### Exercise 2 -Solution: +Now, using matrix `mat` create a data frame with randomly generated +column names. Use the `randstring` function from the `Random` module +to generate them. Store this data frame in `df` variable. + +
+Solution ``` julia> using Random @@ -139,10 +65,16 @@ julia> df = DataFrame(mat, [randstring() for _ in 1:4]) 5 │ 0.714515 0.861872 0.971521 0.176768 ``` +
### Exercise 3 -Solution: +Create a new data frame, taking `df` as a source that will have the same +columns but its column names will be `y1`, `y2`, `y3`, `y4`. + +
+Solution + ``` julia> DataFrame(["y$i" => df[!, i] for i in 1:4]) 5×4 DataFrame @@ -170,9 +102,15 @@ julia> rename(df, string.("y", 1:4)) 5 │ 0.714515 0.861872 0.971521 0.176768 ``` +
+ ### Exercise 4 -Solution: +Create a dictionary holding `column_name => column_vector` pairs +using data stored in data frame `df`. Save this dictionary in variable `d`. + +
+Solution ``` julia> d = Dict([n => df[:, n] for n in names(df)]) @@ -194,9 +132,15 @@ Dict{Symbol, AbstractVector} with 4 entries: Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521] ``` +
+ ### Exercise 5 -Solution: +Create a data frame back from dictionary `d` from exercise 4. Compare it +with `df`. + +
+Solution ``` julia> DataFrame(d) @@ -215,9 +159,15 @@ Note that columns of a data frame are now sorted by their names. This is done for `Dict` objects because such dictionaries do not have a defined order of keys. +
+ ### Exercise 6 -Solution: +For data frame `df` compute the dot product between all pairs of its columns. +Use the `dot` function from the `LinearAlgebra` module. + +
+Solution ``` julia> using LinearAlgebra @@ -232,9 +182,36 @@ julia> pairwise(dot, eachcol(df)) 1.50558 1.18411 0.909744 1.47431 ``` +
+ ### Exercise 7 -Solution: +Given two data frames: + +``` +julia> df1 = DataFrame(a=1:2, b=11:12) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 11 + 2 │ 2 12 + +julia> df2 = DataFrame(a=1:2, c=101:102) +2×2 DataFrame + Row │ a c + │ Int64 Int64 +─────┼────────────── + 1 │ 1 101 + 2 │ 2 102 +``` + +vertically concatenate them so that only columns that are present in both +data frames are kept. Check the documentation of `vcat` to see how to +do it. + +
+Solution ``` julia> vcat(df1, df2, cols=:intersect) @@ -255,9 +232,16 @@ julia> vcat(df1, df2) ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2 ``` +
+ ### Exercise 8 -Solution: +Now append to `df1` table `df2`, but add only the columns from `df2` that +are present in `df1`. Check the documentation of `append!` to see how to +do it. + +
+Solution ``` julia> append!(df1, df2, cols=:subset) @@ -271,9 +255,20 @@ julia> append!(df1, df2, cols=:subset) 4 │ 2 missing ``` +
+ ### Exercise 9 -Solution +Create a `circle` data frame, using the `push!` function that will store +1000 samples of the following process: +* draw `x` and `y` uniformly and independently from the [-1,1[ interval; +* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1` + and is `false` otherwise. + +Compute summary statistics of this data frame. + +
+Solution ``` circle=DataFrame() @@ -287,9 +282,16 @@ describe(circle) We note that mean of variable `inside` is approximately π. +
+ ### Exercise 10 -Solution: +Create a scatterplot of `circle` data frame where its `x` and `y` axis +will be the plotted points and `inside` variable will determine the color +of the plotted point. + +
+Solution ``` using Plots diff --git a/exercises/exercises11.md b/exercises/exercises11.md index bb69101..1315250 100644 --- a/exercises/exercises11.md +++ b/exercises/exercises11.md @@ -13,83 +13,8 @@ sampled from uniform distribution on [0, 1[ interval. Serialize it to disk, and next deserialize. Check if the deserialized object is the same as the source data frame. -### Exercise 2 - -Add a column `n` to the `df` data frame that in each row will hold the -number of observations in column `x` that have distance less than `0.1` to -a value stored in a given row of `x`. - -### Exercise 3 - -Investigate visually how does `n` depend on `x` in data frame `df`. - -### Exercise 4 - -Someone has prepared the following test data for you: -``` -teststr = """ -"x","sinx" -0.139279,0.138829 -0.456779,0.441059 -0.344034,0.337287 -0.140253,0.139794 -0.848344,0.750186 -0.977512,0.829109 -0.032737,0.032731 -0.702750,0.646318 -0.422339,0.409895 -0.393878,0.383772 -""" -``` - -Load this data into `testdf` data frame. - -### Exercise 5 - -Check the accuracy of computations of sinus of `x` in `testdf`. -Print all rows for which the absolute difference is greater than `5e-7`. -In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute -difference. - -### Exercise 6 - -Group data in data frame `df` into buckets of 0.1 width and store the result in -`gdf` data frame (sort the groups). Use the `cut` function from -CategoricalArrays.jl to do it (check its documentation to learn how to do it). -Check the number of values in each group. - -### Exercise 7 - -Display the grouping keys in `gdf` grouped data frame. Show them as named tuples. -Check what would be the group order if you asked not to sort them. - -### Exercise 8 - -Compute average `n` for each group in `gdf`. - -### Exercise 9 - -Fit a linear model explaining `n` by `x` separately for each group in `gdf`. -Use the `\` operator to fit it (recall it from chapter 4). -For each group produce the result as named tuple having fields `α₀` and `αₓ`. - -### Exercise 10 - -Repeat exercise 9 but using the GLM.jl package. This time -extract the p-value for the slope of estimated coefficient for `x` variable. -Use the `coeftable` function from GLM.jl to get this information. -Check the documentation of this function to learn how to do it (it will be -easiest for you to first convert its result to a `DataFrame`). - -# Solutions -
- -Show! - -### Exercise 1 - -Solution: +Solution ``` julia> using DataFrames @@ -104,9 +29,16 @@ julia> deserialize("df.bin") == df true ``` +
+ ### Exercise 2 -Solution: +Add a column `n` to the `df` data frame that in each row will hold the +number of observations in column `x` that have distance less than `0.1` to +a value stored in a given row of `x`. + +
+Solution A simple approach is: ``` @@ -151,9 +83,14 @@ df.n = f2(df.x) In this solution the fact that we used function barrier is even more relevant as we explicitly use loops inside. +
+ ### Exercise 3 -Solution: +Investigate visually how does `n` depend on `x` in data frame `df`. + +
+Solution ``` using Plots @@ -162,9 +99,31 @@ scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false) As expected on the border of the domain number of neighbors drops. +
+ ### Exercise 4 -Solution: +Someone has prepared the following test data for you: +``` +teststr = """ +"x","sinx" +0.139279,0.138829 +0.456779,0.441059 +0.344034,0.337287 +0.140253,0.139794 +0.848344,0.750186 +0.977512,0.829109 +0.032737,0.032731 +0.702750,0.646318 +0.422339,0.409895 +0.393878,0.383772 +""" +``` + +Load this data into `testdf` data frame. + +
+Solution ``` julia> using CSV @@ -188,8 +147,18 @@ julia> testdf = CSV.read(IOBuffer(teststr), DataFrame) 10 │ 0.393878 0.383772 ``` +
+ ### Exercise 5 +Check the accuracy of computations of sinus of `x` in `testdf`. +Print all rows for which the absolute difference is greater than `5e-7`. +In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute +difference. + +
+Solution + Since data frame is small we can use `eachrow`: ``` @@ -202,9 +171,18 @@ julia> for row in eachrow(testdf) (x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7) ``` +
+ ### Exercise 6 -Solution: +Group data in data frame `df` into buckets of 0.1 width and store the result in +`gdf` data frame (sort the groups). Use the `cut` function from +CategoricalArrays.jl to do it (check its documentation to learn how to do it). +Check the number of values in each group. + +
+Solution + ``` julia> using CategoricalArrays @@ -244,9 +222,15 @@ julia> combine(gdf, nrow) # alternative way to do it You might get a bit different numbers but all should be around 10,000. +
+ ### Exercise 7 -Solution: +Display the grouping keys in `gdf` grouped data frame. Show them as named tuples. +Check what would be the group order if you asked not to sort them. + +
+Solution ``` julia> NamedTuple.(keys(gdf)) @@ -282,9 +266,14 @@ the resulting group order could depend on the type of grouping column, so if you want to depend on the order of groups always spass `sort` keyword argument explicitly. +
+ ### Exercise 8 -Solution: +Compute average `n` for each group in `gdf`. + +
+Solution ``` julia> using Statistics @@ -319,9 +308,16 @@ julia> combine(gdf, :n => mean) # alternative way to do it 10 │ [0.9, 1.0) 14944.5 ``` +
+ ### Exercise 9 -Solution: +Fit a linear model explaining `n` by `x` separately for each group in `gdf`. +Use the `\` operator to fit it (recall it from chapter 4). +For each group produce the result as named tuple having fields `α₀` and `αₓ`. + +
+Solution ``` julia> function fitmodel(x, n) @@ -364,9 +360,18 @@ julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that y We note that indeed in the first and last group the regression has a significant slope. +
+ ### Exercise 10 -Solution: +Repeat exercise 9 but using the GLM.jl package. This time +extract the p-value for the slope of estimated coefficient for `x` variable. +Use the `coeftable` function from GLM.jl to get this information. +Check the documentation of this function to learn how to do it (it will be +easiest for you to first convert its result to a `DataFrame`). + +
+Solution ``` julia> using GLM diff --git a/exercises/exercises12.md b/exercises/exercises12.md index fd67ca0..8ae2a56 100644 --- a/exercises/exercises12.md +++ b/exercises/exercises12.md @@ -14,86 +14,8 @@ is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`. Download this file and check if it indeed has this checksum. You might need to read documentation of `string` and `join` functions. -### Exercise 2 - -Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip -that contains the ego-nets of Eastern European users collected from the music -streaming service Deezer in February 2020. Nodes are users and edges are mutual -follower relationships. - -From the file extract deezer_edges.json and deezer_target.csv files and -save them to disk. - -### Exercise 3 - -Load deezer_edges.json and deezer_target.csv files to Julia. -The JSON file should be loaded as JSON3.jl object `edges_json`. -The CSV file should be loaded into a data frame `target_df`. - -### Exercise 4 - -Check that keys in the `edges_json` are in the same order as `id` column -in `target_df`. - -### Exercise 5 - -From every value stored in `edges_json` create a graph representing -ego-net of the given node. Store these graphs in a vector that will make the -`egonet` column of in the `target_df` data frame. - -### Exercise 6 - -Ego-net in our data set is a subgraph of a full Deezer graph where for some -node all its neighbors are included, but also it contains all edges between the -neighbors. -Therefore we expect that diameter of every ego-net is at most 2 (as every -two nodes are either connected directly or by a common friend). -Check if this is indeed the case. Use the `diameter` function. - -### Exercise 7 - -For each ego-net find a central node that is connected to every other node -in this network. Use the `degree` and `findall` functions to achieve this. -Add `center` column with numbers of nodes that are connected to all other -nodes in the ego-net to `target_df` data frame. - -Next add a column `center_len` that gives the number of such nodes. - -Check how many times different numbers of center nodes are found. - -### Exercise 8 - -Add the following ego-net features to the `target_df` data frame: -* `size`: number of nodes in ego-net -* `mean_degree`: average node degree in ego-net - -Check mean values of these two columns by `target` column. - -### Exercise 9 - -Continuing to work with `target_df` data frame create a logistic regression -explaining `target` by `size` and `mean_degree`. - -### Exercise 10 - -Continuing to work with `target_df` create a scatterplot where `size` will be on -one axis and `mean_degree` rounded to nearest integer on the other axis. -Plot the mean of `target` for each point being a combination of `size` and -rounded `mean_degree`. - -Additionally fit a LOESS model explaining `target` by `size`. Make a prediction -for values in range from 5% to 95% quantile (to concentrate on typical values -of size). - -# Solutions -
- -Show! - -### Exercise 1 - -Solution: +Solution ``` using Downloads @@ -106,9 +28,20 @@ sha == shastr The last line should produce `true`. +
+ ### Exercise 2 -Solution: +Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip +that contains the ego-nets of Eastern European users collected from the music +streaming service Deezer in February 2020. Nodes are users and edges are mutual +follower relationships. + +From the file extract deezer_edges.json and deezer_target.csv files and +save them to disk. + +
+Solution ``` Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip") @@ -125,9 +58,16 @@ end close(archive) ``` +
+ ### Exercise 3 -Solution: +Load deezer_edges.json and deezer_target.csv files to Julia. +The JSON file should be loaded as JSON3.jl object `edges_json`. +The CSV file should be loaded into a data frame `target_df`. + +
+Solution ``` using CSV @@ -137,17 +77,32 @@ edges_json = JSON3.read(read("deezer_edges.json")) target_df = CSV.read("deezer_target.csv", DataFrame) ``` +
+ ### Exercise 4 -Solution (short, but you need to have a good understanding of Julia types -and standar functions to properly write it): +Check that keys in the `edges_json` are in the same order as `id` column +in `target_df`. + +
+Solution + +This is short, but you need to have a good understanding of Julia types +and standar functions to properly write it: ``` Symbol.(target_df.id) == keys(edges_json) ``` +
+ ### Exercise 5 -Solution: +From every value stored in `edges_json` create a graph representing +ego-net of the given node. Store these graphs in a vector that will make the +`egonet` column of in the `target_df` data frame. + +
+Solution ``` using Graphs @@ -163,9 +118,19 @@ end target_df.egonet = edgelist2graph(values(edges_json)) ``` +
+ ### Exercise 6 -Solution: +Ego-net in our data set is a subgraph of a full Deezer graph where for some +node all its neighbors are included, but also it contains all edges between the +neighbors. +Therefore we expect that diameter of every ego-net is at most 2 (as every +two nodes are either connected directly or by a common friend). +Check if this is indeed the case. Use the `diameter` function. + +
+Solution ``` julia> extrema(diameter.(target_df.egonet)) @@ -174,9 +139,21 @@ julia> extrema(diameter.(target_df.egonet)) Indeed we see that for each ego-net diameter is 2. +
+ ### Exercise 7 -Solution: +For each ego-net find a central node that is connected to every other node +in this network. Use the `degree` and `findall` functions to achieve this. +Add `center` column with numbers of nodes that are connected to all other +nodes in the ego-net to `target_df` data frame. + +Next add a column `center_len` that gives the number of such nodes. + +Check how many times different numbers of center nodes are found. + +
+Solution ``` target_df.center = map(target_df.egonet) do g @@ -192,9 +169,18 @@ the condition we want to check. We notice that in some cases it is impossible to identify the center of the ego-net uniquely. +
+ ### Exercise 8 -Solution: +Add the following ego-net features to the `target_df` data frame: +* `size`: number of nodes in ego-net +* `mean_degree`: average node degree in ego-net + +Check mean values of these two columns by `target` column. + +
+Solution ``` using Statistics @@ -206,9 +192,15 @@ combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean) It seems that for target equal to `0` size and average degree in the network are a bit larger. +
+ ### Exercise 9 -Solution: +Continuing to work with `target_df` data frame create a logistic regression +explaining `target` by `size` and `mean_degree`. + +
+Solution ``` using GLM @@ -217,9 +209,21 @@ glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink()) We see that only `size` is statistically significant. +
+ ### Exercise 10 -Solution: +Continuing to work with `target_df` create a scatterplot where `size` will be on +one axis and `mean_degree` rounded to nearest integer on the other axis. +Plot the mean of `target` for each point being a combination of `size` and +rounded `mean_degree`. + +Additionally fit a LOESS model explaining `target` by `size`. Make a prediction +for values in range from 5% to 95% quantile (to concentrate on typical values +of size). + +
+Solution ``` using Plots @@ -242,6 +246,6 @@ plot(size_predict, target_predict; xlabel="size", ylabel="predicted target", legend=false) ``` -Between quantiles 5% and 95% we see a downward shaped relationship. +Between quantiles 5% and 95% of `size` we see a downward shaped relationship.
diff --git a/exercises/exercises13.md b/exercises/exercises13.md index 7be60b4..d452814 100644 --- a/exercises/exercises13.md +++ b/exercises/exercises13.md @@ -13,12 +13,47 @@ https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset. archive and extract primary_data.csv and secondary_data.csv files from it. Save the files to disk. +
+Solution + +``` +using Downloads +import ZipFile +Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip") +archive = ZipFile.Reader("MushroomDataset.zip") +idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files)) +open("primary_data.csv", "w") do io + write(io, read(archive.files[idx])) +end +idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files)) +open("secondary_data.csv", "w") do io + write(io, read(archive.files[idx])) +end +close(archive) +``` + +
+ ### Exercise 2 Load primary_data.csv into the `primary` data frame. Load secondary_data.csv into the `secondary` data frame. Describe the contents of both data frames. +
+Solution + +``` +using CSV +using DataFrames +primary = CSV.read("primary_data.csv", DataFrame; delim=';') +secondary = CSV.read("secondary_data.csv", DataFrame; delim=';') +describe(primary) +describe(secondary) +``` + +
+ ### Exercise 3 Start with `primary` data. Note that columns starting from column 4 have @@ -32,6 +67,25 @@ three columns just after `class` column in the `parsed_primary` data frame. Check `renamecols` keyword argument of `select` to avoid renaming of the produced columns. +
+Solution + +``` +parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ") +parse_nominal(::Missing) = missing +parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", ")) +parse_numeric(::Missing) = missing +idcols = ["family", "name", "class"] +numericcols = ["cap-diameter", "stem-height", "stem-width"] +parsed_primary = select(primary, + idcols, + numericcols .=> ByRow(parse_numeric), + Not([idcols; numericcols]) .=> ByRow(parse_nominal); + renamecols=false) +``` + +
+ ### Exercise 4 In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be @@ -49,119 +103,8 @@ Use the following rules: For each found pair print to the screen the row number, family, name, and class. -### Exercise 5 - -Still using `parsed_primary` find what is the average probability of class being -`p` by `family`. Additionally add number of observations in each group. Sort -these results by the probability. Try using DataFramesMeta.jl to do this -exercise (this requirement is optional). - -Store the result in `agg_primary` data frame. - -### Exercise 6 - -Now using `agg_primary` data frame collapse it so that for each unique `pr_p` -it gives us a total number of rows that had this probability and a tuple -of mushroom family names. - -Optionally: try to display the produced table so that the tuple containing the -list of families for each group is not cropped (this will require large -terminal). - -### Exercise 7 - -From our preliminary analysis of `primary` data we see that `missing` value in -the primary data is non-informative, so in `secondary` data we should be -cautious when building a model if we allowed for missing data (in practice -if we were investigating some real mushroom we most likely would know its -characteristics). - -Therefore as a first step drop in-place all columns in `secondary` data frame -that have missing values. - -### Exercise 8 - -Create a logistic regression predicting `class` based on all remaining features -in the data frame. You might need to check the `Term` usage in StatsModels.jl -documentation. - -You will notice that for `stem-color` and `habitat` columns you get strange -estimation results (large absolute values of estimated parameters and even -larger standard errors). Explain why this happens by analyzing frequency tables -of these variables against `class` column. - -### Exercise 9 - -Add `class_p` column to `secondary` as a second column that will contain -predicted probability from the model created in exercise 8 of a given -observation having class `p`. - -Print descriptive statistics of column `class_p` by `class`. - -### Exercise 10 - -Plot FPR-TPR ROC curve for our model and compute associated AUC value. - -# Solutions -
- -Show! - -### Exercise 1 - -Solution: - -``` -using Downloads -import ZipFile -Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip") -archive = ZipFile.Reader("MushroomDataset.zip") -idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files)) -open("primary_data.csv", "w") do io - write(io, read(archive.files[idx])) -end -idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files)) -open("secondary_data.csv", "w") do io - write(io, read(archive.files[idx])) -end -close(archive) -``` - -### Exercise 2 - -Solution: - -``` -using CSV -using DataFrames -primary = CSV.read("primary_data.csv", DataFrame; delim=';') -secondary = CSV.read("secondary_data.csv", DataFrame; delim=';') -describe(primary) -describe(secondary) -``` - -### Exercise 3 - -Solution: - -``` -parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ") -parse_nominal(::Missing) = missing -parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", ")) -parse_numeric(::Missing) = missing -idcols = ["family", "name", "class"] -numericcols = ["cap-diameter", "stem-height", "stem-width"] -parsed_primary = select(primary, - idcols, - numericcols .=> ByRow(parse_numeric), - Not([idcols; numericcols]) .=> ByRow(parse_nominal); - renamecols=false) -``` - -### Exercise 4 - -Solution: +Solution ``` function overlap_numeric(v1, v2) @@ -200,9 +143,19 @@ end Note that in this exercise using `eachrow` is not a problem (although it is not type stable) because the data is small. +
+ ### Exercise 5 -Solution: +Still using `parsed_primary` find what is the average probability of class being +`p` by `family`. Additionally add number of observations in each group. Sort +these results by the probability. Try using DataFramesMeta.jl to do this +exercise (this requirement is optional). + +Store the result in `agg_primary` data frame. + +
+Solution ``` using Statistics @@ -214,17 +167,40 @@ agg_primary = @chain parsed_primary begin end ``` +
+ ### Exercise 6 -Solution: +Now using `agg_primary` data frame collapse it so that for each unique `pr_p` +it gives us a total number of rows that had this probability and a tuple +of mushroom family names. + +Optionally: try to display the produced table so that the tuple containing the +list of families for each group is not cropped (this will require large +terminal). + +
+Solution ``` -show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140) +show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families); truncate=140) ``` +
+ ### Exercise 7 -Solution: +From our preliminary analysis of `primary` data we see that `missing` value in +the primary data is non-informative, so in `secondary` data we should be +cautious when building a model if we allowed for missing data (in practice +if we were investigating some real mushroom we most likely would know its +characteristics). + +Therefore as a first step drop in-place all columns in `secondary` data frame +that have missing values. + +
+Solution ``` select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)]) @@ -233,9 +209,21 @@ select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)]) Note that we select based on actual contents of the columns and not by their element type (column could allow for missing values but not have them). +
+ ### Exercise 8 -Solution: +Create a logistic regression predicting `class` based on all remaining features +in the data frame. You might need to check the `Term` usage in StatsModels.jl +documentation. + +You will notice that for `stem-color` and `habitat` columns you get strange +estimation results (large absolute values of estimated parameters and even +larger standard errors). Explain why this happens by analyzing frequency tables +of these variables against `class` column. + +
+Solution ``` using GLM @@ -247,12 +235,21 @@ freqtable(secondary, "stem-color", "class") freqtable(secondary, "habitat", "class") ``` -We can see that for cetrain levels of `stem-color` and `habitat` variables +We can see that for certain levels of `stem-color` and `habitat` variables there is a perfect separation of classes. +
+ ### Exercise 9 -Solution: +Add `class_p` column to `secondary` as a second column that will contain +predicted probability from the model created in exercise 8 of a given +observation having class `p`. + +Print descriptive statistics of column `class_p` by `class`. + +
+Solution ``` insertcols!(secondary, 2, :class_p => predict(model)) @@ -264,9 +261,14 @@ end We can see that the model has some discriminatory power, but there is still a significant overlap between classes. +
+ ### Exercise 10 -Solution: +Plot FPR-TPR ROC curve for our model and compute associated AUC value. + +
+Solution ``` using Plots