diff --git a/README.md b/README.md index 8505d0b..aa2937b 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,11 @@ This repository contains source codes for the book that is written by Bogumił Kamiński and is planned to be published in 2022 by [Manning Publications Co.](https://www.manning.com/) +Extras: +* in the `/exercises` folder for each book chapter you can find 10 additional + exercises with solutions (they are meant for self study and are not discussed + in the book) + ## Setting up your environment In order to prepare the Julia environment before working with the materials @@ -71,9 +76,9 @@ The codes for each chapter are stored in files named *chXX.jl*, where *XX* is chapter number. The exceptions are * chapter 14, where additionally a separate *ch14_server.jl* is present along -with *ch14.jl* (the reason is that in this chapter we create a web service and -the *ch14_server.jl* contains the server-side code that should be run in a -separate Julia process); + with *ch14.jl* (the reason is that in this chapter we create a web service and + the *ch14_server.jl* contains the server-side code that should be run in a + separate Julia process); * appendix A, where the file name used is *appA.txt* because it also contains other instructions than only Julia code (in particular package manager mode instructions). @@ -113,8 +118,6 @@ There are the following videos that feature material related to this book: * [Analysis of GitHub developer graph](https://www.twitch.tv/videos/1527593261) (a shortened version of material covered in chapter 12) - - ## Data used in the book For your convenience I additionally stored data files that we use in this book. @@ -130,5 +133,3 @@ They are respectively: under GPL-3.0 License) * owensboro.zip (for chapter 13, available at The Stanford Open Policing Project under the Open Data Commons Attribution License) - - diff --git a/exercises/README.md b/exercises/README.md new file mode 100644 index 0000000..0839e01 --- /dev/null +++ b/exercises/README.md @@ -0,0 +1,33 @@ +# Julia for Data Analysis + +This folder contains additional exercises to the book +["Julia for Data Analysis"](https://www.manning.com/books/julia-for-data-analysis?utm_source=bkamins&utm_medium=affiliate&utm_campaign=book_kaminski2_julia_3_17_22) +that is written by Bogumił Kamiński and is planned to be published in 2022 +by [Manning Publications Co.](https://www.manning.com/). + +The exercises were prepared by [Bogumił Kamiński](https://github.com/bkamins) and [Daniel Kaszyński](https://www.linkedin.com/in/daniel-kaszy%C5%84ski-3a7807113/). + +For each book chapter you can find 10 additional exercises with solutions. +The exercises are meant for self study and are not discussed in the book. + +The exercises contained in this folder have varying level of difficulty. The +problems for the first few chapters are meant to be relatively easy, however you +might find exercises for the last chapters more challenging (if you are learning +Julia I still encourage you to try these exercises and walk through the +solutions trying to understand them). In particular, in some exercises I, on +purpose, require using more functionalities of Julia ecosystem that is covered +in the book. This is meant to teach you how to use help and documentation, +as this is a very important skill to master. + +The files containing exercises have a naming convention `execricesDD.md`, where +`DD` is book chapter number for which the exercises were prepared. + +All the exercises should be possible to solve using project environment setting +that is used in the whole book (please check the global README.md file of this +repository for details). + +--- + +*Preparation of these exercises have been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.* + +![SGH & NAWA](logo.png) diff --git a/exercises/example8.csv.gz b/exercises/example8.csv.gz new file mode 100644 index 0000000..31ee657 Binary files /dev/null and b/exercises/example8.csv.gz differ diff --git a/exercises/exercises02.md b/exercises/exercises02.md new file mode 100644 index 0000000..c2ac52d --- /dev/null +++ b/exercises/exercises02.md @@ -0,0 +1,267 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 2 + +# Problems + +### Exercise 1 + +Consider the following code: +``` +x = [1, 2] +y = x +y[1] = 10 +``` +What is the value of `x[1]` and why? + +### Exercise 2 + +How can you type `⚡ = 1`. Check if this operation succeeds and what is its result. + +### Exercise 3 + +What will be the value of variable `x` after running of the following code and why? +``` +x = 0.0 +for i in 1:7_000_000 + global x += 1/7 +end +x /= 1_000_000 +``` + +### Exercise 4 + +Express the type `Matrix{Bool}` using `Array` type. + +### Exercise 5 + +Let `x` be a vector. Write code that prints an error if `x` is empty +(has zero elements) + +### Exercise 6 + +Write a function called `exec` that takes two values `x` and `y` and a function +accepting two arguments, call it `op` and returns `op(x, y)`. Make `+` to be +the default value of `op`. + +### Exercise 7 + +Write a function that calculates a sum of absolute values of values stored in +a collection passed to it. + +### Exercise 8 + +Write a function that swaps first and last element in an array in place. + +### Exercise 9 + +Write a loop in global scope that calculates the sum of cubes of numbers from +`1` to `10^6`. Next use the `sum` function to perform the same computation. +What is the difference in timing of these operations? + +### Exercise 10 + +Explain the value of the result of summation obtained in exercise 9. + +# Solutions + +
+ +Show! + +### Exercise 1 + +`x[1]` will be `10` because `y = x` is not copying data but it binds +the same value both to variable `x` and `y`. + +### Exercise 2 + +In help mode (activated by `?`) copy-paste `⚡` to get: +``` +help?> ⚡ +"⚡" can be typed by \:zap: +``` +After the `⚡ = 1` operation a new variable `⚡` is defined and it is bound +to value `1`. + +### Exercise 3 + +`x` will have value `0.9999999999242748`. This value is below `1.0` because +representation of `1/7` using `Float64` type is less than rational number 1/7, +and the error accumulates when we do addition multiple times. + +*Extra*: You can check that indeed that `Float64` representation is a bit less +than rational 1/7 by increasing the precision of computations using the `big` +function: +``` +julia> big(1/7) # convert Floa64 to high-precision float +0.142857142857142849212692681248881854116916656494140625 + +julia> 1/big(7) # construct high-precision float directly +0.1428571428571428571428571428571428571428571428571428571428571428571428571428568 +``` +As you can see there is a difference at 17th place after decimal dot where we +have `4` vs `5`. + +### Exercise 4 + +It is `Array{Bool, 2}`. You immediately get this information in REPL: +``` +julia> Matrix{Bool} +Matrix{Bool} (alias for Array{Bool, 2}) +``` + +### Exercise 5 + +You can do it like this: +``` +length(x) == 0 && println("x is empty") +``` + +*Extra*: typically in such case one would use the `isempty` function and throw +an exception instead of just printing information (here I assume that `x` was +passed as an argument to the function): +``` +isempty(x) && throw(ArgumentError("x is not allowed to be empty")) +``` + +### Exercise 6 + +Here are two ways to define the `exec` function: +``` +exec1(x, y, op=+) = op(x, y) +exec2(x, y; op=+) = op(x, y) +``` +The first of them uses positional arguments, and the second a keyword argument. +Here is a difference in how they would be called: +``` +julia> exec1(2, 3, *) +6 + +julia> exec2(2, 3; op=*) +6 +``` + +### Exercise 7 + +Such a function can be written as: +``` +sumabs(x) = sum(abs, x) +``` + +### Exercise 8 + +This can be written for example as: +``` +function swap!(x) + f = x[1] + x[1] = x[end] + x[end] = f + return x +end +``` + +*Extra* A more advanced way to write this function would be: +``` +function swap!(x) + if length(x) > 1 + x[begin], x[end] = x[end], x[begin] + end + return x +end +``` +Note the differences in the code: +* we use `begin` instead of `1` to get the first element. This is a safer + practice since some collections in Julia do not use 1-based indexing (in + practice you are not likely to see them, so this comment is most relevant + for package developers) +* if there are `0` or `1` element in the collection the function does not do + anything (depending on the context we might want to throw an error instead) +* in `x[begin], x[end] = x[end], x[begin]` we perform two assignments at the + same time to avoid having to use a temporaty variable `f` (this operation + is technically called tuple destructuring; we discuss it in later chapters of + the book) + +### Exercise 9 + +We used `@time` macro in chapter 1. + +Version in global scope: +``` +julia> s = 0 +0 + +julia> @time for i in 1:10^6 + global s += i^3 + end + 0.076299 seconds (2.00 M allocations: 30.517 MiB, 10.47% gc time) +``` + +Version with a function using a `sum` function: +``` +julia> sum3(n) = sum(x -> x^3, 1:n) +sum3 (generic function with 1 method) + +julia> @time sum3(10^6) + 0.000012 seconds +-8222430735553051648 +``` + +Version with `sum` function in global scope: +``` +julia> @time sum(x -> x^3, 1:10^6) + 0.027436 seconds (48.61 k allocations: 2.558 MiB, 99.75% compilation time) +-8222430735553051648 + +julia> @time sum(x -> x^3, 1:10^6) + 0.025744 seconds (48.61 k allocations: 2.557 MiB, 99.76% compilation time) +-8222430735553051648 +``` + +As you can see using a loop in global scope is inefficient. It leads to +many allocations and slow execution. + +Using a `sum3` function leads to fastest execution. You might ask why using +`sum(x -> x^3, 1:10^6)` in global scope is slower. The reason is that an +anonymous function `x -> x^3` is defined anew each time this operation is called +which forces compilation of the `sum` function (but it is still faster than +the loop in global scope). + +For a reference check the function with a loop inside it: +``` +julia> function sum3loop(n) + s = 0 + for i in 1:n + s += i^3 + end + return s + end +sum3loop (generic function with 1 method) + +julia> @time sum3loop(10^6) + 0.001378 seconds +-8222430735553051648 +``` +This is also much faster than a loop in global scope. + +### Exercise 10 + +In exercise 9 we note that the result is `-8222430735553051648` which is a +negative value, although we are adding cubes of positive values. The +reason of the problem is that operations on integers overflow. If you +are working with numbers larger that can be stored in `Int` type, which is: +``` +julia> typemax(Int) +9223372036854775807 +``` +use `big` numbers that we discussed in *Exercise 3*: +``` +julia> @time sum(x -> big(x)^3, 1:10^6) + 0.833234 seconds (11.05 M allocations: 236.113 MiB, 23.77% gc time, 2.63% compilation time) +250000500000250000000000 +``` +Now we get a correct result, at the cost of slower computation. + +
diff --git a/exercises/exercises03.md b/exercises/exercises03.md new file mode 100644 index 0000000..ce039f3 --- /dev/null +++ b/exercises/exercises03.md @@ -0,0 +1,345 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 3 + +# Problems + +### Exercise 1 + +Check what methods does the `repeat` function have. +Are they all covered in help for this function? + +### Exercise 2 + +Write a function `fun2` that takes any vector and returns the difference between +the largest and the smallest element in this vector. + +### Exercise 3 + +Generate a vector of one million random numbers from `[0, 1]` interval. +Check what is a faster way to get a maximum and minimum element in it. One +option is by using the `maximum` and `minimum` functions and the other is by +using the `extrema` function. + +### Exercise 4 + +Assume you have accidentally typed `+x = 1` when wanting to assign `1` to +variable `x`. What effects can this operation have? + +### Exercise 5 + +What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why? + +### Exercise 6 + +Define two identical anonymous functions `x -> x + 1` in global scope? Do they +have the same type? + +### Exercise 7 + +Define the `wrap` function taking one argument `i` and returning the anonymous +function `x -> x + i`. Is the type of such anonymous function the same across +calls to `wrap` function? + +### Exercise 8 + +You want to write a function that accepts any `Integer` except `Bool` and returns +the passed value. If `Bool` is passed an error should be thrown. + +### Exercise 9 + +The `@time` macro measures time taken by an expression run and prints it, +but returns the value of the expression. +The `@elapsed` macro works differently - it does not print anything, but returns +time taken to evaluate an expression. Test the `@elapsed` macro by to see how +long it takes to shuffle a vector of one million floats. Use the `shuffle` function +from `Random` module. + +### Exercise 10 + +Using the `@btime` macro benchmark the time of calculating the sum of one million +random floats. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Write: +``` +julia> methods(repeat) +# 6 methods for generic function "repeat": +[1] repeat(A::AbstractArray; inner, outer) in Base at abstractarraymath.jl:392 +[2] repeat(A::AbstractArray, counts...) in Base at abstractarraymath.jl:355 +[3] repeat(c::Char, r::Integer) in Base at strings/string.jl:336 +[4] repeat(c::AbstractChar, r::Integer) in Base at strings/string.jl:335 +[5] repeat(s::Union{SubString{String}, String}, r::Integer) in Base at strings/substring.jl:248 +[6] repeat(s::AbstractString, r::Integer) in Base at strings/basic.jl:715 +``` + +Now write `?repeat` and you will see that there are four entries in help. +The reason is that for `Char` and `AbstractChar` as well as for +`AbstractString` and `Union{SubString{String}, String}` there is one help entry. + +Why do these cases have two methods defined? +The reason is performance. For example `repeat(c::AbstractChar, r::Integer)` +is a generic function that accept any character values +and `repeat(c::Char, r::Integer)` is its faster version +that accepts values that have `Char` type only (and it is invoked by Julia +if value of type `Char` is passed as an argument to `repeat`). + +### Exercise 2 + +You can define is as follows: +``` +fun2(x::AbstractVector) = maximum(x) - minimum(x) +``` +or as follows: +``` +function fun2(x::AbstractVector) + lo, hi = extrema(x) + return hi - lo +end +``` +Note that these two functions will work with vectors of any elements that +are ordered and support subtraction (they do not have to be numbers). + +### Exercise 3 + +Here is a way to compare the performance of both options: +``` +julia> using BenchmarkTools + +julia> x = rand(10^6); + +julia> @btime minimum($x), maximum($x) + 860.700 μs (0 allocations: 0 bytes) +(1.489173560242918e-6, 0.9999984347293639) + +julia> @btime extrema($x) + 2.185 ms (0 allocations: 0 bytes) +(1.489173560242918e-6, 0.9999984347293639) +``` + +As you can see in this situation, although `extrema` does the operation +in a single pass over `x` it is slower than computing `minimum` and `maximum` +in two passes. + +### Exercise 4 + +If it is a fresh Julia session you define a new function in `Main` for `+` operator: + +``` +julia> +x=1 ++ (generic function with 1 method) + +julia> methods(+) +# 1 method for generic function "+": +[1] +(x) in Main at REPL[1]:1 + +julia> +(10) +1 +``` + +This will also break any further uses of `+` in your programs: + +``` +julia> 1 + 2 +ERROR: MethodError: no method matching +(::Int64, ::Int64) +You may have intended to import Base.:+ +Closest candidates are: + +(::Any) at REPL[1]:1 +``` + +If you earlier used addition in this Julia session then the operation will error. +Start a fresh Julia session: + +``` +julia> 1 + 2 +3 + +julia> +x=1 +ERROR: error in method definition: function Base.+ must be explicitly imported to be extended +``` + +### Exercise 5 + +You get an empty vector: +``` +julia> subtypes(Union{Float64, Missing}) +Type[] +``` + +The reason is that the `subtypes` function returns subtypes of explicitly +declared types that have names (type of such types is `DataType` in Julia). + +*Extra* for this reason `subtypes` has a limited use. To check if one type +is a subtype of some other type use the `<:` operator. + +### Exercise 6 + +No, each of them has a different type: +``` +julia> f1 = x -> x + 1 +#1 (generic function with 1 method) + +julia> f2 = x -> x + 1 +#3 (generic function with 1 method) + +julia> typeof(f1) +var"#1#2" + +julia> typeof(f2) +var"#3#4" +``` + +This is the reason why function call like `sum(x -> x^2, 1:10)` in global +scope triggers compilation each time: + +``` +julia> @time sum(x -> x^2, 1:10) + 0.070714 seconds (167.41 k allocations: 8.815 MiB, 14.29% gc time, 93.91% compilation time) +385 + +julia> @time sum(x -> x^2, 1:10) + 0.020971 seconds (47.82 k allocations: 2.529 MiB, 99.75% compilation time) +385 + +julia> @time sum(x -> x^2, 1:10) + 0.021184 seconds (47.81 k allocations: 2.529 MiB, 99.77% compilation time) +385 +``` + +### Exercise 7 + +Yes, the type is the same: + +``` +julia> wrap(i) = x -> x + i +wrap (generic function with 1 method) + +julia> typeof(wrap(1)) +var"#11#12" + +julia> typeof(wrap(2)) +var"#11#12" +``` + +Julia defines a new type for such an anonymous function only once +The consequence of this is that e.g. expressions inside a function like +`sum(x -> x ^ i, 1:10)` where `i` is an argument to a function do not trigger +compilation (as opposed to similar expressions in global scope, see exercise 6). + +``` +julia> sumi(i) = sum(x -> x^i, 1:10) +sumi (generic function with 1 method) + +julia> @time sumi(1) + 0.000004 seconds +55 + +julia> @time sumi(2) + 0.000001 seconds +385 + +julia> @time sumi(3) + 0.000003 seconds +3025 +``` + +### Exercise 8 + +We check subtypes of `Integer`: + +``` +julia> subtypes(Integer) +3-element Vector{Any}: + Bool + Signed + Unsigned +``` + +The first way to write such a function is then: +``` +fun1(i::Union{Signed, Unsigned}) = i +``` +and now we have: +``` +julia> fun1(1) +1 + +julia> fun1(true) +ERROR: MethodError: no method matching fun1(::Bool) +``` + +The second way is: +``` +fun2(i::Integer) = i +fun2(::Bool) = throw(ArgumentError("Bool is not supported")) +``` + +and now you have: +``` +julia> fun2(1) +1 + +julia> fun2(true) +ERROR: ArgumentError: Bool is not supported +``` + +### Exercise 9 + +Here is the code that performs the task: +``` +julia> using Random # needed to get access to shuffle + +julia> x = rand(10^6); # generate random floats + +julia> @elapsed shuffle(x) +0.0518085 + +julia> @elapsed shuffle(x) +0.01257 + +julia> @elapsed shuffle(x) +0.012483 +``` + +Note that the first time we run `shuffle` it takes longer due to compilation. + +### Exercise 10 + +The code you can use is: + +``` +julia> using BenchmarkTools + +julia> @btime sum($(rand(10^6))) + 155.300 μs (0 allocations: 0 bytes) +500330.6375697419 +``` + +Note that the following: +``` +julia> @btime sum(rand(10^6)) + 1.644 ms (2 allocations: 7.63 MiB) +500266.9457722128 +``` +would be an incorrect timing as you would also measure the time of generating +of the vector. + +Alternatively you can e.g. write: +``` +julia> x = rand(10^6); + +julia> @btime sum($x) + 154.700 μs (0 allocations: 0 bytes) +500151.95875364926 +``` + +
diff --git a/exercises/exercises04.md b/exercises/exercises04.md new file mode 100644 index 0000000..26ebcef --- /dev/null +++ b/exercises/exercises04.md @@ -0,0 +1,536 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 4 + +# Problems + +### Exercise 1 + +Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix +columnwise with consecutive numbers). Next calculate sum, mean and standard +deviation of each row and each column of this matrix. + +### Exercise 2 + +For each column of the matrix created in exercise 1 compute its range +(i.e. the difference between maximum and minimum element stored in it). + +### Exercise 3 + +This is data for car speed (mph) and distance taken to stop (ft) +from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley. + +``` +speed dist + 4 2 + 4 10 + 7 4 + 7 22 + 8 16 + 9 10 + 10 18 + 10 26 + 10 34 + 11 17 + 11 28 + 12 14 + 12 20 + 12 24 + 12 28 + 13 26 + 13 34 + 13 34 + 13 46 + 14 26 + 14 36 + 14 60 + 14 80 + 15 20 + 15 26 + 15 54 + 16 32 + 16 40 + 17 32 + 17 40 + 17 50 + 18 42 + 18 56 + 18 76 + 18 84 + 19 36 + 19 46 + 19 68 + 20 32 + 20 48 + 20 52 + 20 56 + 20 64 + 22 66 + 23 54 + 24 70 + 24 92 + 24 93 + 24 120 + 25 85 +``` + +Load this data into Julia (this is part of the exercise) and fit a linear +regression where speed is a feature and distance is target variable. + +### Exercise 4 + +Plot the data loaded in exercise 4. Additionally plot the fitted regression +(you need to check Plots.jl documentation to find a way to do this). + +### Exercise 5 + +A simple code for calculation of Fibonacci numbers for positive +arguments is as follows: + +``` +fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2) +``` + +Using the BenchmarkTools.jl package measure runtime of this function for +`n` ranging from `1` to `20`. + +### Exercise 6 + +Improve the speed of code from exercise 5 by using a dictionary where you +store a mapping of `n` to `fib(n)`. Measure the performance of this function +for the same range of values as in exercise 5. + +### Exercise 7 + +Create a vector containing named tuples representing elements of a 4x4 grid. +So the first element of this vector should be `(x=1, y=1)` and last should be +`(x=4, y=4)`. Store the vector in variable `v`. + +### Exercise 8 + +The `filter` function allows you to select some values of an input collection. +Check its documentation first. Next, use it to keep from the vector `v` from +exercise 7 only elements whose sum is even. + +### Exercise 9 + +Check the documentation of the `filter!` function. Perform the same operation +as asked in exercise 8 but using `filter!`. What is the difference? + +### Exercise 10 + +Write a function that takes a number `n`. Next it generates two independent +random vectors of length `n` and returns their correlation coefficient. +Run this function `10000` times for `n` equal to `10`, `100`, `1000`, +and `10000`. +Create a plot with four histograms of distribution of computed Pearson +correlation coefficient. Check in the Plots.jl package which function can be +used to plot histograms. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Write: +``` +julia> using Statistics + +julia> mat = [1 3 5 + 2 4 6] +2×3 Matrix{Int64}: + 1 3 5 + 2 4 6 + +julia> sum(mat, dims=1) +1×3 Matrix{Int64}: + 3 7 11 + +julia> sum(mat, dims=2) +2×1 Matrix{Int64}: + 9 + 12 + +julia> mean(mat, dims=1) +1×3 Matrix{Float64}: + 1.5 3.5 5.5 + +julia> mean(mat, dims=2) +2×1 Matrix{Float64}: + 3.0 + 4.0 + +julia> std(mat, dims=1) +1×3 Matrix{Float64}: + 0.707107 0.707107 0.707107 + +julia> std(mat, dims=2) +2×1 Matrix{Float64}: + 2.0 + 2.0 +``` + +Observe that the returned statistics are also stored in matrices. +If we compute them for columns (`dims=1`) then the produced matrix has one row. +If we compute them for rows (`dims=2`) then the produced matrix has one column. + +### Exercise 2 + +Here are some ways you can do it: +``` +julia> [maximum(x) - minimum(x) for x in eachcol(mat)] +3-element Vector{Int64}: + 1 + 1 + 1 + +julia> map(x -> maximum(x) - minimum(x), eachcol(mat)) +3-element Vector{Int64}: + 1 + 1 + 1 +``` + +Observe that if we used `eachcol` the produced result is a vector (not a matrix +like in exercise 1). + +### Exercise 3 + +First create a matrix with source data by copy pasting it from the exercise +like this: +``` +data = [ + 4 2 + 4 10 + 7 4 + 7 22 + 8 16 + 9 10 + 10 18 + 10 26 + 10 34 + 11 17 + 11 28 + 12 14 + 12 20 + 12 24 + 12 28 + 13 26 + 13 34 + 13 34 + 13 46 + 14 26 + 14 36 + 14 60 + 14 80 + 15 20 + 15 26 + 15 54 + 16 32 + 16 40 + 17 32 + 17 40 + 17 50 + 18 42 + 18 56 + 18 76 + 18 84 + 19 36 + 19 46 + 19 68 + 20 32 + 20 48 + 20 52 + 20 56 + 20 64 + 22 66 + 23 54 + 24 70 + 24 92 + 24 93 + 24 120 + 25 85 +] +``` + +Now use the GLM.jl package to fit the model: + +``` +julia> using GLM + +julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1])) +StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}} + +distance ~ 1 + speed + +Coefficients: +───────────────────────────────────────────────────────────────────────── + Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% +───────────────────────────────────────────────────────────────────────── +(Intercept) -17.5791 6.75844 -2.60 0.0123 -31.1678 -3.99034 +speed 3.93241 0.415513 9.46 <1e-11 3.09696 4.76785 +───────────────────────────────────────────────────────────────────────── +``` + +You can get the same estimates using the `\` operator like this: +``` +julia> [ones(50) data[:, 1]] \ data[:, 2] +2-element Vector{Float64}: + -17.579094890510966 + 3.9324087591240877 +``` + +### Exercise 4 + +Run the following: +``` +using Plots +scatter(data[:, 1], data[:, 2]; + xlab="speed", ylab="distance", legend=false, smooth=true) +``` + +The `smooth=true` keyword argument adds the linear regression line to the plot. + +### Exercise 5 + +Use the following code: +``` +julia> using BenchmarkTools + +julia> for i in 1:40 + print(i, " ") + @btime fib($i) + end +1 2.500 ns (0 allocations: 0 bytes) +2 2.700 ns (0 allocations: 0 bytes) +3 4.800 ns (0 allocations: 0 bytes) +4 7.500 ns (0 allocations: 0 bytes) +5 12.112 ns (0 allocations: 0 bytes) +6 19.980 ns (0 allocations: 0 bytes) +7 32.125 ns (0 allocations: 0 bytes) +8 52.696 ns (0 allocations: 0 bytes) +9 85.010 ns (0 allocations: 0 bytes) +10 140.311 ns (0 allocations: 0 bytes) +11 222.177 ns (0 allocations: 0 bytes) +12 359.903 ns (0 allocations: 0 bytes) +13 582.123 ns (0 allocations: 0 bytes) +14 1.000 μs (0 allocations: 0 bytes) +15 1.560 μs (0 allocations: 0 bytes) +16 2.522 μs (0 allocations: 0 bytes) +17 4.000 μs (0 allocations: 0 bytes) +18 6.600 μs (0 allocations: 0 bytes) +19 11.400 μs (0 allocations: 0 bytes) +20 18.100 μs (0 allocations: 0 bytes) +``` + +Notice that execution time for number `n` is roughly sum of ececution times +for numbers `n-1` and `n-2`. + +### Exercise 6 + +Use the following code: + +``` +julia> fib_dict = Dict{Int, Int}() +Dict{Int64, Int64}() + +julia> function fib2(n) + haskey(fib_dict, n) && return fib_dict[n] + fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2) + fib_dict[n] = fib_n + return fib_n + end +fib2 (generic function with 1 method) + +julia> for i in 1:20 + print(i, " ") + @btime fib2($i) + end +1 40.808 ns (0 allocations: 0 bytes) +2 40.101 ns (0 allocations: 0 bytes) +3 40.101 ns (0 allocations: 0 bytes) +4 40.707 ns (0 allocations: 0 bytes) +5 42.727 ns (0 allocations: 0 bytes) +6 40.909 ns (0 allocations: 0 bytes) +7 40.404 ns (0 allocations: 0 bytes) +8 40.707 ns (0 allocations: 0 bytes) +9 40.808 ns (0 allocations: 0 bytes) +10 39.798 ns (0 allocations: 0 bytes) +11 40.909 ns (0 allocations: 0 bytes) +12 40.404 ns (0 allocations: 0 bytes) +13 42.872 ns (0 allocations: 0 bytes) +14 42.626 ns (0 allocations: 0 bytes) +15 47.972 ns (1 allocation: 16 bytes) +16 46.505 ns (1 allocation: 16 bytes) +17 46.302 ns (1 allocation: 16 bytes) +18 45.390 ns (1 allocation: 16 bytes) +19 47.160 ns (1 allocation: 16 bytes) +20 46.201 ns (1 allocation: 16 bytes) +``` + +Note that benchmarking essentially gives us a time of dictionary lookup. +The reason is that `@btime` executes the same expression many times, so +for the fastest execution time the value for each `n` is already stored in +`fib_dict`. + +It would be more interesting to see the runtime of `fib2` for some large value +of `n` executed once: + +``` +julia> @time fib2(100) + 0.000018 seconds (107 allocations: 1.672 KiB) +3736710778780434371 + +julia> @time fib2(200) + 0.000025 seconds (204 allocations: 20.453 KiB) +-1123705814761610347 +``` + +As you can see things are indeed fast. Note that for `n=200` we get a negative +values because of integer overflow. + +As a more advanced topic (not covered in the book) it is worth to comment that +`fib2` is not type stable. If we wanted to make it type stable we need to +declare `fib_dict` dictionary as `const`. Here is the code and benchmarks +(you need to restart Julia to run this test): + +``` +julia> const fib_dict = Dict{Int, Int}() +Dict{Int64, Int64}() + +julia> function fib2(n) + haskey(fib_dict, n) && return fib_dict[n] + fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2) + fib_dict[n] = fib_n + return fib_n + end +fib2 (generic function with 1 method) + +julia> @time fib2(100) + 0.000014 seconds (6 allocations: 5.828 KiB) +3736710778780434371 + +julia> @time fib2(200) + 0.000011 seconds (3 allocations: 17.312 KiB) +-1123705814761610347 +``` + +As you can see the code does less allocations and is faster now. + +### Exercise 7 + +Since we are asked to create a vector we can write: + +``` +julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4] +16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}: + (x = 1, y = 1) + (x = 1, y = 2) + (x = 1, y = 3) + (x = 1, y = 4) + (x = 2, y = 1) + (x = 2, y = 2) + (x = 2, y = 3) + (x = 2, y = 4) + (x = 3, y = 1) + (x = 3, y = 2) + (x = 3, y = 3) + (x = 3, y = 4) + (x = 4, y = 1) + (x = 4, y = 2) + (x = 4, y = 3) + (x = 4, y = 4) +``` + +Note (not covered in the book) that you could create a matrix by changing +the syntax a bit: + +``` +julia> [(x=x, y=y) for x in 1:4, y in 1:4] +4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}: + (x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4) + (x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4) + (x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4) + (x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4) +``` + +Finally, we can use a bit shorter syntax (covered in chapter 14 of the book): + +``` +julia> [(; x, y) for x in 1:4, y in 1:4] +4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}: + (x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4) + (x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4) + (x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4) + (x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4) +``` + +### Exercise 8 + +To get help on the `filter` function write `?filter`. Next run: + +``` +julia> filter(e -> iseven(e.x + e.y), v) +8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}: + (x = 1, y = 1) + (x = 1, y = 3) + (x = 2, y = 2) + (x = 2, y = 4) + (x = 3, y = 1) + (x = 3, y = 3) + (x = 4, y = 2) + (x = 4, y = 4) +``` + +### Exercise 9 + +To get help on the `filter!` function write `?filter!`. Next run: + +``` +julia> filter!(e -> iseven(e.x + e.y), v) +8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}: + (x = 1, y = 1) + (x = 1, y = 3) + (x = 2, y = 2) + (x = 2, y = 4) + (x = 3, y = 1) + (x = 3, y = 3) + (x = 4, y = 2) + (x = 4, y = 4) + +julia> v +8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}: + (x = 1, y = 1) + (x = 1, y = 3) + (x = 2, y = 2) + (x = 2, y = 4) + (x = 3, y = 1) + (x = 3, y = 3) + (x = 4, y = 2) + (x = 4, y = 4) +``` + +Notice that `filter` allocated a new vector, while `filter!` updated the `v` +vector in place. + +### Exercise 10 + +You can use for example the following code: + +``` +using Statistics +using Plots +rand_cor(n) = cor(rand(n), rand(n)) +plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false) + for n in [10, 100, 1000, 10000]]...) +``` + +Observe that as you increase `n` the dispersion of the correlation coefficient +decreases. + +
diff --git a/exercises/exercises05.md b/exercises/exercises05.md new file mode 100644 index 0000000..f91a442 --- /dev/null +++ b/exercises/exercises05.md @@ -0,0 +1,287 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 5 + +# Problems + +### Exercise 1 + +Create a matrix containing truth table for `&&` and `||` operations. + +### Exercise 2 + +The `issubset` function checks if one collection is a subset of other +collection. + +Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for +`k` varying from `1` to `3`. Store the result in a vector. + +### Exercise 3 + +Write a function that accepts two vectors and returns `true` if they have equal +length and otherwise returns `false`. + +### Exercise 4 + +Consider the vectors `x = [1, 2, 1, 2, 1, 2]`, +`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`. +Calculate their Adjusted Mutual Information using scikit-learn. + +### Exercise 5 + +Using Adjusted Mutual Information function from exercise 4 generate +a pair of random vectors of length 100 containing integer numbers from the +range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI. +Check in the documentation of the `rand` function how you can draw a sample +from a collection of values. + +### Exercise 6 + +Adjust the code from exercise 5 but replace first 50 elements of each vector +with zero. Repeat the experiment. + +### Exercise 7 + +Write a function that takes a vector of integer values and returns a dictionary +giving information how many times each integer was present in the passed vector. + +Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`, +and `v3 = 3:5`. + +### Exercise 8 + +Write code that creates a `Bool` diagonal matrix of size 5x5. + +### Exercise 9 + +Write a code comparing performance of calculation of sum of logarithms of +elements of a vector `1:100` using broadcasting and the `sum` function vs only +the `sum` function taking a function as a first argument. + +### Exercise 10 + +Create a dictionary in which for each number from `1` to `10` you will store +a vector of its positive divisors. You can check the reminder of division +of two values using the `rem` function. + +Additionally (not covered in the book), you can drop elements +from a comprehension if you add an `if` clause after the `for` clause, for +example to keep only odd numbers from range `1:10` do: + +``` +julia> [i for i in 1:10 if isodd(i)] +5-element Vector{Int64}: + 1 + 3 + 5 + 7 + 9 +``` + +You can populate a dictionary by passing a vector of pairs to it (not covered in +the book), for example: + +``` +julia> Dict(["a" => 1, "b" => 2]) +Dict{String, Int64} with 2 entries: + "b" => 2 + "a" => 1 +``` + +# Solutions + +
+ +Show! + +### Exercise 1 + +You can do it as follows: +``` +julia> [true, false] .&& [true false] +2×2 BitMatrix: + 1 0 + 0 0 + +julia> [true, false] .|| [true false] +2×2 BitMatrix: + 1 1 + 1 0 +``` + +Note that the first array is a vector, while the second array is a 1-row matrix. + +### Exercise 2 + +You can do it like this using broadcasting: +``` +julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3]) +3-element BitVector: + 0 + 1 + 1 +``` +Note that you need to use `Ref` to protect `4:6` from being broadcasted over. + +### Exercise 3 + +This function can be written as follows: + +``` +function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y) +``` + +### Exercise 4 + +You can do this exercise as follows: +``` +julia> using PyCall + +julia> metrics = pyimport("sklearn.metrics"); + +julia> metrics.adjusted_mutual_info_score(x, y) +-0.11111111111111087 + +julia> metrics.adjusted_mutual_info_score(x, z) +0.7276079390930807 + +julia> metrics.adjusted_mutual_info_score(y, z) +-0.21267989848846763 +``` + +### Exercise 5 + +You can create such a plot using the following commands: + +``` +using Plots +histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100)) + for i in 1:1000], label="AMI") +``` + +You can check that AMI oscillates around 0. + +### Exercise 6 + +This time it is convenient to write a helper function. Note that we use +broadcasting to update values in the vectors. + +``` +function exampleAMI() + x = rand(1:5, 100) + y = rand(1:5, 100) + x[1:50] .= 0 + y[1:50] .= 0 + return metrics.adjusted_mutual_info_score(x, y) +end +histogram([exampleAMI() for i in 1:1000], label="AMI") +``` + +Note that this time AMI is a bit below 0.5, which shows a better match between +vectors. + +### Exercise 7 + +``` +julia> function counter(v::AbstractVector{<:Integer}) + d = Dict{eltype(v), Int}() + for x in v + if haskey(d, x) + d[x] += 1 + else + d[x] = 1 + end + end + return d + end +counter (generic function with 1 method) + +julia> counter(v1) +Dict{Int64, Int64} with 3 entries: + 2 => 2 + 3 => 3 + 1 => 1 + +julia> counter(v2) +Dict{Bool, Int64} with 2 entries: + 0 => 1 + 1 => 1 + +julia> counter(v3) +Dict{Int64, Int64} with 3 entries: + 5 => 1 + 4 => 1 + 3 => 1 +``` + +Note that we used the `eltype` function to set a proper key type for +dictionary `d`. + +### Exercise 8 + +This is a way to do it: +``` +julia> 1:5 .== (1:5)' +5×5 BitMatrix: + 1 0 0 0 0 + 0 1 0 0 0 + 0 0 1 0 0 + 0 0 0 1 0 + 0 0 0 0 1 +``` + +Using the `LinearAlgebra` module you could also write: + +``` +julia> using LinearAlgebra + +julia> I(5) +5×5 Diagonal{Bool, Vector{Bool}}: + 1 ⋅ ⋅ ⋅ ⋅ + ⋅ 1 ⋅ ⋅ ⋅ + ⋅ ⋅ 1 ⋅ ⋅ + ⋅ ⋅ ⋅ 1 ⋅ + ⋅ ⋅ ⋅ ⋅ 1 +``` + +### Exercise 9 + +Here is how you can do it: + +``` +julia> using BenchmarkTools + +julia> @btime sum(log.(1:100)) + 1.620 μs (1 allocation: 896 bytes) +363.7393755555635 + +julia> @btime sum(log, 1:100) + 1.570 μs (0 allocations: 0 bytes) +363.7393755555636 +``` + +As you can see using the `sum` function with `log` as its first argument +is a bit faster as it is not allocating. + +### Exercise 10 + +Here is how you can do it: + +``` +julia> Dict([i => [j for j in 1:i if rem(i, j) == 0] for i in 1:10]) +Dict{Int64, Vector{Int64}} with 10 entries: + 5 => [1, 5] + 4 => [1, 2, 4] + 6 => [1, 2, 3, 6] + 7 => [1, 7] + 2 => [1, 2] + 10 => [1, 2, 5, 10] + 9 => [1, 3, 9] + 8 => [1, 2, 4, 8] + 3 => [1, 3] + 1 => [1] +``` + +
diff --git a/exercises/exercises06.md b/exercises/exercises06.md new file mode 100644 index 0000000..76d55eb --- /dev/null +++ b/exercises/exercises06.md @@ -0,0 +1,271 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 6 + +# Problems + +### Exercise 1 + +Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"` +(replace `3` by a proper interpolation expression) and replace `USD` by `$`. + +### Exercise 2 + +Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data` +as `iris.csv` to your local folder. + +### Exercise 3 + +Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"` +in two lines so that it takes less horizontal space. + +### Exercise 4 + +Load data stored in `iris.csv` file into a `data` vector where each element +should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if +the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed +as floats). + +### Exercise 5 + +The `data` structure is a vector of named tuples, change it to a named tuple +of vectors (with the same field names) and call it `data2`. + +### Exercise 6 + +Calculate the frequency of each type of Iris type (`c` field in `data2`). + +### Exercise 7 + +Create a vector `c2` that is derived from `c` in `data2` but holds inline strings, +vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s. +Compare sizes of the three objects. + +### Exercise 8 + +You know that `refs` field of `PooledArray` stores an integer index of a given +value in it. Using this information make a scatter plot of `pl` vs `pw` vectors +in `data2`, but for each Iris type give a different point color (check the +`color` keyword argument meaning in the Plots.jl manual; you can use the +`plot_color` function). + +### Exercise 9 + +Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to +`str` variable (do not copy paste the string, but type it). + +### Exercise 10 + +In the `str` string from exercise 9 find all matches of a pattern where `a` +is followed by `b` but there can be some characters between them. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution: +``` +julia> "I have apples worth $(1+2)\$" +"I have apples worth 3\$" +``` + +### Exercise 2 + +Solution: +``` +import Downloads +Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", + "iris.csv") +``` + +### Exercise 3 + +Solution: +``` +"https://archive.ics.uci.edu/ml/\ + machine-learning-databases/iris/iris.data" +``` + +### Exercise 4 + +Solution: +``` +julia> function line_parser(line) + elements = split(line, ",") + @assert length(elements) == 5 + return (sl=parse(Float64, elements[1]), + sw=parse(Float64, elements[2]), + pl=parse(Float64, elements[3]), + pw=parse(Float64, elements[4]), + c=elements[5]) + end +line_parser (generic function with 1 method) + +julia> data = line_parser.(readlines("iris.csv")[1:end-1]) +150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}: + (sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa") + (sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa") + (sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa") + (sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa") + (sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa") + ⋮ + (sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica") + (sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica") + (sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica") + (sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica") +``` + +Note that we used `1:end-1` selector to drop last element from the read lines +since it is empty. This is the reason why adding the +`@assert length(elements) == 5` check in the `line_parser` function is useful. + +### Exercise 5 + +Later in the book you will learn more advanced ways to do it. Here let us +use a most basic approach: + +``` +data2 = (sl=[d.sl for d in data], + sw=[d.sw for d in data], + pl=[d.pl for d in data], + pw=[d.pw for d in data], + c=[d.c for d in data]) +``` + +### Exercise 6 + +Solution: +``` +julia> using FreqTables + +julia> freqtable(data2.c) +3-element Named Vector{Int64} +Dim1 │ +──────────────────┼─── +"Iris-setosa" │ 50 +"Iris-versicolor" │ 50 +"Iris-virginica" │ 50 +``` + +### Exercise 7 + +Solution: +``` +julia> using InlineStrings + +julia> c2 = inlinestrings(data2.c) +150-element Vector{String15}: + "Iris-setosa" + "Iris-setosa" + "Iris-setosa" + "Iris-setosa" + "Iris-setosa" + ⋮ + "Iris-virginica" + "Iris-virginica" + "Iris-virginica" + "Iris-virginica" + +julia> using PooledArrays + +julia> c3 = PooledArray(data2.c) +150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}: + "Iris-setosa" + "Iris-setosa" + "Iris-setosa" + "Iris-setosa" + "Iris-setosa" + ⋮ + "Iris-virginica" + "Iris-virginica" + "Iris-virginica" + "Iris-virginica" + +julia> c4 = Symbol.(data2.c) +150-element Vector{Symbol}: + Symbol("Iris-setosa") + Symbol("Iris-setosa") + Symbol("Iris-setosa") + Symbol("Iris-setosa") + Symbol("Iris-setosa") + ⋮ + Symbol("Iris-virginica") + Symbol("Iris-virginica") + Symbol("Iris-virginica") + Symbol("Iris-virginica") + +julia> Base.summarysize(data2.c) +12840 + +julia> Base.summarysize(c2) +2440 + +julia> Base.summarysize(c3) +1696 + +julia> Base.summarysize(c4) +1240 +``` + +### Exercise 8 + +Solution: +``` +using Plots +scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false) +``` + +### Exercise 9 + +The hard part is typing `²`, `⟺` and `∨`. You can check how to do it using help: +``` +help?> ² +"²" can be typed by \^2 + +help?> ⟺ +"⟺" can be typed by \iff + +help?> ∨ +"∨" can be typed by \vee +``` + +Save the string in the `str` variable as we will use it in the next exercise. + +### Exercise 10 + +The exercise does not specify how the matching should be done. If we +want it to be eager (match as much as possible), we write: + +``` +julia> m = match(r"a.*b", str) +RegexMatch("a²=b² ⟺ a=b ∨ a=-b") +``` + +As you can see we have matched whole string. + +If we want it to be lazy (match as little as possible) we write: + +``` +julia> m = match(r"a.*?b", str) +RegexMatch("a²=b") +``` + +This finds us the first such match. + +If we want to find all lazy matches we can write (not covered in the book): + +``` +julia> collect(eachmatch(r"a.*?b", str)) +3-element Vector{RegexMatch}: + RegexMatch("a²=b") + RegexMatch("a=b") + RegexMatch("a=-b") +``` + +
diff --git a/exercises/exercises07.md b/exercises/exercises07.md new file mode 100644 index 0000000..8dd7cb8 --- /dev/null +++ b/exercises/exercises07.md @@ -0,0 +1,314 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 7 + +# Problems + +### Exercise 1 + +Random.org provides a service that returns random numbers. One of the ways +how you can use it is by sending HTTP GET reguests. Here is an example request: + +> https://www.random.org/integers/?num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new + +If you want to understand all the parameters plese check their meaning +[here](https://www.random.org/clients/http/). + +For us it is enough that this request generates 10 random integers in the range +from 1 to 6. Run this query in Julia and parse the result. + +### Exercise 2 + +Write a function that tries to parse a string as an integer. +If it succeeds it should return the integer, otherwise it should return `0` +but print error message. + +### Exercise 3 + +Create a matrix containing truth table for `&&` operation including `missing`. +If some operation errors store `"error"` in the table. As an extra feature (this +is harder so you can skip it) in each cell store both inputs and output to make +reading the table easier. + +### Exercise 4 + +Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all +missing values in it by the mean of the non-missing values. + +### Exercise 5 + +Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse +strings stored in it as `Float64`, while keeping `missing` values unchanged. + +### Exercise 6 + +Print to the terminal all days in January 2023 that are Mondays. + +### Exercise 7 + +Compute the dates that are one month later than January 15, 2020, February 15 +2020, March 15, 2020, and April 15, 2020. How many days pass during this one +month. Print the results to the screen? + +### Exercise 8 + +Parse the following string as JSON: +``` +str = """ +[{"x":1,"y":1}, + {"x":2,"y":4}, + {"x":3,"y":9}, + {"x":4,"y":16}, + {"x":5,"y":25}] +""" +``` +into a `json` variable. + +### Exercise 9 + +Extract from the `json` variable from exercise 8 two vectors `x` and `y` +that correspond to the fields stored in the JSON structure. +Plot `y` as a function of `x`. + +### Exercise 10 + +Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`. +Use linear interpolation for filling missing values. For the extreme values +use nearest available observation (you will need to consult Impute.jl +documentation to find all required functions). + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution (example run): + +``` +julia> using HTTP + +julia> response = HTTP.get("https://www.random.org/integers/?\ + num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new"); + +julia> parse.(Int, split(String(response.body))) +10-element Vector{Int64}: + 6 + 2 + 6 + 3 + 4 + 2 + 5 + 2 + 3 + 6 +``` + +### Exercise 2 + +Example function: + +``` +function str2int(s::AbstractString) + try + return parse(Int, s) + catch e + println(e) + end + return 0 +end +``` + +Let us check it: + +``` +julia> str2int("10") +10 + +julia> str2int(" -1 ") +-1 + +julia> str2int("12345678901234567890") +OverflowError("overflow parsing \"12345678901234567890\"") +0 + +julia> str2int("1.3") +ArgumentError("invalid base 10 digit '.' in \"1.3\"") +0 + +julia> str2int("a") +ArgumentError("invalid base 10 digit 'a' in \"a\"") +0 +``` + +An alternative solution would use `tryparse` (not covered in the book): + +``` +function str2int(s::AbstractString) + v = tryparse(Int, s) + if isnothing(v) + println("error while parsing") + return 0 + end + return v +end +``` +But this time we do not see the cause of the error. + +### Exercise 3 + +Solution: + +``` +julia> function apply_and(x, y) + try + return "$x && $y = $(x && y)" + catch e + return "$x && $y = error" + end + end +apply_and (generic function with 2 methods) + +julia> apply_and.([true, false, missing], [true false missing]) +3×3 Matrix{String}: + "true && true = true" "true && false = false" "true && missing = missing" + "false && true = false" "false && false = false" "false && missing = false" + "missing && true = error" "missing && false = error" "missing && missing = error" +``` + +### Exercise 4 + +Solution: + +``` +julia> using Statistics + +julia> coalesce.(v, mean(skipmissing(v))) +6-element Vector{Float64}: + 1.5 + 2.5 + 3.5 + 4.5 + 5.5 + 3.5 +``` + +### Exercise 5 + +Solution: + +``` +julia> using Missings + +julia> passmissing(parse).(Float64, s) +6-element Vector{Union{Missing, Float64}}: + 1.5 + 2.5 + missing + 4.5 + 5.5 + missing +``` + +### Exercise 6 + +Example solution: + +``` +julia> using Dates + +julia> for day in Date.(2023, 01, 1:31) + dayofweek(day) == 1 && println(day) + end +2023-01-02 +2023-01-09 +2023-01-16 +2023-01-23 +2023-01-30 +``` + +### Exercise 7 + +Example solution: + +``` +julia> for day in Date.(2023, 1:4, 15) + day_next = day + Month(1) + println("$day + 1 month = $day_next (difference: $(day_next - day))") + end +2023-01-15 + 1 month = 2023-02-15 (difference: 31 days) +2023-02-15 + 1 month = 2023-03-15 (difference: 28 days) +2023-03-15 + 1 month = 2023-04-15 (difference: 31 days) +2023-04-15 + 1 month = 2023-05-15 (difference: 30 days) +``` + +### Exercise 8 + +Solution: + +``` +julia> using JSON3 + +julia> json = JSON3.read(str) +5-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, Vector{UInt64}}: + { + "x": 1, + "y": 1 +} + { + "x": 2, + "y": 4 +} + { + "x": 3, + "y": 9 +} + { + "x": 4, + "y": 16 +} + { + "x": 5, + "y": 25 +} +``` + +### Exercise 9 + +Solution: + +``` +using Plots +x = [el.x for el in json] +y = [el.y for el in json] +plot(x, y, xlabel="x", ylabel="y", legend=false) +``` + +### Exercise 10 + +Solution: + +``` +julia> using Impute + +julia> Impute.nocb!(Impute.locf!(Impute.interp(m))) +8-element Vector{Union{Missing, Int64}}: + 1 + 1 + 2 + 3 + 4 + 5 + 6 + 6 +``` + +Note that we use the `locf!` and `nocb!` functions (with `!`) to perform +operation in place (a new vector was already allocated by `Impute.interp`). + +
diff --git a/exercises/exercises08.md b/exercises/exercises08.md new file mode 100644 index 0000000..72da36d --- /dev/null +++ b/exercises/exercises08.md @@ -0,0 +1,294 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 8 + +# Problems + +### Exercise 1 + +Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame` +called `df`. + +### Exercise 2 + +Get number of rows, columns, column names and summary statistics of the +`df` data frame from exercise 1. + +### Exercise 3 + +Make a plot of `number` against `square` columns of `df` data frame. + +### Exercise 4 + +Add a column to `df` data frame with name `name string` containing string +representation of numbers in column `number`, i.e. +`["one", "two", "three", "four"]`. + +### Exercise 5 + +Check if `df` contains column `square2`. + +### Exercise 6 + +Extract column `number` from `df` and empty it (recall `empty!` function +discussed in chapter 4). + +### Exercise 7 + +In `Random` module the `randexp` function is defined that samples numbers +from exponential distribution with scale 1. +Draw two 100,000 element samples from this distribution store them +in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values +and sum of vector `x` and half of vector `y`. + +### Exercise 8 + +Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them, +and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`. +Compute all standard descriptive statistics of columns of this data frame. + +### Exercise 9 + +Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file. +Compare the size of created files using the `filesize` function. + +### Exercise 10 + +Write the `df` data frame into SQLite database. Next find information about +tables in this database. Run a query against a table representing the `df` data +frame to calculate the mean of column `x`. Does it match the result we got in +exercise 8? + +# Solutions + +
+ +Show! + +### Exercise 1 + +CSV.jl supports reading gzip-compressed files so you can just do: + +``` +julia> using CSV + +julia> using DataFrames + +julia> df = CSV.read("example8.csv.gz", DataFrame) +4×2 DataFrame + Row │ number square + │ Int64 Int64 +─────┼──────────────── + 1 │ 1 2 + 2 │ 2 4 + 3 │ 3 9 + 4 │ 4 16 +``` + +You can also do it manually: +``` +julia> using CodecZlib # you might need to install this package + +julia> compressed = read("example8.csv.gz"); + +julia> plain = transcode(GzipDecompressor, compressed); + +julia> df = CSV.read(plain, DataFrame) +4×2 DataFrame + Row │ number square + │ Int64 Int64 +─────┼──────────────── + 1 │ 1 2 + 2 │ 2 4 + 3 │ 3 9 + 4 │ 4 16 +``` + +### Exercise 2 + +Solution: + +``` +julia> nrow(df) +4 + +julia> ncol(df) +2 + +julia> names(df) +2-element Vector{String}: + "number" + "square" + +julia> describe(df) +2×7 DataFrame + Row │ variable mean min median max nmissing eltype + │ Symbol Float64 Int64 Float64 Int64 Int64 DataType +─────┼────────────────────────────────────────────────────────────── + 1 │ number 2.5 1 2.5 4 0 Int64 + 2 │ square 7.75 2 6.5 16 0 Int64 +``` + +### Exercise 3 + +Solution: +``` +using Plots +plot(df.number, df.square, xlabel="number", ylabel="square", legend=false) +``` + +### Exercise 4 + +Solution: + +``` +julia> df."name string" = ["one", "two", "three", "four"] +4-element Vector{String}: + "one" + "two" + "three" + "four" + +julia> df +4×3 DataFrame + Row │ number square name string + │ Int64 Int64 String +─────┼───────────────────────────── + 1 │ 1 2 one + 2 │ 2 4 two + 3 │ 3 9 three + 4 │ 4 16 four +``` + +Note that we needed to use a string as we have space in column name. + +### Exercise 5 + +You can use either `hasproperty` or `columnindex`: + +``` +julia> hasproperty(df, :square2) +false + +julia> columnindex(df, :square2) +0 +``` + +Note that if you try to access this column you will get a hint what was the +mistake you most likely made: + +``` +julia> df.square2 +ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square +``` + +### Exercise 6 + +Solution: + +``` +julia> empty!(df[:, :number]) +Int64[] +``` + +Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)` +as it would corrupt the `df` data frame (these operations do non-copying +extraction of a column from a data frame as opposed to `df[:, :number]` +which makes a copy). + +### Exercise 7 + +Solution: +``` +using Random +using Plots +x = randexp(100_000); +y = randexp(100_000); +histogram(x + y / 2, label="mean") +histogram!(max.(x, y), label="maximum") +``` + +I have put both histograms on the same plot to show that they overlap. + +### Exercise 8 + +Solution (you might get slightly different results because we did not set +the seed of random number generator when creating `x` and `y` vectors): + +``` +julia> df = DataFrame(x=x, y=y); + +julia> df."x+y/2" = x + y / 2; + +julia> df."max.(x,y)" = max.(x, y); + +julia> describe(df, :all) +4×13 DataFrame + Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype + │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType +─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── + 1 │ x 0.997023 0.999119 3.01389e-6 0.285129 0.68856 1.38414 12.1556 0 0.250502 0.077737 Float64 + 2 │ y 1.00109 0.995904 2.78828e-6 0.289371 0.6957 1.38491 12.0445 0 0.689659 0.486246 Float64 + 3 │ x+y/2 1.49757 1.11676 0.00217486 0.688598 1.2235 2.0113 14.2046 0 0.595331 0.32086 Float64 + 4 │ max.(x,y) 1.49872 1.11295 0.00187844 0.691588 1.22466 2.01257 12.1556 0 0.689659 0.486246 Float64 +``` + +We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary +statistics except `first` and `last` as expected. + +### Exercise 9 + +``` +julia> using Arrow + +julia> CSV.write("df.csv", df) +"df.csv" + +julia> Arrow.write("df.arrow", df) +"df.arrow" + +julia> filesize("df.csv") +7587820 + +julia> filesize("df.arrow") +3200874 +``` + +In this case Apache Arrow file is smaller. + +### Exercise 10 + +``` +julia> using SQLite + +julia> db = SQLite.DB("df.db") +SQLite.DB("df.db") + +julia> SQLite.load!(df, db, "df") +"df" + +julia> SQLite.tables(db) +1-element Vector{SQLite.DBTable}: + SQLite.DBTable("df", Tables.Schema: + :x Union{Missing, Float64} + :y Union{Missing, Float64} + Symbol("x+y/2") Union{Missing, Float64} + Symbol("max.(x,y)") Union{Missing, Float64}) + +julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df"); + +julia> DataFrame(query) +1×1 DataFrame + Row │ AVG(x) + │ Float64 +─────┼────────── + 1 │ 0.997023 + +julia> close(db) +``` + +The computed mean of column `x` is the same as we got in exercise 8. + +
diff --git a/exercises/exercises09.md b/exercises/exercises09.md new file mode 100644 index 0000000..acb6dd1 --- /dev/null +++ b/exercises/exercises09.md @@ -0,0 +1,293 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 9 + +# Problems + +In this problemset we will use the `puzzles.csv` file that was +created in chapter 8. Please first load it into your Julia +session using the commands: + +``` +using CSV +using DataFrames +puzzles = CSV.read("puzzles.csv", DataFrame); +``` + +### Exercise 1 + +Create `matein2` data frame that will have only puzzles that have `"mateIn2"` +in the `Themes` column. +Use the `contains` function (check its documentation first). + +### Exercise 2 + +What is the fraction of puzzles that are mate in 2 in relation to all puzzles +in the `puzzles` data frame? + +### Exercise 3 + +Create `small` data frame that holds first 10 rows of `matein2` data frame +and columns `Rating`, `RatingDeviation`, and `NbPlays`. + +### Exercise 4 + +Iterate rows of `small` data frame and print the ratio of +`RatingDeviation` and `NbPlays` for each row. + +### Exercise 5 + +Get names of columns from the `matein2` data frame that end with `n` (ignore case). + +### Exercise 6 + +Write a function `collatz` that runs the following process. Start with a +positive number `n`. If it is even divide it by two. If it is odd multiply +it by 3 and add one. The function should return the number of steps needed to +reach 1. + +Create a `d` dictionary that maps number of steps needed to a list of numbers from +the range `1:100` that required this number of steps. + +### Exercise 7 + +Using the `d` dictionary make a scatter plot of number of steps required +vs average value of numbers that require this number of steps. + +### Exercise 8 + +Repeat the process from exercises 6 and 7, but this time use a data frame +and try to write an appropriate expression using the `combine` and `groupby` +functions (as it was explained in the last part of chapter 9). This time +perform computations for numbers ranging from one to one million. + +### Exercise 9 + +Set seed of random number generator to `1234`. Draw 100 random points +from the interval `[0, 1]`. Store this vector in a data frame as `x` column. +Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`. +Add random noise to column `y` that has normal distribution with mean 0 and +standard deviation 0.25. Call this column `z`. +Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis. + +### Exercise 10 + +Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution: + +``` +julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :] +274135×9 DataFrame + Row │ PuzzleId FEN Moves Rating RatingDeviation Popularity NbPlays Themes GameUrl ⋯ + │ String7 String String Int64 Int64 Int64 Int64 String String ⋯ +────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── + 1 │ 000hf r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1… e8f7 e2e6 f7f8 e6f7 1560 76 88 441 mate mateIn2 middlegame short https://li ⋯ + 2 │ 001Wz 4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P… e8e5 d1d8 e5e8 d8e8 1128 81 87 54 backRankMate endgame mate mateIn… https://li + 3 │ 001om 5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1… g4h4 c5f2 g2g3 f2g3 991 78 89 215 mate mateIn2 middlegame short https://li + 4 │ 003Tx 2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P… e1e4 f3d2 b1a1 c8c1 1716 77 87 476 backRankMate endgame fork mate m… https://li + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ + 274132 │ zzxQS 2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn… c8f8 d6d1 e3e1 d1e1 1149 75 96 1722 mate mateIn2 middlegame short https://li ⋯ + 274133 │ zzxvB 5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3… f6g4 c7f7 f8f7 a7a8 1695 74 95 4857 endgame mate mateIn2 pin sacrifi… https://li + 274134 │ zzzRN 4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3… g1h1 e3e1 f7f1 e1f1 830 108 67 31 endgame mate mateIn2 short https://li + 274135 │ zzzco 5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P… f7f2 b2c2 c1b1 e2d1 1783 75 90 763 endgame mate mateIn2 queensideAt… https://li + 1 column and 274127 rows omitted +``` + +### Exercise 2 + +Solution (two ways to do it): + +``` +julia> using Statistics + +julia> nrow(matein2) / nrow(puzzles) +0.12852152542746353 + +julia> mean(contains.(puzzles.Themes, "mateIn2")) +0.12852152542746353 +``` + +### Exercise 3 + +Solution: + +``` +julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]] +10×3 DataFrame + Row │ Rating RatingDeviation NbPlays + │ Int64 Int64 Int64 +─────┼────────────────────────────────── + 1 │ 1560 76 441 + 2 │ 1128 81 54 + 3 │ 991 78 215 + 4 │ 1716 77 476 + 5 │ 711 81 111 + 6 │ 723 86 806 + 7 │ 754 92 248 + 8 │ 1177 76 827 + 9 │ 994 81 71 + 10 │ 979 144 14 +``` + +### Exercise 4 + +Solution: + +``` +julia> for row in eachrow(small) + println(row.RatingDeviation / row.NbPlays) + end +0.17233560090702948 +1.5 +0.3627906976744186 +0.16176470588235295 +0.7297297297297297 +0.10669975186104218 +0.3709677419354839 +0.09189842805320435 +1.1408450704225352 +10.285714285714286 +``` + +### Exercise 5 + +Solution (several options): +``` +julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N')) +2-element Vector{String}: + "FEN" + "RatingDeviation" + +julia> names(matein2, Cols(col -> endswith(uppercase(col), "N"))) +2-element Vector{String}: + "FEN" + "RatingDeviation" + +julia> names(matein2, r"[nN]$") +2-element Vector{String}: + "FEN" + "RatingDeviation" +``` + +### Exercise 6 + +Solution: + +``` +julia> function collatz(n) + i = 0 + while n != 1 + i += 1 + n = iseven(n) ? div(n, 2) : 3 * n + 1 + end + return i + end +collatz (generic function with 1 method) + +julia> d = Dict{Int, Vector{Int}}() +Dict{Int64, Vector{Int64}}() + +julia> for n in 1:100 + i = collatz(n) + if haskey(d, i) + push!(d[i], n) + else + d[i] = [n] + end + end + +julia> d +Dict{Int64, Vector{Int64}} with 45 entries: + 5 => [5, 32] + 35 => [78, 79] + 110 => [82, 83] + 30 => [86, 87, 89] + 32 => [57, 59] + 6 => [10, 64] + 115 => [73] + 112 => [54, 55] + 4 => [16] + 13 => [34, 35] + 104 => [47] + 12 => [17, 96] + 23 => [25] + 111 => [27] + 92 => [91] + 11 => [48, 52, 53] + 118 => [97] + ⋮ => ⋮ +``` + +As we can see even for small `n` the number of steps required to reach `1` +can get quite large. + +### Exercise 7 + +Solution: + +``` +using Plots +using Statistics +steps = collect(keys(d)) +mean_number = mean.(values(d)) +scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false) +``` + +Note that we needed to use `collect` on `keys` as `scatter` expects an array +not just an iterator. + +### Exercise 8 + +Solution: + +``` +df = DataFrame(n=1:10^6); +df.collatz = collatz.(df.n); +agg = combine(groupby(df, :collatz), :n => mean); +scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false) +``` + +### Exercise 9 + +Set seed of random number generator to `1234`. Draw 100 random points +from the interval `[0, 1]`. Store this vector in a data frame as `x` column. +Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`. +Add random noise to column `y` that has normal distribution with mean 0 and +standard deviation 0.25. Call this column `z`. +Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis. + +Solution: + +``` +using Random +Random.seed!(1234) +df = DataFrame(x=rand(100)) +df.y = 4 .* (df.x .- 0.5) .^ 2 +df.z = df.y + randn(100) / 4 +scatter(df.x, [df.y df.z], labels=["y" "z"]) +``` + +### Exercise 10 + +Solution: + +``` +using Loess +model = loess(df.x, df.z); +x_predict = sort(df.x) +z_predict = predict(model, x_predict) +plot!(x_predict, z_predict; label="z predicted") +``` + +
diff --git a/exercises/exercises10.md b/exercises/exercises10.md new file mode 100644 index 0000000..7c5c604 --- /dev/null +++ b/exercises/exercises10.md @@ -0,0 +1,303 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 10 + +# Problems + +### Exercise 1 + +Generate a random matrix `mat` having size 5x4 and all elements drawn +independently and uniformly from the [0,1[ interval. +Create a data frame using data from this matrix using auto-generated +column names. + +### Exercise 2 + +Now, using matrix `mat` create a data frame with randomly generated +column names. Use the `randstring` function from the `Random` module +to generate them. Store this data frame in `df` variable. + +### Exercise 3 + +Create a new data frame, taking `df` as a source that will have the same +columns but its column names will be `y1`, `y2`, `y3`, `y4`. + +### Exercise 4 + +Create a dictionary holding `column_name => column_vector` pairs +using data stored in data frame `df`. Save this dictionary in variable `d`. + +### Exercise 5 + +Create a data frame back from dictionary `d` from exercise 4. Compare it +with `df`. + +### Exercise 6 + +For data frame `df` compute the dot product between all pairs of its columns. +Use the `dot` function from the `LinearAlgebra` module. + +### Exercise 7 + +Given two data frames: + +``` +julia> df1 = DataFrame(a=1:2, b=11:12) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 11 + 2 │ 2 12 + +julia> df2 = DataFrame(a=1:2, c=101:102) +2×2 DataFrame + Row │ a c + │ Int64 Int64 +─────┼────────────── + 1 │ 1 101 + 2 │ 2 102 +``` + +vertically concatenate them so that only columns that are present in both +data frames are kept. Check the documentation of `vcat` to see how to +do it. + +### Exercise 8 + +Now append to `df1` table `df2`, but add only the columns from `df2` that +are present in `df1`. Check the documentation of `append!` to see how to +do it. + +### Exercise 9 + +Create a `circle` data frame, using the `push!` function that will store +1000 samples of the following process: +* draw `x` and `y` uniformly and independently from the [-1,1[ interval; +* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1` + and is `false` otherwise. + +Compute summary statistics of this data frame. + +### Exercise 10 + +Create a scatterplot of `circle` data frame where its `x` and `y` axis +will be the plotted points and `inside` variable will determine the color +of the plotted point. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution: + +``` +julia> using DataFrames + +julia> mat = rand(5, 4) +5×4 Matrix{Float64}: + 0.8386 0.83612 0.0353994 0.15547 + 0.590172 0.611815 0.0691152 0.915788 + 0.879395 0.07271 0.980079 0.655158 + 0.340435 0.756196 0.0697535 0.388578 + 0.714515 0.861872 0.971521 0.176768 + +julia> DataFrame(mat, :auto) +5×4 DataFrame + Row │ x1 x2 x3 x4 + │ Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────── + 1 │ 0.8386 0.83612 0.0353994 0.15547 + 2 │ 0.590172 0.611815 0.0691152 0.915788 + 3 │ 0.879395 0.07271 0.980079 0.655158 + 4 │ 0.340435 0.756196 0.0697535 0.388578 + 5 │ 0.714515 0.861872 0.971521 0.176768 +``` + +### Exercise 2 + +Solution: + +``` +julia> using Random + +julia> df = DataFrame(mat, [randstring() for _ in 1:4]) +5×4 DataFrame + Row │ 6mTK5evn K8Inf7ER 5Caz55k0 SRiGemsa + │ Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────── + 1 │ 0.8386 0.83612 0.0353994 0.15547 + 2 │ 0.590172 0.611815 0.0691152 0.915788 + 3 │ 0.879395 0.07271 0.980079 0.655158 + 4 │ 0.340435 0.756196 0.0697535 0.388578 + 5 │ 0.714515 0.861872 0.971521 0.176768 +``` + + +### Exercise 3 + +Solution: +``` +julia> DataFrame(["y$i" => df[!, i] for i in 1:4]) +5×4 DataFrame + Row │ y1 y2 y3 y4 + │ Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────── + 1 │ 0.8386 0.83612 0.0353994 0.15547 + 2 │ 0.590172 0.611815 0.0691152 0.915788 + 3 │ 0.879395 0.07271 0.980079 0.655158 + 4 │ 0.340435 0.756196 0.0697535 0.388578 + 5 │ 0.714515 0.861872 0.971521 0.176768 +``` + +You could also use the `raname` function: +``` +julia> rename(df, string.("y", 1:4)) +5×4 DataFrame + Row │ y1 y2 y3 y4 + │ Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────── + 1 │ 0.8386 0.83612 0.0353994 0.15547 + 2 │ 0.590172 0.611815 0.0691152 0.915788 + 3 │ 0.879395 0.07271 0.980079 0.655158 + 4 │ 0.340435 0.756196 0.0697535 0.388578 + 5 │ 0.714515 0.861872 0.971521 0.176768 +``` + +### Exercise 4 + +Solution: + +``` +julia> d = Dict([n => df[:, n] for n in names(df)]) +Dict{String, Vector{Float64}} with 4 entries: + "6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515] + "5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521] + "K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872] + "SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768] +``` + +or (using the `pairs` function; note that this time column names are `Symbol`): + +``` +julia> Dict(pairs(eachcol(df))) +Dict{Symbol, AbstractVector} with 4 entries: + Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515] + :SRiGemsa => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768] + :K8Inf7ER => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872] + Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521] +``` + +### Exercise 5 + +Solution: + +``` +julia> DataFrame(d) +5×4 DataFrame + Row │ 5Caz55k0 6mTK5evn K8Inf7ER SRiGemsa + │ Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────── + 1 │ 0.0353994 0.8386 0.83612 0.15547 + 2 │ 0.0691152 0.590172 0.611815 0.915788 + 3 │ 0.980079 0.879395 0.07271 0.655158 + 4 │ 0.0697535 0.340435 0.756196 0.388578 + 5 │ 0.971521 0.714515 0.861872 0.176768 +``` + +Note that columns of a data frame are now sorted by their names. +This is done for `Dict` objects because such dictionaries do not have +a defined order of keys. + +### Exercise 6 + +Solution: + +``` +julia> using LinearAlgebra + +julia> using StatsBase + +julia> pairwise(dot, eachcol(df)) +4×4 Matrix{Float64}: + 2.45132 1.99944 1.65026 1.50558 + 1.99944 2.39336 1.03322 1.18411 + 1.65026 1.03322 1.9153 0.909744 + 1.50558 1.18411 0.909744 1.47431 +``` + +### Exercise 7 + +Solution: + +``` +julia> vcat(df1, df2, cols=:intersect) +4×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 1 + 4 │ 2 +``` + +By default you will get an error: + +``` +julia> vcat(df1, df2) +ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2 +``` + +### Exercise 8 + +Solution: + +``` +julia> append!(df1, df2, cols=:subset) +4×2 DataFrame + Row │ a b + │ Int64 Int64? +─────┼──────────────── + 1 │ 1 11 + 2 │ 2 12 + 3 │ 1 missing + 4 │ 2 missing +``` + +### Exercise 9 + +Solution + +``` +circle=DataFrame() +for _ in 1:1000 + x, y = 2rand()-1, 2rand()-1 + inside = x^2 + y^2 < 1 + push!(circle, (x=x, y=y, inside=inside)) +end +describe(circle) +``` + +We note that mean of variable `inside` is approximately π. + +### Exercise 10 + +Solution: + +``` +using Plots +scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400)) +scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal) +``` + +In the solution two ways to plot ensuring the ratio between x and y axis is 1 +are shown. Note the differences in the produced output between the two methods. + +
diff --git a/exercises/exercises11.md b/exercises/exercises11.md new file mode 100644 index 0000000..bb69101 --- /dev/null +++ b/exercises/exercises11.md @@ -0,0 +1,479 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 11 + +# Problems + +### Exercise 1 + +Generate a data frame `df` having one column `x` consisting of 100,000 values +sampled from uniform distribution on [0, 1[ interval. +Serialize it to disk, and next deserialize. Check if the deserialized +object is the same as the source data frame. + +### Exercise 2 + +Add a column `n` to the `df` data frame that in each row will hold the +number of observations in column `x` that have distance less than `0.1` to +a value stored in a given row of `x`. + +### Exercise 3 + +Investigate visually how does `n` depend on `x` in data frame `df`. + +### Exercise 4 + +Someone has prepared the following test data for you: +``` +teststr = """ +"x","sinx" +0.139279,0.138829 +0.456779,0.441059 +0.344034,0.337287 +0.140253,0.139794 +0.848344,0.750186 +0.977512,0.829109 +0.032737,0.032731 +0.702750,0.646318 +0.422339,0.409895 +0.393878,0.383772 +""" +``` + +Load this data into `testdf` data frame. + +### Exercise 5 + +Check the accuracy of computations of sinus of `x` in `testdf`. +Print all rows for which the absolute difference is greater than `5e-7`. +In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute +difference. + +### Exercise 6 + +Group data in data frame `df` into buckets of 0.1 width and store the result in +`gdf` data frame (sort the groups). Use the `cut` function from +CategoricalArrays.jl to do it (check its documentation to learn how to do it). +Check the number of values in each group. + +### Exercise 7 + +Display the grouping keys in `gdf` grouped data frame. Show them as named tuples. +Check what would be the group order if you asked not to sort them. + +### Exercise 8 + +Compute average `n` for each group in `gdf`. + +### Exercise 9 + +Fit a linear model explaining `n` by `x` separately for each group in `gdf`. +Use the `\` operator to fit it (recall it from chapter 4). +For each group produce the result as named tuple having fields `α₀` and `αₓ`. + +### Exercise 10 + +Repeat exercise 9 but using the GLM.jl package. This time +extract the p-value for the slope of estimated coefficient for `x` variable. +Use the `coeftable` function from GLM.jl to get this information. +Check the documentation of this function to learn how to do it (it will be +easiest for you to first convert its result to a `DataFrame`). + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution: + +``` +julia> using DataFrames + +julia> df = DataFrame(x=rand(100_000)); + +julia> using Serialization + +julia> serialize("df.bin", df) + +julia> deserialize("df.bin") == df +true +``` + +### Exercise 2 + +Solution: + +A simple approach is: +``` +df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x) +``` + +A more sophisticated approach (faster and allocating less memory) would be: +``` +df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)` +``` + +An even faster solution that is type stable would use function barrier: +``` +f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x) +df.n = f(df.x) +``` + +Finally you can work on sorted data to get a much better performance. Here is an +example (it is a bit more advanced): +``` +function f2(x) + p = sortperm(x) + n = zeros(Int, length(x)) + start = 1 + stop = 1 + idx = 0 + while idx < length(x) # you could add @inbounds here but I typically avoid it + idx += 1 + while x[p[idx]] - x[p[start]] >= 0.1 + start += 1 + end + while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1 + stop += 1 + end + n[p[idx]] = stop - start + end + return n +end +df.n = f2(df.x) +``` + +In this solution the fact that we used function barrier is even more relevant +as we explicitly use loops inside. + +### Exercise 3 + +Solution: + +``` +using Plots +scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false) +``` + +As expected on the border of the domain number of neighbors drops. + +### Exercise 4 + +Solution: + +``` +julia> using CSV + +julia> using DataFrames + +julia> testdf = CSV.read(IOBuffer(teststr), DataFrame) +10×2 DataFrame + Row │ x sinx + │ Float64 Float64 +─────┼──────────────────── + 1 │ 0.139279 0.138829 + 2 │ 0.456779 0.441059 + 3 │ 0.344034 0.337287 + 4 │ 0.140253 0.139794 + 5 │ 0.848344 0.750186 + 6 │ 0.977512 0.829109 + 7 │ 0.032737 0.032731 + 8 │ 0.70275 0.646318 + 9 │ 0.422339 0.409895 + 10 │ 0.393878 0.383772 +``` + +### Exercise 5 + +Since data frame is small we can use `eachrow`: + +``` +julia> for row in eachrow(testdf) + sinx = sin(row.x) + dev = abs(sinx - row.sinx) + dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev)) + end +(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7) +(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7) +``` + +### Exercise 6 + +Solution: +``` +julia> using CategoricalArrays + +julia> df.xbins = cut(df.x, 0.0:0.1:1.0); + +julia> gdf = groupby(df, :xbins; sort=true); + +julia> [nrow(group) for group in gdf] +10-element Vector{Int64}: + 9872 + 9976 + 9968 + 9943 + 10063 + 10173 + 9977 + 10076 + 9908 + 10044 + +julia> combine(gdf, nrow) # alternative way to do it +10×2 DataFrame + Row │ xbins nrow + │ Cat… Int64 +─────┼─────────────────── + 1 │ [0.0, 0.1) 9872 + 2 │ [0.1, 0.2) 9976 + 3 │ [0.2, 0.3) 9968 + 4 │ [0.3, 0.4) 9943 + 5 │ [0.4, 0.5) 10063 + 6 │ [0.5, 0.6) 10173 + 7 │ [0.6, 0.7) 9977 + 8 │ [0.7, 0.8) 10076 + 9 │ [0.8, 0.9) 9908 + 10 │ [0.9, 1.0) 10044 +``` + +You might get a bit different numbers but all should be around 10,000. + +### Exercise 7 + +Solution: + +``` +julia> NamedTuple.(keys(gdf)) +10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}: + (xbins = "[0.0, 0.1)",) + (xbins = "[0.1, 0.2)",) + (xbins = "[0.2, 0.3)",) + (xbins = "[0.3, 0.4)",) + (xbins = "[0.4, 0.5)",) + (xbins = "[0.5, 0.6)",) + (xbins = "[0.6, 0.7)",) + (xbins = "[0.7, 0.8)",) + (xbins = "[0.8, 0.9)",) + (xbins = "[0.9, 1.0)",) + +julia> NamedTuple.(keys(groupby(df, :xbins; sort=false))) +10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}: + (xbins = "[0.4, 0.5)",) + (xbins = "[0.9, 1.0)",) + (xbins = "[0.8, 0.9)",) + (xbins = "[0.0, 0.1)",) + (xbins = "[0.2, 0.3)",) + (xbins = "[0.5, 0.6)",) + (xbins = "[0.7, 0.8)",) + (xbins = "[0.3, 0.4)",) + (xbins = "[0.1, 0.2)",) + (xbins = "[0.6, 0.7)",) +``` + +If you pass `sort=false` instead of `sort=true` you get groups in their order +of appearance in `df`. If you skipped specifying `sort` keyword argument +the resulting group order could depend on the type of grouping column, so if +you want to depend on the order of groups always spass `sort` keyword argument +explicitly. + +### Exercise 8 + +Solution: + +``` +julia> using Statistics + +julia> [mean(group.n) for group in gdf] +10-element Vector{Float64}: + 14845.847751215559 + 19835.367882919007 + 19919.195826645264 + 19993.023936437694 + 20105.506111497565 + 20222.35761329008 + 20151.794727874112 + 20022.69610956729 + 19909.331550262414 + 14944.511449621665 + +julia> combine(gdf, :n => mean) # alternative way to do it +10×2 DataFrame + Row │ xbins n_mean + │ Cat… Float64 +─────┼───────────────────── + 1 │ [0.0, 0.1) 14845.8 + 2 │ [0.1, 0.2) 19835.4 + 3 │ [0.2, 0.3) 19919.2 + 4 │ [0.3, 0.4) 19993.0 + 5 │ [0.4, 0.5) 20105.5 + 6 │ [0.5, 0.6) 20222.4 + 7 │ [0.6, 0.7) 20151.8 + 8 │ [0.7, 0.8) 20022.7 + 9 │ [0.8, 0.9) 19909.3 + 10 │ [0.9, 1.0) 14944.5 +``` + +### Exercise 9 + +Solution: + +``` +julia> function fitmodel(x, n) + X = [ones(length(x)) x] + α₀, αₓ = X \ n + return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14 + end +fitmodel (generic function with 1 method) + +julia> [fitmodel(group.x, group.n) for group in gdf] +10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}: + (α₀ = 9900.190310776916, αₓ = 99131.14394200995) + (α₀ = 19823.115188829383, αₓ = 81.66979172871368) + (α₀ = 19812.9822724435, αₓ = 424.00895772216785) + (α₀ = 19810.726510910834, αₓ = 520.6763238983195) + (α₀ = 19437.772385484135, αₓ = 1483.333906139938) + (α₀ = 20187.521449870146, αₓ = 63.30709585406235) + (α₀ = 20424.362332155855, αₓ = -419.42268710601405) + (α₀ = 20789.70660364678, αₓ = -1022.9778397184706) + (α₀ = 20013.690535193662, αₓ = -122.80055110522495) + (α₀ = 109320.55276082881, αₓ = -99305.18846102979) + +julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14 +10×3 DataFrame + Row │ xbins α₀ αₓ + │ Cat… Float64 Float64 +─────┼──────────────────────────────────────── + 1 │ [0.0, 0.1) 9900.19 99131.1 + 2 │ [0.1, 0.2) 19823.1 81.6698 + 3 │ [0.2, 0.3) 19813.0 424.009 + 4 │ [0.3, 0.4) 19810.7 520.676 + 5 │ [0.4, 0.5) 19437.8 1483.33 + 6 │ [0.5, 0.6) 20187.5 63.3071 + 7 │ [0.6, 0.7) 20424.4 -419.423 + 8 │ [0.7, 0.8) 20789.7 -1022.98 + 9 │ [0.8, 0.9) 20013.7 -122.801 + 10 │ [0.9, 1.0) 1.09321e5 -99305.2 +``` + +We note that indeed in the first and last group the regression has a significant +slope. + +### Exercise 10 + +Solution: + +``` +julia> using GLM + +julia> function fitlmmodel(group; info=false) + model = lm(@formula(n~x), group) + coefdf = DataFrame(coeftable(model)) + info && @show coefdf # to see how the data frame looks like + α₀, αₓ = coefdf[:, "Coef."] + return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14 + end +fitlmmodel (generic function with 1 method) + +julia> [fitlmmodel(group; info = true) for group in gdf] +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼──────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95 + 2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼─────────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1 + 2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼─────────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6 + 2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼─────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5 + 2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7 + 2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6 + 2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼────────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4 + 2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4 + 2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1 + 2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375 +coefdf = 2×7 DataFrame + Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% + │ String Float64 Float64 Float64 Float64 Float64 Float64 +─────┼───────────────────────────────────────────────────────────────────────────────────────────── + 1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5 + 2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3 +10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}: + (α₀ = 9900.190310776927, αₓ = 99131.1439420097) + (α₀ = 19823.115188829663, αₓ = 81.66979172690417) + (α₀ = 19812.98227244386, αₓ = 424.00895772074136) + (α₀ = 19810.726510911398, αₓ = 520.6763238966264) + (α₀ = 19437.772385487086, αₓ = 1483.3339061333743) + (α₀ = 20187.521449871012, αₓ = 63.307095852511125) + (α₀ = 20424.36233216108, αₓ = -419.4226871140539) + (α₀ = 20789.706603652226, αₓ = -1022.9778397257375) + (α₀ = 20013.69053519897, αₓ = -122.80055111148658) + (α₀ = 109320.55276074051, αₓ = -99305.18846093686) + +julia> combine(gdf, fitlmmodel) +10×3 DataFrame + Row │ xbins α₀ αₓ + │ Cat… Float64 Float64 +─────┼──────────────────────────────────────── + 1 │ [0.0, 0.1) 9900.19 99131.1 + 2 │ [0.1, 0.2) 19823.1 81.6698 + 3 │ [0.2, 0.3) 19813.0 424.009 + 4 │ [0.3, 0.4) 19810.7 520.676 + 5 │ [0.4, 0.5) 19437.8 1483.33 + 6 │ [0.5, 0.6) 20187.5 63.3071 + 7 │ [0.6, 0.7) 20424.4 -419.423 + 8 │ [0.7, 0.8) 20789.7 -1022.98 + 9 │ [0.8, 0.9) 20013.7 -122.801 + 10 │ [0.9, 1.0) 1.09321e5 -99305.2 +``` + +We got the same results. The `combine(gdf, fitlmmodel)` style of using +the `combine` function is a bit more advanced and is not covered in the book. +It is used in the cases, like the one we have here, when you want to pass +a whole group to the function in `combine`. Check DataFrames.jl documentation +for more detailed explanations. + +
diff --git a/exercises/exercises12.md b/exercises/exercises12.md new file mode 100644 index 0000000..fd67ca0 --- /dev/null +++ b/exercises/exercises12.md @@ -0,0 +1,247 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 12 + +# Problems + +### Exercise 1 + +The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code of +Go language version 19.2. As you can check on its website its SHA-256 +is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`. +Download this file and check if it indeed has this checksum. +You might need to read documentation of `string` and `join` functions. + +### Exercise 2 + +Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip +that contains the ego-nets of Eastern European users collected from the music +streaming service Deezer in February 2020. Nodes are users and edges are mutual +follower relationships. + +From the file extract deezer_edges.json and deezer_target.csv files and +save them to disk. + +### Exercise 3 + +Load deezer_edges.json and deezer_target.csv files to Julia. +The JSON file should be loaded as JSON3.jl object `edges_json`. +The CSV file should be loaded into a data frame `target_df`. + +### Exercise 4 + +Check that keys in the `edges_json` are in the same order as `id` column +in `target_df`. + +### Exercise 5 + +From every value stored in `edges_json` create a graph representing +ego-net of the given node. Store these graphs in a vector that will make the +`egonet` column of in the `target_df` data frame. + +### Exercise 6 + +Ego-net in our data set is a subgraph of a full Deezer graph where for some +node all its neighbors are included, but also it contains all edges between the +neighbors. +Therefore we expect that diameter of every ego-net is at most 2 (as every +two nodes are either connected directly or by a common friend). +Check if this is indeed the case. Use the `diameter` function. + +### Exercise 7 + +For each ego-net find a central node that is connected to every other node +in this network. Use the `degree` and `findall` functions to achieve this. +Add `center` column with numbers of nodes that are connected to all other +nodes in the ego-net to `target_df` data frame. + +Next add a column `center_len` that gives the number of such nodes. + +Check how many times different numbers of center nodes are found. + +### Exercise 8 + +Add the following ego-net features to the `target_df` data frame: +* `size`: number of nodes in ego-net +* `mean_degree`: average node degree in ego-net + +Check mean values of these two columns by `target` column. + +### Exercise 9 + +Continuing to work with `target_df` data frame create a logistic regression +explaining `target` by `size` and `mean_degree`. + +### Exercise 10 + +Continuing to work with `target_df` create a scatterplot where `size` will be on +one axis and `mean_degree` rounded to nearest integer on the other axis. +Plot the mean of `target` for each point being a combination of `size` and +rounded `mean_degree`. + +Additionally fit a LOESS model explaining `target` by `size`. Make a prediction +for values in range from 5% to 95% quantile (to concentrate on typical values +of size). + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution: + +``` +using Downloads +using SHA +Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz") +shavec = open(sha256, "go.tar.gz") +shastr = join(string.(s; base=16, pad=2)) +sha == shastr +``` + +The last line should produce `true`. + +### Exercise 2 + +Solution: + +``` +Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip") +import ZipFile +archive = ZipFile.Reader("ego.zip") +idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files)) +open("deezer_edges.json", "w") do io + write(io, read(archive.files[idx])) +end +idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files)) +open("deezer_target.csv", "w") do io + write(io, read(archive.files[idx])) +end +close(archive) +``` + +### Exercise 3 + +Solution: + +``` +using CSV +using DataFrames +using JSON3 +edges_json = JSON3.read(read("deezer_edges.json")) +target_df = CSV.read("deezer_target.csv", DataFrame) +``` + +### Exercise 4 + +Solution (short, but you need to have a good understanding of Julia types +and standar functions to properly write it): +``` +Symbol.(target_df.id) == keys(edges_json) +``` + +### Exercise 5 + +Solution: + +``` +using Graphs +function edgelist2graph(edgelist) + nodes = sort!(unique(reduce(vcat, edgelist))) + @assert 0:length(nodes)-1 == nodes + g = SimpleGraph(length(nodes)) + for (a, b) in edgelist + add_edge!(g, a+1, b+1) + end + return g +end +target_df.egonet = edgelist2graph(values(edges_json)) +``` + +### Exercise 6 + +Solution: + +``` +julia> extrema(diameter.(target_df.egonet)) +(2, 2) +``` + +Indeed we see that for each ego-net diameter is 2. + +### Exercise 7 + +Solution: + +``` +target_df.center = map(target_df.egonet) do g + findall(==(nv(g) - 1), degree(g)) +end +target_df.center_len = length.(target_df.center) +combine(groupby(target_df, :center_len, sort=true), nrow) +``` + +Note that we used `map` since in this case it gives a convenient way to express +the condition we want to check. + +We notice that in some cases it is impossible to identify the center of the +ego-net uniquely. + +### Exercise 8 + +Solution: + +``` +using Statistics +target_df.size = nv.(target_df.egonet) +target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size +combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean) +``` + +It seems that for target equal to `0` size and average degree in the network are +a bit larger. + +### Exercise 9 + +Solution: + +``` +using GLM +glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink()) +``` + +We see that only `size` is statistically significant. + +### Exercise 10 + +Solution: + +``` +using Plots +target_df.round_degree = round.(Int, target_df.mean_degree) +agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean) +scatter(agg_df.size, agg_df.round_degree; + zcolor=agg_df.target_mean, + xlabel="size", ylabel="rounded degree", + label="mean target", xaxis=:log) +``` + +It is hard to visually see any strong relationship in the data. + +``` +import Loess +model = Loess.loess(target_df.size, target_df.target) +size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95) +target_predict = Loess.predict(model, size_predict) +plot(size_predict, target_predict; + xlabel="size", ylabel="predicted target", legend=false) +``` + +Between quantiles 5% and 95% we see a downward shaped relationship. + +
diff --git a/exercises/exercises13.md b/exercises/exercises13.md new file mode 100644 index 0000000..7be60b4 --- /dev/null +++ b/exercises/exercises13.md @@ -0,0 +1,280 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 13 + +# Problems + +### Exercise 1 + +Download +https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip +archive and extract primary_data.csv and secondary_data.csv files from it. +Save the files to disk. + +### Exercise 2 + +Load primary_data.csv into the `primary` data frame. +Load secondary_data.csv into the `secondary` data frame. +Describe the contents of both data frames. + +### Exercise 3 + +Start with `primary` data. Note that columns starting from column 4 have +their data encoded using vector notation, but they have been read-in as strings. +Convert these columns to hold proper vectors. Note that some columns have +`missing` values. Most of the columns hold nominal data, but three columns, +i.e. `cap-diameter`, `stem-height`, and `stem-width` have numeric data. These +should be parsed as vectors storing numeric values. After parsing, put these +three columns just after `class` column in the `parsed_primary` data frame. + +Check `renamecols` keyword argument of `select` to +avoid renaming of the produced columns. + +### Exercise 4 + +In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be +confused and have a different class, using the information about them we have +(so all information except their family, name, and class). + +Use the following rules: +* if for some pair of mushrooms the data in some column for either of them is + `missing` then skip matching on this column; for numeric columns if there + is only one value in a vector then treat it as `missing`; +* otherwise: + - for numeric columns check if there is an overlap in the interval + specified by the min and max values for the range passed; + - for nominal columns check if the intersection of nominal values is nonempty. + +For each found pair print to the screen the row number, family, name, and class. + +### Exercise 5 + +Still using `parsed_primary` find what is the average probability of class being +`p` by `family`. Additionally add number of observations in each group. Sort +these results by the probability. Try using DataFramesMeta.jl to do this +exercise (this requirement is optional). + +Store the result in `agg_primary` data frame. + +### Exercise 6 + +Now using `agg_primary` data frame collapse it so that for each unique `pr_p` +it gives us a total number of rows that had this probability and a tuple +of mushroom family names. + +Optionally: try to display the produced table so that the tuple containing the +list of families for each group is not cropped (this will require large +terminal). + +### Exercise 7 + +From our preliminary analysis of `primary` data we see that `missing` value in +the primary data is non-informative, so in `secondary` data we should be +cautious when building a model if we allowed for missing data (in practice +if we were investigating some real mushroom we most likely would know its +characteristics). + +Therefore as a first step drop in-place all columns in `secondary` data frame +that have missing values. + +### Exercise 8 + +Create a logistic regression predicting `class` based on all remaining features +in the data frame. You might need to check the `Term` usage in StatsModels.jl +documentation. + +You will notice that for `stem-color` and `habitat` columns you get strange +estimation results (large absolute values of estimated parameters and even +larger standard errors). Explain why this happens by analyzing frequency tables +of these variables against `class` column. + +### Exercise 9 + +Add `class_p` column to `secondary` as a second column that will contain +predicted probability from the model created in exercise 8 of a given +observation having class `p`. + +Print descriptive statistics of column `class_p` by `class`. + +### Exercise 10 + +Plot FPR-TPR ROC curve for our model and compute associated AUC value. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution: + +``` +using Downloads +import ZipFile +Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip") +archive = ZipFile.Reader("MushroomDataset.zip") +idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files)) +open("primary_data.csv", "w") do io + write(io, read(archive.files[idx])) +end +idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files)) +open("secondary_data.csv", "w") do io + write(io, read(archive.files[idx])) +end +close(archive) +``` + +### Exercise 2 + +Solution: + +``` +using CSV +using DataFrames +primary = CSV.read("primary_data.csv", DataFrame; delim=';') +secondary = CSV.read("secondary_data.csv", DataFrame; delim=';') +describe(primary) +describe(secondary) +``` + +### Exercise 3 + +Solution: + +``` +parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ") +parse_nominal(::Missing) = missing +parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", ")) +parse_numeric(::Missing) = missing +idcols = ["family", "name", "class"] +numericcols = ["cap-diameter", "stem-height", "stem-width"] +parsed_primary = select(primary, + idcols, + numericcols .=> ByRow(parse_numeric), + Not([idcols; numericcols]) .=> ByRow(parse_nominal); + renamecols=false) +``` + +### Exercise 4 + +Solution: + +``` +function overlap_numeric(v1, v2) + # there are no missing values in numeric columns + if length(v1) == 1 || length(v2) == 1 + return true + else + return max(v1[1], v2[1]) <= min(v1[2], v2[2]) + end +end + +function overlap_nominal(v1, v2) + if ismissing(v1) || ismissing(v2) + return true + else + return !isempty(intersect(v1, v2)) + end +end + +function rows_overlap(row1, row2) + # note that in parsed_primary numeric columns have indices 4 to 6 + # and nominal columns have indices 7 to 23 + return all(i -> overlap_numeric(row1[i], row2[i]), 4:6) && + all(i -> overlap_nominal(row1[i], row2[i]), 7:23) +end + +for i in 1:nrow(parsed_primary), j in i+1:nrow(parsed_primary) + row1 = parsed_primary[i, :] + row2 = parsed_primary[j, :] + if rows_overlap(row1, row2) && row1.class != row2.class + println((i, Tuple(row1[1:3]), j, Tuple(row2[1:3]))) + end +end +``` + +Note that in this exercise using `eachrow` is not a problem +(although it is not type stable) because the data is small. + +### Exercise 5 + +Solution: + +``` +using Statistics +using DataFramesMeta +agg_primary = @chain parsed_primary begin + groupby(:family) + @combine(:pr_p = mean(:class .== "p"), $nrow) + sort(:pr_p) +end +``` + +### Exercise 6 + +Solution: + +``` +show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140) +``` + +### Exercise 7 + +Solution: + +``` +select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)]) +``` + +Note that we select based on actual contents of the columns and not by their +element type (column could allow for missing values but not have them). + +### Exercise 8 + +Solution: + +``` +using GLM +using FreqTables +secondary.class = secondary.class .== "p" +model = glm(Term(:class)~sum(Term.(Symbol.(names(secondary, Not(:class))))), + secondary, Binomial(), LogitLink()) +freqtable(secondary, "stem-color", "class") +freqtable(secondary, "habitat", "class") +``` + +We can see that for cetrain levels of `stem-color` and `habitat` variables +there is a perfect separation of classes. + +### Exercise 9 + +Solution: + +``` +insertcols!(secondary, 2, :class_p => predict(model)) +combine(groupby(secondary, :class)) do sdf + return describe(sdf, :detailed, cols=:class_p) +end +``` + +We can see that the model has some discriminatory power, but there +is still a significant overlap between classes. + +### Exercise 10 + +Solution: + +``` +using Plots +using ROCAnalysis +roc_data = roc(secondary; score=:class_p, target=:class) +plot(roc_data.pfa, 1 .- roc_data.pmiss; + title="AUC=$(round(100*(1 - auc(roc_data)), digits=2))%", + xlabel="FPR", ylabel="TPR", legend=false) +``` + +
diff --git a/exercises/exercises14.md b/exercises/exercises14.md new file mode 100644 index 0000000..fa4dfd8 --- /dev/null +++ b/exercises/exercises14.md @@ -0,0 +1,384 @@ +# Julia for Data Analysis + +## Bogumił Kamiński, Daniel Kaszyński + +# Chapter 14 + +# Problems + +### Exercise 1 + +Write a simulator that takes one parameter `n`. Next we assume we draw +`n` times from this set with replacement. The simulator should return the +average number of items from this set that were drawn at least once. + +Call the function running this simulation `boot`. + +### Exercise 2 + +Now write a function `simboot` that takes parameters `n` and `k` and runs +the simulation defined in the `boot` function `k` times. It should return +a named tuple storing `k`, `n`, mean of produced values, and ends of +approximated 95% confidence interval of the result. + +Make this function single threaded. Check how long +this function runs for `n=1000` and `k=1_000_000`. + +### Exercise 3 + +Now rewrite this simulator to be multi threaded. Use 4 cores for benchmarking. +Call the function `simbootT`. Check how long this function runs for `n=1000` and +`k=1_000_000`. + +### Exercise 4 + +Now rewrite `boot` and `simbootT` to perform less allocations. Achieve this by +making sure that all allocated objects are passed to `boot` function (so that it +does not do any allocations internally). Call these new functions `boot!` and +`simbootT2`. You might need to use the `Threads.threadid` and `Threads.nthreads` +functions. + +### Exercise 5 + +Use either of the solutions we have developed in the previous exercises to +create a web service taking `k` and `n` parameters and returning the values +produced by `boot` functions and time to run the simulation. You might want to +use the `@timed` macro in your solution. + +Start the server. + +### Exercise 6 + +Query the server started in the exercise 5 with +the following parameters: +* `k=1000` and `n=1000` +* `k=1.5` and `n=1000` + +### Exercise 7 + +Collect the data generated by a web service into the `df` data frame for +`k = [10^i for i in 3:6]` and `n = [10^i for i in 1:3]`. + +### Exercise 8 + +Replace the `value` column in the `df` data frame by its contents in-place. + +### Exercise 9 + +Checks that execution time roughly scales proportionally to the product +of `k` times `n`. + +### Exercise 10 + +Plot the expected fraction of seen elements in the set as a function of +`n` by `k` along with 95% confidence interval around these values. + +# Solutions + +
+ +Show! + +### Exercise 1 + +Solution (there are many other approaches you could use): + +``` +using Statistics + +function boot(n::Integer) + table = falses(n) + for _ in 1:n + table[rand(1:n)] = true + end + return mean(table) +end +``` + +### Exercise 2 + +Solution: + +``` +function simboot(k::Integer, n::Integer) + result = [boot(n) for _ in 1:k] + mv = mean(result) + sdv = std(result) + lo95 = mv - 1.96 * sdv / sqrt(k) + hi95 = mv + 1.96 * sdv / sqrt(k) + return (; k, n, mv, lo95, hi95) +end +``` + +We run it twice to make sure everything is compiled: +``` +julia> @time simboot(1000, 1_000_000) + 7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time) +(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944) + +julia> @time simboot(1000, 1_000_000) + 7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time) +(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754) +``` + +We see that on my computer the run time is around 7 seconds. + +### Exercise 3 + +Solution: + +``` +using ThreadsX + +function simbootT(k::Integer, n::Integer) + result = ThreadsX.map(i -> boot(n), 1:k) + mv = mean(result) + sdv = std(result) + lo95 = mv - 1.96 * sdv / sqrt(k) + hi95 = mv + 1.96 * sdv / sqrt(k) + return (; k, n, mv, lo95, hi95) +end +``` + +Here is the timing for four threads: +``` +julia> @time simbootT(1000, 1_000_000) + 2.390795 seconds (3.37 k allocations: 119.434 MiB) +(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484) + +julia> @time simbootT(1000, 1_000_000) + 2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time) +(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554) +``` + +Indeed we see a significant performance improvement. + +### Exercise 4 + +Solution: + +``` +function boot!(n::Integer, pool) + table = pool[Threads.threadid()] + fill!(table, false) + for _ in 1:n + table[rand(1:n)] = true + end + return mean(table) +end + +function simbootT2(k::Integer, n::Integer) + pool = [falses(n) for _ in 1:Threads.nthreads()] + result = ThreadsX.map(i -> boot!(n, pool), 1:k) + mv = mean(result) + sdv = std(result) + lo95 = mv - 1.96 * sdv / sqrt(k) + hi95 = mv + 1.96 * sdv / sqrt(k) + return (; k, n, mv, lo95, hi95) +end +``` + +In the solution the `pool` vector keeps `table` vector +individually for each thread. Let us test the timing: + +``` +julia> @time simbootT2(1000, 1_000_000) + 2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation) +(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206) + +julia> @time simbootT2(1000, 1_000_000) + 2.340694 seconds (391 allocations: 586.453 KiB) +(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058) +``` + +Indeed, we see that the number of allocations was decreased, which should lower +GC usage. However, the runtime of the simulation is similar since in this task +memory allocation does not account for a significant portion of the runtime. + +### Exercise 5 + +Solution (I used the simplest single-threaded code here; this is a complete +code of the web service): + +``` +using Genie +using Statistics + +function boot(n::Integer) + table = falses(n) + for _ in 1:n + table[rand(1:n)] = true + end + return mean(table) +end + +function simboot(k::Integer, n::Integer) + result = [boot(n) for _ in 1:k] + mv = mean(result) + sdv = std(result) + lo95 = mv - 1.96 * sdv / sqrt(k) + hi95 = mv + 1.96 * sdv / sqrt(k) + return (; k, n, mv, lo95, hi95) +end + +Genie.config.run_as_server = true + +Genie.Router.route("/", method=POST) do + message = Genie.Requests.jsonpayload() + return try + k = message["k"] + n = message["n"] + value, time = @timed simboot(k, n) + Genie.Renderer.Json.json((status="OK", time=time, value=value)) + catch + Genie.Renderer.Json.json((status="ERROR", time="", value="")) + end +end + +Genie.Server.up() +``` + +### Exercise 6 + +Solution: + +``` +julia> using HTTP + +julia> using JSON3 + +julia> HTTP.post("http://127.0.0.1:8000", + ["Content-Type" => "application/json"], + JSON3.write((k=1000, n=1000))) +HTTP.Messages.Response: +""" +HTTP/1.1 200 OK +Content-Type: application/json; charset=utf-8 +Server: Genie/Julia/1.8.2 +Transfer-Encoding: chunked + +{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}""" + +julia> HTTP.post("http://127.0.0.1:8000", + ["Content-Type" => "application/json"], + JSON3.write((k=1.5, n=1000))) +HTTP.Messages.Response: +""" +HTTP/1.1 200 OK +Content-Type: application/json; charset=utf-8 +Server: Genie/Julia/1.8.2 +Transfer-Encoding: chunked + +{"status":"ERROR","time":"","value":""}""" +``` + +As expected we got a positive answer the first time and an error on the second call. + +### Exercise 7 + +Solution: + +``` +using DataFrames + +df = DataFrame() +for k in [10^i for i in 3:6], n in [10^i for i in 1:3] + @show k, n + req = HTTP.post("http://127.0.0.1:8000", + ["Content-Type" => "application/json"], + JSON3.write((; k, n))) + push!(df, NamedTuple(JSON3.read(req.body))) +end +``` + +Note that I convert `JSON3.Object` into a `NamedTuple` to easily `push!` +it into the `df` data frame. + +Let us have a look at the produced data frame: + +``` +julia> df +12×3 DataFrame + Row │ status time value + │ String Float64 Object… +─────┼────────────────────────────────────────────────────── + 1 │ OK 0.0006784 {\n "k": 1000,\n "n": … + 2 │ OK 0.0038374 {\n "k": 1000,\n "n": … + 3 │ OK 0.0150844 {\n "k": 1000,\n "n": … + 4 │ OK 0.0014071 {\n "k": 10000,\n "n":… + 5 │ OK 0.008443 {\n "k": 10000,\n "n":… + 6 │ OK 0.0700319 {\n "k": 10000,\n "n":… + 7 │ OK 0.0253826 {\n "k": 100000,\n "n"… + 8 │ OK 0.0795937 {\n "k": 100000,\n "n"… + 9 │ OK 0.708287 {\n "k": 100000,\n "n"… + 10 │ OK 0.160286 {\n "k": 1000000,\n "n… + 11 │ OK 0.803433 {\n "k": 1000000,\n "n… + 12 │ OK 7.23958 {\n "k": 1000000,\n "n… +``` + +### Exercise 8 + +Solution: + +``` +julia> select!(df, :status, :time, :value => AsTable) +12×7 DataFrame + Row │ status time k n mv lo95 hi95 + │ String Float64 Int64 Int64 Float64 Float64 Float64 +─────┼───────────────────────────────────────────────────────────────── + 1 │ OK 0.0006784 1000 10 0.6469 0.640745 0.653055 + 2 │ OK 0.0038374 1000 100 0.63508 0.633035 0.637125 + 3 │ OK 0.0150844 1000 1000 0.632178 0.631581 0.632775 + 4 │ OK 0.0014071 10000 10 0.65239 0.650425 0.654355 + 5 │ OK 0.008443 10000 100 0.634456 0.633845 0.635067 + 6 │ OK 0.0700319 10000 1000 0.63207 0.631878 0.632262 + 7 │ OK 0.0253826 100000 10 0.651411 0.650793 0.652029 + 8 │ OK 0.0795937 100000 100 0.634 0.633807 0.634193 + 9 │ OK 0.708287 100000 1000 0.63224 0.632179 0.632302 + 10 │ OK 0.160286 1000000 10 0.65129 0.651095 0.651486 + 11 │ OK 0.803433 1000000 100 0.633995 0.633934 0.634056 + 12 │ OK 7.23958 1000000 1000 0.63232 0.632301 0.63234 +``` + +### Exercise 9 + +Solution: + +``` +julia> using DataFramesMeta + +julia> @chain df begin + @rselect(:k, :n, :avg_time = :time / (:k * :n)) + unstack(:k, :n, :avg_time) + end +4×4 DataFrame + Row │ k 10 100 1000 + │ Int64 Float64? Float64? Float64? +─────┼───────────────────────────────────────────── + 1 │ 1000 6.784e-8 3.8374e-8 1.50844e-8 + 2 │ 10000 1.4071e-8 8.443e-9 7.00319e-9 + 3 │ 100000 2.53826e-8 7.95937e-9 7.08287e-9 + 4 │ 1000000 1.60286e-8 8.03433e-9 7.23958e-9 +``` + +We see that indeed this is the case. For large `k` and `n` the average time per +single sample stabilizes (for small values the runtime is low so the timing is +more affected by external noise and the other operations that the functions do +affect the results more). + +### Exercise 10 + +Solution: +``` +using Plots +gdf = groupby(df, :k, sort=true) +plot([bar(string.(g.n), g.mv; + ylim=(0.62, 0.66), xlabel="n", ylabel="estimate", + legend=false, title=first(g.k), + yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...) +``` + +As expected error bandwidth gets smaller as `k` encreases. +Note that as `n` increases the estimated value tends to `1-exp(-1)`. + +
diff --git a/exercises/logo.png b/exercises/logo.png new file mode 100644 index 0000000..3d71441 Binary files /dev/null and b/exercises/logo.png differ