add exercises
This commit is contained in:
15
README.md
15
README.md
@@ -5,6 +5,11 @@ This repository contains source codes for the
|
||||
book that is written by Bogumił Kamiński and is planned to be published in 2022
|
||||
by [Manning Publications Co.](https://www.manning.com/)
|
||||
|
||||
Extras:
|
||||
* in the `/exercises` folder for each book chapter you can find 10 additional
|
||||
exercises with solutions (they are meant for self study and are not discussed
|
||||
in the book)
|
||||
|
||||
## Setting up your environment
|
||||
|
||||
In order to prepare the Julia environment before working with the materials
|
||||
@@ -71,9 +76,9 @@ The codes for each chapter are stored in files named *chXX.jl*, where *XX* is
|
||||
chapter number. The exceptions are
|
||||
|
||||
* chapter 14, where additionally a separate *ch14_server.jl* is present along
|
||||
with *ch14.jl* (the reason is that in this chapter we create a web service and
|
||||
the *ch14_server.jl* contains the server-side code that should be run in a
|
||||
separate Julia process);
|
||||
with *ch14.jl* (the reason is that in this chapter we create a web service and
|
||||
the *ch14_server.jl* contains the server-side code that should be run in a
|
||||
separate Julia process);
|
||||
* appendix A, where the file name used is *appA.txt* because it also
|
||||
contains other instructions than only Julia code (in particular package
|
||||
manager mode instructions).
|
||||
@@ -113,8 +118,6 @@ There are the following videos that feature material related to this book:
|
||||
* [Analysis of GitHub developer graph](https://www.twitch.tv/videos/1527593261)
|
||||
(a shortened version of material covered in chapter 12)
|
||||
|
||||
|
||||
|
||||
## Data used in the book
|
||||
|
||||
For your convenience I additionally stored data files that we use in this book.
|
||||
@@ -130,5 +133,3 @@ They are respectively:
|
||||
<https://snap.stanford.edu/data/github-social.html> under GPL-3.0 License)
|
||||
* owensboro.zip (for chapter 13, available at The Stanford Open Policing Project
|
||||
under the Open Data Commons Attribution License)
|
||||
|
||||
<!-- markdownlint-disable-file MD033 -->
|
||||
|
||||
33
exercises/README.md
Normal file
33
exercises/README.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
This folder contains additional exercises to the book
|
||||
["Julia for Data Analysis"](https://www.manning.com/books/julia-for-data-analysis?utm_source=bkamins&utm_medium=affiliate&utm_campaign=book_kaminski2_julia_3_17_22)
|
||||
that is written by Bogumił Kamiński and is planned to be published in 2022
|
||||
by [Manning Publications Co.](https://www.manning.com/).
|
||||
|
||||
The exercises were prepared by [Bogumił Kamiński](https://github.com/bkamins) and [Daniel Kaszyński](https://www.linkedin.com/in/daniel-kaszy%C5%84ski-3a7807113/).
|
||||
|
||||
For each book chapter you can find 10 additional exercises with solutions.
|
||||
The exercises are meant for self study and are not discussed in the book.
|
||||
|
||||
The exercises contained in this folder have varying level of difficulty. The
|
||||
problems for the first few chapters are meant to be relatively easy, however you
|
||||
might find exercises for the last chapters more challenging (if you are learning
|
||||
Julia I still encourage you to try these exercises and walk through the
|
||||
solutions trying to understand them). In particular, in some exercises I, on
|
||||
purpose, require using more functionalities of Julia ecosystem that is covered
|
||||
in the book. This is meant to teach you how to use help and documentation,
|
||||
as this is a very important skill to master.
|
||||
|
||||
The files containing exercises have a naming convention `execricesDD.md`, where
|
||||
`DD` is book chapter number for which the exercises were prepared.
|
||||
|
||||
All the exercises should be possible to solve using project environment setting
|
||||
that is used in the whole book (please check the global README.md file of this
|
||||
repository for details).
|
||||
|
||||
---
|
||||
|
||||
*Preparation of these exercises have been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.*
|
||||
|
||||

|
||||
BIN
exercises/example8.csv.gz
Normal file
BIN
exercises/example8.csv.gz
Normal file
Binary file not shown.
267
exercises/exercises02.md
Normal file
267
exercises/exercises02.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 2
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Consider the following code:
|
||||
```
|
||||
x = [1, 2]
|
||||
y = x
|
||||
y[1] = 10
|
||||
```
|
||||
What is the value of `x[1]` and why?
|
||||
|
||||
### Exercise 2
|
||||
|
||||
How can you type `⚡ = 1`. Check if this operation succeeds and what is its result.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
What will be the value of variable `x` after running of the following code and why?
|
||||
```
|
||||
x = 0.0
|
||||
for i in 1:7_000_000
|
||||
global x += 1/7
|
||||
end
|
||||
x /= 1_000_000
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Express the type `Matrix{Bool}` using `Array` type.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Let `x` be a vector. Write code that prints an error if `x` is empty
|
||||
(has zero elements)
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Write a function called `exec` that takes two values `x` and `y` and a function
|
||||
accepting two arguments, call it `op` and returns `op(x, y)`. Make `+` to be
|
||||
the default value of `op`.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Write a function that calculates a sum of absolute values of values stored in
|
||||
a collection passed to it.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Write a function that swaps first and last element in an array in place.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Write a loop in global scope that calculates the sum of cubes of numbers from
|
||||
`1` to `10^6`. Next use the `sum` function to perform the same computation.
|
||||
What is the difference in timing of these operations?
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Explain the value of the result of summation obtained in exercise 9.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
`x[1]` will be `10` because `y = x` is not copying data but it binds
|
||||
the same value both to variable `x` and `y`.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
In help mode (activated by `?`) copy-paste `⚡` to get:
|
||||
```
|
||||
help?> ⚡
|
||||
"⚡" can be typed by \:zap:<tab>
|
||||
```
|
||||
After the `⚡ = 1` operation a new variable `⚡` is defined and it is bound
|
||||
to value `1`.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
`x` will have value `0.9999999999242748`. This value is below `1.0` because
|
||||
representation of `1/7` using `Float64` type is less than rational number 1/7,
|
||||
and the error accumulates when we do addition multiple times.
|
||||
|
||||
*Extra*: You can check that indeed that `Float64` representation is a bit less
|
||||
than rational 1/7 by increasing the precision of computations using the `big`
|
||||
function:
|
||||
```
|
||||
julia> big(1/7) # convert Floa64 to high-precision float
|
||||
0.142857142857142849212692681248881854116916656494140625
|
||||
|
||||
julia> 1/big(7) # construct high-precision float directly
|
||||
0.1428571428571428571428571428571428571428571428571428571428571428571428571428568
|
||||
```
|
||||
As you can see there is a difference at 17th place after decimal dot where we
|
||||
have `4` vs `5`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
It is `Array{Bool, 2}`. You immediately get this information in REPL:
|
||||
```
|
||||
julia> Matrix{Bool}
|
||||
Matrix{Bool} (alias for Array{Bool, 2})
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
You can do it like this:
|
||||
```
|
||||
length(x) == 0 && println("x is empty")
|
||||
```
|
||||
|
||||
*Extra*: typically in such case one would use the `isempty` function and throw
|
||||
an exception instead of just printing information (here I assume that `x` was
|
||||
passed as an argument to the function):
|
||||
```
|
||||
isempty(x) && throw(ArgumentError("x is not allowed to be empty"))
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Here are two ways to define the `exec` function:
|
||||
```
|
||||
exec1(x, y, op=+) = op(x, y)
|
||||
exec2(x, y; op=+) = op(x, y)
|
||||
```
|
||||
The first of them uses positional arguments, and the second a keyword argument.
|
||||
Here is a difference in how they would be called:
|
||||
```
|
||||
julia> exec1(2, 3, *)
|
||||
6
|
||||
|
||||
julia> exec2(2, 3; op=*)
|
||||
6
|
||||
```
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Such a function can be written as:
|
||||
```
|
||||
sumabs(x) = sum(abs, x)
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
This can be written for example as:
|
||||
```
|
||||
function swap!(x)
|
||||
f = x[1]
|
||||
x[1] = x[end]
|
||||
x[end] = f
|
||||
return x
|
||||
end
|
||||
```
|
||||
|
||||
*Extra* A more advanced way to write this function would be:
|
||||
```
|
||||
function swap!(x)
|
||||
if length(x) > 1
|
||||
x[begin], x[end] = x[end], x[begin]
|
||||
end
|
||||
return x
|
||||
end
|
||||
```
|
||||
Note the differences in the code:
|
||||
* we use `begin` instead of `1` to get the first element. This is a safer
|
||||
practice since some collections in Julia do not use 1-based indexing (in
|
||||
practice you are not likely to see them, so this comment is most relevant
|
||||
for package developers)
|
||||
* if there are `0` or `1` element in the collection the function does not do
|
||||
anything (depending on the context we might want to throw an error instead)
|
||||
* in `x[begin], x[end] = x[end], x[begin]` we perform two assignments at the
|
||||
same time to avoid having to use a temporaty variable `f` (this operation
|
||||
is technically called tuple destructuring; we discuss it in later chapters of
|
||||
the book)
|
||||
|
||||
### Exercise 9
|
||||
|
||||
We used `@time` macro in chapter 1.
|
||||
|
||||
Version in global scope:
|
||||
```
|
||||
julia> s = 0
|
||||
0
|
||||
|
||||
julia> @time for i in 1:10^6
|
||||
global s += i^3
|
||||
end
|
||||
0.076299 seconds (2.00 M allocations: 30.517 MiB, 10.47% gc time)
|
||||
```
|
||||
|
||||
Version with a function using a `sum` function:
|
||||
```
|
||||
julia> sum3(n) = sum(x -> x^3, 1:n)
|
||||
sum3 (generic function with 1 method)
|
||||
|
||||
julia> @time sum3(10^6)
|
||||
0.000012 seconds
|
||||
-8222430735553051648
|
||||
```
|
||||
|
||||
Version with `sum` function in global scope:
|
||||
```
|
||||
julia> @time sum(x -> x^3, 1:10^6)
|
||||
0.027436 seconds (48.61 k allocations: 2.558 MiB, 99.75% compilation time)
|
||||
-8222430735553051648
|
||||
|
||||
julia> @time sum(x -> x^3, 1:10^6)
|
||||
0.025744 seconds (48.61 k allocations: 2.557 MiB, 99.76% compilation time)
|
||||
-8222430735553051648
|
||||
```
|
||||
|
||||
As you can see using a loop in global scope is inefficient. It leads to
|
||||
many allocations and slow execution.
|
||||
|
||||
Using a `sum3` function leads to fastest execution. You might ask why using
|
||||
`sum(x -> x^3, 1:10^6)` in global scope is slower. The reason is that an
|
||||
anonymous function `x -> x^3` is defined anew each time this operation is called
|
||||
which forces compilation of the `sum` function (but it is still faster than
|
||||
the loop in global scope).
|
||||
|
||||
For a reference check the function with a loop inside it:
|
||||
```
|
||||
julia> function sum3loop(n)
|
||||
s = 0
|
||||
for i in 1:n
|
||||
s += i^3
|
||||
end
|
||||
return s
|
||||
end
|
||||
sum3loop (generic function with 1 method)
|
||||
|
||||
julia> @time sum3loop(10^6)
|
||||
0.001378 seconds
|
||||
-8222430735553051648
|
||||
```
|
||||
This is also much faster than a loop in global scope.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
In exercise 9 we note that the result is `-8222430735553051648` which is a
|
||||
negative value, although we are adding cubes of positive values. The
|
||||
reason of the problem is that operations on integers overflow. If you
|
||||
are working with numbers larger that can be stored in `Int` type, which is:
|
||||
```
|
||||
julia> typemax(Int)
|
||||
9223372036854775807
|
||||
```
|
||||
use `big` numbers that we discussed in *Exercise 3*:
|
||||
```
|
||||
julia> @time sum(x -> big(x)^3, 1:10^6)
|
||||
0.833234 seconds (11.05 M allocations: 236.113 MiB, 23.77% gc time, 2.63% compilation time)
|
||||
250000500000250000000000
|
||||
```
|
||||
Now we get a correct result, at the cost of slower computation.
|
||||
|
||||
</details>
|
||||
345
exercises/exercises03.md
Normal file
345
exercises/exercises03.md
Normal file
@@ -0,0 +1,345 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 3
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Check what methods does the `repeat` function have.
|
||||
Are they all covered in help for this function?
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Write a function `fun2` that takes any vector and returns the difference between
|
||||
the largest and the smallest element in this vector.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Generate a vector of one million random numbers from `[0, 1]` interval.
|
||||
Check what is a faster way to get a maximum and minimum element in it. One
|
||||
option is by using the `maximum` and `minimum` functions and the other is by
|
||||
using the `extrema` function.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Assume you have accidentally typed `+x = 1` when wanting to assign `1` to
|
||||
variable `x`. What effects can this operation have?
|
||||
|
||||
### Exercise 5
|
||||
|
||||
What is the result of calling the `subtypes` on `Union{Bool, Missing}` and why?
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Define two identical anonymous functions `x -> x + 1` in global scope? Do they
|
||||
have the same type?
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Define the `wrap` function taking one argument `i` and returning the anonymous
|
||||
function `x -> x + i`. Is the type of such anonymous function the same across
|
||||
calls to `wrap` function?
|
||||
|
||||
### Exercise 8
|
||||
|
||||
You want to write a function that accepts any `Integer` except `Bool` and returns
|
||||
the passed value. If `Bool` is passed an error should be thrown.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
The `@time` macro measures time taken by an expression run and prints it,
|
||||
but returns the value of the expression.
|
||||
The `@elapsed` macro works differently - it does not print anything, but returns
|
||||
time taken to evaluate an expression. Test the `@elapsed` macro by to see how
|
||||
long it takes to shuffle a vector of one million floats. Use the `shuffle` function
|
||||
from `Random` module.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Using the `@btime` macro benchmark the time of calculating the sum of one million
|
||||
random floats.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Write:
|
||||
```
|
||||
julia> methods(repeat)
|
||||
# 6 methods for generic function "repeat":
|
||||
[1] repeat(A::AbstractArray; inner, outer) in Base at abstractarraymath.jl:392
|
||||
[2] repeat(A::AbstractArray, counts...) in Base at abstractarraymath.jl:355
|
||||
[3] repeat(c::Char, r::Integer) in Base at strings/string.jl:336
|
||||
[4] repeat(c::AbstractChar, r::Integer) in Base at strings/string.jl:335
|
||||
[5] repeat(s::Union{SubString{String}, String}, r::Integer) in Base at strings/substring.jl:248
|
||||
[6] repeat(s::AbstractString, r::Integer) in Base at strings/basic.jl:715
|
||||
```
|
||||
|
||||
Now write `?repeat` and you will see that there are four entries in help.
|
||||
The reason is that for `Char` and `AbstractChar` as well as for
|
||||
`AbstractString` and `Union{SubString{String}, String}` there is one help entry.
|
||||
|
||||
Why do these cases have two methods defined?
|
||||
The reason is performance. For example `repeat(c::AbstractChar, r::Integer)`
|
||||
is a generic function that accept any character values
|
||||
and `repeat(c::Char, r::Integer)` is its faster version
|
||||
that accepts values that have `Char` type only (and it is invoked by Julia
|
||||
if value of type `Char` is passed as an argument to `repeat`).
|
||||
|
||||
### Exercise 2
|
||||
|
||||
You can define is as follows:
|
||||
```
|
||||
fun2(x::AbstractVector) = maximum(x) - minimum(x)
|
||||
```
|
||||
or as follows:
|
||||
```
|
||||
function fun2(x::AbstractVector)
|
||||
lo, hi = extrema(x)
|
||||
return hi - lo
|
||||
end
|
||||
```
|
||||
Note that these two functions will work with vectors of any elements that
|
||||
are ordered and support subtraction (they do not have to be numbers).
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Here is a way to compare the performance of both options:
|
||||
```
|
||||
julia> using BenchmarkTools
|
||||
|
||||
julia> x = rand(10^6);
|
||||
|
||||
julia> @btime minimum($x), maximum($x)
|
||||
860.700 μs (0 allocations: 0 bytes)
|
||||
(1.489173560242918e-6, 0.9999984347293639)
|
||||
|
||||
julia> @btime extrema($x)
|
||||
2.185 ms (0 allocations: 0 bytes)
|
||||
(1.489173560242918e-6, 0.9999984347293639)
|
||||
```
|
||||
|
||||
As you can see in this situation, although `extrema` does the operation
|
||||
in a single pass over `x` it is slower than computing `minimum` and `maximum`
|
||||
in two passes.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
If it is a fresh Julia session you define a new function in `Main` for `+` operator:
|
||||
|
||||
```
|
||||
julia> +x=1
|
||||
+ (generic function with 1 method)
|
||||
|
||||
julia> methods(+)
|
||||
# 1 method for generic function "+":
|
||||
[1] +(x) in Main at REPL[1]:1
|
||||
|
||||
julia> +(10)
|
||||
1
|
||||
```
|
||||
|
||||
This will also break any further uses of `+` in your programs:
|
||||
|
||||
```
|
||||
julia> 1 + 2
|
||||
ERROR: MethodError: no method matching +(::Int64, ::Int64)
|
||||
You may have intended to import Base.:+
|
||||
Closest candidates are:
|
||||
+(::Any) at REPL[1]:1
|
||||
```
|
||||
|
||||
If you earlier used addition in this Julia session then the operation will error.
|
||||
Start a fresh Julia session:
|
||||
|
||||
```
|
||||
julia> 1 + 2
|
||||
3
|
||||
|
||||
julia> +x=1
|
||||
ERROR: error in method definition: function Base.+ must be explicitly imported to be extended
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
You get an empty vector:
|
||||
```
|
||||
julia> subtypes(Union{Float64, Missing})
|
||||
Type[]
|
||||
```
|
||||
|
||||
The reason is that the `subtypes` function returns subtypes of explicitly
|
||||
declared types that have names (type of such types is `DataType` in Julia).
|
||||
|
||||
*Extra* for this reason `subtypes` has a limited use. To check if one type
|
||||
is a subtype of some other type use the `<:` operator.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
No, each of them has a different type:
|
||||
```
|
||||
julia> f1 = x -> x + 1
|
||||
#1 (generic function with 1 method)
|
||||
|
||||
julia> f2 = x -> x + 1
|
||||
#3 (generic function with 1 method)
|
||||
|
||||
julia> typeof(f1)
|
||||
var"#1#2"
|
||||
|
||||
julia> typeof(f2)
|
||||
var"#3#4"
|
||||
```
|
||||
|
||||
This is the reason why function call like `sum(x -> x^2, 1:10)` in global
|
||||
scope triggers compilation each time:
|
||||
|
||||
```
|
||||
julia> @time sum(x -> x^2, 1:10)
|
||||
0.070714 seconds (167.41 k allocations: 8.815 MiB, 14.29% gc time, 93.91% compilation time)
|
||||
385
|
||||
|
||||
julia> @time sum(x -> x^2, 1:10)
|
||||
0.020971 seconds (47.82 k allocations: 2.529 MiB, 99.75% compilation time)
|
||||
385
|
||||
|
||||
julia> @time sum(x -> x^2, 1:10)
|
||||
0.021184 seconds (47.81 k allocations: 2.529 MiB, 99.77% compilation time)
|
||||
385
|
||||
```
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Yes, the type is the same:
|
||||
|
||||
```
|
||||
julia> wrap(i) = x -> x + i
|
||||
wrap (generic function with 1 method)
|
||||
|
||||
julia> typeof(wrap(1))
|
||||
var"#11#12"
|
||||
|
||||
julia> typeof(wrap(2))
|
||||
var"#11#12"
|
||||
```
|
||||
|
||||
Julia defines a new type for such an anonymous function only once
|
||||
The consequence of this is that e.g. expressions inside a function like
|
||||
`sum(x -> x ^ i, 1:10)` where `i` is an argument to a function do not trigger
|
||||
compilation (as opposed to similar expressions in global scope, see exercise 6).
|
||||
|
||||
```
|
||||
julia> sumi(i) = sum(x -> x^i, 1:10)
|
||||
sumi (generic function with 1 method)
|
||||
|
||||
julia> @time sumi(1)
|
||||
0.000004 seconds
|
||||
55
|
||||
|
||||
julia> @time sumi(2)
|
||||
0.000001 seconds
|
||||
385
|
||||
|
||||
julia> @time sumi(3)
|
||||
0.000003 seconds
|
||||
3025
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
We check subtypes of `Integer`:
|
||||
|
||||
```
|
||||
julia> subtypes(Integer)
|
||||
3-element Vector{Any}:
|
||||
Bool
|
||||
Signed
|
||||
Unsigned
|
||||
```
|
||||
|
||||
The first way to write such a function is then:
|
||||
```
|
||||
fun1(i::Union{Signed, Unsigned}) = i
|
||||
```
|
||||
and now we have:
|
||||
```
|
||||
julia> fun1(1)
|
||||
1
|
||||
|
||||
julia> fun1(true)
|
||||
ERROR: MethodError: no method matching fun1(::Bool)
|
||||
```
|
||||
|
||||
The second way is:
|
||||
```
|
||||
fun2(i::Integer) = i
|
||||
fun2(::Bool) = throw(ArgumentError("Bool is not supported"))
|
||||
```
|
||||
|
||||
and now you have:
|
||||
```
|
||||
julia> fun2(1)
|
||||
1
|
||||
|
||||
julia> fun2(true)
|
||||
ERROR: ArgumentError: Bool is not supported
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Here is the code that performs the task:
|
||||
```
|
||||
julia> using Random # needed to get access to shuffle
|
||||
|
||||
julia> x = rand(10^6); # generate random floats
|
||||
|
||||
julia> @elapsed shuffle(x)
|
||||
0.0518085
|
||||
|
||||
julia> @elapsed shuffle(x)
|
||||
0.01257
|
||||
|
||||
julia> @elapsed shuffle(x)
|
||||
0.012483
|
||||
```
|
||||
|
||||
Note that the first time we run `shuffle` it takes longer due to compilation.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
The code you can use is:
|
||||
|
||||
```
|
||||
julia> using BenchmarkTools
|
||||
|
||||
julia> @btime sum($(rand(10^6)))
|
||||
155.300 μs (0 allocations: 0 bytes)
|
||||
500330.6375697419
|
||||
```
|
||||
|
||||
Note that the following:
|
||||
```
|
||||
julia> @btime sum(rand(10^6))
|
||||
1.644 ms (2 allocations: 7.63 MiB)
|
||||
500266.9457722128
|
||||
```
|
||||
would be an incorrect timing as you would also measure the time of generating
|
||||
of the vector.
|
||||
|
||||
Alternatively you can e.g. write:
|
||||
```
|
||||
julia> x = rand(10^6);
|
||||
|
||||
julia> @btime sum($x)
|
||||
154.700 μs (0 allocations: 0 bytes)
|
||||
500151.95875364926
|
||||
```
|
||||
|
||||
</details>
|
||||
536
exercises/exercises04.md
Normal file
536
exercises/exercises04.md
Normal file
@@ -0,0 +1,536 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 4
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Create a matrix of shape 2x3 containing numbers from 1 to 6 (fill the matrix
|
||||
columnwise with consecutive numbers). Next calculate sum, mean and standard
|
||||
deviation of each row and each column of this matrix.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
For each column of the matrix created in exercise 1 compute its range
|
||||
(i.e. the difference between maximum and minimum element stored in it).
|
||||
|
||||
### Exercise 3
|
||||
|
||||
This is data for car speed (mph) and distance taken to stop (ft)
|
||||
from Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
|
||||
|
||||
```
|
||||
speed dist
|
||||
4 2
|
||||
4 10
|
||||
7 4
|
||||
7 22
|
||||
8 16
|
||||
9 10
|
||||
10 18
|
||||
10 26
|
||||
10 34
|
||||
11 17
|
||||
11 28
|
||||
12 14
|
||||
12 20
|
||||
12 24
|
||||
12 28
|
||||
13 26
|
||||
13 34
|
||||
13 34
|
||||
13 46
|
||||
14 26
|
||||
14 36
|
||||
14 60
|
||||
14 80
|
||||
15 20
|
||||
15 26
|
||||
15 54
|
||||
16 32
|
||||
16 40
|
||||
17 32
|
||||
17 40
|
||||
17 50
|
||||
18 42
|
||||
18 56
|
||||
18 76
|
||||
18 84
|
||||
19 36
|
||||
19 46
|
||||
19 68
|
||||
20 32
|
||||
20 48
|
||||
20 52
|
||||
20 56
|
||||
20 64
|
||||
22 66
|
||||
23 54
|
||||
24 70
|
||||
24 92
|
||||
24 93
|
||||
24 120
|
||||
25 85
|
||||
```
|
||||
|
||||
Load this data into Julia (this is part of the exercise) and fit a linear
|
||||
regression where speed is a feature and distance is target variable.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Plot the data loaded in exercise 4. Additionally plot the fitted regression
|
||||
(you need to check Plots.jl documentation to find a way to do this).
|
||||
|
||||
### Exercise 5
|
||||
|
||||
A simple code for calculation of Fibonacci numbers for positive
|
||||
arguments is as follows:
|
||||
|
||||
```
|
||||
fib(n) =n < 3 ? 1 : fib(n-1) + fib(n-2)
|
||||
```
|
||||
|
||||
Using the BenchmarkTools.jl package measure runtime of this function for
|
||||
`n` ranging from `1` to `20`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Improve the speed of code from exercise 5 by using a dictionary where you
|
||||
store a mapping of `n` to `fib(n)`. Measure the performance of this function
|
||||
for the same range of values as in exercise 5.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Create a vector containing named tuples representing elements of a 4x4 grid.
|
||||
So the first element of this vector should be `(x=1, y=1)` and last should be
|
||||
`(x=4, y=4)`. Store the vector in variable `v`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
The `filter` function allows you to select some values of an input collection.
|
||||
Check its documentation first. Next, use it to keep from the vector `v` from
|
||||
exercise 7 only elements whose sum is even.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Check the documentation of the `filter!` function. Perform the same operation
|
||||
as asked in exercise 8 but using `filter!`. What is the difference?
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Write a function that takes a number `n`. Next it generates two independent
|
||||
random vectors of length `n` and returns their correlation coefficient.
|
||||
Run this function `10000` times for `n` equal to `10`, `100`, `1000`,
|
||||
and `10000`.
|
||||
Create a plot with four histograms of distribution of computed Pearson
|
||||
correlation coefficient. Check in the Plots.jl package which function can be
|
||||
used to plot histograms.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Write:
|
||||
```
|
||||
julia> using Statistics
|
||||
|
||||
julia> mat = [1 3 5
|
||||
2 4 6]
|
||||
2×3 Matrix{Int64}:
|
||||
1 3 5
|
||||
2 4 6
|
||||
|
||||
julia> sum(mat, dims=1)
|
||||
1×3 Matrix{Int64}:
|
||||
3 7 11
|
||||
|
||||
julia> sum(mat, dims=2)
|
||||
2×1 Matrix{Int64}:
|
||||
9
|
||||
12
|
||||
|
||||
julia> mean(mat, dims=1)
|
||||
1×3 Matrix{Float64}:
|
||||
1.5 3.5 5.5
|
||||
|
||||
julia> mean(mat, dims=2)
|
||||
2×1 Matrix{Float64}:
|
||||
3.0
|
||||
4.0
|
||||
|
||||
julia> std(mat, dims=1)
|
||||
1×3 Matrix{Float64}:
|
||||
0.707107 0.707107 0.707107
|
||||
|
||||
julia> std(mat, dims=2)
|
||||
2×1 Matrix{Float64}:
|
||||
2.0
|
||||
2.0
|
||||
```
|
||||
|
||||
Observe that the returned statistics are also stored in matrices.
|
||||
If we compute them for columns (`dims=1`) then the produced matrix has one row.
|
||||
If we compute them for rows (`dims=2`) then the produced matrix has one column.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Here are some ways you can do it:
|
||||
```
|
||||
julia> [maximum(x) - minimum(x) for x in eachcol(mat)]
|
||||
3-element Vector{Int64}:
|
||||
1
|
||||
1
|
||||
1
|
||||
|
||||
julia> map(x -> maximum(x) - minimum(x), eachcol(mat))
|
||||
3-element Vector{Int64}:
|
||||
1
|
||||
1
|
||||
1
|
||||
```
|
||||
|
||||
Observe that if we used `eachcol` the produced result is a vector (not a matrix
|
||||
like in exercise 1).
|
||||
|
||||
### Exercise 3
|
||||
|
||||
First create a matrix with source data by copy pasting it from the exercise
|
||||
like this:
|
||||
```
|
||||
data = [
|
||||
4 2
|
||||
4 10
|
||||
7 4
|
||||
7 22
|
||||
8 16
|
||||
9 10
|
||||
10 18
|
||||
10 26
|
||||
10 34
|
||||
11 17
|
||||
11 28
|
||||
12 14
|
||||
12 20
|
||||
12 24
|
||||
12 28
|
||||
13 26
|
||||
13 34
|
||||
13 34
|
||||
13 46
|
||||
14 26
|
||||
14 36
|
||||
14 60
|
||||
14 80
|
||||
15 20
|
||||
15 26
|
||||
15 54
|
||||
16 32
|
||||
16 40
|
||||
17 32
|
||||
17 40
|
||||
17 50
|
||||
18 42
|
||||
18 56
|
||||
18 76
|
||||
18 84
|
||||
19 36
|
||||
19 46
|
||||
19 68
|
||||
20 32
|
||||
20 48
|
||||
20 52
|
||||
20 56
|
||||
20 64
|
||||
22 66
|
||||
23 54
|
||||
24 70
|
||||
24 92
|
||||
24 93
|
||||
24 120
|
||||
25 85
|
||||
]
|
||||
```
|
||||
|
||||
Now use the GLM.jl package to fit the model:
|
||||
|
||||
```
|
||||
julia> using GLM
|
||||
|
||||
julia> lm(@formula(distance~speed), (distance=data[:, 2], speed=data[:, 1]))
|
||||
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64, Matrix{Float64}}
|
||||
|
||||
distance ~ 1 + speed
|
||||
|
||||
Coefficients:
|
||||
─────────────────────────────────────────────────────────────────────────
|
||||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
─────────────────────────────────────────────────────────────────────────
|
||||
(Intercept) -17.5791 6.75844 -2.60 0.0123 -31.1678 -3.99034
|
||||
speed 3.93241 0.415513 9.46 <1e-11 3.09696 4.76785
|
||||
─────────────────────────────────────────────────────────────────────────
|
||||
```
|
||||
|
||||
You can get the same estimates using the `\` operator like this:
|
||||
```
|
||||
julia> [ones(50) data[:, 1]] \ data[:, 2]
|
||||
2-element Vector{Float64}:
|
||||
-17.579094890510966
|
||||
3.9324087591240877
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Run the following:
|
||||
```
|
||||
using Plots
|
||||
scatter(data[:, 1], data[:, 2];
|
||||
xlab="speed", ylab="distance", legend=false, smooth=true)
|
||||
```
|
||||
|
||||
The `smooth=true` keyword argument adds the linear regression line to the plot.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Use the following code:
|
||||
```
|
||||
julia> using BenchmarkTools
|
||||
|
||||
julia> for i in 1:40
|
||||
print(i, " ")
|
||||
@btime fib($i)
|
||||
end
|
||||
1 2.500 ns (0 allocations: 0 bytes)
|
||||
2 2.700 ns (0 allocations: 0 bytes)
|
||||
3 4.800 ns (0 allocations: 0 bytes)
|
||||
4 7.500 ns (0 allocations: 0 bytes)
|
||||
5 12.112 ns (0 allocations: 0 bytes)
|
||||
6 19.980 ns (0 allocations: 0 bytes)
|
||||
7 32.125 ns (0 allocations: 0 bytes)
|
||||
8 52.696 ns (0 allocations: 0 bytes)
|
||||
9 85.010 ns (0 allocations: 0 bytes)
|
||||
10 140.311 ns (0 allocations: 0 bytes)
|
||||
11 222.177 ns (0 allocations: 0 bytes)
|
||||
12 359.903 ns (0 allocations: 0 bytes)
|
||||
13 582.123 ns (0 allocations: 0 bytes)
|
||||
14 1.000 μs (0 allocations: 0 bytes)
|
||||
15 1.560 μs (0 allocations: 0 bytes)
|
||||
16 2.522 μs (0 allocations: 0 bytes)
|
||||
17 4.000 μs (0 allocations: 0 bytes)
|
||||
18 6.600 μs (0 allocations: 0 bytes)
|
||||
19 11.400 μs (0 allocations: 0 bytes)
|
||||
20 18.100 μs (0 allocations: 0 bytes)
|
||||
```
|
||||
|
||||
Notice that execution time for number `n` is roughly sum of ececution times
|
||||
for numbers `n-1` and `n-2`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Use the following code:
|
||||
|
||||
```
|
||||
julia> fib_dict = Dict{Int, Int}()
|
||||
Dict{Int64, Int64}()
|
||||
|
||||
julia> function fib2(n)
|
||||
haskey(fib_dict, n) && return fib_dict[n]
|
||||
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
|
||||
fib_dict[n] = fib_n
|
||||
return fib_n
|
||||
end
|
||||
fib2 (generic function with 1 method)
|
||||
|
||||
julia> for i in 1:20
|
||||
print(i, " ")
|
||||
@btime fib2($i)
|
||||
end
|
||||
1 40.808 ns (0 allocations: 0 bytes)
|
||||
2 40.101 ns (0 allocations: 0 bytes)
|
||||
3 40.101 ns (0 allocations: 0 bytes)
|
||||
4 40.707 ns (0 allocations: 0 bytes)
|
||||
5 42.727 ns (0 allocations: 0 bytes)
|
||||
6 40.909 ns (0 allocations: 0 bytes)
|
||||
7 40.404 ns (0 allocations: 0 bytes)
|
||||
8 40.707 ns (0 allocations: 0 bytes)
|
||||
9 40.808 ns (0 allocations: 0 bytes)
|
||||
10 39.798 ns (0 allocations: 0 bytes)
|
||||
11 40.909 ns (0 allocations: 0 bytes)
|
||||
12 40.404 ns (0 allocations: 0 bytes)
|
||||
13 42.872 ns (0 allocations: 0 bytes)
|
||||
14 42.626 ns (0 allocations: 0 bytes)
|
||||
15 47.972 ns (1 allocation: 16 bytes)
|
||||
16 46.505 ns (1 allocation: 16 bytes)
|
||||
17 46.302 ns (1 allocation: 16 bytes)
|
||||
18 45.390 ns (1 allocation: 16 bytes)
|
||||
19 47.160 ns (1 allocation: 16 bytes)
|
||||
20 46.201 ns (1 allocation: 16 bytes)
|
||||
```
|
||||
|
||||
Note that benchmarking essentially gives us a time of dictionary lookup.
|
||||
The reason is that `@btime` executes the same expression many times, so
|
||||
for the fastest execution time the value for each `n` is already stored in
|
||||
`fib_dict`.
|
||||
|
||||
It would be more interesting to see the runtime of `fib2` for some large value
|
||||
of `n` executed once:
|
||||
|
||||
```
|
||||
julia> @time fib2(100)
|
||||
0.000018 seconds (107 allocations: 1.672 KiB)
|
||||
3736710778780434371
|
||||
|
||||
julia> @time fib2(200)
|
||||
0.000025 seconds (204 allocations: 20.453 KiB)
|
||||
-1123705814761610347
|
||||
```
|
||||
|
||||
As you can see things are indeed fast. Note that for `n=200` we get a negative
|
||||
values because of integer overflow.
|
||||
|
||||
As a more advanced topic (not covered in the book) it is worth to comment that
|
||||
`fib2` is not type stable. If we wanted to make it type stable we need to
|
||||
declare `fib_dict` dictionary as `const`. Here is the code and benchmarks
|
||||
(you need to restart Julia to run this test):
|
||||
|
||||
```
|
||||
julia> const fib_dict = Dict{Int, Int}()
|
||||
Dict{Int64, Int64}()
|
||||
|
||||
julia> function fib2(n)
|
||||
haskey(fib_dict, n) && return fib_dict[n]
|
||||
fib_n = n < 3 ? 1 : fib2(n-1) + fib2(n-2)
|
||||
fib_dict[n] = fib_n
|
||||
return fib_n
|
||||
end
|
||||
fib2 (generic function with 1 method)
|
||||
|
||||
julia> @time fib2(100)
|
||||
0.000014 seconds (6 allocations: 5.828 KiB)
|
||||
3736710778780434371
|
||||
|
||||
julia> @time fib2(200)
|
||||
0.000011 seconds (3 allocations: 17.312 KiB)
|
||||
-1123705814761610347
|
||||
```
|
||||
|
||||
As you can see the code does less allocations and is faster now.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Since we are asked to create a vector we can write:
|
||||
|
||||
```
|
||||
julia> v = [(x=x, y=y) for x in 1:4 for y in 1:4]
|
||||
16-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
|
||||
(x = 1, y = 1)
|
||||
(x = 1, y = 2)
|
||||
(x = 1, y = 3)
|
||||
(x = 1, y = 4)
|
||||
(x = 2, y = 1)
|
||||
(x = 2, y = 2)
|
||||
(x = 2, y = 3)
|
||||
(x = 2, y = 4)
|
||||
(x = 3, y = 1)
|
||||
(x = 3, y = 2)
|
||||
(x = 3, y = 3)
|
||||
(x = 3, y = 4)
|
||||
(x = 4, y = 1)
|
||||
(x = 4, y = 2)
|
||||
(x = 4, y = 3)
|
||||
(x = 4, y = 4)
|
||||
```
|
||||
|
||||
Note (not covered in the book) that you could create a matrix by changing
|
||||
the syntax a bit:
|
||||
|
||||
```
|
||||
julia> [(x=x, y=y) for x in 1:4, y in 1:4]
|
||||
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
|
||||
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
|
||||
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
|
||||
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
|
||||
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
|
||||
```
|
||||
|
||||
Finally, we can use a bit shorter syntax (covered in chapter 14 of the book):
|
||||
|
||||
```
|
||||
julia> [(; x, y) for x in 1:4, y in 1:4]
|
||||
4×4 Matrix{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
|
||||
(x = 1, y = 1) (x = 1, y = 2) (x = 1, y = 3) (x = 1, y = 4)
|
||||
(x = 2, y = 1) (x = 2, y = 2) (x = 2, y = 3) (x = 2, y = 4)
|
||||
(x = 3, y = 1) (x = 3, y = 2) (x = 3, y = 3) (x = 3, y = 4)
|
||||
(x = 4, y = 1) (x = 4, y = 2) (x = 4, y = 3) (x = 4, y = 4)
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
To get help on the `filter` function write `?filter`. Next run:
|
||||
|
||||
```
|
||||
julia> filter(e -> iseven(e.x + e.y), v)
|
||||
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
|
||||
(x = 1, y = 1)
|
||||
(x = 1, y = 3)
|
||||
(x = 2, y = 2)
|
||||
(x = 2, y = 4)
|
||||
(x = 3, y = 1)
|
||||
(x = 3, y = 3)
|
||||
(x = 4, y = 2)
|
||||
(x = 4, y = 4)
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
To get help on the `filter!` function write `?filter!`. Next run:
|
||||
|
||||
```
|
||||
julia> filter!(e -> iseven(e.x + e.y), v)
|
||||
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
|
||||
(x = 1, y = 1)
|
||||
(x = 1, y = 3)
|
||||
(x = 2, y = 2)
|
||||
(x = 2, y = 4)
|
||||
(x = 3, y = 1)
|
||||
(x = 3, y = 3)
|
||||
(x = 4, y = 2)
|
||||
(x = 4, y = 4)
|
||||
|
||||
julia> v
|
||||
8-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
|
||||
(x = 1, y = 1)
|
||||
(x = 1, y = 3)
|
||||
(x = 2, y = 2)
|
||||
(x = 2, y = 4)
|
||||
(x = 3, y = 1)
|
||||
(x = 3, y = 3)
|
||||
(x = 4, y = 2)
|
||||
(x = 4, y = 4)
|
||||
```
|
||||
|
||||
Notice that `filter` allocated a new vector, while `filter!` updated the `v`
|
||||
vector in place.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
You can use for example the following code:
|
||||
|
||||
```
|
||||
using Statistics
|
||||
using Plots
|
||||
rand_cor(n) = cor(rand(n), rand(n))
|
||||
plot([histogram([rand_cor(n) for i in 1:10000], title="n=$n", legend=false)
|
||||
for n in [10, 100, 1000, 10000]]...)
|
||||
```
|
||||
|
||||
Observe that as you increase `n` the dispersion of the correlation coefficient
|
||||
decreases.
|
||||
|
||||
</details>
|
||||
287
exercises/exercises05.md
Normal file
287
exercises/exercises05.md
Normal file
@@ -0,0 +1,287 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 5
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Create a matrix containing truth table for `&&` and `||` operations.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
The `issubset` function checks if one collection is a subset of other
|
||||
collection.
|
||||
|
||||
Now take a range `4:6` and check if it is a subset of ranges `4+k:4-k` for
|
||||
`k` varying from `1` to `3`. Store the result in a vector.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Write a function that accepts two vectors and returns `true` if they have equal
|
||||
length and otherwise returns `false`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Consider the vectors `x = [1, 2, 1, 2, 1, 2]`,
|
||||
`y = ["a", "a", "b", "b", "b", "a"]`, and `z = [1, 2, 1, 2, 1, 3]`.
|
||||
Calculate their Adjusted Mutual Information using scikit-learn.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Using Adjusted Mutual Information function from exercise 4 generate
|
||||
a pair of random vectors of length 100 containing integer numbers from the
|
||||
range `1:5`. Repeat this exercise 1000 times and plot a histogram of AMI.
|
||||
Check in the documentation of the `rand` function how you can draw a sample
|
||||
from a collection of values.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Adjust the code from exercise 5 but replace first 50 elements of each vector
|
||||
with zero. Repeat the experiment.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Write a function that takes a vector of integer values and returns a dictionary
|
||||
giving information how many times each integer was present in the passed vector.
|
||||
|
||||
Test this function on vectors `v1 = [1, 2, 3, 2, 3, 3]`, `v2 = [true, false]`,
|
||||
and `v3 = 3:5`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Write code that creates a `Bool` diagonal matrix of size 5x5.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Write a code comparing performance of calculation of sum of logarithms of
|
||||
elements of a vector `1:100` using broadcasting and the `sum` function vs only
|
||||
the `sum` function taking a function as a first argument.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Create a dictionary in which for each number from `1` to `10` you will store
|
||||
a vector of its positive divisors. You can check the reminder of division
|
||||
of two values using the `rem` function.
|
||||
|
||||
Additionally (not covered in the book), you can drop elements
|
||||
from a comprehension if you add an `if` clause after the `for` clause, for
|
||||
example to keep only odd numbers from range `1:10` do:
|
||||
|
||||
```
|
||||
julia> [i for i in 1:10 if isodd(i)]
|
||||
5-element Vector{Int64}:
|
||||
1
|
||||
3
|
||||
5
|
||||
7
|
||||
9
|
||||
```
|
||||
|
||||
You can populate a dictionary by passing a vector of pairs to it (not covered in
|
||||
the book), for example:
|
||||
|
||||
```
|
||||
julia> Dict(["a" => 1, "b" => 2])
|
||||
Dict{String, Int64} with 2 entries:
|
||||
"b" => 2
|
||||
"a" => 1
|
||||
```
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
You can do it as follows:
|
||||
```
|
||||
julia> [true, false] .&& [true false]
|
||||
2×2 BitMatrix:
|
||||
1 0
|
||||
0 0
|
||||
|
||||
julia> [true, false] .|| [true false]
|
||||
2×2 BitMatrix:
|
||||
1 1
|
||||
1 0
|
||||
```
|
||||
|
||||
Note that the first array is a vector, while the second array is a 1-row matrix.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
You can do it like this using broadcasting:
|
||||
```
|
||||
julia> issubset.(Ref(4:6), [4-k:4+k for k in 1:3])
|
||||
3-element BitVector:
|
||||
0
|
||||
1
|
||||
1
|
||||
```
|
||||
Note that you need to use `Ref` to protect `4:6` from being broadcasted over.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
This function can be written as follows:
|
||||
|
||||
```
|
||||
function equallength(x::AbstractVector, y::AbstractVector) = length(x) == length(y)
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
You can do this exercise as follows:
|
||||
```
|
||||
julia> using PyCall
|
||||
|
||||
julia> metrics = pyimport("sklearn.metrics");
|
||||
|
||||
julia> metrics.adjusted_mutual_info_score(x, y)
|
||||
-0.11111111111111087
|
||||
|
||||
julia> metrics.adjusted_mutual_info_score(x, z)
|
||||
0.7276079390930807
|
||||
|
||||
julia> metrics.adjusted_mutual_info_score(y, z)
|
||||
-0.21267989848846763
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
You can create such a plot using the following commands:
|
||||
|
||||
```
|
||||
using Plots
|
||||
histogram([metrics.adjusted_mutual_info_score(rand(1:5, 100), rand(1:5, 100))
|
||||
for i in 1:1000], label="AMI")
|
||||
```
|
||||
|
||||
You can check that AMI oscillates around 0.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
This time it is convenient to write a helper function. Note that we use
|
||||
broadcasting to update values in the vectors.
|
||||
|
||||
```
|
||||
function exampleAMI()
|
||||
x = rand(1:5, 100)
|
||||
y = rand(1:5, 100)
|
||||
x[1:50] .= 0
|
||||
y[1:50] .= 0
|
||||
return metrics.adjusted_mutual_info_score(x, y)
|
||||
end
|
||||
histogram([exampleAMI() for i in 1:1000], label="AMI")
|
||||
```
|
||||
|
||||
Note that this time AMI is a bit below 0.5, which shows a better match between
|
||||
vectors.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
```
|
||||
julia> function counter(v::AbstractVector{<:Integer})
|
||||
d = Dict{eltype(v), Int}()
|
||||
for x in v
|
||||
if haskey(d, x)
|
||||
d[x] += 1
|
||||
else
|
||||
d[x] = 1
|
||||
end
|
||||
end
|
||||
return d
|
||||
end
|
||||
counter (generic function with 1 method)
|
||||
|
||||
julia> counter(v1)
|
||||
Dict{Int64, Int64} with 3 entries:
|
||||
2 => 2
|
||||
3 => 3
|
||||
1 => 1
|
||||
|
||||
julia> counter(v2)
|
||||
Dict{Bool, Int64} with 2 entries:
|
||||
0 => 1
|
||||
1 => 1
|
||||
|
||||
julia> counter(v3)
|
||||
Dict{Int64, Int64} with 3 entries:
|
||||
5 => 1
|
||||
4 => 1
|
||||
3 => 1
|
||||
```
|
||||
|
||||
Note that we used the `eltype` function to set a proper key type for
|
||||
dictionary `d`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
This is a way to do it:
|
||||
```
|
||||
julia> 1:5 .== (1:5)'
|
||||
5×5 BitMatrix:
|
||||
1 0 0 0 0
|
||||
0 1 0 0 0
|
||||
0 0 1 0 0
|
||||
0 0 0 1 0
|
||||
0 0 0 0 1
|
||||
```
|
||||
|
||||
Using the `LinearAlgebra` module you could also write:
|
||||
|
||||
```
|
||||
julia> using LinearAlgebra
|
||||
|
||||
julia> I(5)
|
||||
5×5 Diagonal{Bool, Vector{Bool}}:
|
||||
1 ⋅ ⋅ ⋅ ⋅
|
||||
⋅ 1 ⋅ ⋅ ⋅
|
||||
⋅ ⋅ 1 ⋅ ⋅
|
||||
⋅ ⋅ ⋅ 1 ⋅
|
||||
⋅ ⋅ ⋅ ⋅ 1
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Here is how you can do it:
|
||||
|
||||
```
|
||||
julia> using BenchmarkTools
|
||||
|
||||
julia> @btime sum(log.(1:100))
|
||||
1.620 μs (1 allocation: 896 bytes)
|
||||
363.7393755555635
|
||||
|
||||
julia> @btime sum(log, 1:100)
|
||||
1.570 μs (0 allocations: 0 bytes)
|
||||
363.7393755555636
|
||||
```
|
||||
|
||||
As you can see using the `sum` function with `log` as its first argument
|
||||
is a bit faster as it is not allocating.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Here is how you can do it:
|
||||
|
||||
```
|
||||
julia> Dict([i => [j for j in 1:i if rem(i, j) == 0] for i in 1:10])
|
||||
Dict{Int64, Vector{Int64}} with 10 entries:
|
||||
5 => [1, 5]
|
||||
4 => [1, 2, 4]
|
||||
6 => [1, 2, 3, 6]
|
||||
7 => [1, 7]
|
||||
2 => [1, 2]
|
||||
10 => [1, 2, 5, 10]
|
||||
9 => [1, 3, 9]
|
||||
8 => [1, 2, 4, 8]
|
||||
3 => [1, 3]
|
||||
1 => [1]
|
||||
```
|
||||
|
||||
</details>
|
||||
271
exercises/exercises06.md
Normal file
271
exercises/exercises06.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 6
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
|
||||
(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
|
||||
as `iris.csv` to your local folder.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
|
||||
in two lines so that it takes less horizontal space.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Load data stored in `iris.csv` file into a `data` vector where each element
|
||||
should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
|
||||
the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
|
||||
as floats).
|
||||
|
||||
### Exercise 5
|
||||
|
||||
The `data` structure is a vector of named tuples, change it to a named tuple
|
||||
of vectors (with the same field names) and call it `data2`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Calculate the frequency of each type of Iris type (`c` field in `data2`).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
|
||||
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
|
||||
Compare sizes of the three objects.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
You know that `refs` field of `PooledArray` stores an integer index of a given
|
||||
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
|
||||
in `data2`, but for each Iris type give a different point color (check the
|
||||
`color` keyword argument meaning in the Plots.jl manual; you can use the
|
||||
`plot_color` function).
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Type the following string `"a²=b² ⟺ a=b ∨ a=-b"` in your terminal and bind it to
|
||||
`str` variable (do not copy paste the string, but type it).
|
||||
|
||||
### Exercise 10
|
||||
|
||||
In the `str` string from exercise 9 find all matches of a pattern where `a`
|
||||
is followed by `b` but there can be some characters between them.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> "I have apples worth $(1+2)\$"
|
||||
"I have apples worth 3\$"
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
```
|
||||
import Downloads
|
||||
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
|
||||
"iris.csv")
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
```
|
||||
"https://archive.ics.uci.edu/ml/\
|
||||
machine-learning-databases/iris/iris.data"
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> function line_parser(line)
|
||||
elements = split(line, ",")
|
||||
@assert length(elements) == 5
|
||||
return (sl=parse(Float64, elements[1]),
|
||||
sw=parse(Float64, elements[2]),
|
||||
pl=parse(Float64, elements[3]),
|
||||
pw=parse(Float64, elements[4]),
|
||||
c=elements[5])
|
||||
end
|
||||
line_parser (generic function with 1 method)
|
||||
|
||||
julia> data = line_parser.(readlines("iris.csv")[1:end-1])
|
||||
150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
|
||||
(sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
|
||||
(sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
|
||||
(sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
|
||||
(sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
|
||||
(sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
|
||||
⋮
|
||||
(sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
|
||||
(sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
|
||||
(sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
|
||||
(sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")
|
||||
```
|
||||
|
||||
Note that we used `1:end-1` selector to drop last element from the read lines
|
||||
since it is empty. This is the reason why adding the
|
||||
`@assert length(elements) == 5` check in the `line_parser` function is useful.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Later in the book you will learn more advanced ways to do it. Here let us
|
||||
use a most basic approach:
|
||||
|
||||
```
|
||||
data2 = (sl=[d.sl for d in data],
|
||||
sw=[d.sw for d in data],
|
||||
pl=[d.pl for d in data],
|
||||
pw=[d.pw for d in data],
|
||||
c=[d.c for d in data])
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> using FreqTables
|
||||
|
||||
julia> freqtable(data2.c)
|
||||
3-element Named Vector{Int64}
|
||||
Dim1 │
|
||||
──────────────────┼───
|
||||
"Iris-setosa" │ 50
|
||||
"Iris-versicolor" │ 50
|
||||
"Iris-virginica" │ 50
|
||||
```
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> using InlineStrings
|
||||
|
||||
julia> c2 = inlinestrings(data2.c)
|
||||
150-element Vector{String15}:
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
⋮
|
||||
"Iris-virginica"
|
||||
"Iris-virginica"
|
||||
"Iris-virginica"
|
||||
"Iris-virginica"
|
||||
|
||||
julia> using PooledArrays
|
||||
|
||||
julia> c3 = PooledArray(data2.c)
|
||||
150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
"Iris-setosa"
|
||||
⋮
|
||||
"Iris-virginica"
|
||||
"Iris-virginica"
|
||||
"Iris-virginica"
|
||||
"Iris-virginica"
|
||||
|
||||
julia> c4 = Symbol.(data2.c)
|
||||
150-element Vector{Symbol}:
|
||||
Symbol("Iris-setosa")
|
||||
Symbol("Iris-setosa")
|
||||
Symbol("Iris-setosa")
|
||||
Symbol("Iris-setosa")
|
||||
Symbol("Iris-setosa")
|
||||
⋮
|
||||
Symbol("Iris-virginica")
|
||||
Symbol("Iris-virginica")
|
||||
Symbol("Iris-virginica")
|
||||
Symbol("Iris-virginica")
|
||||
|
||||
julia> Base.summarysize(data2.c)
|
||||
12840
|
||||
|
||||
julia> Base.summarysize(c2)
|
||||
2440
|
||||
|
||||
julia> Base.summarysize(c3)
|
||||
1696
|
||||
|
||||
julia> Base.summarysize(c4)
|
||||
1240
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
```
|
||||
using Plots
|
||||
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
The hard part is typing `²`, `⟺` and `∨`. You can check how to do it using help:
|
||||
```
|
||||
help?> ²
|
||||
"²" can be typed by \^2<tab>
|
||||
|
||||
help?> ⟺
|
||||
"⟺" can be typed by \iff<tab>
|
||||
|
||||
help?> ∨
|
||||
"∨" can be typed by \vee<tab>
|
||||
```
|
||||
|
||||
Save the string in the `str` variable as we will use it in the next exercise.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
The exercise does not specify how the matching should be done. If we
|
||||
want it to be eager (match as much as possible), we write:
|
||||
|
||||
```
|
||||
julia> m = match(r"a.*b", str)
|
||||
RegexMatch("a²=b² ⟺ a=b ∨ a=-b")
|
||||
```
|
||||
|
||||
As you can see we have matched whole string.
|
||||
|
||||
If we want it to be lazy (match as little as possible) we write:
|
||||
|
||||
```
|
||||
julia> m = match(r"a.*?b", str)
|
||||
RegexMatch("a²=b")
|
||||
```
|
||||
|
||||
This finds us the first such match.
|
||||
|
||||
If we want to find all lazy matches we can write (not covered in the book):
|
||||
|
||||
```
|
||||
julia> collect(eachmatch(r"a.*?b", str))
|
||||
3-element Vector{RegexMatch}:
|
||||
RegexMatch("a²=b")
|
||||
RegexMatch("a=b")
|
||||
RegexMatch("a=-b")
|
||||
```
|
||||
|
||||
</details>
|
||||
314
exercises/exercises07.md
Normal file
314
exercises/exercises07.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 7
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Random.org provides a service that returns random numbers. One of the ways
|
||||
how you can use it is by sending HTTP GET reguests. Here is an example request:
|
||||
|
||||
> https://www.random.org/integers/?num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new
|
||||
|
||||
If you want to understand all the parameters plese check their meaning
|
||||
[here](https://www.random.org/clients/http/).
|
||||
|
||||
For us it is enough that this request generates 10 random integers in the range
|
||||
from 1 to 6. Run this query in Julia and parse the result.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Write a function that tries to parse a string as an integer.
|
||||
If it succeeds it should return the integer, otherwise it should return `0`
|
||||
but print error message.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Create a matrix containing truth table for `&&` operation including `missing`.
|
||||
If some operation errors store `"error"` in the table. As an extra feature (this
|
||||
is harder so you can skip it) in each cell store both inputs and output to make
|
||||
reading the table easier.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Take a vector `v = [1.5, 2.5, missing, 4.5, 5.5, missing]` and replace all
|
||||
missing values in it by the mean of the non-missing values.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Take a vector `s = ["1.5", "2.5", missing, "4.5", "5.5", missing]` and parse
|
||||
strings stored in it as `Float64`, while keeping `missing` values unchanged.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Print to the terminal all days in January 2023 that are Mondays.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Compute the dates that are one month later than January 15, 2020, February 15
|
||||
2020, March 15, 2020, and April 15, 2020. How many days pass during this one
|
||||
month. Print the results to the screen?
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Parse the following string as JSON:
|
||||
```
|
||||
str = """
|
||||
[{"x":1,"y":1},
|
||||
{"x":2,"y":4},
|
||||
{"x":3,"y":9},
|
||||
{"x":4,"y":16},
|
||||
{"x":5,"y":25}]
|
||||
"""
|
||||
```
|
||||
into a `json` variable.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Extract from the `json` variable from exercise 8 two vectors `x` and `y`
|
||||
that correspond to the fields stored in the JSON structure.
|
||||
Plot `y` as a function of `x`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Given a vector `m = [missing, 1, missing, 3, missing, missing, 6, missing]`.
|
||||
Use linear interpolation for filling missing values. For the extreme values
|
||||
use nearest available observation (you will need to consult Impute.jl
|
||||
documentation to find all required functions).
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution (example run):
|
||||
|
||||
```
|
||||
julia> using HTTP
|
||||
|
||||
julia> response = HTTP.get("https://www.random.org/integers/?\
|
||||
num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new");
|
||||
|
||||
julia> parse.(Int, split(String(response.body)))
|
||||
10-element Vector{Int64}:
|
||||
6
|
||||
2
|
||||
6
|
||||
3
|
||||
4
|
||||
2
|
||||
5
|
||||
2
|
||||
3
|
||||
6
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Example function:
|
||||
|
||||
```
|
||||
function str2int(s::AbstractString)
|
||||
try
|
||||
return parse(Int, s)
|
||||
catch e
|
||||
println(e)
|
||||
end
|
||||
return 0
|
||||
end
|
||||
```
|
||||
|
||||
Let us check it:
|
||||
|
||||
```
|
||||
julia> str2int("10")
|
||||
10
|
||||
|
||||
julia> str2int(" -1 ")
|
||||
-1
|
||||
|
||||
julia> str2int("12345678901234567890")
|
||||
OverflowError("overflow parsing \"12345678901234567890\"")
|
||||
0
|
||||
|
||||
julia> str2int("1.3")
|
||||
ArgumentError("invalid base 10 digit '.' in \"1.3\"")
|
||||
0
|
||||
|
||||
julia> str2int("a")
|
||||
ArgumentError("invalid base 10 digit 'a' in \"a\"")
|
||||
0
|
||||
```
|
||||
|
||||
An alternative solution would use `tryparse` (not covered in the book):
|
||||
|
||||
```
|
||||
function str2int(s::AbstractString)
|
||||
v = tryparse(Int, s)
|
||||
if isnothing(v)
|
||||
println("error while parsing")
|
||||
return 0
|
||||
end
|
||||
return v
|
||||
end
|
||||
```
|
||||
But this time we do not see the cause of the error.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> function apply_and(x, y)
|
||||
try
|
||||
return "$x && $y = $(x && y)"
|
||||
catch e
|
||||
return "$x && $y = error"
|
||||
end
|
||||
end
|
||||
apply_and (generic function with 2 methods)
|
||||
|
||||
julia> apply_and.([true, false, missing], [true false missing])
|
||||
3×3 Matrix{String}:
|
||||
"true && true = true" "true && false = false" "true && missing = missing"
|
||||
"false && true = false" "false && false = false" "false && missing = false"
|
||||
"missing && true = error" "missing && false = error" "missing && missing = error"
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using Statistics
|
||||
|
||||
julia> coalesce.(v, mean(skipmissing(v)))
|
||||
6-element Vector{Float64}:
|
||||
1.5
|
||||
2.5
|
||||
3.5
|
||||
4.5
|
||||
5.5
|
||||
3.5
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using Missings
|
||||
|
||||
julia> passmissing(parse).(Float64, s)
|
||||
6-element Vector{Union{Missing, Float64}}:
|
||||
1.5
|
||||
2.5
|
||||
missing
|
||||
4.5
|
||||
5.5
|
||||
missing
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Example solution:
|
||||
|
||||
```
|
||||
julia> using Dates
|
||||
|
||||
julia> for day in Date.(2023, 01, 1:31)
|
||||
dayofweek(day) == 1 && println(day)
|
||||
end
|
||||
2023-01-02
|
||||
2023-01-09
|
||||
2023-01-16
|
||||
2023-01-23
|
||||
2023-01-30
|
||||
```
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Example solution:
|
||||
|
||||
```
|
||||
julia> for day in Date.(2023, 1:4, 15)
|
||||
day_next = day + Month(1)
|
||||
println("$day + 1 month = $day_next (difference: $(day_next - day))")
|
||||
end
|
||||
2023-01-15 + 1 month = 2023-02-15 (difference: 31 days)
|
||||
2023-02-15 + 1 month = 2023-03-15 (difference: 28 days)
|
||||
2023-03-15 + 1 month = 2023-04-15 (difference: 31 days)
|
||||
2023-04-15 + 1 month = 2023-05-15 (difference: 30 days)
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using JSON3
|
||||
|
||||
julia> json = JSON3.read(str)
|
||||
5-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, Vector{UInt64}}:
|
||||
{
|
||||
"x": 1,
|
||||
"y": 1
|
||||
}
|
||||
{
|
||||
"x": 2,
|
||||
"y": 4
|
||||
}
|
||||
{
|
||||
"x": 3,
|
||||
"y": 9
|
||||
}
|
||||
{
|
||||
"x": 4,
|
||||
"y": 16
|
||||
}
|
||||
{
|
||||
"x": 5,
|
||||
"y": 25
|
||||
}
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Plots
|
||||
x = [el.x for el in json]
|
||||
y = [el.y for el in json]
|
||||
plot(x, y, xlabel="x", ylabel="y", legend=false)
|
||||
```
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using Impute
|
||||
|
||||
julia> Impute.nocb!(Impute.locf!(Impute.interp(m)))
|
||||
8-element Vector{Union{Missing, Int64}}:
|
||||
1
|
||||
1
|
||||
2
|
||||
3
|
||||
4
|
||||
5
|
||||
6
|
||||
6
|
||||
```
|
||||
|
||||
Note that we use the `locf!` and `nocb!` functions (with `!`) to perform
|
||||
operation in place (a new vector was already allocated by `Impute.interp`).
|
||||
|
||||
</details>
|
||||
294
exercises/exercises08.md
Normal file
294
exercises/exercises08.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 8
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame`
|
||||
called `df`.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Get number of rows, columns, column names and summary statistics of the
|
||||
`df` data frame from exercise 1.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Make a plot of `number` against `square` columns of `df` data frame.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Add a column to `df` data frame with name `name string` containing string
|
||||
representation of numbers in column `number`, i.e.
|
||||
`["one", "two", "three", "four"]`.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Check if `df` contains column `square2`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Extract column `number` from `df` and empty it (recall `empty!` function
|
||||
discussed in chapter 4).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
In `Random` module the `randexp` function is defined that samples numbers
|
||||
from exponential distribution with scale 1.
|
||||
Draw two 100,000 element samples from this distribution store them
|
||||
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
|
||||
and sum of vector `x` and half of vector `y`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
|
||||
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
|
||||
Compute all standard descriptive statistics of columns of this data frame.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
|
||||
Compare the size of created files using the `filesize` function.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Write the `df` data frame into SQLite database. Next find information about
|
||||
tables in this database. Run a query against a table representing the `df` data
|
||||
frame to calculate the mean of column `x`. Does it match the result we got in
|
||||
exercise 8?
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
CSV.jl supports reading gzip-compressed files so you can just do:
|
||||
|
||||
```
|
||||
julia> using CSV
|
||||
|
||||
julia> using DataFrames
|
||||
|
||||
julia> df = CSV.read("example8.csv.gz", DataFrame)
|
||||
4×2 DataFrame
|
||||
Row │ number square
|
||||
│ Int64 Int64
|
||||
─────┼────────────────
|
||||
1 │ 1 2
|
||||
2 │ 2 4
|
||||
3 │ 3 9
|
||||
4 │ 4 16
|
||||
```
|
||||
|
||||
You can also do it manually:
|
||||
```
|
||||
julia> using CodecZlib # you might need to install this package
|
||||
|
||||
julia> compressed = read("example8.csv.gz");
|
||||
|
||||
julia> plain = transcode(GzipDecompressor, compressed);
|
||||
|
||||
julia> df = CSV.read(plain, DataFrame)
|
||||
4×2 DataFrame
|
||||
Row │ number square
|
||||
│ Int64 Int64
|
||||
─────┼────────────────
|
||||
1 │ 1 2
|
||||
2 │ 2 4
|
||||
3 │ 3 9
|
||||
4 │ 4 16
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> nrow(df)
|
||||
4
|
||||
|
||||
julia> ncol(df)
|
||||
2
|
||||
|
||||
julia> names(df)
|
||||
2-element Vector{String}:
|
||||
"number"
|
||||
"square"
|
||||
|
||||
julia> describe(df)
|
||||
2×7 DataFrame
|
||||
Row │ variable mean min median max nmissing eltype
|
||||
│ Symbol Float64 Int64 Float64 Int64 Int64 DataType
|
||||
─────┼──────────────────────────────────────────────────────────────
|
||||
1 │ number 2.5 1 2.5 4 0 Int64
|
||||
2 │ square 7.75 2 6.5 16 0 Int64
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
```
|
||||
using Plots
|
||||
plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> df."name string" = ["one", "two", "three", "four"]
|
||||
4-element Vector{String}:
|
||||
"one"
|
||||
"two"
|
||||
"three"
|
||||
"four"
|
||||
|
||||
julia> df
|
||||
4×3 DataFrame
|
||||
Row │ number square name string
|
||||
│ Int64 Int64 String
|
||||
─────┼─────────────────────────────
|
||||
1 │ 1 2 one
|
||||
2 │ 2 4 two
|
||||
3 │ 3 9 three
|
||||
4 │ 4 16 four
|
||||
```
|
||||
|
||||
Note that we needed to use a string as we have space in column name.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
You can use either `hasproperty` or `columnindex`:
|
||||
|
||||
```
|
||||
julia> hasproperty(df, :square2)
|
||||
false
|
||||
|
||||
julia> columnindex(df, :square2)
|
||||
0
|
||||
```
|
||||
|
||||
Note that if you try to access this column you will get a hint what was the
|
||||
mistake you most likely made:
|
||||
|
||||
```
|
||||
julia> df.square2
|
||||
ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> empty!(df[:, :number])
|
||||
Int64[]
|
||||
```
|
||||
|
||||
Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)`
|
||||
as it would corrupt the `df` data frame (these operations do non-copying
|
||||
extraction of a column from a data frame as opposed to `df[:, :number]`
|
||||
which makes a copy).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
```
|
||||
using Random
|
||||
using Plots
|
||||
x = randexp(100_000);
|
||||
y = randexp(100_000);
|
||||
histogram(x + y / 2, label="mean")
|
||||
histogram!(max.(x, y), label="maximum")
|
||||
```
|
||||
|
||||
I have put both histograms on the same plot to show that they overlap.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution (you might get slightly different results because we did not set
|
||||
the seed of random number generator when creating `x` and `y` vectors):
|
||||
|
||||
```
|
||||
julia> df = DataFrame(x=x, y=y);
|
||||
|
||||
julia> df."x+y/2" = x + y / 2;
|
||||
|
||||
julia> df."max.(x,y)" = max.(x, y);
|
||||
|
||||
julia> describe(df, :all)
|
||||
4×13 DataFrame
|
||||
Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype
|
||||
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType
|
||||
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ x 0.997023 0.999119 3.01389e-6 0.285129 0.68856 1.38414 12.1556 0 0.250502 0.077737 Float64
|
||||
2 │ y 1.00109 0.995904 2.78828e-6 0.289371 0.6957 1.38491 12.0445 0 0.689659 0.486246 Float64
|
||||
3 │ x+y/2 1.49757 1.11676 0.00217486 0.688598 1.2235 2.0113 14.2046 0 0.595331 0.32086 Float64
|
||||
4 │ max.(x,y) 1.49872 1.11295 0.00187844 0.691588 1.22466 2.01257 12.1556 0 0.689659 0.486246 Float64
|
||||
```
|
||||
|
||||
We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary
|
||||
statistics except `first` and `last` as expected.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
```
|
||||
julia> using Arrow
|
||||
|
||||
julia> CSV.write("df.csv", df)
|
||||
"df.csv"
|
||||
|
||||
julia> Arrow.write("df.arrow", df)
|
||||
"df.arrow"
|
||||
|
||||
julia> filesize("df.csv")
|
||||
7587820
|
||||
|
||||
julia> filesize("df.arrow")
|
||||
3200874
|
||||
```
|
||||
|
||||
In this case Apache Arrow file is smaller.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
```
|
||||
julia> using SQLite
|
||||
|
||||
julia> db = SQLite.DB("df.db")
|
||||
SQLite.DB("df.db")
|
||||
|
||||
julia> SQLite.load!(df, db, "df")
|
||||
"df"
|
||||
|
||||
julia> SQLite.tables(db)
|
||||
1-element Vector{SQLite.DBTable}:
|
||||
SQLite.DBTable("df", Tables.Schema:
|
||||
:x Union{Missing, Float64}
|
||||
:y Union{Missing, Float64}
|
||||
Symbol("x+y/2") Union{Missing, Float64}
|
||||
Symbol("max.(x,y)") Union{Missing, Float64})
|
||||
|
||||
julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df");
|
||||
|
||||
julia> DataFrame(query)
|
||||
1×1 DataFrame
|
||||
Row │ AVG(x)
|
||||
│ Float64
|
||||
─────┼──────────
|
||||
1 │ 0.997023
|
||||
|
||||
julia> close(db)
|
||||
```
|
||||
|
||||
The computed mean of column `x` is the same as we got in exercise 8.
|
||||
|
||||
</details>
|
||||
293
exercises/exercises09.md
Normal file
293
exercises/exercises09.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 9
|
||||
|
||||
# Problems
|
||||
|
||||
In this problemset we will use the `puzzles.csv` file that was
|
||||
created in chapter 8. Please first load it into your Julia
|
||||
session using the commands:
|
||||
|
||||
```
|
||||
using CSV
|
||||
using DataFrames
|
||||
puzzles = CSV.read("puzzles.csv", DataFrame);
|
||||
```
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Create `matein2` data frame that will have only puzzles that have `"mateIn2"`
|
||||
in the `Themes` column.
|
||||
Use the `contains` function (check its documentation first).
|
||||
|
||||
### Exercise 2
|
||||
|
||||
What is the fraction of puzzles that are mate in 2 in relation to all puzzles
|
||||
in the `puzzles` data frame?
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Create `small` data frame that holds first 10 rows of `matein2` data frame
|
||||
and columns `Rating`, `RatingDeviation`, and `NbPlays`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Iterate rows of `small` data frame and print the ratio of
|
||||
`RatingDeviation` and `NbPlays` for each row.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Get names of columns from the `matein2` data frame that end with `n` (ignore case).
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Write a function `collatz` that runs the following process. Start with a
|
||||
positive number `n`. If it is even divide it by two. If it is odd multiply
|
||||
it by 3 and add one. The function should return the number of steps needed to
|
||||
reach 1.
|
||||
|
||||
Create a `d` dictionary that maps number of steps needed to a list of numbers from
|
||||
the range `1:100` that required this number of steps.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Using the `d` dictionary make a scatter plot of number of steps required
|
||||
vs average value of numbers that require this number of steps.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Repeat the process from exercises 6 and 7, but this time use a data frame
|
||||
and try to write an appropriate expression using the `combine` and `groupby`
|
||||
functions (as it was explained in the last part of chapter 9). This time
|
||||
perform computations for numbers ranging from one to one million.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Set seed of random number generator to `1234`. Draw 100 random points
|
||||
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
|
||||
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
|
||||
Add random noise to column `y` that has normal distribution with mean 0 and
|
||||
standard deviation 0.25. Call this column `z`.
|
||||
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Add a line of LOESS regression of `x` explaining `z` plot to figure produced in exercise 10.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> matein2 = puzzles[contains.(puzzles.Themes, "mateIn2"), :]
|
||||
274135×9 DataFrame
|
||||
Row │ PuzzleId FEN Moves Rating RatingDeviation Popularity NbPlays Themes GameUrl ⋯
|
||||
│ String7 String String Int64 Int64 Int64 Int64 String String ⋯
|
||||
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ 000hf r1bqk2r/pp1nbNp1/2p1p2p/8/2BP4/1… e8f7 e2e6 f7f8 e6f7 1560 76 88 441 mate mateIn2 middlegame short https://li ⋯
|
||||
2 │ 001Wz 4r1k1/5ppp/r1p5/p1n1RP2/8/2P2N1P… e8e5 d1d8 e5e8 d8e8 1128 81 87 54 backRankMate endgame mate mateIn… https://li
|
||||
3 │ 001om 5r1k/pp4pp/5p2/1BbQp1r1/6K1/7P/1… g4h4 c5f2 g2g3 f2g3 991 78 89 215 mate mateIn2 middlegame short https://li
|
||||
4 │ 003Tx 2r5/pR5p/5p1k/4p3/4r3/B4nPP/PP3P… e1e4 f3d2 b1a1 c8c1 1716 77 87 476 backRankMate endgame fork mate m… https://li
|
||||
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
|
||||
274132 │ zzxQS 2R2q2/3nk1r1/p1Br1p2/1p2p3/1P3Pn… c8f8 d6d1 e3e1 d1e1 1149 75 96 1722 mate mateIn2 middlegame short https://li ⋯
|
||||
274133 │ zzxvB 5rk1/R1Q2ppp/5n2/4p3/1pB5/7q/1P3… f6g4 c7f7 f8f7 a7a8 1695 74 95 4857 endgame mate mateIn2 pin sacrifi… https://li
|
||||
274134 │ zzzRN 4r2k/1NR2Q1p/4P1n1/pp1p4/3P4/4q3… g1h1 e3e1 f7f1 e1f1 830 108 67 31 endgame mate mateIn2 short https://li
|
||||
274135 │ zzzco 5Q2/pp3R1P/1kpp4/4p3/2P1P3/3PP2P… f7f2 b2c2 c1b1 e2d1 1783 75 90 763 endgame mate mateIn2 queensideAt… https://li
|
||||
1 column and 274127 rows omitted
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution (two ways to do it):
|
||||
|
||||
```
|
||||
julia> using Statistics
|
||||
|
||||
julia> nrow(matein2) / nrow(puzzles)
|
||||
0.12852152542746353
|
||||
|
||||
julia> mean(contains.(puzzles.Themes, "mateIn2"))
|
||||
0.12852152542746353
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> small = matein2[1:10, ["Rating", "RatingDeviation", "NbPlays"]]
|
||||
10×3 DataFrame
|
||||
Row │ Rating RatingDeviation NbPlays
|
||||
│ Int64 Int64 Int64
|
||||
─────┼──────────────────────────────────
|
||||
1 │ 1560 76 441
|
||||
2 │ 1128 81 54
|
||||
3 │ 991 78 215
|
||||
4 │ 1716 77 476
|
||||
5 │ 711 81 111
|
||||
6 │ 723 86 806
|
||||
7 │ 754 92 248
|
||||
8 │ 1177 76 827
|
||||
9 │ 994 81 71
|
||||
10 │ 979 144 14
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> for row in eachrow(small)
|
||||
println(row.RatingDeviation / row.NbPlays)
|
||||
end
|
||||
0.17233560090702948
|
||||
1.5
|
||||
0.3627906976744186
|
||||
0.16176470588235295
|
||||
0.7297297297297297
|
||||
0.10669975186104218
|
||||
0.3709677419354839
|
||||
0.09189842805320435
|
||||
1.1408450704225352
|
||||
10.285714285714286
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution (several options):
|
||||
```
|
||||
julia> names(matein2, Cols(col -> uppercase(col[end]) == 'N'))
|
||||
2-element Vector{String}:
|
||||
"FEN"
|
||||
"RatingDeviation"
|
||||
|
||||
julia> names(matein2, Cols(col -> endswith(uppercase(col), "N")))
|
||||
2-element Vector{String}:
|
||||
"FEN"
|
||||
"RatingDeviation"
|
||||
|
||||
julia> names(matein2, r"[nN]$")
|
||||
2-element Vector{String}:
|
||||
"FEN"
|
||||
"RatingDeviation"
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> function collatz(n)
|
||||
i = 0
|
||||
while n != 1
|
||||
i += 1
|
||||
n = iseven(n) ? div(n, 2) : 3 * n + 1
|
||||
end
|
||||
return i
|
||||
end
|
||||
collatz (generic function with 1 method)
|
||||
|
||||
julia> d = Dict{Int, Vector{Int}}()
|
||||
Dict{Int64, Vector{Int64}}()
|
||||
|
||||
julia> for n in 1:100
|
||||
i = collatz(n)
|
||||
if haskey(d, i)
|
||||
push!(d[i], n)
|
||||
else
|
||||
d[i] = [n]
|
||||
end
|
||||
end
|
||||
|
||||
julia> d
|
||||
Dict{Int64, Vector{Int64}} with 45 entries:
|
||||
5 => [5, 32]
|
||||
35 => [78, 79]
|
||||
110 => [82, 83]
|
||||
30 => [86, 87, 89]
|
||||
32 => [57, 59]
|
||||
6 => [10, 64]
|
||||
115 => [73]
|
||||
112 => [54, 55]
|
||||
4 => [16]
|
||||
13 => [34, 35]
|
||||
104 => [47]
|
||||
12 => [17, 96]
|
||||
23 => [25]
|
||||
111 => [27]
|
||||
92 => [91]
|
||||
11 => [48, 52, 53]
|
||||
118 => [97]
|
||||
⋮ => ⋮
|
||||
```
|
||||
|
||||
As we can see even for small `n` the number of steps required to reach `1`
|
||||
can get quite large.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Plots
|
||||
using Statistics
|
||||
steps = collect(keys(d))
|
||||
mean_number = mean.(values(d))
|
||||
scatter(steps, mean_number, xlabel="steps", ylabel="mean of numbers", legend=false)
|
||||
```
|
||||
|
||||
Note that we needed to use `collect` on `keys` as `scatter` expects an array
|
||||
not just an iterator.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
df = DataFrame(n=1:10^6);
|
||||
df.collatz = collatz.(df.n);
|
||||
agg = combine(groupby(df, :collatz), :n => mean);
|
||||
scatter(agg.collatz, agg.n_mean, xlabel="steps", ylabel="mean of numbers", legend=false)
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Set seed of random number generator to `1234`. Draw 100 random points
|
||||
from the interval `[0, 1]`. Store this vector in a data frame as `x` column.
|
||||
Now compute `y` column using a formula `4 * (x - 0.5) ^ 2`.
|
||||
Add random noise to column `y` that has normal distribution with mean 0 and
|
||||
standard deviation 0.25. Call this column `z`.
|
||||
Make a scatter plot with `x` on x-axis and `y` and `z` on y-axis.
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Random
|
||||
Random.seed!(1234)
|
||||
df = DataFrame(x=rand(100))
|
||||
df.y = 4 .* (df.x .- 0.5) .^ 2
|
||||
df.z = df.y + randn(100) / 4
|
||||
scatter(df.x, [df.y df.z], labels=["y" "z"])
|
||||
```
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Loess
|
||||
model = loess(df.x, df.z);
|
||||
x_predict = sort(df.x)
|
||||
z_predict = predict(model, x_predict)
|
||||
plot!(x_predict, z_predict; label="z predicted")
|
||||
```
|
||||
|
||||
</details>
|
||||
303
exercises/exercises10.md
Normal file
303
exercises/exercises10.md
Normal file
@@ -0,0 +1,303 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 10
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Generate a random matrix `mat` having size 5x4 and all elements drawn
|
||||
independently and uniformly from the [0,1[ interval.
|
||||
Create a data frame using data from this matrix using auto-generated
|
||||
column names.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Now, using matrix `mat` create a data frame with randomly generated
|
||||
column names. Use the `randstring` function from the `Random` module
|
||||
to generate them. Store this data frame in `df` variable.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Create a new data frame, taking `df` as a source that will have the same
|
||||
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Create a dictionary holding `column_name => column_vector` pairs
|
||||
using data stored in data frame `df`. Save this dictionary in variable `d`.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Create a data frame back from dictionary `d` from exercise 4. Compare it
|
||||
with `df`.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
For data frame `df` compute the dot product between all pairs of its columns.
|
||||
Use the `dot` function from the `LinearAlgebra` module.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Given two data frames:
|
||||
|
||||
```
|
||||
julia> df1 = DataFrame(a=1:2, b=11:12)
|
||||
2×2 DataFrame
|
||||
Row │ a b
|
||||
│ Int64 Int64
|
||||
─────┼──────────────
|
||||
1 │ 1 11
|
||||
2 │ 2 12
|
||||
|
||||
julia> df2 = DataFrame(a=1:2, c=101:102)
|
||||
2×2 DataFrame
|
||||
Row │ a c
|
||||
│ Int64 Int64
|
||||
─────┼──────────────
|
||||
1 │ 1 101
|
||||
2 │ 2 102
|
||||
```
|
||||
|
||||
vertically concatenate them so that only columns that are present in both
|
||||
data frames are kept. Check the documentation of `vcat` to see how to
|
||||
do it.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Now append to `df1` table `df2`, but add only the columns from `df2` that
|
||||
are present in `df1`. Check the documentation of `append!` to see how to
|
||||
do it.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Create a `circle` data frame, using the `push!` function that will store
|
||||
1000 samples of the following process:
|
||||
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
|
||||
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
|
||||
and is `false` otherwise.
|
||||
|
||||
Compute summary statistics of this data frame.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Create a scatterplot of `circle` data frame where its `x` and `y` axis
|
||||
will be the plotted points and `inside` variable will determine the color
|
||||
of the plotted point.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using DataFrames
|
||||
|
||||
julia> mat = rand(5, 4)
|
||||
5×4 Matrix{Float64}:
|
||||
0.8386 0.83612 0.0353994 0.15547
|
||||
0.590172 0.611815 0.0691152 0.915788
|
||||
0.879395 0.07271 0.980079 0.655158
|
||||
0.340435 0.756196 0.0697535 0.388578
|
||||
0.714515 0.861872 0.971521 0.176768
|
||||
|
||||
julia> DataFrame(mat, :auto)
|
||||
5×4 DataFrame
|
||||
Row │ x1 x2 x3 x4
|
||||
│ Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────
|
||||
1 │ 0.8386 0.83612 0.0353994 0.15547
|
||||
2 │ 0.590172 0.611815 0.0691152 0.915788
|
||||
3 │ 0.879395 0.07271 0.980079 0.655158
|
||||
4 │ 0.340435 0.756196 0.0697535 0.388578
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using Random
|
||||
|
||||
julia> df = DataFrame(mat, [randstring() for _ in 1:4])
|
||||
5×4 DataFrame
|
||||
Row │ 6mTK5evn K8Inf7ER 5Caz55k0 SRiGemsa
|
||||
│ Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────
|
||||
1 │ 0.8386 0.83612 0.0353994 0.15547
|
||||
2 │ 0.590172 0.611815 0.0691152 0.915788
|
||||
3 │ 0.879395 0.07271 0.980079 0.655158
|
||||
4 │ 0.340435 0.756196 0.0697535 0.388578
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
|
||||
5×4 DataFrame
|
||||
Row │ y1 y2 y3 y4
|
||||
│ Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────
|
||||
1 │ 0.8386 0.83612 0.0353994 0.15547
|
||||
2 │ 0.590172 0.611815 0.0691152 0.915788
|
||||
3 │ 0.879395 0.07271 0.980079 0.655158
|
||||
4 │ 0.340435 0.756196 0.0697535 0.388578
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
You could also use the `raname` function:
|
||||
```
|
||||
julia> rename(df, string.("y", 1:4))
|
||||
5×4 DataFrame
|
||||
Row │ y1 y2 y3 y4
|
||||
│ Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────
|
||||
1 │ 0.8386 0.83612 0.0353994 0.15547
|
||||
2 │ 0.590172 0.611815 0.0691152 0.915788
|
||||
3 │ 0.879395 0.07271 0.980079 0.655158
|
||||
4 │ 0.340435 0.756196 0.0697535 0.388578
|
||||
5 │ 0.714515 0.861872 0.971521 0.176768
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> d = Dict([n => df[:, n] for n in names(df)])
|
||||
Dict{String, Vector{Float64}} with 4 entries:
|
||||
"6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
|
||||
"5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
|
||||
"K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
|
||||
"SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
|
||||
```
|
||||
|
||||
or (using the `pairs` function; note that this time column names are `Symbol`):
|
||||
|
||||
```
|
||||
julia> Dict(pairs(eachcol(df)))
|
||||
Dict{Symbol, AbstractVector} with 4 entries:
|
||||
Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
|
||||
:SRiGemsa => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
|
||||
:K8Inf7ER => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
|
||||
Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> DataFrame(d)
|
||||
5×4 DataFrame
|
||||
Row │ 5Caz55k0 6mTK5evn K8Inf7ER SRiGemsa
|
||||
│ Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────
|
||||
1 │ 0.0353994 0.8386 0.83612 0.15547
|
||||
2 │ 0.0691152 0.590172 0.611815 0.915788
|
||||
3 │ 0.980079 0.879395 0.07271 0.655158
|
||||
4 │ 0.0697535 0.340435 0.756196 0.388578
|
||||
5 │ 0.971521 0.714515 0.861872 0.176768
|
||||
```
|
||||
|
||||
Note that columns of a data frame are now sorted by their names.
|
||||
This is done for `Dict` objects because such dictionaries do not have
|
||||
a defined order of keys.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using LinearAlgebra
|
||||
|
||||
julia> using StatsBase
|
||||
|
||||
julia> pairwise(dot, eachcol(df))
|
||||
4×4 Matrix{Float64}:
|
||||
2.45132 1.99944 1.65026 1.50558
|
||||
1.99944 2.39336 1.03322 1.18411
|
||||
1.65026 1.03322 1.9153 0.909744
|
||||
1.50558 1.18411 0.909744 1.47431
|
||||
```
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> vcat(df1, df2, cols=:intersect)
|
||||
4×1 DataFrame
|
||||
Row │ a
|
||||
│ Int64
|
||||
─────┼───────
|
||||
1 │ 1
|
||||
2 │ 2
|
||||
3 │ 1
|
||||
4 │ 2
|
||||
```
|
||||
|
||||
By default you will get an error:
|
||||
|
||||
```
|
||||
julia> vcat(df1, df2)
|
||||
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> append!(df1, df2, cols=:subset)
|
||||
4×2 DataFrame
|
||||
Row │ a b
|
||||
│ Int64 Int64?
|
||||
─────┼────────────────
|
||||
1 │ 1 11
|
||||
2 │ 2 12
|
||||
3 │ 1 missing
|
||||
4 │ 2 missing
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution
|
||||
|
||||
```
|
||||
circle=DataFrame()
|
||||
for _ in 1:1000
|
||||
x, y = 2rand()-1, 2rand()-1
|
||||
inside = x^2 + y^2 < 1
|
||||
push!(circle, (x=x, y=y, inside=inside))
|
||||
end
|
||||
describe(circle)
|
||||
```
|
||||
|
||||
We note that mean of variable `inside` is approximately π.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Plots
|
||||
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400))
|
||||
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal)
|
||||
```
|
||||
|
||||
In the solution two ways to plot ensuring the ratio between x and y axis is 1
|
||||
are shown. Note the differences in the produced output between the two methods.
|
||||
|
||||
</details>
|
||||
479
exercises/exercises11.md
Normal file
479
exercises/exercises11.md
Normal file
@@ -0,0 +1,479 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 11
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Generate a data frame `df` having one column `x` consisting of 100,000 values
|
||||
sampled from uniform distribution on [0, 1[ interval.
|
||||
Serialize it to disk, and next deserialize. Check if the deserialized
|
||||
object is the same as the source data frame.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Add a column `n` to the `df` data frame that in each row will hold the
|
||||
number of observations in column `x` that have distance less than `0.1` to
|
||||
a value stored in a given row of `x`.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Investigate visually how does `n` depend on `x` in data frame `df`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Someone has prepared the following test data for you:
|
||||
```
|
||||
teststr = """
|
||||
"x","sinx"
|
||||
0.139279,0.138829
|
||||
0.456779,0.441059
|
||||
0.344034,0.337287
|
||||
0.140253,0.139794
|
||||
0.848344,0.750186
|
||||
0.977512,0.829109
|
||||
0.032737,0.032731
|
||||
0.702750,0.646318
|
||||
0.422339,0.409895
|
||||
0.393878,0.383772
|
||||
"""
|
||||
```
|
||||
|
||||
Load this data into `testdf` data frame.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Check the accuracy of computations of sinus of `x` in `testdf`.
|
||||
Print all rows for which the absolute difference is greater than `5e-7`.
|
||||
In this case display `x`, `sinx`, the exact value of `sin(x)` and the absolute
|
||||
difference.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Group data in data frame `df` into buckets of 0.1 width and store the result in
|
||||
`gdf` data frame (sort the groups). Use the `cut` function from
|
||||
CategoricalArrays.jl to do it (check its documentation to learn how to do it).
|
||||
Check the number of values in each group.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Display the grouping keys in `gdf` grouped data frame. Show them as named tuples.
|
||||
Check what would be the group order if you asked not to sort them.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Compute average `n` for each group in `gdf`.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Fit a linear model explaining `n` by `x` separately for each group in `gdf`.
|
||||
Use the `\` operator to fit it (recall it from chapter 4).
|
||||
For each group produce the result as named tuple having fields `α₀` and `αₓ`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Repeat exercise 9 but using the GLM.jl package. This time
|
||||
extract the p-value for the slope of estimated coefficient for `x` variable.
|
||||
Use the `coeftable` function from GLM.jl to get this information.
|
||||
Check the documentation of this function to learn how to do it (it will be
|
||||
easiest for you to first convert its result to a `DataFrame`).
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using DataFrames
|
||||
|
||||
julia> df = DataFrame(x=rand(100_000));
|
||||
|
||||
julia> using Serialization
|
||||
|
||||
julia> serialize("df.bin", df)
|
||||
|
||||
julia> deserialize("df.bin") == df
|
||||
true
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
A simple approach is:
|
||||
```
|
||||
df.n = map(v -> count(abs.(df.x .- v) .< 0.1), df.x)
|
||||
```
|
||||
|
||||
A more sophisticated approach (faster and allocating less memory) would be:
|
||||
```
|
||||
df.n = `map(v -> count(w -> abs(w-v) < 0.1, df.x), df.x)`
|
||||
```
|
||||
|
||||
An even faster solution that is type stable would use function barrier:
|
||||
```
|
||||
f(x) = map(v -> count(w -> abs(w-v) < 0.1, x), x)
|
||||
df.n = f(df.x)
|
||||
```
|
||||
|
||||
Finally you can work on sorted data to get a much better performance. Here is an
|
||||
example (it is a bit more advanced):
|
||||
```
|
||||
function f2(x)
|
||||
p = sortperm(x)
|
||||
n = zeros(Int, length(x))
|
||||
start = 1
|
||||
stop = 1
|
||||
idx = 0
|
||||
while idx < length(x) # you could add @inbounds here but I typically avoid it
|
||||
idx += 1
|
||||
while x[p[idx]] - x[p[start]] >= 0.1
|
||||
start += 1
|
||||
end
|
||||
while stop <= length(x) && x[p[stop]] - x[p[idx]] < 0.1
|
||||
stop += 1
|
||||
end
|
||||
n[p[idx]] = stop - start
|
||||
end
|
||||
return n
|
||||
end
|
||||
df.n = f2(df.x)
|
||||
```
|
||||
|
||||
In this solution the fact that we used function barrier is even more relevant
|
||||
as we explicitly use loops inside.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Plots
|
||||
scatter(df.x, df.n, xlabel="x", ylabel="neighbors", legend=false)
|
||||
```
|
||||
|
||||
As expected on the border of the domain number of neighbors drops.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using CSV
|
||||
|
||||
julia> using DataFrames
|
||||
|
||||
julia> testdf = CSV.read(IOBuffer(teststr), DataFrame)
|
||||
10×2 DataFrame
|
||||
Row │ x sinx
|
||||
│ Float64 Float64
|
||||
─────┼────────────────────
|
||||
1 │ 0.139279 0.138829
|
||||
2 │ 0.456779 0.441059
|
||||
3 │ 0.344034 0.337287
|
||||
4 │ 0.140253 0.139794
|
||||
5 │ 0.848344 0.750186
|
||||
6 │ 0.977512 0.829109
|
||||
7 │ 0.032737 0.032731
|
||||
8 │ 0.70275 0.646318
|
||||
9 │ 0.422339 0.409895
|
||||
10 │ 0.393878 0.383772
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Since data frame is small we can use `eachrow`:
|
||||
|
||||
```
|
||||
julia> for row in eachrow(testdf)
|
||||
sinx = sin(row.x)
|
||||
dev = abs(sinx - row.sinx)
|
||||
dev > 5e-7 && println((x=row.x, computed=sinx, data=row.sinx, dev=dev))
|
||||
end
|
||||
(x = 0.456779, computed = 0.44105962391808606, data = 0.441059, dev = 6.239180860845295e-7)
|
||||
(x = 0.70275, computed = 0.6463185646550751, data = 0.646318, dev = 5.646550751414736e-7)
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
```
|
||||
julia> using CategoricalArrays
|
||||
|
||||
julia> df.xbins = cut(df.x, 0.0:0.1:1.0);
|
||||
|
||||
julia> gdf = groupby(df, :xbins; sort=true);
|
||||
|
||||
julia> [nrow(group) for group in gdf]
|
||||
10-element Vector{Int64}:
|
||||
9872
|
||||
9976
|
||||
9968
|
||||
9943
|
||||
10063
|
||||
10173
|
||||
9977
|
||||
10076
|
||||
9908
|
||||
10044
|
||||
|
||||
julia> combine(gdf, nrow) # alternative way to do it
|
||||
10×2 DataFrame
|
||||
Row │ xbins nrow
|
||||
│ Cat… Int64
|
||||
─────┼───────────────────
|
||||
1 │ [0.0, 0.1) 9872
|
||||
2 │ [0.1, 0.2) 9976
|
||||
3 │ [0.2, 0.3) 9968
|
||||
4 │ [0.3, 0.4) 9943
|
||||
5 │ [0.4, 0.5) 10063
|
||||
6 │ [0.5, 0.6) 10173
|
||||
7 │ [0.6, 0.7) 9977
|
||||
8 │ [0.7, 0.8) 10076
|
||||
9 │ [0.8, 0.9) 9908
|
||||
10 │ [0.9, 1.0) 10044
|
||||
```
|
||||
|
||||
You might get a bit different numbers but all should be around 10,000.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> NamedTuple.(keys(gdf))
|
||||
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
|
||||
(xbins = "[0.0, 0.1)",)
|
||||
(xbins = "[0.1, 0.2)",)
|
||||
(xbins = "[0.2, 0.3)",)
|
||||
(xbins = "[0.3, 0.4)",)
|
||||
(xbins = "[0.4, 0.5)",)
|
||||
(xbins = "[0.5, 0.6)",)
|
||||
(xbins = "[0.6, 0.7)",)
|
||||
(xbins = "[0.7, 0.8)",)
|
||||
(xbins = "[0.8, 0.9)",)
|
||||
(xbins = "[0.9, 1.0)",)
|
||||
|
||||
julia> NamedTuple.(keys(groupby(df, :xbins; sort=false)))
|
||||
10-element Vector{NamedTuple{(:xbins,), Tuple{CategoricalValue{String, UInt32}}}}:
|
||||
(xbins = "[0.4, 0.5)",)
|
||||
(xbins = "[0.9, 1.0)",)
|
||||
(xbins = "[0.8, 0.9)",)
|
||||
(xbins = "[0.0, 0.1)",)
|
||||
(xbins = "[0.2, 0.3)",)
|
||||
(xbins = "[0.5, 0.6)",)
|
||||
(xbins = "[0.7, 0.8)",)
|
||||
(xbins = "[0.3, 0.4)",)
|
||||
(xbins = "[0.1, 0.2)",)
|
||||
(xbins = "[0.6, 0.7)",)
|
||||
```
|
||||
|
||||
If you pass `sort=false` instead of `sort=true` you get groups in their order
|
||||
of appearance in `df`. If you skipped specifying `sort` keyword argument
|
||||
the resulting group order could depend on the type of grouping column, so if
|
||||
you want to depend on the order of groups always spass `sort` keyword argument
|
||||
explicitly.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using Statistics
|
||||
|
||||
julia> [mean(group.n) for group in gdf]
|
||||
10-element Vector{Float64}:
|
||||
14845.847751215559
|
||||
19835.367882919007
|
||||
19919.195826645264
|
||||
19993.023936437694
|
||||
20105.506111497565
|
||||
20222.35761329008
|
||||
20151.794727874112
|
||||
20022.69610956729
|
||||
19909.331550262414
|
||||
14944.511449621665
|
||||
|
||||
julia> combine(gdf, :n => mean) # alternative way to do it
|
||||
10×2 DataFrame
|
||||
Row │ xbins n_mean
|
||||
│ Cat… Float64
|
||||
─────┼─────────────────────
|
||||
1 │ [0.0, 0.1) 14845.8
|
||||
2 │ [0.1, 0.2) 19835.4
|
||||
3 │ [0.2, 0.3) 19919.2
|
||||
4 │ [0.3, 0.4) 19993.0
|
||||
5 │ [0.4, 0.5) 20105.5
|
||||
6 │ [0.5, 0.6) 20222.4
|
||||
7 │ [0.6, 0.7) 20151.8
|
||||
8 │ [0.7, 0.8) 20022.7
|
||||
9 │ [0.8, 0.9) 19909.3
|
||||
10 │ [0.9, 1.0) 14944.5
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> function fitmodel(x, n)
|
||||
X = [ones(length(x)) x]
|
||||
α₀, αₓ = X \ n
|
||||
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
|
||||
end
|
||||
fitmodel (generic function with 1 method)
|
||||
|
||||
julia> [fitmodel(group.x, group.n) for group in gdf]
|
||||
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
|
||||
(α₀ = 9900.190310776916, αₓ = 99131.14394200995)
|
||||
(α₀ = 19823.115188829383, αₓ = 81.66979172871368)
|
||||
(α₀ = 19812.9822724435, αₓ = 424.00895772216785)
|
||||
(α₀ = 19810.726510910834, αₓ = 520.6763238983195)
|
||||
(α₀ = 19437.772385484135, αₓ = 1483.333906139938)
|
||||
(α₀ = 20187.521449870146, αₓ = 63.30709585406235)
|
||||
(α₀ = 20424.362332155855, αₓ = -419.42268710601405)
|
||||
(α₀ = 20789.70660364678, αₓ = -1022.9778397184706)
|
||||
(α₀ = 20013.690535193662, αₓ = -122.80055110522495)
|
||||
(α₀ = 109320.55276082881, αₓ = -99305.18846102979)
|
||||
|
||||
julia> combine(gdf, [:x, :n] => fitmodel => AsTable) # alternative syntax that you will learn in chapter 14
|
||||
10×3 DataFrame
|
||||
Row │ xbins α₀ αₓ
|
||||
│ Cat… Float64 Float64
|
||||
─────┼────────────────────────────────────────
|
||||
1 │ [0.0, 0.1) 9900.19 99131.1
|
||||
2 │ [0.1, 0.2) 19823.1 81.6698
|
||||
3 │ [0.2, 0.3) 19813.0 424.009
|
||||
4 │ [0.3, 0.4) 19810.7 520.676
|
||||
5 │ [0.4, 0.5) 19437.8 1483.33
|
||||
6 │ [0.5, 0.6) 20187.5 63.3071
|
||||
7 │ [0.6, 0.7) 20424.4 -419.423
|
||||
8 │ [0.7, 0.8) 20789.7 -1022.98
|
||||
9 │ [0.8, 0.9) 20013.7 -122.801
|
||||
10 │ [0.9, 1.0) 1.09321e5 -99305.2
|
||||
```
|
||||
|
||||
We note that indeed in the first and last group the regression has a significant
|
||||
slope.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using GLM
|
||||
|
||||
julia> function fitlmmodel(group; info=false)
|
||||
model = lm(@formula(n~x), group)
|
||||
coefdf = DataFrame(coeftable(model))
|
||||
info && @show coefdf # to see how the data frame looks like
|
||||
α₀, αₓ = coefdf[:, "Coef."]
|
||||
return (α₀=α₀, αₓ=αₓ) # or (;α₀, αₓ) as will be discussed in chapter 14
|
||||
end
|
||||
fitlmmodel (generic function with 1 method)
|
||||
|
||||
julia> [fitlmmodel(group; info = true) for group in gdf]
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 9900.19 0.388607 25476.1 0.0 9899.43 9900.95
|
||||
2 │ x 99131.1 6.75846 14667.7 0.0 99117.9 99144.4
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼───────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 19823.1 2.52926 7837.5 0.0 19818.2 19828.1
|
||||
2 │ x 81.6698 16.5512 4.93436 8.17139e-7 49.226 114.114
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼───────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 19813.0 2.8427 6969.79 0.0 19807.4 19818.6
|
||||
2 │ x 424.009 11.2737 37.6106 1.32368e-289 401.91 446.108
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼───────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 19810.7 3.98478 4971.59 0.0 19802.9 19818.5
|
||||
2 │ x 520.676 11.3429 45.9033 0.0 498.442 542.911
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 19437.8 6.07925 3197.4 0.0 19425.9 19449.7
|
||||
2 │ x 1483.33 13.4768 110.065 0.0 1456.92 1509.75
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 20187.5 9.72795 2075.21 0.0 20168.5 20206.6
|
||||
2 │ x 63.3071 17.6538 3.58603 0.000337323 28.7022 97.912
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼──────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 20424.4 10.2201 1998.45 0.0 20404.3 20444.4
|
||||
2 │ x -419.423 15.7112 -26.6958 1.0356e-151 -450.22 -388.626
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼──────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 20789.7 9.56063 2174.51 0.0 20771.0 20808.4
|
||||
2 │ x -1022.98 12.7417 -80.2856 0.0 -1047.95 -998.001
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 20013.7 8.86033 2258.8 0.0 19996.3 20031.1
|
||||
2 │ x -122.801 10.4201 -11.785 7.60822e-32 -143.226 -102.375
|
||||
coefdf = 2×7 DataFrame
|
||||
Row │ Name Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
|
||||
│ String Float64 Float64 Float64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
|
||||
1 │ (Intercept) 1.09321e5 5.78343 18902.4 0.0 1.09309e5 1.09332e5
|
||||
2 │ x -99305.2 6.08269 -16325.9 0.0 -99317.1 -99293.3
|
||||
10-element Vector{NamedTuple{(:α₀, :αₓ), Tuple{Float64, Float64}}}:
|
||||
(α₀ = 9900.190310776927, αₓ = 99131.1439420097)
|
||||
(α₀ = 19823.115188829663, αₓ = 81.66979172690417)
|
||||
(α₀ = 19812.98227244386, αₓ = 424.00895772074136)
|
||||
(α₀ = 19810.726510911398, αₓ = 520.6763238966264)
|
||||
(α₀ = 19437.772385487086, αₓ = 1483.3339061333743)
|
||||
(α₀ = 20187.521449871012, αₓ = 63.307095852511125)
|
||||
(α₀ = 20424.36233216108, αₓ = -419.4226871140539)
|
||||
(α₀ = 20789.706603652226, αₓ = -1022.9778397257375)
|
||||
(α₀ = 20013.69053519897, αₓ = -122.80055111148658)
|
||||
(α₀ = 109320.55276074051, αₓ = -99305.18846093686)
|
||||
|
||||
julia> combine(gdf, fitlmmodel)
|
||||
10×3 DataFrame
|
||||
Row │ xbins α₀ αₓ
|
||||
│ Cat… Float64 Float64
|
||||
─────┼────────────────────────────────────────
|
||||
1 │ [0.0, 0.1) 9900.19 99131.1
|
||||
2 │ [0.1, 0.2) 19823.1 81.6698
|
||||
3 │ [0.2, 0.3) 19813.0 424.009
|
||||
4 │ [0.3, 0.4) 19810.7 520.676
|
||||
5 │ [0.4, 0.5) 19437.8 1483.33
|
||||
6 │ [0.5, 0.6) 20187.5 63.3071
|
||||
7 │ [0.6, 0.7) 20424.4 -419.423
|
||||
8 │ [0.7, 0.8) 20789.7 -1022.98
|
||||
9 │ [0.8, 0.9) 20013.7 -122.801
|
||||
10 │ [0.9, 1.0) 1.09321e5 -99305.2
|
||||
```
|
||||
|
||||
We got the same results. The `combine(gdf, fitlmmodel)` style of using
|
||||
the `combine` function is a bit more advanced and is not covered in the book.
|
||||
It is used in the cases, like the one we have here, when you want to pass
|
||||
a whole group to the function in `combine`. Check DataFrames.jl documentation
|
||||
for more detailed explanations.
|
||||
|
||||
</details>
|
||||
247
exercises/exercises12.md
Normal file
247
exercises/exercises12.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 12
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code of
|
||||
Go language version 19.2. As you can check on its website its SHA-256
|
||||
is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`.
|
||||
Download this file and check if it indeed has this checksum.
|
||||
You might need to read documentation of `string` and `join` functions.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip
|
||||
that contains the ego-nets of Eastern European users collected from the music
|
||||
streaming service Deezer in February 2020. Nodes are users and edges are mutual
|
||||
follower relationships.
|
||||
|
||||
From the file extract deezer_edges.json and deezer_target.csv files and
|
||||
save them to disk.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Load deezer_edges.json and deezer_target.csv files to Julia.
|
||||
The JSON file should be loaded as JSON3.jl object `edges_json`.
|
||||
The CSV file should be loaded into a data frame `target_df`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Check that keys in the `edges_json` are in the same order as `id` column
|
||||
in `target_df`.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
From every value stored in `edges_json` create a graph representing
|
||||
ego-net of the given node. Store these graphs in a vector that will make the
|
||||
`egonet` column of in the `target_df` data frame.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Ego-net in our data set is a subgraph of a full Deezer graph where for some
|
||||
node all its neighbors are included, but also it contains all edges between the
|
||||
neighbors.
|
||||
Therefore we expect that diameter of every ego-net is at most 2 (as every
|
||||
two nodes are either connected directly or by a common friend).
|
||||
Check if this is indeed the case. Use the `diameter` function.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
For each ego-net find a central node that is connected to every other node
|
||||
in this network. Use the `degree` and `findall` functions to achieve this.
|
||||
Add `center` column with numbers of nodes that are connected to all other
|
||||
nodes in the ego-net to `target_df` data frame.
|
||||
|
||||
Next add a column `center_len` that gives the number of such nodes.
|
||||
|
||||
Check how many times different numbers of center nodes are found.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Add the following ego-net features to the `target_df` data frame:
|
||||
* `size`: number of nodes in ego-net
|
||||
* `mean_degree`: average node degree in ego-net
|
||||
|
||||
Check mean values of these two columns by `target` column.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Continuing to work with `target_df` data frame create a logistic regression
|
||||
explaining `target` by `size` and `mean_degree`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Continuing to work with `target_df` create a scatterplot where `size` will be on
|
||||
one axis and `mean_degree` rounded to nearest integer on the other axis.
|
||||
Plot the mean of `target` for each point being a combination of `size` and
|
||||
rounded `mean_degree`.
|
||||
|
||||
Additionally fit a LOESS model explaining `target` by `size`. Make a prediction
|
||||
for values in range from 5% to 95% quantile (to concentrate on typical values
|
||||
of size).
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Downloads
|
||||
using SHA
|
||||
Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz")
|
||||
shavec = open(sha256, "go.tar.gz")
|
||||
shastr = join(string.(s; base=16, pad=2))
|
||||
sha == shastr
|
||||
```
|
||||
|
||||
The last line should produce `true`.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
|
||||
import ZipFile
|
||||
archive = ZipFile.Reader("ego.zip")
|
||||
idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files))
|
||||
open("deezer_edges.json", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files))
|
||||
open("deezer_target.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
close(archive)
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using CSV
|
||||
using DataFrames
|
||||
using JSON3
|
||||
edges_json = JSON3.read(read("deezer_edges.json"))
|
||||
target_df = CSV.read("deezer_target.csv", DataFrame)
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution (short, but you need to have a good understanding of Julia types
|
||||
and standar functions to properly write it):
|
||||
```
|
||||
Symbol.(target_df.id) == keys(edges_json)
|
||||
```
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Graphs
|
||||
function edgelist2graph(edgelist)
|
||||
nodes = sort!(unique(reduce(vcat, edgelist)))
|
||||
@assert 0:length(nodes)-1 == nodes
|
||||
g = SimpleGraph(length(nodes))
|
||||
for (a, b) in edgelist
|
||||
add_edge!(g, a+1, b+1)
|
||||
end
|
||||
return g
|
||||
end
|
||||
target_df.egonet = edgelist2graph(values(edges_json))
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> extrema(diameter.(target_df.egonet))
|
||||
(2, 2)
|
||||
```
|
||||
|
||||
Indeed we see that for each ego-net diameter is 2.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
target_df.center = map(target_df.egonet) do g
|
||||
findall(==(nv(g) - 1), degree(g))
|
||||
end
|
||||
target_df.center_len = length.(target_df.center)
|
||||
combine(groupby(target_df, :center_len, sort=true), nrow)
|
||||
```
|
||||
|
||||
Note that we used `map` since in this case it gives a convenient way to express
|
||||
the condition we want to check.
|
||||
|
||||
We notice that in some cases it is impossible to identify the center of the
|
||||
ego-net uniquely.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Statistics
|
||||
target_df.size = nv.(target_df.egonet)
|
||||
target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size
|
||||
combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)
|
||||
```
|
||||
|
||||
It seems that for target equal to `0` size and average degree in the network are
|
||||
a bit larger.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using GLM
|
||||
glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())
|
||||
```
|
||||
|
||||
We see that only `size` is statistically significant.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Plots
|
||||
target_df.round_degree = round.(Int, target_df.mean_degree)
|
||||
agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean)
|
||||
scatter(agg_df.size, agg_df.round_degree;
|
||||
zcolor=agg_df.target_mean,
|
||||
xlabel="size", ylabel="rounded degree",
|
||||
label="mean target", xaxis=:log)
|
||||
```
|
||||
|
||||
It is hard to visually see any strong relationship in the data.
|
||||
|
||||
```
|
||||
import Loess
|
||||
model = Loess.loess(target_df.size, target_df.target)
|
||||
size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95)
|
||||
target_predict = Loess.predict(model, size_predict)
|
||||
plot(size_predict, target_predict;
|
||||
xlabel="size", ylabel="predicted target", legend=false)
|
||||
```
|
||||
|
||||
Between quantiles 5% and 95% we see a downward shaped relationship.
|
||||
|
||||
</details>
|
||||
280
exercises/exercises13.md
Normal file
280
exercises/exercises13.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 13
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Download
|
||||
https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip
|
||||
archive and extract primary_data.csv and secondary_data.csv files from it.
|
||||
Save the files to disk.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Load primary_data.csv into the `primary` data frame.
|
||||
Load secondary_data.csv into the `secondary` data frame.
|
||||
Describe the contents of both data frames.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Start with `primary` data. Note that columns starting from column 4 have
|
||||
their data encoded using vector notation, but they have been read-in as strings.
|
||||
Convert these columns to hold proper vectors. Note that some columns have
|
||||
`missing` values. Most of the columns hold nominal data, but three columns,
|
||||
i.e. `cap-diameter`, `stem-height`, and `stem-width` have numeric data. These
|
||||
should be parsed as vectors storing numeric values. After parsing, put these
|
||||
three columns just after `class` column in the `parsed_primary` data frame.
|
||||
|
||||
Check `renamecols` keyword argument of `select` to
|
||||
avoid renaming of the produced columns.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be
|
||||
confused and have a different class, using the information about them we have
|
||||
(so all information except their family, name, and class).
|
||||
|
||||
Use the following rules:
|
||||
* if for some pair of mushrooms the data in some column for either of them is
|
||||
`missing` then skip matching on this column; for numeric columns if there
|
||||
is only one value in a vector then treat it as `missing`;
|
||||
* otherwise:
|
||||
- for numeric columns check if there is an overlap in the interval
|
||||
specified by the min and max values for the range passed;
|
||||
- for nominal columns check if the intersection of nominal values is nonempty.
|
||||
|
||||
For each found pair print to the screen the row number, family, name, and class.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Still using `parsed_primary` find what is the average probability of class being
|
||||
`p` by `family`. Additionally add number of observations in each group. Sort
|
||||
these results by the probability. Try using DataFramesMeta.jl to do this
|
||||
exercise (this requirement is optional).
|
||||
|
||||
Store the result in `agg_primary` data frame.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Now using `agg_primary` data frame collapse it so that for each unique `pr_p`
|
||||
it gives us a total number of rows that had this probability and a tuple
|
||||
of mushroom family names.
|
||||
|
||||
Optionally: try to display the produced table so that the tuple containing the
|
||||
list of families for each group is not cropped (this will require large
|
||||
terminal).
|
||||
|
||||
### Exercise 7
|
||||
|
||||
From our preliminary analysis of `primary` data we see that `missing` value in
|
||||
the primary data is non-informative, so in `secondary` data we should be
|
||||
cautious when building a model if we allowed for missing data (in practice
|
||||
if we were investigating some real mushroom we most likely would know its
|
||||
characteristics).
|
||||
|
||||
Therefore as a first step drop in-place all columns in `secondary` data frame
|
||||
that have missing values.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Create a logistic regression predicting `class` based on all remaining features
|
||||
in the data frame. You might need to check the `Term` usage in StatsModels.jl
|
||||
documentation.
|
||||
|
||||
You will notice that for `stem-color` and `habitat` columns you get strange
|
||||
estimation results (large absolute values of estimated parameters and even
|
||||
larger standard errors). Explain why this happens by analyzing frequency tables
|
||||
of these variables against `class` column.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Add `class_p` column to `secondary` as a second column that will contain
|
||||
predicted probability from the model created in exercise 8 of a given
|
||||
observation having class `p`.
|
||||
|
||||
Print descriptive statistics of column `class_p` by `class`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Plot FPR-TPR ROC curve for our model and compute associated AUC value.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Downloads
|
||||
import ZipFile
|
||||
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
|
||||
archive = ZipFile.Reader("MushroomDataset.zip")
|
||||
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
|
||||
open("primary_data.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
|
||||
open("secondary_data.csv", "w") do io
|
||||
write(io, read(archive.files[idx]))
|
||||
end
|
||||
close(archive)
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using CSV
|
||||
using DataFrames
|
||||
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
|
||||
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
|
||||
describe(primary)
|
||||
describe(secondary)
|
||||
```
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
|
||||
parse_nominal(::Missing) = missing
|
||||
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
|
||||
parse_numeric(::Missing) = missing
|
||||
idcols = ["family", "name", "class"]
|
||||
numericcols = ["cap-diameter", "stem-height", "stem-width"]
|
||||
parsed_primary = select(primary,
|
||||
idcols,
|
||||
numericcols .=> ByRow(parse_numeric),
|
||||
Not([idcols; numericcols]) .=> ByRow(parse_nominal);
|
||||
renamecols=false)
|
||||
```
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
function overlap_numeric(v1, v2)
|
||||
# there are no missing values in numeric columns
|
||||
if length(v1) == 1 || length(v2) == 1
|
||||
return true
|
||||
else
|
||||
return max(v1[1], v2[1]) <= min(v1[2], v2[2])
|
||||
end
|
||||
end
|
||||
|
||||
function overlap_nominal(v1, v2)
|
||||
if ismissing(v1) || ismissing(v2)
|
||||
return true
|
||||
else
|
||||
return !isempty(intersect(v1, v2))
|
||||
end
|
||||
end
|
||||
|
||||
function rows_overlap(row1, row2)
|
||||
# note that in parsed_primary numeric columns have indices 4 to 6
|
||||
# and nominal columns have indices 7 to 23
|
||||
return all(i -> overlap_numeric(row1[i], row2[i]), 4:6) &&
|
||||
all(i -> overlap_nominal(row1[i], row2[i]), 7:23)
|
||||
end
|
||||
|
||||
for i in 1:nrow(parsed_primary), j in i+1:nrow(parsed_primary)
|
||||
row1 = parsed_primary[i, :]
|
||||
row2 = parsed_primary[j, :]
|
||||
if rows_overlap(row1, row2) && row1.class != row2.class
|
||||
println((i, Tuple(row1[1:3]), j, Tuple(row2[1:3])))
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Note that in this exercise using `eachrow` is not a problem
|
||||
(although it is not type stable) because the data is small.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Statistics
|
||||
using DataFramesMeta
|
||||
agg_primary = @chain parsed_primary begin
|
||||
groupby(:family)
|
||||
@combine(:pr_p = mean(:class .== "p"), $nrow)
|
||||
sort(:pr_p)
|
||||
end
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families), truncate=140)
|
||||
```
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])
|
||||
```
|
||||
|
||||
Note that we select based on actual contents of the columns and not by their
|
||||
element type (column could allow for missing values but not have them).
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using GLM
|
||||
using FreqTables
|
||||
secondary.class = secondary.class .== "p"
|
||||
model = glm(Term(:class)~sum(Term.(Symbol.(names(secondary, Not(:class))))),
|
||||
secondary, Binomial(), LogitLink())
|
||||
freqtable(secondary, "stem-color", "class")
|
||||
freqtable(secondary, "habitat", "class")
|
||||
```
|
||||
|
||||
We can see that for cetrain levels of `stem-color` and `habitat` variables
|
||||
there is a perfect separation of classes.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
insertcols!(secondary, 2, :class_p => predict(model))
|
||||
combine(groupby(secondary, :class)) do sdf
|
||||
return describe(sdf, :detailed, cols=:class_p)
|
||||
end
|
||||
```
|
||||
|
||||
We can see that the model has some discriminatory power, but there
|
||||
is still a significant overlap between classes.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using Plots
|
||||
using ROCAnalysis
|
||||
roc_data = roc(secondary; score=:class_p, target=:class)
|
||||
plot(roc_data.pfa, 1 .- roc_data.pmiss;
|
||||
title="AUC=$(round(100*(1 - auc(roc_data)), digits=2))%",
|
||||
xlabel="FPR", ylabel="TPR", legend=false)
|
||||
```
|
||||
|
||||
</details>
|
||||
384
exercises/exercises14.md
Normal file
384
exercises/exercises14.md
Normal file
@@ -0,0 +1,384 @@
|
||||
# Julia for Data Analysis
|
||||
|
||||
## Bogumił Kamiński, Daniel Kaszyński
|
||||
|
||||
# Chapter 14
|
||||
|
||||
# Problems
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Write a simulator that takes one parameter `n`. Next we assume we draw
|
||||
`n` times from this set with replacement. The simulator should return the
|
||||
average number of items from this set that were drawn at least once.
|
||||
|
||||
Call the function running this simulation `boot`.
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Now write a function `simboot` that takes parameters `n` and `k` and runs
|
||||
the simulation defined in the `boot` function `k` times. It should return
|
||||
a named tuple storing `k`, `n`, mean of produced values, and ends of
|
||||
approximated 95% confidence interval of the result.
|
||||
|
||||
Make this function single threaded. Check how long
|
||||
this function runs for `n=1000` and `k=1_000_000`.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Now rewrite this simulator to be multi threaded. Use 4 cores for benchmarking.
|
||||
Call the function `simbootT`. Check how long this function runs for `n=1000` and
|
||||
`k=1_000_000`.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Now rewrite `boot` and `simbootT` to perform less allocations. Achieve this by
|
||||
making sure that all allocated objects are passed to `boot` function (so that it
|
||||
does not do any allocations internally). Call these new functions `boot!` and
|
||||
`simbootT2`. You might need to use the `Threads.threadid` and `Threads.nthreads`
|
||||
functions.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Use either of the solutions we have developed in the previous exercises to
|
||||
create a web service taking `k` and `n` parameters and returning the values
|
||||
produced by `boot` functions and time to run the simulation. You might want to
|
||||
use the `@timed` macro in your solution.
|
||||
|
||||
Start the server.
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Query the server started in the exercise 5 with
|
||||
the following parameters:
|
||||
* `k=1000` and `n=1000`
|
||||
* `k=1.5` and `n=1000`
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Collect the data generated by a web service into the `df` data frame for
|
||||
`k = [10^i for i in 3:6]` and `n = [10^i for i in 1:3]`.
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Replace the `value` column in the `df` data frame by its contents in-place.
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Checks that execution time roughly scales proportionally to the product
|
||||
of `k` times `n`.
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Plot the expected fraction of seen elements in the set as a function of
|
||||
`n` by `k` along with 95% confidence interval around these values.
|
||||
|
||||
# Solutions
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Show!</summary>
|
||||
|
||||
### Exercise 1
|
||||
|
||||
Solution (there are many other approaches you could use):
|
||||
|
||||
```
|
||||
using Statistics
|
||||
|
||||
function boot(n::Integer)
|
||||
table = falses(n)
|
||||
for _ in 1:n
|
||||
table[rand(1:n)] = true
|
||||
end
|
||||
return mean(table)
|
||||
end
|
||||
```
|
||||
|
||||
### Exercise 2
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
function simboot(k::Integer, n::Integer)
|
||||
result = [boot(n) for _ in 1:k]
|
||||
mv = mean(result)
|
||||
sdv = std(result)
|
||||
lo95 = mv - 1.96 * sdv / sqrt(k)
|
||||
hi95 = mv + 1.96 * sdv / sqrt(k)
|
||||
return (; k, n, mv, lo95, hi95)
|
||||
end
|
||||
```
|
||||
|
||||
We run it twice to make sure everything is compiled:
|
||||
```
|
||||
julia> @time simboot(1000, 1_000_000)
|
||||
7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time)
|
||||
(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944)
|
||||
|
||||
julia> @time simboot(1000, 1_000_000)
|
||||
7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time)
|
||||
(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754)
|
||||
```
|
||||
|
||||
We see that on my computer the run time is around 7 seconds.
|
||||
|
||||
### Exercise 3
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using ThreadsX
|
||||
|
||||
function simbootT(k::Integer, n::Integer)
|
||||
result = ThreadsX.map(i -> boot(n), 1:k)
|
||||
mv = mean(result)
|
||||
sdv = std(result)
|
||||
lo95 = mv - 1.96 * sdv / sqrt(k)
|
||||
hi95 = mv + 1.96 * sdv / sqrt(k)
|
||||
return (; k, n, mv, lo95, hi95)
|
||||
end
|
||||
```
|
||||
|
||||
Here is the timing for four threads:
|
||||
```
|
||||
julia> @time simbootT(1000, 1_000_000)
|
||||
2.390795 seconds (3.37 k allocations: 119.434 MiB)
|
||||
(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484)
|
||||
|
||||
julia> @time simbootT(1000, 1_000_000)
|
||||
2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time)
|
||||
(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554)
|
||||
```
|
||||
|
||||
Indeed we see a significant performance improvement.
|
||||
|
||||
### Exercise 4
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
function boot!(n::Integer, pool)
|
||||
table = pool[Threads.threadid()]
|
||||
fill!(table, false)
|
||||
for _ in 1:n
|
||||
table[rand(1:n)] = true
|
||||
end
|
||||
return mean(table)
|
||||
end
|
||||
|
||||
function simbootT2(k::Integer, n::Integer)
|
||||
pool = [falses(n) for _ in 1:Threads.nthreads()]
|
||||
result = ThreadsX.map(i -> boot!(n, pool), 1:k)
|
||||
mv = mean(result)
|
||||
sdv = std(result)
|
||||
lo95 = mv - 1.96 * sdv / sqrt(k)
|
||||
hi95 = mv + 1.96 * sdv / sqrt(k)
|
||||
return (; k, n, mv, lo95, hi95)
|
||||
end
|
||||
```
|
||||
|
||||
In the solution the `pool` vector keeps `table` vector
|
||||
individually for each thread. Let us test the timing:
|
||||
|
||||
```
|
||||
julia> @time simbootT2(1000, 1_000_000)
|
||||
2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation)
|
||||
(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206)
|
||||
|
||||
julia> @time simbootT2(1000, 1_000_000)
|
||||
2.340694 seconds (391 allocations: 586.453 KiB)
|
||||
(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058)
|
||||
```
|
||||
|
||||
Indeed, we see that the number of allocations was decreased, which should lower
|
||||
GC usage. However, the runtime of the simulation is similar since in this task
|
||||
memory allocation does not account for a significant portion of the runtime.
|
||||
|
||||
### Exercise 5
|
||||
|
||||
Solution (I used the simplest single-threaded code here; this is a complete
|
||||
code of the web service):
|
||||
|
||||
```
|
||||
using Genie
|
||||
using Statistics
|
||||
|
||||
function boot(n::Integer)
|
||||
table = falses(n)
|
||||
for _ in 1:n
|
||||
table[rand(1:n)] = true
|
||||
end
|
||||
return mean(table)
|
||||
end
|
||||
|
||||
function simboot(k::Integer, n::Integer)
|
||||
result = [boot(n) for _ in 1:k]
|
||||
mv = mean(result)
|
||||
sdv = std(result)
|
||||
lo95 = mv - 1.96 * sdv / sqrt(k)
|
||||
hi95 = mv + 1.96 * sdv / sqrt(k)
|
||||
return (; k, n, mv, lo95, hi95)
|
||||
end
|
||||
|
||||
Genie.config.run_as_server = true
|
||||
|
||||
Genie.Router.route("/", method=POST) do
|
||||
message = Genie.Requests.jsonpayload()
|
||||
return try
|
||||
k = message["k"]
|
||||
n = message["n"]
|
||||
value, time = @timed simboot(k, n)
|
||||
Genie.Renderer.Json.json((status="OK", time=time, value=value))
|
||||
catch
|
||||
Genie.Renderer.Json.json((status="ERROR", time="", value=""))
|
||||
end
|
||||
end
|
||||
|
||||
Genie.Server.up()
|
||||
```
|
||||
|
||||
### Exercise 6
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using HTTP
|
||||
|
||||
julia> using JSON3
|
||||
|
||||
julia> HTTP.post("http://127.0.0.1:8000",
|
||||
["Content-Type" => "application/json"],
|
||||
JSON3.write((k=1000, n=1000)))
|
||||
HTTP.Messages.Response:
|
||||
"""
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json; charset=utf-8
|
||||
Server: Genie/Julia/1.8.2
|
||||
Transfer-Encoding: chunked
|
||||
|
||||
{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}"""
|
||||
|
||||
julia> HTTP.post("http://127.0.0.1:8000",
|
||||
["Content-Type" => "application/json"],
|
||||
JSON3.write((k=1.5, n=1000)))
|
||||
HTTP.Messages.Response:
|
||||
"""
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json; charset=utf-8
|
||||
Server: Genie/Julia/1.8.2
|
||||
Transfer-Encoding: chunked
|
||||
|
||||
{"status":"ERROR","time":"","value":""}"""
|
||||
```
|
||||
|
||||
As expected we got a positive answer the first time and an error on the second call.
|
||||
|
||||
### Exercise 7
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
using DataFrames
|
||||
|
||||
df = DataFrame()
|
||||
for k in [10^i for i in 3:6], n in [10^i for i in 1:3]
|
||||
@show k, n
|
||||
req = HTTP.post("http://127.0.0.1:8000",
|
||||
["Content-Type" => "application/json"],
|
||||
JSON3.write((; k, n)))
|
||||
push!(df, NamedTuple(JSON3.read(req.body)))
|
||||
end
|
||||
```
|
||||
|
||||
Note that I convert `JSON3.Object` into a `NamedTuple` to easily `push!`
|
||||
it into the `df` data frame.
|
||||
|
||||
Let us have a look at the produced data frame:
|
||||
|
||||
```
|
||||
julia> df
|
||||
12×3 DataFrame
|
||||
Row │ status time value
|
||||
│ String Float64 Object…
|
||||
─────┼──────────────────────────────────────────────────────
|
||||
1 │ OK 0.0006784 {\n "k": 1000,\n "n": …
|
||||
2 │ OK 0.0038374 {\n "k": 1000,\n "n": …
|
||||
3 │ OK 0.0150844 {\n "k": 1000,\n "n": …
|
||||
4 │ OK 0.0014071 {\n "k": 10000,\n "n":…
|
||||
5 │ OK 0.008443 {\n "k": 10000,\n "n":…
|
||||
6 │ OK 0.0700319 {\n "k": 10000,\n "n":…
|
||||
7 │ OK 0.0253826 {\n "k": 100000,\n "n"…
|
||||
8 │ OK 0.0795937 {\n "k": 100000,\n "n"…
|
||||
9 │ OK 0.708287 {\n "k": 100000,\n "n"…
|
||||
10 │ OK 0.160286 {\n "k": 1000000,\n "n…
|
||||
11 │ OK 0.803433 {\n "k": 1000000,\n "n…
|
||||
12 │ OK 7.23958 {\n "k": 1000000,\n "n…
|
||||
```
|
||||
|
||||
### Exercise 8
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> select!(df, :status, :time, :value => AsTable)
|
||||
12×7 DataFrame
|
||||
Row │ status time k n mv lo95 hi95
|
||||
│ String Float64 Int64 Int64 Float64 Float64 Float64
|
||||
─────┼─────────────────────────────────────────────────────────────────
|
||||
1 │ OK 0.0006784 1000 10 0.6469 0.640745 0.653055
|
||||
2 │ OK 0.0038374 1000 100 0.63508 0.633035 0.637125
|
||||
3 │ OK 0.0150844 1000 1000 0.632178 0.631581 0.632775
|
||||
4 │ OK 0.0014071 10000 10 0.65239 0.650425 0.654355
|
||||
5 │ OK 0.008443 10000 100 0.634456 0.633845 0.635067
|
||||
6 │ OK 0.0700319 10000 1000 0.63207 0.631878 0.632262
|
||||
7 │ OK 0.0253826 100000 10 0.651411 0.650793 0.652029
|
||||
8 │ OK 0.0795937 100000 100 0.634 0.633807 0.634193
|
||||
9 │ OK 0.708287 100000 1000 0.63224 0.632179 0.632302
|
||||
10 │ OK 0.160286 1000000 10 0.65129 0.651095 0.651486
|
||||
11 │ OK 0.803433 1000000 100 0.633995 0.633934 0.634056
|
||||
12 │ OK 7.23958 1000000 1000 0.63232 0.632301 0.63234
|
||||
```
|
||||
|
||||
### Exercise 9
|
||||
|
||||
Solution:
|
||||
|
||||
```
|
||||
julia> using DataFramesMeta
|
||||
|
||||
julia> @chain df begin
|
||||
@rselect(:k, :n, :avg_time = :time / (:k * :n))
|
||||
unstack(:k, :n, :avg_time)
|
||||
end
|
||||
4×4 DataFrame
|
||||
Row │ k 10 100 1000
|
||||
│ Int64 Float64? Float64? Float64?
|
||||
─────┼─────────────────────────────────────────────
|
||||
1 │ 1000 6.784e-8 3.8374e-8 1.50844e-8
|
||||
2 │ 10000 1.4071e-8 8.443e-9 7.00319e-9
|
||||
3 │ 100000 2.53826e-8 7.95937e-9 7.08287e-9
|
||||
4 │ 1000000 1.60286e-8 8.03433e-9 7.23958e-9
|
||||
```
|
||||
|
||||
We see that indeed this is the case. For large `k` and `n` the average time per
|
||||
single sample stabilizes (for small values the runtime is low so the timing is
|
||||
more affected by external noise and the other operations that the functions do
|
||||
affect the results more).
|
||||
|
||||
### Exercise 10
|
||||
|
||||
Solution:
|
||||
```
|
||||
using Plots
|
||||
gdf = groupby(df, :k, sort=true)
|
||||
plot([bar(string.(g.n), g.mv;
|
||||
ylim=(0.62, 0.66), xlabel="n", ylabel="estimate",
|
||||
legend=false, title=first(g.k),
|
||||
yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...)
|
||||
```
|
||||
|
||||
As expected error bandwidth gets smaller as `k` encreases.
|
||||
Note that as `n` increases the estimated value tends to `1-exp(-1)`.
|
||||
|
||||
</details>
|
||||
BIN
exercises/logo.png
Normal file
BIN
exercises/logo.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 12 KiB |
Reference in New Issue
Block a user