JuliaForDataAnalysis/exercises/exercises06.md
Bogumił Kamiński 3b8ffa5d40 add exercises
2022-10-14 12:27:04 +02:00

272 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 6
# Problems
### Exercise 1
Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
### Exercise 2
Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
as `iris.csv` to your local folder.
### Exercise 3
Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
in two lines so that it takes less horizontal space.
### Exercise 4
Load data stored in `iris.csv` file into a `data` vector where each element
should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
as floats).
### Exercise 5
The `data` structure is a vector of named tuples, change it to a named tuple
of vectors (with the same field names) and call it `data2`.
### Exercise 6
Calculate the frequency of each type of Iris type (`c` field in `data2`).
### Exercise 7
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
Compare sizes of the three objects.
### Exercise 8
You know that `refs` field of `PooledArray` stores an integer index of a given
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
in `data2`, but for each Iris type give a different point color (check the
`color` keyword argument meaning in the Plots.jl manual; you can use the
`plot_color` function).
### Exercise 9
Type the following string `"a²=b² ⟺ a=b a=-b"` in your terminal and bind it to
`str` variable (do not copy paste the string, but type it).
### Exercise 10
In the `str` string from exercise 9 find all matches of a pattern where `a`
is followed by `b` but there can be some characters between them.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"
```
### Exercise 2
Solution:
```
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
"iris.csv")
```
### Exercise 3
Solution:
```
"https://archive.ics.uci.edu/ml/\
machine-learning-databases/iris/iris.data"
```
### Exercise 4
Solution:
```
julia> function line_parser(line)
elements = split(line, ",")
@assert length(elements) == 5
return (sl=parse(Float64, elements[1]),
sw=parse(Float64, elements[2]),
pl=parse(Float64, elements[3]),
pw=parse(Float64, elements[4]),
c=elements[5])
end
line_parser (generic function with 1 method)
julia> data = line_parser.(readlines("iris.csv")[1:end-1])
150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
(sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
(sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
(sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
(sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
(sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
(sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")
```
Note that we used `1:end-1` selector to drop last element from the read lines
since it is empty. This is the reason why adding the
`@assert length(elements) == 5` check in the `line_parser` function is useful.
### Exercise 5
Later in the book you will learn more advanced ways to do it. Here let us
use a most basic approach:
```
data2 = (sl=[d.sl for d in data],
sw=[d.sw for d in data],
pl=[d.pl for d in data],
pw=[d.pw for d in data],
c=[d.c for d in data])
```
### Exercise 6
Solution:
```
julia> using FreqTables
julia> freqtable(data2.c)
3-element Named Vector{Int64}
Dim1 │
──────────────────┼───
"Iris-setosa" │ 50
"Iris-versicolor" │ 50
"Iris-virginica" │ 50
```
### Exercise 7
Solution:
```
julia> using InlineStrings
julia> c2 = inlinestrings(data2.c)
150-element Vector{String15}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> using PooledArrays
julia> c3 = PooledArray(data2.c)
150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> c4 = Symbol.(data2.c)
150-element Vector{Symbol}:
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
julia> Base.summarysize(data2.c)
12840
julia> Base.summarysize(c2)
2440
julia> Base.summarysize(c3)
1696
julia> Base.summarysize(c4)
1240
```
### Exercise 8
Solution:
```
using Plots
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
```
### Exercise 9
The hard part is typing `²`, `⟺` and ``. You can check how to do it using help:
```
help?> ²
"²" can be typed by \^2<tab>
help?> ⟺
"⟺" can be typed by \iff<tab>
help?>
"" can be typed by \vee<tab>
```
Save the string in the `str` variable as we will use it in the next exercise.
### Exercise 10
The exercise does not specify how the matching should be done. If we
want it to be eager (match as much as possible), we write:
```
julia> m = match(r"a.*b", str)
RegexMatch("a²=b² ⟺ a=b a=-b")
```
As you can see we have matched whole string.
If we want it to be lazy (match as little as possible) we write:
```
julia> m = match(r"a.*?b", str)
RegexMatch("a²=b")
```
This finds us the first such match.
If we want to find all lazy matches we can write (not covered in the book):
```
julia> collect(eachmatch(r"a.*?b", str))
3-element Vector{RegexMatch}:
RegexMatch("a²=b")
RegexMatch("a=b")
RegexMatch("a=-b")
```
</details>