JuliaForDataAnalysis/exercises/exercises06.md

287 lines
6.5 KiB
Markdown
Raw Normal View History

2022-10-14 12:27:04 +02:00
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 6
# Problems
### Exercise 1
Interpolate the expression `1 + 2` into a string `"I have apples worth 3USD"`
(replace `3` by a proper interpolation expression) and replace `USD` by `$`.
<details>
2022-10-14 13:43:12 +02:00
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 2
2022-10-14 13:43:12 +02:00
Download the file `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`
as `iris.csv` to your local folder.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
"iris.csv")
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 3
2022-10-14 13:43:12 +02:00
Write the string `"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"`
in two lines so that it takes less horizontal space.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
"https://archive.ics.uci.edu/ml/\
machine-learning-databases/iris/iris.data"
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 4
2022-10-14 13:43:12 +02:00
Load data stored in `iris.csv` file into a `data` vector where each element
should be a named tuple of the form `(sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")` if
the source line had data `1.0,2.0,3.0,4.0,x` (note that first four elements are parsed
as floats).
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> function line_parser(line)
elements = split(line, ",")
@assert length(elements) == 5
return (sl=parse(Float64, elements[1]),
sw=parse(Float64, elements[2]),
pl=parse(Float64, elements[3]),
pw=parse(Float64, elements[4]),
c=elements[5])
end
line_parser (generic function with 1 method)
julia> data = line_parser.(readlines("iris.csv")[1:end-1])
150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
(sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
(sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
(sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
(sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
(sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
(sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")
```
Note that we used `1:end-1` selector to drop last element from the read lines
since it is empty. This is the reason why adding the
`@assert length(elements) == 5` check in the `line_parser` function is useful.
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 5
2022-10-14 13:43:12 +02:00
The `data` structure is a vector of named tuples, change it to a named tuple
of vectors (with the same field names) and call it `data2`.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
Later in the book you will learn more advanced ways to do it. Here let us
use a most basic approach:
```
data2 = (sl=[d.sl for d in data],
sw=[d.sw for d in data],
pl=[d.pl for d in data],
pw=[d.pw for d in data],
c=[d.c for d in data])
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 6
2022-10-14 13:43:12 +02:00
Calculate the frequency of each type of Iris type (`c` field in `data2`).
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> using FreqTables
julia> freqtable(data2.c)
3-element Named Vector{Int64}
Dim1 │
──────────────────┼───
"Iris-setosa" │ 50
"Iris-versicolor" │ 50
"Iris-virginica" │ 50
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 7
2022-10-14 13:43:12 +02:00
Create a vector `c2` that is derived from `c` in `data2` but holds inline strings,
vector `c3` that is a `PooledVector`, and vector `c4` that holds `Symbol`s.
Compare sizes of the three objects.
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
julia> using InlineStrings
julia> c2 = inlinestrings(data2.c)
150-element Vector{String15}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> using PooledArrays
julia> c3 = PooledArray(data2.c)
150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> c4 = Symbol.(data2.c)
150-element Vector{Symbol}:
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
julia> Base.summarysize(data2.c)
12840
julia> Base.summarysize(c2)
2440
julia> Base.summarysize(c3)
1696
julia> Base.summarysize(c4)
1240
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 8
2022-10-14 13:43:12 +02:00
You know that `refs` field of `PooledArray` stores an integer index of a given
value in it. Using this information make a scatter plot of `pl` vs `pw` vectors
in `data2`, but for each Iris type give a different point color (check the
`color` keyword argument meaning in the Plots.jl manual; you can use the
`plot_color` function).
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
```
using Plots
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
```
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 9
2022-10-14 13:43:12 +02:00
Type the following string `"a²=b² ⟺ a=b a=-b"` in your terminal and bind it to
`str` variable (do not copy paste the string, but type it).
<details>
<summary>Solution</summary>
2022-10-14 12:27:04 +02:00
The hard part is typing `²`, `⟺` and ``. You can check how to do it using help:
```
help?> ²
"²" can be typed by \^2<tab>
help?> ⟺
"⟺" can be typed by \iff<tab>
help?>
"" can be typed by \vee<tab>
```
Save the string in the `str` variable as we will use it in the next exercise.
2022-10-14 13:43:12 +02:00
</details>
2022-10-14 12:27:04 +02:00
### Exercise 10
2022-10-14 13:43:12 +02:00
In the `str` string from exercise 9 find all matches of a pattern where `a`
is followed by `b` but there can be some characters between them.
<details>
<summary>Show!</summary>
2022-10-14 12:27:04 +02:00
The exercise does not specify how the matching should be done. If we
want it to be eager (match as much as possible), we write:
```
julia> m = match(r"a.*b", str)
RegexMatch("a²=b² ⟺ a=b a=-b")
```
As you can see we have matched whole string.
If we want it to be lazy (match as little as possible) we write:
```
julia> m = match(r"a.*?b", str)
RegexMatch("a²=b")
```
This finds us the first such match.
If we want to find all lazy matches we can write (not covered in the book):
```
julia> collect(eachmatch(r"a.*?b", str))
3-element Vector{RegexMatch}:
RegexMatch("a²=b")
RegexMatch("a=b")
RegexMatch("a=-b")
```
</details>