JuliaForDataAnalysis/exercises/exercises06.md
2022-10-14 13:43:12 +02:00

6.5 KiB
Raw Blame History

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 6

Problems

Exercise 1

Interpolate the expression 1 + 2 into a string "I have apples worth 3USD" (replace 3 by a proper interpolation expression) and replace USD by $.

Solution
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"

Exercise 2

Download the file https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data as iris.csv to your local folder.

Solution
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   "iris.csv")

Exercise 3

Write the string "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" in two lines so that it takes less horizontal space.

Solution
"https://archive.ics.uci.edu/ml/\
 machine-learning-databases/iris/iris.data"

Exercise 4

Load data stored in iris.csv file into a data vector where each element should be a named tuple of the form (sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x") if the source line had data 1.0,2.0,3.0,4.0,x (note that first four elements are parsed as floats).

Solution
julia> function line_parser(line)
           elements = split(line, ",")
           @assert length(elements) == 5
           return (sl=parse(Float64, elements[1]),
                   sw=parse(Float64, elements[2]),
                   pl=parse(Float64, elements[3]),
                   pw=parse(Float64, elements[4]),
                   c=elements[5])
       end
line_parser (generic function with 1 method)

julia> data = line_parser.(readlines("iris.csv")[1:end-1])
150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
 (sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
 (sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
 (sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
 (sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
 (sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
 ⋮
 (sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
 (sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
 (sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
 (sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")

Note that we used 1:end-1 selector to drop last element from the read lines since it is empty. This is the reason why adding the @assert length(elements) == 5 check in the line_parser function is useful.

Exercise 5

The data structure is a vector of named tuples, change it to a named tuple of vectors (with the same field names) and call it data2.

Solution

Later in the book you will learn more advanced ways to do it. Here let us use a most basic approach:

data2 = (sl=[d.sl for d in data],
         sw=[d.sw for d in data],
         pl=[d.pl for d in data],
         pw=[d.pw for d in data],
         c=[d.c for d in data])

Exercise 6

Calculate the frequency of each type of Iris type (c field in data2).

Solution
julia> using FreqTables

julia> freqtable(data2.c)
3-element Named Vector{Int64}
Dim1              │
──────────────────┼───
"Iris-setosa"     │ 50
"Iris-versicolor" │ 50
"Iris-virginica"  │ 50

Exercise 7

Create a vector c2 that is derived from c in data2 but holds inline strings, vector c3 that is a PooledVector, and vector c4 that holds Symbols. Compare sizes of the three objects.

Solution
julia> using InlineStrings

julia> c2 = inlinestrings(data2.c)
150-element Vector{String15}:
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 ⋮
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"

julia> using PooledArrays

julia> c3 = PooledArray(data2.c)
150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 ⋮
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"

julia> c4 = Symbol.(data2.c)
150-element Vector{Symbol}:
 Symbol("Iris-setosa")
 Symbol("Iris-setosa")
 Symbol("Iris-setosa")
 Symbol("Iris-setosa")
 Symbol("Iris-setosa")
 ⋮
 Symbol("Iris-virginica")
 Symbol("Iris-virginica")
 Symbol("Iris-virginica")
 Symbol("Iris-virginica")

julia> Base.summarysize(data2.c)
12840

julia> Base.summarysize(c2)
2440

julia> Base.summarysize(c3)
1696

julia> Base.summarysize(c4)
1240

Exercise 8

You know that refs field of PooledArray stores an integer index of a given value in it. Using this information make a scatter plot of pl vs pw vectors in data2, but for each Iris type give a different point color (check the color keyword argument meaning in the Plots.jl manual; you can use the plot_color function).

Solution
using Plots
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)

Exercise 9

Type the following string "a²=b² ⟺ a=b a=-b" in your terminal and bind it to str variable (do not copy paste the string, but type it).

Solution

The hard part is typing ², and . You can check how to do it using help:

help?> ²
"²" can be typed by \^2<tab>

help?> ⟺
"⟺" can be typed by \iff<tab>

help?> 
"" can be typed by \vee<tab>

Save the string in the str variable as we will use it in the next exercise.

Exercise 10

In the str string from exercise 9 find all matches of a pattern where a is followed by b but there can be some characters between them.

Show!

The exercise does not specify how the matching should be done. If we want it to be eager (match as much as possible), we write:

julia> m = match(r"a.*b", str)
RegexMatch("a²=b² ⟺ a=b  a=-b")

As you can see we have matched whole string.

If we want it to be lazy (match as little as possible) we write:

julia> m = match(r"a.*?b", str)
RegexMatch("a²=b")

This finds us the first such match.

If we want to find all lazy matches we can write (not covered in the book):

julia> collect(eachmatch(r"a.*?b", str))
3-element Vector{RegexMatch}:
 RegexMatch("a²=b")
 RegexMatch("a=b")
 RegexMatch("a=-b")