6.5 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 6
Problems
Exercise 1
Interpolate the expression 1 + 2
into a string
"I have apples worth 3USD"
(replace 3
by a
proper interpolation expression) and replace USD
by
$
.
Solution
julia> "I have apples worth $(1+2)\$"
"I have apples worth 3\$"
Exercise 2
Download the file
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
as iris.csv
to your local folder.
Solution
import Downloads
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
"iris.csv")
Exercise 3
Write the string
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
in two lines so that it takes less horizontal space.
Solution
"https://archive.ics.uci.edu/ml/\
machine-learning-databases/iris/iris.data"
Exercise 4
Load data stored in iris.csv
file into a
data
vector where each element should be a named tuple of
the form (sl=1.0, sw=2.0, pl=3.0, pw=4.0, c="x")
if the
source line had data 1.0,2.0,3.0,4.0,x
(note that first
four elements are parsed as floats).
Solution
julia> function line_parser(line)
elements = split(line, ",")
@assert length(elements) == 5
return (sl=parse(Float64, elements[1]),
sw=parse(Float64, elements[2]),
pl=parse(Float64, elements[3]),
pw=parse(Float64, elements[4]),
c=elements[5])
end
line_parser (generic function with 1 method)
julia> data = line_parser.(readlines("iris.csv")[1:end-1])
150-element Vector{NamedTuple{(:sl, :sw, :pl, :pw, :c), Tuple{Float64, Float64, Float64, Float64, SubString{String}}}}:
(sl = 5.1, sw = 3.5, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.9, sw = 3.0, pl = 1.4, pw = 0.2, c = "Iris-setosa")
(sl = 4.7, sw = 3.2, pl = 1.3, pw = 0.2, c = "Iris-setosa")
(sl = 4.6, sw = 3.1, pl = 1.5, pw = 0.2, c = "Iris-setosa")
(sl = 5.0, sw = 3.6, pl = 1.4, pw = 0.2, c = "Iris-setosa")
⋮
(sl = 6.3, sw = 2.5, pl = 5.0, pw = 1.9, c = "Iris-virginica")
(sl = 6.5, sw = 3.0, pl = 5.2, pw = 2.0, c = "Iris-virginica")
(sl = 6.2, sw = 3.4, pl = 5.4, pw = 2.3, c = "Iris-virginica")
(sl = 5.9, sw = 3.0, pl = 5.1, pw = 1.8, c = "Iris-virginica")
Note that we used 1:end-1
selector to drop last element
from the read lines since it is empty. This is the reason why adding the
@assert length(elements) == 5
check in the
line_parser
function is useful.
Exercise 5
The data
structure is a vector of named tuples, change
it to a named tuple of vectors (with the same field names) and call it
data2
.
Solution
Later in the book you will learn more advanced ways to do it. Here let us use a most basic approach:
data2 = (sl=[d.sl for d in data],
sw=[d.sw for d in data],
pl=[d.pl for d in data],
pw=[d.pw for d in data],
c=[d.c for d in data])
Exercise 6
Calculate the frequency of each type of Iris type (c
field in data2
).
Solution
julia> using FreqTables
julia> freqtable(data2.c)
3-element Named Vector{Int64}
Dim1 │
──────────────────┼───
"Iris-setosa" │ 50
"Iris-versicolor" │ 50
"Iris-virginica" │ 50
Exercise 7
Create a vector c2
that is derived from c
in data2
but holds inline strings, vector c3
that is a PooledVector
, and vector c4
that
holds Symbol
s. Compare sizes of the three objects.
Solution
julia> using InlineStrings
julia> c2 = inlinestrings(data2.c)
150-element Vector{String15}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
⋮
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> using PooledArrays
julia> c3 = PooledArray(data2.c)
150-element PooledVector{SubString{String}, UInt32, Vector{UInt32}}:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
⋮
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
"Iris-virginica"
julia> c4 = Symbol.(data2.c)
150-element Vector{Symbol}:
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
Symbol("Iris-setosa")
⋮
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
Symbol("Iris-virginica")
julia> Base.summarysize(data2.c)
12840
julia> Base.summarysize(c2)
2440
julia> Base.summarysize(c3)
1696
julia> Base.summarysize(c4)
1240
Exercise 8
You know that refs
field of PooledArray
stores an integer index of a given value in it. Using this information
make a scatter plot of pl
vs pw
vectors in
data2
, but for each Iris type give a different point color
(check the color
keyword argument meaning in the Plots.jl
manual; you can use the plot_color
function).
Solution
using Plots
scatter(data2.pl, data2.pw, color=plot_color(c3.refs), legend=false)
Exercise 9
Type the following string "a²=b² ⟺ a=b ∨ a=-b"
in your
terminal and bind it to str
variable (do not copy paste the
string, but type it).
Solution
The hard part is typing ²
, ⟺
and
∨
. You can check how to do it using help:
help?> ²
"²" can be typed by \^2<tab>
help?> ⟺
"⟺" can be typed by \iff<tab>
help?> ∨
"∨" can be typed by \vee<tab>
Save the string in the str
variable as we will use it in
the next exercise.
Exercise 10
In the str
string from exercise 9 find all matches of a
pattern where a
is followed by b
but there can
be some characters between them.
Show!
The exercise does not specify how the matching should be done. If we want it to be eager (match as much as possible), we write:
julia> m = match(r"a.*b", str)
RegexMatch("a²=b² ⟺ a=b ∨ a=-b")
As you can see we have matched whole string.
If we want it to be lazy (match as little as possible) we write:
julia> m = match(r"a.*?b", str)
RegexMatch("a²=b")
This finds us the first such match.
If we want to find all lazy matches we can write (not covered in the book):
julia> collect(eachmatch(r"a.*?b", str))
3-element Vector{RegexMatch}:
RegexMatch("a²=b")
RegexMatch("a=b")
RegexMatch("a=-b")