8.3 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 10
Problems
Exercise 1
Generate a random matrix mat having size 5x4 and all
elements drawn independently and uniformly from the [0,1[ interval.
Create a data frame using data from this matrix using auto-generated
column names.
Solution
julia> using DataFrames
julia> mat = rand(5, 4)
5×4 Matrix{Float64}:
0.8386 0.83612 0.0353994 0.15547
0.590172 0.611815 0.0691152 0.915788
0.879395 0.07271 0.980079 0.655158
0.340435 0.756196 0.0697535 0.388578
0.714515 0.861872 0.971521 0.176768
julia> DataFrame(mat, :auto)
5×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
Exercise 2
Now, using matrix mat create a data frame with randomly
generated column names. Use the randstring function from
the Random module to generate them. Store this data frame
in df variable.
Solution
julia> using Random
julia> df = DataFrame(mat, [randstring() for _ in 1:4])
5×4 DataFrame
Row │ 6mTK5evn K8Inf7ER 5Caz55k0 SRiGemsa
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
Exercise 3
Create a new data frame, taking df as a source that will
have the same columns but its column names will be y1,
y2, y3, y4.
Solution
julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
5×4 DataFrame
Row │ y1 y2 y3 y4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
You could also use the raname function:
julia> rename(df, string.("y", 1:4))
5×4 DataFrame
Row │ y1 y2 y3 y4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
Exercise 4
Create a dictionary holding
column_name => column_vector pairs using data stored in
data frame df. Save this dictionary in variable
d.
Solution
julia> d = Dict([n => df[:, n] for n in names(df)])
Dict{String, Vector{Float64}} with 4 entries:
"6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
"5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
"K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
"SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
or (using the pairs function; note that this time column
names are Symbol):
julia> Dict(pairs(eachcol(df)))
Dict{Symbol, AbstractVector} with 4 entries:
Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
:SRiGemsa => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
:K8Inf7ER => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
Exercise 5
Create a data frame back from dictionary d from exercise
4. Compare it with df.
Solution
julia> DataFrame(d)
5×4 DataFrame
Row │ 5Caz55k0 6mTK5evn K8Inf7ER SRiGemsa
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.0353994 0.8386 0.83612 0.15547
2 │ 0.0691152 0.590172 0.611815 0.915788
3 │ 0.980079 0.879395 0.07271 0.655158
4 │ 0.0697535 0.340435 0.756196 0.388578
5 │ 0.971521 0.714515 0.861872 0.176768
Note that columns of a data frame are now sorted by their names. This
is done for Dict objects because such dictionaries do not
have a defined order of keys.
Exercise 6
For data frame df compute the dot product between all
pairs of its columns. Use the dot function from the
LinearAlgebra module.
Solution
julia> using LinearAlgebra
julia> using StatsBase
julia> pairwise(dot, eachcol(df))
4×4 Matrix{Float64}:
2.45132 1.99944 1.65026 1.50558
1.99944 2.39336 1.03322 1.18411
1.65026 1.03322 1.9153 0.909744
1.50558 1.18411 0.909744 1.47431
Exercise 7
Given two data frames:
julia> df1 = DataFrame(a=1:2, b=11:12)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> df2 = DataFrame(a=1:2, c=101:102)
2×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 101
2 │ 2 102
vertically concatenate them so that only columns that are present in
both data frames are kept. Check the documentation of vcat
to see how to do it.
Solution
julia> vcat(df1, df2, cols=:intersect)
4×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 1
4 │ 2
By default you will get an error:
julia> vcat(df1, df2)
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
Exercise 8
Now append to df1 table df2, but add only
the columns from df2 that are present in df1.
Check the documentation of append! to see how to do it.
Solution
julia> append!(df1, df2, cols=:subset)
4×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 11
2 │ 2 12
3 │ 1 missing
4 │ 2 missing
Exercise 9
Create a circle data frame, using the push!
function that will store 1000 samples of the following process: * draw
x and y uniformly and independently from the
[-1,1[ interval; * compute a binary variable inside that is
true if x^2+y^2 < 1 and is
false otherwise.
Compute summary statistics of this data frame.
Solution
circle=DataFrame()
for _ in 1:1000
x, y = 2rand()-1, 2rand()-1
inside = x^2 + y^2 < 1
push!(circle, (x=x, y=y, inside=inside))
end
describe(circle)
We note that mean of variable inside is approximately
π.
Exercise 10
Create a scatterplot of circle data frame where its
x and y axis will be the plotted points and
inside variable will determine the color of the plotted
point.
Solution
using Plots
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400))
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal)
In the solution two ways to plot ensuring the ratio between x and y axis is 1 are shown. Note the differences in the produced output between the two methods.