Files
JuliaForDataAnalysis/exercises/exercises10.md
Bogumił Kamiński 3b8ffa5d40 add exercises
2022-10-14 12:27:04 +02:00

8.2 KiB
Raw Blame History

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 10

Problems

Exercise 1

Generate a random matrix mat having size 5x4 and all elements drawn independently and uniformly from the [0,1[ interval. Create a data frame using data from this matrix using auto-generated column names.

Exercise 2

Now, using matrix mat create a data frame with randomly generated column names. Use the randstring function from the Random module to generate them. Store this data frame in df variable.

Exercise 3

Create a new data frame, taking df as a source that will have the same columns but its column names will be y1, y2, y3, y4.

Exercise 4

Create a dictionary holding column_name => column_vector pairs using data stored in data frame df. Save this dictionary in variable d.

Exercise 5

Create a data frame back from dictionary d from exercise 4. Compare it with df.

Exercise 6

For data frame df compute the dot product between all pairs of its columns. Use the dot function from the LinearAlgebra module.

Exercise 7

Given two data frames:

julia> df1 = DataFrame(a=1:2, b=11:12)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12

julia> df2 = DataFrame(a=1:2, c=101:102)
2×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1    101
   2 │     2    102

vertically concatenate them so that only columns that are present in both data frames are kept. Check the documentation of vcat to see how to do it.

Exercise 8

Now append to df1 table df2, but add only the columns from df2 that are present in df1. Check the documentation of append! to see how to do it.

Exercise 9

Create a circle data frame, using the push! function that will store 1000 samples of the following process: * draw x and y uniformly and independently from the [-1,1[ interval; * compute a binary variable inside that is true if x^2+y^2 < 1 and is false otherwise.

Compute summary statistics of this data frame.

Exercise 10

Create a scatterplot of circle data frame where its x and y axis will be the plotted points and inside variable will determine the color of the plotted point.

Solutions

Show!

Exercise 1

Solution:

julia> using DataFrames

julia> mat = rand(5, 4)
5×4 Matrix{Float64}:
 0.8386    0.83612   0.0353994  0.15547
 0.590172  0.611815  0.0691152  0.915788
 0.879395  0.07271   0.980079   0.655158
 0.340435  0.756196  0.0697535  0.388578
 0.714515  0.861872  0.971521   0.176768

julia> DataFrame(mat, :auto)
5×4 DataFrame
 Row │ x1        x2        x3         x4
     │ Float64   Float64   Float64    Float64
─────┼─────────────────────────────────────────
   1 │ 0.8386    0.83612   0.0353994  0.15547
   2 │ 0.590172  0.611815  0.0691152  0.915788
   3 │ 0.879395  0.07271   0.980079   0.655158
   4 │ 0.340435  0.756196  0.0697535  0.388578
   5 │ 0.714515  0.861872  0.971521   0.176768

Exercise 2

Solution:

julia> using Random

julia> df = DataFrame(mat, [randstring() for _ in 1:4])
5×4 DataFrame
 Row │ 6mTK5evn  K8Inf7ER  5Caz55k0   SRiGemsa
     │ Float64   Float64   Float64    Float64
─────┼─────────────────────────────────────────
   1 │ 0.8386    0.83612   0.0353994  0.15547
   2 │ 0.590172  0.611815  0.0691152  0.915788
   3 │ 0.879395  0.07271   0.980079   0.655158
   4 │ 0.340435  0.756196  0.0697535  0.388578
   5 │ 0.714515  0.861872  0.971521   0.176768

Exercise 3

Solution:

julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
5×4 DataFrame
 Row │ y1        y2        y3         y4
     │ Float64   Float64   Float64    Float64
─────┼─────────────────────────────────────────
   1 │ 0.8386    0.83612   0.0353994  0.15547
   2 │ 0.590172  0.611815  0.0691152  0.915788
   3 │ 0.879395  0.07271   0.980079   0.655158
   4 │ 0.340435  0.756196  0.0697535  0.388578
   5 │ 0.714515  0.861872  0.971521   0.176768

You could also use the raname function:

julia> rename(df, string.("y", 1:4))
5×4 DataFrame
 Row │ y1        y2        y3         y4
     │ Float64   Float64   Float64    Float64
─────┼─────────────────────────────────────────
   1 │ 0.8386    0.83612   0.0353994  0.15547
   2 │ 0.590172  0.611815  0.0691152  0.915788
   3 │ 0.879395  0.07271   0.980079   0.655158
   4 │ 0.340435  0.756196  0.0697535  0.388578
   5 │ 0.714515  0.861872  0.971521   0.176768

Exercise 4

Solution:

julia> d = Dict([n => df[:, n] for n in names(df)])
Dict{String, Vector{Float64}} with 4 entries:
  "6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
  "5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
  "K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
  "SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]

or (using the pairs function; note that this time column names are Symbol):

julia> Dict(pairs(eachcol(df)))
Dict{Symbol, AbstractVector} with 4 entries:
  Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
  :SRiGemsa          => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
  :K8Inf7ER          => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
  Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]

Exercise 5

Solution:

julia> DataFrame(d)
5×4 DataFrame
 Row │ 5Caz55k0   6mTK5evn  K8Inf7ER  SRiGemsa
     │ Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────
   1 │ 0.0353994  0.8386    0.83612   0.15547
   2 │ 0.0691152  0.590172  0.611815  0.915788
   3 │ 0.980079   0.879395  0.07271   0.655158
   4 │ 0.0697535  0.340435  0.756196  0.388578
   5 │ 0.971521   0.714515  0.861872  0.176768

Note that columns of a data frame are now sorted by their names. This is done for Dict objects because such dictionaries do not have a defined order of keys.

Exercise 6

Solution:

julia> using LinearAlgebra

julia> using StatsBase

julia> pairwise(dot, eachcol(df))
4×4 Matrix{Float64}:
 2.45132  1.99944  1.65026   1.50558
 1.99944  2.39336  1.03322   1.18411
 1.65026  1.03322  1.9153    0.909744
 1.50558  1.18411  0.909744  1.47431

Exercise 7

Solution:

julia> vcat(df1, df2, cols=:intersect)
4×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     1
   4 │     2

By default you will get an error:

julia> vcat(df1, df2)
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2

Exercise 8

Solution:

julia> append!(df1, df2, cols=:subset)
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1       11
   2 │     2       12
   3 │     1  missing
   4 │     2  missing

Exercise 9

Solution

circle=DataFrame()
for _ in 1:1000
    x, y = 2rand()-1, 2rand()-1
    inside = x^2 + y^2 < 1
    push!(circle, (x=x, y=y, inside=inside))
end
describe(circle)

We note that mean of variable inside is approximately π.

Exercise 10

Solution:

using Plots
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400))
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal)

In the solution two ways to plot ensuring the ratio between x and y axis is 1 are shown. Note the differences in the produced output between the two methods.