Files
JuliaForDataAnalysis/exercises/exercises10.md
Bogumił Kamiński 3b8ffa5d40 add exercises
2022-10-14 12:27:04 +02:00

304 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Julia for Data Analysis
## Bogumił Kamiński, Daniel Kaszyński
# Chapter 10
# Problems
### Exercise 1
Generate a random matrix `mat` having size 5x4 and all elements drawn
independently and uniformly from the [0,1[ interval.
Create a data frame using data from this matrix using auto-generated
column names.
### Exercise 2
Now, using matrix `mat` create a data frame with randomly generated
column names. Use the `randstring` function from the `Random` module
to generate them. Store this data frame in `df` variable.
### Exercise 3
Create a new data frame, taking `df` as a source that will have the same
columns but its column names will be `y1`, `y2`, `y3`, `y4`.
### Exercise 4
Create a dictionary holding `column_name => column_vector` pairs
using data stored in data frame `df`. Save this dictionary in variable `d`.
### Exercise 5
Create a data frame back from dictionary `d` from exercise 4. Compare it
with `df`.
### Exercise 6
For data frame `df` compute the dot product between all pairs of its columns.
Use the `dot` function from the `LinearAlgebra` module.
### Exercise 7
Given two data frames:
```
julia> df1 = DataFrame(a=1:2, b=11:12)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> df2 = DataFrame(a=1:2, c=101:102)
2×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 101
2 │ 2 102
```
vertically concatenate them so that only columns that are present in both
data frames are kept. Check the documentation of `vcat` to see how to
do it.
### Exercise 8
Now append to `df1` table `df2`, but add only the columns from `df2` that
are present in `df1`. Check the documentation of `append!` to see how to
do it.
### Exercise 9
Create a `circle` data frame, using the `push!` function that will store
1000 samples of the following process:
* draw `x` and `y` uniformly and independently from the [-1,1[ interval;
* compute a binary variable `inside` that is `true` if `x^2+y^2 < 1`
and is `false` otherwise.
Compute summary statistics of this data frame.
### Exercise 10
Create a scatterplot of `circle` data frame where its `x` and `y` axis
will be the plotted points and `inside` variable will determine the color
of the plotted point.
# Solutions
<details>
<summary>Show!</summary>
### Exercise 1
Solution:
```
julia> using DataFrames
julia> mat = rand(5, 4)
5×4 Matrix{Float64}:
0.8386 0.83612 0.0353994 0.15547
0.590172 0.611815 0.0691152 0.915788
0.879395 0.07271 0.980079 0.655158
0.340435 0.756196 0.0697535 0.388578
0.714515 0.861872 0.971521 0.176768
julia> DataFrame(mat, :auto)
5×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
### Exercise 2
Solution:
```
julia> using Random
julia> df = DataFrame(mat, [randstring() for _ in 1:4])
5×4 DataFrame
Row │ 6mTK5evn K8Inf7ER 5Caz55k0 SRiGemsa
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
### Exercise 3
Solution:
```
julia> DataFrame(["y$i" => df[!, i] for i in 1:4])
5×4 DataFrame
Row │ y1 y2 y3 y4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
You could also use the `raname` function:
```
julia> rename(df, string.("y", 1:4))
5×4 DataFrame
Row │ y1 y2 y3 y4
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.8386 0.83612 0.0353994 0.15547
2 │ 0.590172 0.611815 0.0691152 0.915788
3 │ 0.879395 0.07271 0.980079 0.655158
4 │ 0.340435 0.756196 0.0697535 0.388578
5 │ 0.714515 0.861872 0.971521 0.176768
```
### Exercise 4
Solution:
```
julia> d = Dict([n => df[:, n] for n in names(df)])
Dict{String, Vector{Float64}} with 4 entries:
"6mTK5evn" => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
"5Caz55k0" => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
"K8Inf7ER" => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
"SRiGemsa" => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
```
or (using the `pairs` function; note that this time column names are `Symbol`):
```
julia> Dict(pairs(eachcol(df)))
Dict{Symbol, AbstractVector} with 4 entries:
Symbol("6mTK5evn") => [0.8386, 0.590172, 0.879395, 0.340435, 0.714515]
:SRiGemsa => [0.15547, 0.915788, 0.655158, 0.388578, 0.176768]
:K8Inf7ER => [0.83612, 0.611815, 0.07271, 0.756196, 0.861872]
Symbol("5Caz55k0") => [0.0353994, 0.0691152, 0.980079, 0.0697535, 0.971521]
```
### Exercise 5
Solution:
```
julia> DataFrame(d)
5×4 DataFrame
Row │ 5Caz55k0 6mTK5evn K8Inf7ER SRiGemsa
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 0.0353994 0.8386 0.83612 0.15547
2 │ 0.0691152 0.590172 0.611815 0.915788
3 │ 0.980079 0.879395 0.07271 0.655158
4 │ 0.0697535 0.340435 0.756196 0.388578
5 │ 0.971521 0.714515 0.861872 0.176768
```
Note that columns of a data frame are now sorted by their names.
This is done for `Dict` objects because such dictionaries do not have
a defined order of keys.
### Exercise 6
Solution:
```
julia> using LinearAlgebra
julia> using StatsBase
julia> pairwise(dot, eachcol(df))
4×4 Matrix{Float64}:
2.45132 1.99944 1.65026 1.50558
1.99944 2.39336 1.03322 1.18411
1.65026 1.03322 1.9153 0.909744
1.50558 1.18411 0.909744 1.47431
```
### Exercise 7
Solution:
```
julia> vcat(df1, df2, cols=:intersect)
4×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 1
4 │ 2
```
By default you will get an error:
```
julia> vcat(df1, df2)
ERROR: ArgumentError: column(s) c are missing from argument(s) 1, and column(s) b are missing from argument(s) 2
```
### Exercise 8
Solution:
```
julia> append!(df1, df2, cols=:subset)
4×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 11
2 │ 2 12
3 │ 1 missing
4 │ 2 missing
```
### Exercise 9
Solution
```
circle=DataFrame()
for _ in 1:1000
x, y = 2rand()-1, 2rand()-1
inside = x^2 + y^2 < 1
push!(circle, (x=x, y=y, inside=inside))
end
describe(circle)
```
We note that mean of variable `inside` is approximately π.
### Exercise 10
Solution:
```
using Plots
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, size=(400, 400))
scatter(circle.x, circle.y, color=[i ? "black" : "red" for i in circle.inside], xlabel="x", ylabel="y", legend=false, aspect_ratio=:equal)
```
In the solution two ways to plot ensuring the ratio between x and y axis is 1
are shown. Note the differences in the produced output between the two methods.
</details>