309 lines
7.7 KiB
Markdown
309 lines
7.7 KiB
Markdown
# Julia for Data Analysis
|
||
|
||
## Bogumił Kamiński, Daniel Kaszyński
|
||
|
||
# Chapter 8
|
||
|
||
# Problems
|
||
|
||
### Exercise 1
|
||
|
||
Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame`
|
||
called `df`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
CSV.jl supports reading gzip-compressed files so you can just do:
|
||
|
||
```
|
||
julia> using CSV
|
||
|
||
julia> using DataFrames
|
||
|
||
julia> df = CSV.read("example8.csv.gz", DataFrame)
|
||
4×2 DataFrame
|
||
Row │ number square
|
||
│ Int64 Int64
|
||
─────┼────────────────
|
||
1 │ 1 2
|
||
2 │ 2 4
|
||
3 │ 3 9
|
||
4 │ 4 16
|
||
```
|
||
|
||
You can also do it manually:
|
||
```
|
||
julia> using CodecZlib # you might need to install this package
|
||
|
||
julia> compressed = read("example8.csv.gz");
|
||
|
||
julia> plain = transcode(GzipDecompressor, compressed);
|
||
|
||
julia> df = CSV.read(plain, DataFrame)
|
||
4×2 DataFrame
|
||
Row │ number square
|
||
│ Int64 Int64
|
||
─────┼────────────────
|
||
1 │ 1 2
|
||
2 │ 2 4
|
||
3 │ 3 9
|
||
4 │ 4 16
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 2
|
||
|
||
Get number of rows, columns, column names and summary statistics of the
|
||
`df` data frame from exercise 1.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> nrow(df)
|
||
4
|
||
|
||
julia> ncol(df)
|
||
2
|
||
|
||
julia> names(df)
|
||
2-element Vector{String}:
|
||
"number"
|
||
"square"
|
||
|
||
julia> describe(df)
|
||
2×7 DataFrame
|
||
Row │ variable mean min median max nmissing eltype
|
||
│ Symbol Float64 Int64 Float64 Int64 Int64 DataType
|
||
─────┼──────────────────────────────────────────────────────────────
|
||
1 │ number 2.5 1 2.5 4 0 Int64
|
||
2 │ square 7.75 2 6.5 16 0 Int64
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 3
|
||
|
||
Make a plot of `number` against `square` columns of `df` data frame.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
using Plots
|
||
plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 4
|
||
|
||
Add a column to `df` data frame with name `name string` containing string
|
||
representation of numbers in column `number`, i.e.
|
||
`["one", "two", "three", "four"]`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> df."name string" = ["one", "two", "three", "four"]
|
||
4-element Vector{String}:
|
||
"one"
|
||
"two"
|
||
"three"
|
||
"four"
|
||
|
||
julia> df
|
||
4×3 DataFrame
|
||
Row │ number square name string
|
||
│ Int64 Int64 String
|
||
─────┼─────────────────────────────
|
||
1 │ 1 2 one
|
||
2 │ 2 4 two
|
||
3 │ 3 9 three
|
||
4 │ 4 16 four
|
||
```
|
||
|
||
Note that we needed to use a string as we have space in column name.
|
||
|
||
</details>
|
||
|
||
### Exercise 5
|
||
|
||
Check if `df` contains column `square2`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
You can use either `hasproperty` or `columnindex`:
|
||
|
||
```
|
||
julia> hasproperty(df, :square2)
|
||
false
|
||
|
||
julia> columnindex(df, :square2)
|
||
0
|
||
```
|
||
|
||
Note that if you try to access this column you will get a hint what was the
|
||
mistake you most likely made:
|
||
|
||
```
|
||
julia> df.square2
|
||
ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square
|
||
```
|
||
|
||
</details>
|
||
|
||
### Exercise 6
|
||
|
||
Extract column `number` from `df` and empty it (recall `empty!` function
|
||
discussed in chapter 4).
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> empty!(df[:, :number])
|
||
Int64[]
|
||
```
|
||
|
||
Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)`
|
||
as it would corrupt the `df` data frame (these operations do non-copying
|
||
extraction of a column from a data frame as opposed to `df[:, :number]`
|
||
which makes a copy).
|
||
|
||
</details>
|
||
|
||
### Exercise 7
|
||
|
||
In `Random` module the `randexp` function is defined that samples numbers
|
||
from exponential distribution with scale 1.
|
||
Draw two 100,000 element samples from this distribution store them
|
||
in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values
|
||
and sum of vector `x` and half of vector `y`.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
using Random
|
||
using Plots
|
||
x = randexp(100_000);
|
||
y = randexp(100_000);
|
||
histogram(x + y / 2, label="mean")
|
||
histogram!(max.(x, y), label="maximum")
|
||
```
|
||
|
||
I have put both histograms on the same plot to show that they overlap.
|
||
|
||
</details>
|
||
|
||
### Exercise 8
|
||
|
||
Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them,
|
||
and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
|
||
Compute all standard descriptive statistics of columns of this data frame.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
You might get slightly different results because we did not set
|
||
the seed of random number generator when creating `x` and `y` vectors:
|
||
|
||
```
|
||
julia> df = DataFrame(x=x, y=y);
|
||
|
||
julia> df."x+y/2" = x + y / 2;
|
||
|
||
julia> df."max.(x,y)" = max.(x, y);
|
||
|
||
julia> describe(df, :all)
|
||
4×13 DataFrame
|
||
Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype
|
||
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType
|
||
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
|
||
1 │ x 0.997023 0.999119 3.01389e-6 0.285129 0.68856 1.38414 12.1556 0 0.250502 0.077737 Float64
|
||
2 │ y 1.00109 0.995904 2.78828e-6 0.289371 0.6957 1.38491 12.0445 0 0.689659 0.486246 Float64
|
||
3 │ x+y/2 1.49757 1.11676 0.00217486 0.688598 1.2235 2.0113 14.2046 0 0.595331 0.32086 Float64
|
||
4 │ max.(x,y) 1.49872 1.11295 0.00187844 0.691588 1.22466 2.01257 12.1556 0 0.689659 0.486246 Float64
|
||
```
|
||
|
||
We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary
|
||
statistics except `first` and `last` as expected.
|
||
|
||
</details>
|
||
|
||
### Exercise 9
|
||
|
||
Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file.
|
||
Compare the size of created files using the `filesize` function.
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using Arrow
|
||
|
||
julia> CSV.write("df.csv", df)
|
||
"df.csv"
|
||
|
||
julia> Arrow.write("df.arrow", df)
|
||
"df.arrow"
|
||
|
||
julia> filesize("df.csv")
|
||
7587820
|
||
|
||
julia> filesize("df.arrow")
|
||
3200874
|
||
```
|
||
|
||
In this case Apache Arrow file is smaller.
|
||
|
||
</details>
|
||
|
||
### Exercise 10
|
||
|
||
Write the `df` data frame into SQLite database. Next find information about
|
||
tables in this database. Run a query against a table representing the `df` data
|
||
frame to calculate the mean of column `x`. Does it match the result we got in
|
||
exercise 8?
|
||
|
||
<details>
|
||
<summary>Solution</summary>
|
||
|
||
```
|
||
julia> using SQLite
|
||
|
||
julia> db = SQLite.DB("df.db")
|
||
SQLite.DB("df.db")
|
||
|
||
julia> SQLite.load!(df, db, "df")
|
||
"df"
|
||
|
||
julia> SQLite.tables(db)
|
||
1-element Vector{SQLite.DBTable}:
|
||
SQLite.DBTable("df", Tables.Schema:
|
||
:x Union{Missing, Float64}
|
||
:y Union{Missing, Float64}
|
||
Symbol("x+y/2") Union{Missing, Float64}
|
||
Symbol("max.(x,y)") Union{Missing, Float64})
|
||
|
||
julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df");
|
||
|
||
julia> DataFrame(query)
|
||
1×1 DataFrame
|
||
Row │ AVG(x)
|
||
│ Float64
|
||
─────┼──────────
|
||
1 │ 0.997023
|
||
|
||
julia> close(db)
|
||
```
|
||
|
||
The computed mean of column `x` is the same as we got in exercise 8.
|
||
|
||
</details>
|