JuliaForDataAnalysis/exercises/exercises08.md
2022-10-14 13:43:12 +02:00

7.7 KiB
Raw Blame History

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 8

Problems

Exercise 1

Read data stored in a gzip-compressed file example8.csv.gz into a DataFrame called df.

Solution

CSV.jl supports reading gzip-compressed files so you can just do:

julia> using CSV

julia> using DataFrames

julia> df = CSV.read("example8.csv.gz", DataFrame)
4×2 DataFrame
 Row │ number  square
     │ Int64   Int64
─────┼────────────────
   1 │      1       2
   2 │      2       4
   3 │      3       9
   4 │      4      16

You can also do it manually:

julia> using CodecZlib # you might need to install this package

julia> compressed = read("example8.csv.gz");

julia> plain = transcode(GzipDecompressor, compressed);

julia> df = CSV.read(plain, DataFrame)
4×2 DataFrame
 Row │ number  square
     │ Int64   Int64
─────┼────────────────
   1 │      1       2
   2 │      2       4
   3 │      3       9
   4 │      4      16

Exercise 2

Get number of rows, columns, column names and summary statistics of the df data frame from exercise 1.

Solution
julia> nrow(df)
4

julia> ncol(df)
2

julia> names(df)
2-element Vector{String}:
 "number"
 "square"

julia> describe(df)
2×7 DataFrame
 Row │ variable  mean     min    median   max    nmissing  eltype
     │ Symbol    Float64  Int64  Float64  Int64  Int64     DataType
─────┼──────────────────────────────────────────────────────────────
   1 │ number       2.5       1      2.5      4         0  Int64
   2 │ square       7.75      2      6.5     16         0  Int64

Exercise 3

Make a plot of number against square columns of df data frame.

Solution
using Plots
plot(df.number, df.square, xlabel="number", ylabel="square", legend=false)

Exercise 4

Add a column to df data frame with name name string containing string representation of numbers in column number, i.e. ["one", "two", "three", "four"].

Solution
julia> df."name string" = ["one", "two", "three", "four"]
4-element Vector{String}:
 "one"
 "two"
 "three"
 "four"

julia> df
4×3 DataFrame
 Row │ number  square  name string
     │ Int64   Int64   String
─────┼─────────────────────────────
   1 │      1       2  one
   2 │      2       4  two
   3 │      3       9  three
   4 │      4      16  four

Note that we needed to use a string as we have space in column name.

Exercise 5

Check if df contains column square2.

Solution

You can use either hasproperty or columnindex:

julia> hasproperty(df, :square2)
false

julia> columnindex(df, :square2)
0

Note that if you try to access this column you will get a hint what was the mistake you most likely made:

julia> df.square2
ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square

Exercise 6

Extract column number from df and empty it (recall empty! function discussed in chapter 4).

Solution
julia> empty!(df[:, :number])
Int64[]

Note that you must not do empty!(df[!, :number]) nor empty!(df.number) as it would corrupt the df data frame (these operations do non-copying extraction of a column from a data frame as opposed to df[:, :number] which makes a copy).

Exercise 7

In Random module the randexp function is defined that samples numbers from exponential distribution with scale 1. Draw two 100,000 element samples from this distribution store them in x and y vectors. Plot histograms of maximum of pairs of sampled values and sum of vector x and half of vector y.

Solution
using Random
using Plots
x = randexp(100_000);
y = randexp(100_000);
histogram(x + y / 2, label="mean")
histogram!(max.(x, y), label="maximum")

I have put both histograms on the same plot to show that they overlap.

Exercise 8

Using vectors x and y from exercise 7 create the df data frame storing them, and maximum of pairs of sampled values and sum of vector x and half of vector y. Compute all standard descriptive statistics of columns of this data frame.

Solution

You might get slightly different results because we did not set the seed of random number generator when creating x and y vectors:

julia> df = DataFrame(x=x, y=y);

julia> df."x+y/2" = x + y / 2;

julia> df."max.(x,y)" = max.(x, y);

julia> describe(df, :all)
4×13 DataFrame
 Row │ variable   mean      std       min         q25       median   q75      max      nunique  nmissing  first     last      eltype
     │ Symbol     Float64   Float64   Float64     Float64   Float64  Float64  Float64  Nothing  Int64     Float64   Float64   DataType
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ x          0.997023  0.999119  3.01389e-6  0.285129  0.68856  1.38414  12.1556                  0  0.250502  0.077737  Float64
   2 │ y          1.00109   0.995904  2.78828e-6  0.289371  0.6957   1.38491  12.0445                  0  0.689659  0.486246  Float64
   3 │ x+y/2      1.49757   1.11676   0.00217486  0.688598  1.2235   2.0113   14.2046                  0  0.595331  0.32086   Float64
   4 │ max.(x,y)  1.49872   1.11295   0.00187844  0.691588  1.22466  2.01257  12.1556                  0  0.689659  0.486246  Float64

We indeed see that x+y/2 and max.(x,y) columns have very similar summary statistics except first and last as expected.

Exercise 9

Store the df data frame from exercise 8 in Apache Arrow file and CSV file. Compare the size of created files using the filesize function.

Solution
julia> using Arrow

julia> CSV.write("df.csv", df)
"df.csv"

julia> Arrow.write("df.arrow", df)
"df.arrow"

julia> filesize("df.csv")
7587820

julia> filesize("df.arrow")
3200874

In this case Apache Arrow file is smaller.

Exercise 10

Write the df data frame into SQLite database. Next find information about tables in this database. Run a query against a table representing the df data frame to calculate the mean of column x. Does it match the result we got in exercise 8?

Solution
julia> using SQLite

julia> db = SQLite.DB("df.db")
SQLite.DB("df.db")

julia> SQLite.load!(df, db, "df")
"df"

julia> SQLite.tables(db)
1-element Vector{SQLite.DBTable}:
 SQLite.DBTable("df", Tables.Schema:
 :x                   Union{Missing, Float64}
 :y                   Union{Missing, Float64}
 Symbol("x+y/2")      Union{Missing, Float64}
 Symbol("max.(x,y)")  Union{Missing, Float64})

julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df");

julia> DataFrame(query)
1×1 DataFrame
 Row │ AVG(x)
     │ Float64
─────┼──────────
   1 │ 0.997023

julia> close(db)

The computed mean of column x is the same as we got in exercise 8.