# Julia for Data Analysis ## Bogumił Kamiński, Daniel Kaszyński # Chapter 8 # Problems ### Exercise 1 Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame` called `df`.
Solution CSV.jl supports reading gzip-compressed files so you can just do: ``` julia> using CSV julia> using DataFrames julia> df = CSV.read("example8.csv.gz", DataFrame) 4×2 DataFrame Row │ number square │ Int64 Int64 ─────┼──────────────── 1 │ 1 2 2 │ 2 4 3 │ 3 9 4 │ 4 16 ``` You can also do it manually: ``` julia> using CodecZlib # you might need to install this package julia> compressed = read("example8.csv.gz"); julia> plain = transcode(GzipDecompressor, compressed); julia> df = CSV.read(plain, DataFrame) 4×2 DataFrame Row │ number square │ Int64 Int64 ─────┼──────────────── 1 │ 1 2 2 │ 2 4 3 │ 3 9 4 │ 4 16 ```
### Exercise 2 Get number of rows, columns, column names and summary statistics of the `df` data frame from exercise 1.
Solution ``` julia> nrow(df) 4 julia> ncol(df) 2 julia> names(df) 2-element Vector{String}: "number" "square" julia> describe(df) 2×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Float64 Int64 Float64 Int64 Int64 DataType ─────┼────────────────────────────────────────────────────────────── 1 │ number 2.5 1 2.5 4 0 Int64 2 │ square 7.75 2 6.5 16 0 Int64 ```
### Exercise 3 Make a plot of `number` against `square` columns of `df` data frame.
Solution ``` using Plots plot(df.number, df.square, xlabel="number", ylabel="square", legend=false) ```
### Exercise 4 Add a column to `df` data frame with name `name string` containing string representation of numbers in column `number`, i.e. `["one", "two", "three", "four"]`.
Solution ``` julia> df."name string" = ["one", "two", "three", "four"] 4-element Vector{String}: "one" "two" "three" "four" julia> df 4×3 DataFrame Row │ number square name string │ Int64 Int64 String ─────┼───────────────────────────── 1 │ 1 2 one 2 │ 2 4 two 3 │ 3 9 three 4 │ 4 16 four ``` Note that we needed to use a string as we have space in column name.
### Exercise 5 Check if `df` contains column `square2`.
Solution You can use either `hasproperty` or `columnindex`: ``` julia> hasproperty(df, :square2) false julia> columnindex(df, :square2) 0 ``` Note that if you try to access this column you will get a hint what was the mistake you most likely made: ``` julia> df.square2 ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square ```
### Exercise 6 Extract column `number` from `df` and empty it (recall `empty!` function discussed in chapter 4).
Solution ``` julia> empty!(df[:, :number]) Int64[] ``` Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)` as it would corrupt the `df` data frame (these operations do non-copying extraction of a column from a data frame as opposed to `df[:, :number]` which makes a copy).
### Exercise 7 In `Random` module the `randexp` function is defined that samples numbers from exponential distribution with scale 1. Draw two 100,000 element samples from this distribution store them in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values and sum of vector `x` and half of vector `y`.
Solution ``` using Random using Plots x = randexp(100_000); y = randexp(100_000); histogram(x + y / 2, label="mean") histogram!(max.(x, y), label="maximum") ``` I have put both histograms on the same plot to show that they overlap.
### Exercise 8 Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them, and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`. Compute all standard descriptive statistics of columns of this data frame.
Solution You might get slightly different results because we did not set the seed of random number generator when creating `x` and `y` vectors: ``` julia> df = DataFrame(x=x, y=y); julia> df."x+y/2" = x + y / 2; julia> df."max.(x,y)" = max.(x, y); julia> describe(df, :all) 4×13 DataFrame Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ x 0.997023 0.999119 3.01389e-6 0.285129 0.68856 1.38414 12.1556 0 0.250502 0.077737 Float64 2 │ y 1.00109 0.995904 2.78828e-6 0.289371 0.6957 1.38491 12.0445 0 0.689659 0.486246 Float64 3 │ x+y/2 1.49757 1.11676 0.00217486 0.688598 1.2235 2.0113 14.2046 0 0.595331 0.32086 Float64 4 │ max.(x,y) 1.49872 1.11295 0.00187844 0.691588 1.22466 2.01257 12.1556 0 0.689659 0.486246 Float64 ``` We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary statistics except `first` and `last` as expected.
### Exercise 9 Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file. Compare the size of created files using the `filesize` function.
Solution ``` julia> using Arrow julia> CSV.write("df.csv", df) "df.csv" julia> Arrow.write("df.arrow", df) "df.arrow" julia> filesize("df.csv") 7587820 julia> filesize("df.arrow") 3200874 ``` In this case Apache Arrow file is smaller.
### Exercise 10 Write the `df` data frame into SQLite database. Next find information about tables in this database. Run a query against a table representing the `df` data frame to calculate the mean of column `x`. Does it match the result we got in exercise 8?
Solution ``` julia> using SQLite julia> db = SQLite.DB("df.db") SQLite.DB("df.db") julia> SQLite.load!(df, db, "df") "df" julia> SQLite.tables(db) 1-element Vector{SQLite.DBTable}: SQLite.DBTable("df", Tables.Schema: :x Union{Missing, Float64} :y Union{Missing, Float64} Symbol("x+y/2") Union{Missing, Float64} Symbol("max.(x,y)") Union{Missing, Float64}) julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df"); julia> DataFrame(query) 1×1 DataFrame Row │ AVG(x) │ Float64 ─────┼────────── 1 │ 0.997023 julia> close(db) ``` The computed mean of column `x` is the same as we got in exercise 8.