# Julia for Data Analysis ## Bogumił Kamiński, Daniel Kaszyński # Chapter 8 # Problems ### Exercise 1 Read data stored in a gzip-compressed file `example8.csv.gz` into a `DataFrame` called `df`. ### Exercise 2 Get number of rows, columns, column names and summary statistics of the `df` data frame from exercise 1. ### Exercise 3 Make a plot of `number` against `square` columns of `df` data frame. ### Exercise 4 Add a column to `df` data frame with name `name string` containing string representation of numbers in column `number`, i.e. `["one", "two", "three", "four"]`. ### Exercise 5 Check if `df` contains column `square2`. ### Exercise 6 Extract column `number` from `df` and empty it (recall `empty!` function discussed in chapter 4). ### Exercise 7 In `Random` module the `randexp` function is defined that samples numbers from exponential distribution with scale 1. Draw two 100,000 element samples from this distribution store them in `x` and `y` vectors. Plot histograms of maximum of pairs of sampled values and sum of vector `x` and half of vector `y`. ### Exercise 8 Using vectors `x` and `y` from exercise 7 create the `df` data frame storing them, and maximum of pairs of sampled values and sum of vector `x` and half of vector `y`. Compute all standard descriptive statistics of columns of this data frame. ### Exercise 9 Store the `df` data frame from exercise 8 in Apache Arrow file and CSV file. Compare the size of created files using the `filesize` function. ### Exercise 10 Write the `df` data frame into SQLite database. Next find information about tables in this database. Run a query against a table representing the `df` data frame to calculate the mean of column `x`. Does it match the result we got in exercise 8? # Solutions
Show! ### Exercise 1 CSV.jl supports reading gzip-compressed files so you can just do: ``` julia> using CSV julia> using DataFrames julia> df = CSV.read("example8.csv.gz", DataFrame) 4×2 DataFrame Row │ number square │ Int64 Int64 ─────┼──────────────── 1 │ 1 2 2 │ 2 4 3 │ 3 9 4 │ 4 16 ``` You can also do it manually: ``` julia> using CodecZlib # you might need to install this package julia> compressed = read("example8.csv.gz"); julia> plain = transcode(GzipDecompressor, compressed); julia> df = CSV.read(plain, DataFrame) 4×2 DataFrame Row │ number square │ Int64 Int64 ─────┼──────────────── 1 │ 1 2 2 │ 2 4 3 │ 3 9 4 │ 4 16 ``` ### Exercise 2 Solution: ``` julia> nrow(df) 4 julia> ncol(df) 2 julia> names(df) 2-element Vector{String}: "number" "square" julia> describe(df) 2×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Float64 Int64 Float64 Int64 Int64 DataType ─────┼────────────────────────────────────────────────────────────── 1 │ number 2.5 1 2.5 4 0 Int64 2 │ square 7.75 2 6.5 16 0 Int64 ``` ### Exercise 3 Solution: ``` using Plots plot(df.number, df.square, xlabel="number", ylabel="square", legend=false) ``` ### Exercise 4 Solution: ``` julia> df."name string" = ["one", "two", "three", "four"] 4-element Vector{String}: "one" "two" "three" "four" julia> df 4×3 DataFrame Row │ number square name string │ Int64 Int64 String ─────┼───────────────────────────── 1 │ 1 2 one 2 │ 2 4 two 3 │ 3 9 three 4 │ 4 16 four ``` Note that we needed to use a string as we have space in column name. ### Exercise 5 You can use either `hasproperty` or `columnindex`: ``` julia> hasproperty(df, :square2) false julia> columnindex(df, :square2) 0 ``` Note that if you try to access this column you will get a hint what was the mistake you most likely made: ``` julia> df.square2 ERROR: ArgumentError: column name :square2 not found in the data frame; existing most similar names are: :square ``` ### Exercise 6 Solution: ``` julia> empty!(df[:, :number]) Int64[] ``` Note that you must not do `empty!(df[!, :number])` nor `empty!(df.number)` as it would corrupt the `df` data frame (these operations do non-copying extraction of a column from a data frame as opposed to `df[:, :number]` which makes a copy). ### Exercise 7 Solution: ``` using Random using Plots x = randexp(100_000); y = randexp(100_000); histogram(x + y / 2, label="mean") histogram!(max.(x, y), label="maximum") ``` I have put both histograms on the same plot to show that they overlap. ### Exercise 8 Solution (you might get slightly different results because we did not set the seed of random number generator when creating `x` and `y` vectors): ``` julia> df = DataFrame(x=x, y=y); julia> df."x+y/2" = x + y / 2; julia> df."max.(x,y)" = max.(x, y); julia> describe(df, :all) 4×13 DataFrame Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ x 0.997023 0.999119 3.01389e-6 0.285129 0.68856 1.38414 12.1556 0 0.250502 0.077737 Float64 2 │ y 1.00109 0.995904 2.78828e-6 0.289371 0.6957 1.38491 12.0445 0 0.689659 0.486246 Float64 3 │ x+y/2 1.49757 1.11676 0.00217486 0.688598 1.2235 2.0113 14.2046 0 0.595331 0.32086 Float64 4 │ max.(x,y) 1.49872 1.11295 0.00187844 0.691588 1.22466 2.01257 12.1556 0 0.689659 0.486246 Float64 ``` We indeed see that `x+y/2` and `max.(x,y)` columns have very similar summary statistics except `first` and `last` as expected. ### Exercise 9 ``` julia> using Arrow julia> CSV.write("df.csv", df) "df.csv" julia> Arrow.write("df.arrow", df) "df.arrow" julia> filesize("df.csv") 7587820 julia> filesize("df.arrow") 3200874 ``` In this case Apache Arrow file is smaller. ### Exercise 10 ``` julia> using SQLite julia> db = SQLite.DB("df.db") SQLite.DB("df.db") julia> SQLite.load!(df, db, "df") "df" julia> SQLite.tables(db) 1-element Vector{SQLite.DBTable}: SQLite.DBTable("df", Tables.Schema: :x Union{Missing, Float64} :y Union{Missing, Float64} Symbol("x+y/2") Union{Missing, Float64} Symbol("max.(x,y)") Union{Missing, Float64}) julia> query = DBInterface.execute(db, "SELECT AVG(x) FROM df"); julia> DataFrame(query) 1×1 DataFrame Row │ AVG(x) │ Float64 ─────┼────────── 1 │ 0.997023 julia> close(db) ``` The computed mean of column `x` is the same as we got in exercise 8.