# Julia for Data Analysis ## Bogumił Kamiński, Daniel Kaszyński # Chapter 12 # Problems ### Exercise 1 The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code of Go language version 19.2. As you can check on its website its SHA-256 is `sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"`. Download this file and check if it indeed has this checksum. You might need to read documentation of `string` and `join` functions. ### Exercise 2 Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip that contains the ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. From the file extract deezer_edges.json and deezer_target.csv files and save them to disk. ### Exercise 3 Load deezer_edges.json and deezer_target.csv files to Julia. The JSON file should be loaded as JSON3.jl object `edges_json`. The CSV file should be loaded into a data frame `target_df`. ### Exercise 4 Check that keys in the `edges_json` are in the same order as `id` column in `target_df`. ### Exercise 5 From every value stored in `edges_json` create a graph representing ego-net of the given node. Store these graphs in a vector that will make the `egonet` column of in the `target_df` data frame. ### Exercise 6 Ego-net in our data set is a subgraph of a full Deezer graph where for some node all its neighbors are included, but also it contains all edges between the neighbors. Therefore we expect that diameter of every ego-net is at most 2 (as every two nodes are either connected directly or by a common friend). Check if this is indeed the case. Use the `diameter` function. ### Exercise 7 For each ego-net find a central node that is connected to every other node in this network. Use the `degree` and `findall` functions to achieve this. Add `center` column with numbers of nodes that are connected to all other nodes in the ego-net to `target_df` data frame. Next add a column `center_len` that gives the number of such nodes. Check how many times different numbers of center nodes are found. ### Exercise 8 Add the following ego-net features to the `target_df` data frame: * `size`: number of nodes in ego-net * `mean_degree`: average node degree in ego-net Check mean values of these two columns by `target` column. ### Exercise 9 Continuing to work with `target_df` data frame create a logistic regression explaining `target` by `size` and `mean_degree`. ### Exercise 10 Continuing to work with `target_df` create a scatterplot where `size` will be on one axis and `mean_degree` rounded to nearest integer on the other axis. Plot the mean of `target` for each point being a combination of `size` and rounded `mean_degree`. Additionally fit a LOESS model explaining `target` by `size`. Make a prediction for values in range from 5% to 95% quantile (to concentrate on typical values of size). # Solutions
Show! ### Exercise 1 Solution: ``` using Downloads using SHA Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz") shavec = open(sha256, "go.tar.gz") shastr = join(string.(s; base=16, pad=2)) sha == shastr ``` The last line should produce `true`. ### Exercise 2 Solution: ``` Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip") import ZipFile archive = ZipFile.Reader("ego.zip") idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files)) open("deezer_edges.json", "w") do io write(io, read(archive.files[idx])) end idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files)) open("deezer_target.csv", "w") do io write(io, read(archive.files[idx])) end close(archive) ``` ### Exercise 3 Solution: ``` using CSV using DataFrames using JSON3 edges_json = JSON3.read(read("deezer_edges.json")) target_df = CSV.read("deezer_target.csv", DataFrame) ``` ### Exercise 4 Solution (short, but you need to have a good understanding of Julia types and standar functions to properly write it): ``` Symbol.(target_df.id) == keys(edges_json) ``` ### Exercise 5 Solution: ``` using Graphs function edgelist2graph(edgelist) nodes = sort!(unique(reduce(vcat, edgelist))) @assert 0:length(nodes)-1 == nodes g = SimpleGraph(length(nodes)) for (a, b) in edgelist add_edge!(g, a+1, b+1) end return g end target_df.egonet = edgelist2graph(values(edges_json)) ``` ### Exercise 6 Solution: ``` julia> extrema(diameter.(target_df.egonet)) (2, 2) ``` Indeed we see that for each ego-net diameter is 2. ### Exercise 7 Solution: ``` target_df.center = map(target_df.egonet) do g findall(==(nv(g) - 1), degree(g)) end target_df.center_len = length.(target_df.center) combine(groupby(target_df, :center_len, sort=true), nrow) ``` Note that we used `map` since in this case it gives a convenient way to express the condition we want to check. We notice that in some cases it is impossible to identify the center of the ego-net uniquely. ### Exercise 8 Solution: ``` using Statistics target_df.size = nv.(target_df.egonet) target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean) ``` It seems that for target equal to `0` size and average degree in the network are a bit larger. ### Exercise 9 Solution: ``` using GLM glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink()) ``` We see that only `size` is statistically significant. ### Exercise 10 Solution: ``` using Plots target_df.round_degree = round.(Int, target_df.mean_degree) agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean) scatter(agg_df.size, agg_df.round_degree; zcolor=agg_df.target_mean, xlabel="size", ylabel="rounded degree", label="mean target", xaxis=:log) ``` It is hard to visually see any strong relationship in the data. ``` import Loess model = Loess.loess(target_df.size, target_df.target) size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95) target_predict = Loess.predict(model, size_predict) plot(size_predict, target_predict; xlabel="size", ylabel="predicted target", legend=false) ``` Between quantiles 5% and 95% we see a downward shaped relationship.