JuliaForDataAnalysis/exercises/exercises12.md
2023-05-06 21:45:45 +02:00

6.4 KiB

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 12

Problems

Exercise 1

The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code of Go language version 19.2. As you can check on its website its SHA-256 is sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b". Download this file and check if it indeed has this checksum. You might need to read documentation of string and join functions.

Solution
using Downloads
using SHA
Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz")
shavec = open(sha256, "go.tar.gz")
shastr = join(string.(shavec; base=16, pad=2))
sha == shastr

The last line should produce true.

Exercise 2

Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip that contains the ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships.

From the file extract deezer_edges.json and deezer_target.csv files and save them to disk.

Solution
Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
import ZipFile
archive = ZipFile.Reader("ego.zip")
idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files))
open("deezer_edges.json", "w") do io
    write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files))
open("deezer_target.csv", "w") do io
    write(io, read(archive.files[idx]))
end
close(archive)

Exercise 3

Load deezer_edges.json and deezer_target.csv files to Julia. The JSON file should be loaded as JSON3.jl object edges_json. The CSV file should be loaded into a data frame target_df.

Solution
using CSV
using DataFrames
using JSON3
edges_json = JSON3.read(read("deezer_edges.json"))
target_df = CSV.read("deezer_target.csv", DataFrame)

Exercise 4

Check that keys in the edges_json are in the same order as id column in target_df.

Solution

This is short, but you need to have a good understanding of Julia types and standar functions to properly write it:

Symbol.(target_df.id) == collect(keys(edges_json))

Exercise 5

From every value stored in edges_json create a graph representing ego-net of the given node. Store these graphs in a vector that will make the egonet column of in the target_df data frame.

Solution
using Graphs
function edgelist2graph(edgelist)
    nodes = sort!(unique(reduce(vcat, edgelist)))
    @assert 0:length(nodes)-1 == nodes
    g = SimpleGraph(length(nodes))
    for (a, b) in edgelist
        add_edge!(g, a+1, b+1)
    end
    return g
end
target_df.egonet = edgelist2graph.(values(edges_json))

Exercise 6

Ego-net in our data set is a subgraph of a full Deezer graph where for some node all its neighbors are included, but also it contains all edges between the neighbors. Therefore we expect that diameter of every ego-net is at most 2 (as every two nodes are either connected directly or by a common friend). Check if this is indeed the case. Use the diameter function.

Solution
julia> extrema(diameter.(target_df.egonet))
(2, 2)

Indeed we see that for each ego-net diameter is 2.

Exercise 7

For each ego-net find a central node that is connected to every other node in this network. Use the degree and findall functions to achieve this. Add center column with numbers of nodes that are connected to all other nodes in the ego-net to target_df data frame.

Next add a column center_len that gives the number of such nodes.

Check how many times different numbers of center nodes are found.

Solution
target_df.center = map(target_df.egonet) do g
    findall(==(nv(g) - 1), degree(g))
end
target_df.center_len = length.(target_df.center)
combine(groupby(target_df, :center_len, sort=true), nrow)

Note that we used map since in this case it gives a convenient way to express the condition we want to check.

We notice that in some cases it is impossible to identify the center of the ego-net uniquely.

Exercise 8

Add the following ego-net features to the target_df data frame: * size: number of nodes in ego-net * mean_degree: average node degree in ego-net

Check mean values of these two columns by target column.

Solution
using Statistics
target_df.size = nv.(target_df.egonet)
target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size
combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)

It seems that for target equal to 0 size and average degree in the network are a bit larger.

Exercise 9

Continuing to work with target_df data frame create a logistic regression explaining target by size and mean_degree.

Solution
using GLM
glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())

We see that only size is statistically significant.

Exercise 10

Continuing to work with target_df create a scatterplot where size will be on one axis and mean_degree rounded to nearest integer on the other axis. Plot the mean of target for each point being a combination of size and rounded mean_degree.

Additionally fit a LOESS model explaining target by size. Make a prediction for values in range from 5% to 95% quantile (to concentrate on typical values of size).

Solution
using Plots
target_df.round_degree = round.(Int, target_df.mean_degree)
agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean)
scatter(agg_df.size, agg_df.round_degree;
        zcolor=agg_df.target_mean,
        xlabel="size", ylabel="rounded degree",
        label="mean target", xaxis=:log)

It is hard to visually see any strong relationship in the data.

import Loess
model = Loess.loess(target_df.size, target_df.target)
size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95)
target_predict = Loess.predict(model, size_predict)
plot(size_predict, target_predict;
     xlabel="size", ylabel="predicted target", legend=false)

Between quantiles 5% and 95% of size we see a downward shaped relationship.