6.4 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 12
Problems
Exercise 1
The https://go.dev/dl/go1.19.2.src.tar.gz link contains source code
of Go language version 19.2. As you can check on its website its SHA-256
is
sha = "2ce930d70a931de660fdaf271d70192793b1b240272645bf0275779f6704df6b"
.
Download this file and check if it indeed has this checksum. You might
need to read documentation of string
and join
functions.
Solution
using Downloads
using SHA
Downloads.download("https://go.dev/dl/go1.19.2.src.tar.gz", "go.tar.gz")
shavec = open(sha256, "go.tar.gz")
shastr = join(string.(shavec; base=16, pad=2))
sha == shastr
The last line should produce true
.
Exercise 2
Download the file http://snap.stanford.edu/data/deezer_ego_nets.zip that contains the ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships.
From the file extract deezer_edges.json and deezer_target.csv files and save them to disk.
Solution
Downloads.download("http://snap.stanford.edu/data/deezer_ego_nets.zip", "ego.zip")
import ZipFile
archive = ZipFile.Reader("ego.zip")
idx = only(findall(x -> contains(x.name, "deezer_edges.json"), archive.files))
open("deezer_edges.json", "w") do io
write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "deezer_target.csv"), archive.files))
open("deezer_target.csv", "w") do io
write(io, read(archive.files[idx]))
end
close(archive)
Exercise 3
Load deezer_edges.json and deezer_target.csv files to Julia. The JSON
file should be loaded as JSON3.jl object edges_json
. The
CSV file should be loaded into a data frame target_df
.
Solution
using CSV
using DataFrames
using JSON3
edges_json = JSON3.read(read("deezer_edges.json"))
target_df = CSV.read("deezer_target.csv", DataFrame)
Exercise 4
Check that keys in the edges_json
are in the same order
as id
column in target_df
.
Solution
This is short, but you need to have a good understanding of Julia types and standar functions to properly write it:
Symbol.(target_df.id) == collect(keys(edges_json))
Exercise 5
From every value stored in edges_json
create a graph
representing ego-net of the given node. Store these graphs in a vector
that will make the egonet
column of in the
target_df
data frame.
Solution
using Graphs
function edgelist2graph(edgelist)
nodes = sort!(unique(reduce(vcat, edgelist)))
@assert 0:length(nodes)-1 == nodes
g = SimpleGraph(length(nodes))
for (a, b) in edgelist
add_edge!(g, a+1, b+1)
end
return g
end
target_df.egonet = edgelist2graph.(values(edges_json))
Exercise 6
Ego-net in our data set is a subgraph of a full Deezer graph where
for some node all its neighbors are included, but also it contains all
edges between the neighbors. Therefore we expect that diameter of every
ego-net is at most 2 (as every two nodes are either connected directly
or by a common friend). Check if this is indeed the case. Use the
diameter
function.
Solution
julia> extrema(diameter.(target_df.egonet))
(2, 2)
Indeed we see that for each ego-net diameter is 2.
Exercise 7
For each ego-net find a central node that is connected to every other
node in this network. Use the degree
and
findall
functions to achieve this. Add center
column with numbers of nodes that are connected to all other nodes in
the ego-net to target_df
data frame.
Next add a column center_len
that gives the number of
such nodes.
Check how many times different numbers of center nodes are found.
Solution
target_df.center = map(target_df.egonet) do g
findall(==(nv(g) - 1), degree(g))
end
target_df.center_len = length.(target_df.center)
combine(groupby(target_df, :center_len, sort=true), nrow)
Note that we used map
since in this case it gives a
convenient way to express the condition we want to check.
We notice that in some cases it is impossible to identify the center of the ego-net uniquely.
Exercise 8
Add the following ego-net features to the target_df
data
frame: * size
: number of nodes in ego-net *
mean_degree
: average node degree in ego-net
Check mean values of these two columns by target
column.
Solution
using Statistics
target_df.size = nv.(target_df.egonet)
target_df.mean_degree = 2 .* ne.(target_df.egonet) ./ target_df.size
combine(groupby(target_df, :target, sort=true), [:size, :mean_degree] .=> mean)
It seems that for target equal to 0
size and average
degree in the network are a bit larger.
Exercise 9
Continuing to work with target_df
data frame create a
logistic regression explaining target
by size
and mean_degree
.
Solution
using GLM
glm(@formula(target~size+mean_degree), target_df, Binomial(), LogitLink())
We see that only size
is statistically significant.
Exercise 10
Continuing to work with target_df
create a scatterplot
where size
will be on one axis and mean_degree
rounded to nearest integer on the other axis. Plot the mean of
target
for each point being a combination of
size
and rounded mean_degree
.
Additionally fit a LOESS model explaining target
by
size
. Make a prediction for values in range from 5% to 95%
quantile (to concentrate on typical values of size).
Solution
using Plots
target_df.round_degree = round.(Int, target_df.mean_degree)
agg_df = combine(groupby(target_df, [:size, :round_degree]), :target => mean)
scatter(agg_df.size, agg_df.round_degree;
zcolor=agg_df.target_mean,
xlabel="size", ylabel="rounded degree",
label="mean target", xaxis=:log)
It is hard to visually see any strong relationship in the data.
import Loess
model = Loess.loess(target_df.size, target_df.target)
size_predict = quantile(target_df.size, 0.05):1.0:quantile(target_df.size, 0.95)
target_predict = Loess.predict(model, size_predict)
plot(size_predict, target_predict;
xlabel="size", ylabel="predicted target", legend=false)
Between quantiles 5% and 95% of size
we see a downward
shaped relationship.