11 KiB
Julia for Data Analysis
Bogumił Kamiński, Daniel Kaszyński
Chapter 14
Problems
Exercise 1
Write a simulator that takes one parameter n
. Next we
assume we draw n
times from this set with replacement. The
simulator should return the average number of items from this set that
were drawn at least once.
Call the function running this simulation boot
.
Exercise 2
Now write a function simboot
that takes parameters
n
and k
and runs the simulation defined in the
boot
function k
times. It should return a
named tuple storing k
, n
, mean of produced
values, and ends of approximated 95% confidence interval of the
result.
Make this function single threaded. Check how long this function runs
for n=1000
and k=1_000_000
.
Exercise 3
Now rewrite this simulator to be multi threaded. Use 4 cores for
benchmarking. Call the function simbootT
. Check how long
this function runs for n=1000
and
k=1_000_000
.
Exercise 4
Now rewrite boot
and simbootT
to perform
less allocations. Achieve this by making sure that all allocated objects
are passed to boot
function (so that it does not do any
allocations internally). Call these new functions boot!
and
simbootT2
. You might need to use the
Threads.threadid
and Threads.nthreads
functions.
Exercise 5
Use either of the solutions we have developed in the previous
exercises to create a web service taking k
and
n
parameters and returning the values produced by
boot
functions and time to run the simulation. You might
want to use the @timed
macro in your solution.
Start the server.
Exercise 6
Query the server started in the exercise 5 with the following
parameters: * k=1000
and n=1000
*
k=1.5
and n=1000
Exercise 7
Collect the data generated by a web service into the df
data frame for k = [10^i for i in 3:6]
and
n = [10^i for i in 1:3]
.
Exercise 8
Replace the value
column in the df
data
frame by its contents in-place.
Exercise 9
Checks that execution time roughly scales proportionally to the
product of k
times n
.
Exercise 10
Plot the expected fraction of seen elements in the set as a function
of n
by k
along with 95% confidence interval
around these values.
Solutions
Show!
Exercise 1
Solution (there are many other approaches you could use):
using Statistics
function boot(n::Integer)
table = falses(n)
for _ in 1:n
table[rand(1:n)] = true
end
return mean(table)
end
Exercise 2
Solution:
function simboot(k::Integer, n::Integer)
result = [boot(n) for _ in 1:k]
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
We run it twice to make sure everything is compiled:
julia> @time simboot(1000, 1_000_000)
7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time)
(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944)
julia> @time simboot(1000, 1_000_000)
7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time)
(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754)
We see that on my computer the run time is around 7 seconds.
Exercise 3
Solution:
using ThreadsX
function simbootT(k::Integer, n::Integer)
result = ThreadsX.map(i -> boot(n), 1:k)
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
Here is the timing for four threads:
julia> @time simbootT(1000, 1_000_000)
2.390795 seconds (3.37 k allocations: 119.434 MiB)
(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484)
julia> @time simbootT(1000, 1_000_000)
2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time)
(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554)
Indeed we see a significant performance improvement.
Exercise 4
Solution:
function boot!(n::Integer, pool)
table = pool[Threads.threadid()]
fill!(table, false)
for _ in 1:n
table[rand(1:n)] = true
end
return mean(table)
end
function simbootT2(k::Integer, n::Integer)
pool = [falses(n) for _ in 1:Threads.nthreads()]
result = ThreadsX.map(i -> boot!(n, pool), 1:k)
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
In the solution the pool
vector keeps table
vector individually for each thread. Let us test the timing:
julia> @time simbootT2(1000, 1_000_000)
2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation)
(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206)
julia> @time simbootT2(1000, 1_000_000)
2.340694 seconds (391 allocations: 586.453 KiB)
(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058)
Indeed, we see that the number of allocations was decreased, which should lower GC usage. However, the runtime of the simulation is similar since in this task memory allocation does not account for a significant portion of the runtime.
Exercise 5
Solution (I used the simplest single-threaded code here; this is a complete code of the web service):
using Genie
using Statistics
function boot(n::Integer)
table = falses(n)
for _ in 1:n
table[rand(1:n)] = true
end
return mean(table)
end
function simboot(k::Integer, n::Integer)
result = [boot(n) for _ in 1:k]
mv = mean(result)
sdv = std(result)
lo95 = mv - 1.96 * sdv / sqrt(k)
hi95 = mv + 1.96 * sdv / sqrt(k)
return (; k, n, mv, lo95, hi95)
end
Genie.config.run_as_server = true
Genie.Router.route("/", method=POST) do
message = Genie.Requests.jsonpayload()
return try
k = message["k"]
n = message["n"]
value, time = @timed simboot(k, n)
Genie.Renderer.Json.json((status="OK", time=time, value=value))
catch
Genie.Renderer.Json.json((status="ERROR", time="", value=""))
end
end
Genie.Server.up()
Exercise 6
Solution:
julia> using HTTP
julia> using JSON3
julia> HTTP.post("http://127.0.0.1:8000",
["Content-Type" => "application/json"],
JSON3.write((k=1000, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked
{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}"""
julia> HTTP.post("http://127.0.0.1:8000",
["Content-Type" => "application/json"],
JSON3.write((k=1.5, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked
{"status":"ERROR","time":"","value":""}"""
As expected we got a positive answer the first time and an error on the second call.
Exercise 7
Solution:
using DataFrames
df = DataFrame()
for k in [10^i for i in 3:6], n in [10^i for i in 1:3]
@show k, n
req = HTTP.post("http://127.0.0.1:8000",
["Content-Type" => "application/json"],
JSON3.write((; k, n)))
push!(df, NamedTuple(JSON3.read(req.body)))
end
Note that I convert JSON3.Object
into a
NamedTuple
to easily push!
it into the
df
data frame.
Let us have a look at the produced data frame:
julia> df
12×3 DataFrame
Row │ status time value
│ String Float64 Object…
─────┼──────────────────────────────────────────────────────
1 │ OK 0.0006784 {\n "k": 1000,\n "n": …
2 │ OK 0.0038374 {\n "k": 1000,\n "n": …
3 │ OK 0.0150844 {\n "k": 1000,\n "n": …
4 │ OK 0.0014071 {\n "k": 10000,\n "n":…
5 │ OK 0.008443 {\n "k": 10000,\n "n":…
6 │ OK 0.0700319 {\n "k": 10000,\n "n":…
7 │ OK 0.0253826 {\n "k": 100000,\n "n"…
8 │ OK 0.0795937 {\n "k": 100000,\n "n"…
9 │ OK 0.708287 {\n "k": 100000,\n "n"…
10 │ OK 0.160286 {\n "k": 1000000,\n "n…
11 │ OK 0.803433 {\n "k": 1000000,\n "n…
12 │ OK 7.23958 {\n "k": 1000000,\n "n…
Exercise 8
Solution:
julia> select!(df, :status, :time, :value => AsTable)
12×7 DataFrame
Row │ status time k n mv lo95 hi95
│ String Float64 Int64 Int64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────
1 │ OK 0.0006784 1000 10 0.6469 0.640745 0.653055
2 │ OK 0.0038374 1000 100 0.63508 0.633035 0.637125
3 │ OK 0.0150844 1000 1000 0.632178 0.631581 0.632775
4 │ OK 0.0014071 10000 10 0.65239 0.650425 0.654355
5 │ OK 0.008443 10000 100 0.634456 0.633845 0.635067
6 │ OK 0.0700319 10000 1000 0.63207 0.631878 0.632262
7 │ OK 0.0253826 100000 10 0.651411 0.650793 0.652029
8 │ OK 0.0795937 100000 100 0.634 0.633807 0.634193
9 │ OK 0.708287 100000 1000 0.63224 0.632179 0.632302
10 │ OK 0.160286 1000000 10 0.65129 0.651095 0.651486
11 │ OK 0.803433 1000000 100 0.633995 0.633934 0.634056
12 │ OK 7.23958 1000000 1000 0.63232 0.632301 0.63234
Exercise 9
Solution:
julia> using DataFramesMeta
julia> @chain df begin
@rselect(:k, :n, :avg_time = :time / (:k * :n))
unstack(:k, :n, :avg_time)
end
4×4 DataFrame
Row │ k 10 100 1000
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────────────
1 │ 1000 6.784e-8 3.8374e-8 1.50844e-8
2 │ 10000 1.4071e-8 8.443e-9 7.00319e-9
3 │ 100000 2.53826e-8 7.95937e-9 7.08287e-9
4 │ 1000000 1.60286e-8 8.03433e-9 7.23958e-9
We see that indeed this is the case. For large k
and
n
the average time per single sample stabilizes (for small
values the runtime is low so the timing is more affected by external
noise and the other operations that the functions do affect the results
more).
Exercise 10
Solution:
using Plots
gdf = groupby(df, :k, sort=true)
plot([bar(string.(g.n), g.mv;
ylim=(0.62, 0.66), xlabel="n", ylabel="estimate",
legend=false, title=first(g.k),
yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...)
As expected error bandwidth gets smaller as k
encreases.
Note that as n
increases the estimated value tends to
1-exp(-1)
.