JuliaForDataAnalysis/exercises/exercises14.md
Bogumił Kamiński 3b8ffa5d40 add exercises
2022-10-14 12:27:04 +02:00

11 KiB
Raw Blame History

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 14

Problems

Exercise 1

Write a simulator that takes one parameter n. Next we assume we draw n times from this set with replacement. The simulator should return the average number of items from this set that were drawn at least once.

Call the function running this simulation boot.

Exercise 2

Now write a function simboot that takes parameters n and k and runs the simulation defined in the boot function k times. It should return a named tuple storing k, n, mean of produced values, and ends of approximated 95% confidence interval of the result.

Make this function single threaded. Check how long this function runs for n=1000 and k=1_000_000.

Exercise 3

Now rewrite this simulator to be multi threaded. Use 4 cores for benchmarking. Call the function simbootT. Check how long this function runs for n=1000 and k=1_000_000.

Exercise 4

Now rewrite boot and simbootT to perform less allocations. Achieve this by making sure that all allocated objects are passed to boot function (so that it does not do any allocations internally). Call these new functions boot! and simbootT2. You might need to use the Threads.threadid and Threads.nthreads functions.

Exercise 5

Use either of the solutions we have developed in the previous exercises to create a web service taking k and n parameters and returning the values produced by boot functions and time to run the simulation. You might want to use the @timed macro in your solution.

Start the server.

Exercise 6

Query the server started in the exercise 5 with the following parameters: * k=1000 and n=1000 * k=1.5 and n=1000

Exercise 7

Collect the data generated by a web service into the df data frame for k = [10^i for i in 3:6] and n = [10^i for i in 1:3].

Exercise 8

Replace the value column in the df data frame by its contents in-place.

Exercise 9

Checks that execution time roughly scales proportionally to the product of k times n.

Exercise 10

Plot the expected fraction of seen elements in the set as a function of n by k along with 95% confidence interval around these values.

Solutions

Show!

Exercise 1

Solution (there are many other approaches you could use):

using Statistics

function boot(n::Integer)
    table = falses(n)
    for _ in 1:n
        table[rand(1:n)] = true
    end
    return mean(table)
end

Exercise 2

Solution:

function simboot(k::Integer, n::Integer)
    result = [boot(n) for _ in 1:k]
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end

We run it twice to make sure everything is compiled:

julia> @time simboot(1000, 1_000_000)
  7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time)
(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944)

julia> @time simboot(1000, 1_000_000)
  7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time)
(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754)

We see that on my computer the run time is around 7 seconds.

Exercise 3

Solution:

using ThreadsX

function simbootT(k::Integer, n::Integer)
    result = ThreadsX.map(i -> boot(n), 1:k)
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end

Here is the timing for four threads:

julia> @time simbootT(1000, 1_000_000)
  2.390795 seconds (3.37 k allocations: 119.434 MiB)
(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484)

julia> @time simbootT(1000, 1_000_000)
  2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time)
(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554)

Indeed we see a significant performance improvement.

Exercise 4

Solution:

function boot!(n::Integer, pool)
    table = pool[Threads.threadid()]
    fill!(table, false)
    for _ in 1:n
        table[rand(1:n)] = true
    end
    return mean(table)
end

function simbootT2(k::Integer, n::Integer)
    pool = [falses(n) for _ in 1:Threads.nthreads()]
    result = ThreadsX.map(i -> boot!(n, pool), 1:k)
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end

In the solution the pool vector keeps table vector individually for each thread. Let us test the timing:

julia> @time simbootT2(1000, 1_000_000)
  2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation)
(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206)

julia> @time simbootT2(1000, 1_000_000)
  2.340694 seconds (391 allocations: 586.453 KiB)
(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058)

Indeed, we see that the number of allocations was decreased, which should lower GC usage. However, the runtime of the simulation is similar since in this task memory allocation does not account for a significant portion of the runtime.

Exercise 5

Solution (I used the simplest single-threaded code here; this is a complete code of the web service):

using Genie
using Statistics

function boot(n::Integer)
    table = falses(n)
    for _ in 1:n
        table[rand(1:n)] = true
    end
    return mean(table)
end

function simboot(k::Integer, n::Integer)
    result = [boot(n) for _ in 1:k]
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end

Genie.config.run_as_server = true

Genie.Router.route("/", method=POST) do
  message = Genie.Requests.jsonpayload()
  return try
      k = message["k"]
      n = message["n"]
      value, time = @timed simboot(k, n)
      Genie.Renderer.Json.json((status="OK", time=time, value=value))
  catch
      Genie.Renderer.Json.json((status="ERROR", time="", value=""))
  end
end

Genie.Server.up()

Exercise 6

Solution:

julia> using HTTP

julia> using JSON3

julia> HTTP.post("http://127.0.0.1:8000",
                 ["Content-Type" => "application/json"],
                 JSON3.write((k=1000, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked

{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}"""

julia> HTTP.post("http://127.0.0.1:8000",
                 ["Content-Type" => "application/json"],
                 JSON3.write((k=1.5, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked

{"status":"ERROR","time":"","value":""}"""

As expected we got a positive answer the first time and an error on the second call.

Exercise 7

Solution:

using DataFrames

df = DataFrame()
for k in [10^i for i in 3:6], n in [10^i for i in 1:3]
    @show k, n
    req = HTTP.post("http://127.0.0.1:8000",
                    ["Content-Type" => "application/json"],
                    JSON3.write((; k, n)))
    push!(df, NamedTuple(JSON3.read(req.body)))
end

Note that I convert JSON3.Object into a NamedTuple to easily push! it into the df data frame.

Let us have a look at the produced data frame:

julia> df
12×3 DataFrame
 Row │ status  time       value
     │ String  Float64    Object…
─────┼──────────────────────────────────────────────────────
   1 │ OK      0.0006784  {\n      "k": 1000,\n      "n": …
   2 │ OK      0.0038374  {\n      "k": 1000,\n      "n": …
   3 │ OK      0.0150844  {\n      "k": 1000,\n      "n": …
   4 │ OK      0.0014071  {\n      "k": 10000,\n      "n":…
   5 │ OK      0.008443   {\n      "k": 10000,\n      "n":…
   6 │ OK      0.0700319  {\n      "k": 10000,\n      "n":…
   7 │ OK      0.0253826  {\n      "k": 100000,\n      "n"…
   8 │ OK      0.0795937  {\n      "k": 100000,\n      "n"…
   9 │ OK      0.708287   {\n      "k": 100000,\n      "n"…
  10 │ OK      0.160286   {\n      "k": 1000000,\n      "n…
  11 │ OK      0.803433   {\n      "k": 1000000,\n      "n…
  12 │ OK      7.23958    {\n      "k": 1000000,\n      "n…

Exercise 8

Solution:

julia> select!(df, :status, :time, :value => AsTable)
12×7 DataFrame
 Row │ status  time       k        n      mv        lo95      hi95
     │ String  Float64    Int64    Int64  Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ OK      0.0006784     1000     10  0.6469    0.640745  0.653055
   2 │ OK      0.0038374     1000    100  0.63508   0.633035  0.637125
   3 │ OK      0.0150844     1000   1000  0.632178  0.631581  0.632775
   4 │ OK      0.0014071    10000     10  0.65239   0.650425  0.654355
   5 │ OK      0.008443     10000    100  0.634456  0.633845  0.635067
   6 │ OK      0.0700319    10000   1000  0.63207   0.631878  0.632262
   7 │ OK      0.0253826   100000     10  0.651411  0.650793  0.652029
   8 │ OK      0.0795937   100000    100  0.634     0.633807  0.634193
   9 │ OK      0.708287    100000   1000  0.63224   0.632179  0.632302
  10 │ OK      0.160286   1000000     10  0.65129   0.651095  0.651486
  11 │ OK      0.803433   1000000    100  0.633995  0.633934  0.634056
  12 │ OK      7.23958    1000000   1000  0.63232   0.632301  0.63234

Exercise 9

Solution:

julia> using DataFramesMeta

julia> @chain df begin
           @rselect(:k, :n, :avg_time = :time / (:k * :n))
           unstack(:k, :n, :avg_time)
       end
4×4 DataFrame
 Row │ k        10          100         1000
     │ Int64    Float64?    Float64?    Float64?
─────┼─────────────────────────────────────────────
   1 │    1000  6.784e-8    3.8374e-8   1.50844e-8
   2 │   10000  1.4071e-8   8.443e-9    7.00319e-9
   3 │  100000  2.53826e-8  7.95937e-9  7.08287e-9
   4 │ 1000000  1.60286e-8  8.03433e-9  7.23958e-9

We see that indeed this is the case. For large k and n the average time per single sample stabilizes (for small values the runtime is low so the timing is more affected by external noise and the other operations that the functions do affect the results more).

Exercise 10

Solution:

using Plots
gdf = groupby(df, :k, sort=true)
plot([bar(string.(g.n), g.mv;
          ylim=(0.62, 0.66), xlabel="n", ylabel="estimate",
          legend=false, title=first(g.k),
          yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...)

As expected error bandwidth gets smaller as k encreases. Note that as n increases the estimated value tends to 1-exp(-1).