# Julia for Data Analysis

## Bogumił Kamiński, Daniel Kaszyński

# Chapter 14

# Problems

### Exercise 1

Write a simulator that takes one parameter `n`. Next we assume we draw
`n` times from this set with replacement. The simulator should return the
average number of items from this set that were drawn at least once.

Call the function running this simulation `boot`.

<details>
<summary>Solution</summary>

There are many other approaches you could use:

```
using Statistics

function boot(n::Integer)
    table = falses(n)
    for _ in 1:n
        table[rand(1:n)] = true
    end
    return mean(table)
end
```

</details>

### Exercise 2

Now write a function `simboot` that takes parameters `n` and `k` and runs
the simulation defined in the `boot` function `k` times. It should return
a named tuple storing `k`, `n`, mean of produced values, and ends of
approximated 95% confidence interval of the result.

Make this function single threaded. Check how long
this function runs for `n=1000` and `k=1_000_000`.

<details>
<summary>Solution</summary>

```
function simboot(k::Integer, n::Integer)
    result = [boot(n) for _ in 1:k]
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end
```

We run it twice to make sure everything is compiled:
```
julia> @time simboot(1000, 1_000_000)
  7.113436 seconds (3.00 k allocations: 119.347 MiB, 0.24% gc time)
(k = 1000, n = 1000000, mv = 0.632128799, lo95 = 0.6321282057815055, hi95 = 0.6321293922184944)

julia> @time simboot(1000, 1_000_000)
  7.058031 seconds (3.00 k allocations: 119.347 MiB, 0.19% gc time)
(k = 1000, n = 1000000, mv = 0.632112942, lo95 = 0.6321123461087246, hi95 = 0.6321135378912754)
```

We see that on my computer the run time is around 7 seconds.

</details>

### Exercise 3

Now rewrite this simulator to be multi threaded. Use 4 cores for benchmarking.
Call the function `simbootT`. Check how long this function runs for `n=1000` and
`k=1_000_000`.

<details>
<summary>Solution</summary>

```
using ThreadsX

function simbootT(k::Integer, n::Integer)
    result = ThreadsX.map(i -> boot(n), 1:k)
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end
```

Here is the timing for four threads:
```
julia> @time simbootT(1000, 1_000_000)
  2.390795 seconds (3.37 k allocations: 119.434 MiB)
(k = 1000, n = 1000000, mv = 0.632117067, lo95 = 0.6321164425245517, hi95 = 0.6321176914754484)

julia> @time simbootT(1000, 1_000_000)
  2.435889 seconds (3.38 k allocations: 119.434 MiB, 1.13% gc time)
(k = 1000, n = 1000000, mv = 0.6321205520000001, lo95 = 0.6321199284351448, hi95 = 0.6321211755648554)
```

Indeed we see a significant performance improvement.

</details>

### Exercise 4

Now rewrite `boot` and `simbootT` to perform less allocations. Achieve this by
making sure that all allocated objects are passed to `boot` function (so that it
does not do any allocations internally). Call these new functions `boot!` and
`simbootT2`. You might need to use the `Threads.threadid` and `Threads.nthreads`
functions.

<details>
<summary>Solution</summary>

```
function boot!(n::Integer, pool)
    table = pool[Threads.threadid()]
    fill!(table, false)
    for _ in 1:n
        table[rand(1:n)] = true
    end
    return mean(table)
end

function simbootT2(k::Integer, n::Integer)
    pool = [falses(n) for _ in 1:Threads.nthreads()]
    result = ThreadsX.map(i -> boot!(n, pool), 1:k)
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end
```

In the solution the `pool` vector keeps `table` vector
individually for each thread. Let us test the timing:

```
julia> @time simbootT2(1000, 1_000_000)
  2.424664 seconds (3.69 k allocations: 746.042 KiB, 1.75% compilation time: 5% of which was recompilation)
(k = 1000, n = 1000000, mv = 0.632119321, lo95 = 0.6321186866457794, hi95 = 0.6321199553542206)

julia> @time simbootT2(1000, 1_000_000)
  2.340694 seconds (391 allocations: 586.453 KiB)
(k = 1000, n = 1000000, mv = 0.6321318470000001, lo95 = 0.6321312368042945, hi95 = 0.6321324571957058)
```

Indeed, we see that the number of allocations was decreased, which should lower
GC usage. However, the runtime of the simulation is similar since in this task
memory allocation does not account for a significant portion of the runtime.

</details>

### Exercise 5

Use either of the solutions we have developed in the previous exercises to
create a web service taking `k` and `n` parameters and returning the values
produced by `boot` functions and time to run the simulation. You might want to
use the `@timed` macro in your solution.

Start the server.

<details>
<summary>Solution</summary>

I used the simplest single-threaded code here; this is a complete
code of the web service:

```
using Genie
using Statistics

function boot(n::Integer)
    table = falses(n)
    for _ in 1:n
        table[rand(1:n)] = true
    end
    return mean(table)
end

function simboot(k::Integer, n::Integer)
    result = [boot(n) for _ in 1:k]
    mv = mean(result)
    sdv = std(result)
    lo95 = mv - 1.96 * sdv / sqrt(k)
    hi95 = mv + 1.96 * sdv / sqrt(k)
    return (; k, n, mv, lo95, hi95)
end

Genie.config.run_as_server = true

Genie.Router.route("/", method=POST) do
  message = Genie.Requests.jsonpayload()
  return try
      k = message["k"]
      n = message["n"]
      value, time = @timed simboot(k, n)
      Genie.Renderer.Json.json((status="OK", time=time, value=value))
  catch
      Genie.Renderer.Json.json((status="ERROR", time="", value=""))
  end
end

Genie.Server.up()
```

</details>

### Exercise 6

Query the server started in the exercise 5 with
the following parameters:
* `k=1000` and `n=1000`
* `k=1.5` and `n=1000`

<details>
<summary>Solution</summary>

```
julia> using HTTP

julia> using JSON3

julia> HTTP.post("http://127.0.0.1:8000",
                 ["Content-Type" => "application/json"],
                 JSON3.write((k=1000, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked

{"status":"OK","time":0.2385469,"value":{"k":1000,"n":1000,"mv":0.6323970000000001,"lo95":0.6317754483212517,"hi95":0.6330185516787485}}"""

julia> HTTP.post("http://127.0.0.1:8000",
                 ["Content-Type" => "application/json"],
                 JSON3.write((k=1.5, n=1000)))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Server: Genie/Julia/1.8.2
Transfer-Encoding: chunked

{"status":"ERROR","time":"","value":""}"""
```

As expected we got a positive answer the first time and an error on the second call.

</details>

### Exercise 7

Collect the data generated by a web service into the `df` data frame for
`k = [10^i for i in 3:6]` and `n = [10^i for i in 1:3]`.

<details>
<summary>Solution</summary>

```
using DataFrames

df = DataFrame()
for k in [10^i for i in 3:6], n in [10^i for i in 1:3]
    @show k, n
    req = HTTP.post("http://127.0.0.1:8000",
                    ["Content-Type" => "application/json"],
                    JSON3.write((; k, n)))
    push!(df, NamedTuple(JSON3.read(req.body)))
end
```

Note that I convert `JSON3.Object` into a `NamedTuple` to easily `push!`
it into the `df` data frame.

Let us have a look at the produced data frame:

```
julia> df
12×3 DataFrame
 Row │ status  time       value
     │ String  Float64    Object…
─────┼──────────────────────────────────────────────────────
   1 │ OK      0.0006784  {\n      "k": 1000,\n      "n": …
   2 │ OK      0.0038374  {\n      "k": 1000,\n      "n": …
   3 │ OK      0.0150844  {\n      "k": 1000,\n      "n": …
   4 │ OK      0.0014071  {\n      "k": 10000,\n      "n":…
   5 │ OK      0.008443   {\n      "k": 10000,\n      "n":…
   6 │ OK      0.0700319  {\n      "k": 10000,\n      "n":…
   7 │ OK      0.0253826  {\n      "k": 100000,\n      "n"…
   8 │ OK      0.0795937  {\n      "k": 100000,\n      "n"…
   9 │ OK      0.708287   {\n      "k": 100000,\n      "n"…
  10 │ OK      0.160286   {\n      "k": 1000000,\n      "n…
  11 │ OK      0.803433   {\n      "k": 1000000,\n      "n…
  12 │ OK      7.23958    {\n      "k": 1000000,\n      "n…
```

</details>

### Exercise 8

Replace the `value` column in the `df` data frame by its contents in-place.

<details>
<summary>Solution</summary>

```
julia> select!(df, :status, :time, :value => AsTable)
12×7 DataFrame
 Row │ status  time       k        n      mv        lo95      hi95
     │ String  Float64    Int64    Int64  Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ OK      0.0006784     1000     10  0.6469    0.640745  0.653055
   2 │ OK      0.0038374     1000    100  0.63508   0.633035  0.637125
   3 │ OK      0.0150844     1000   1000  0.632178  0.631581  0.632775
   4 │ OK      0.0014071    10000     10  0.65239   0.650425  0.654355
   5 │ OK      0.008443     10000    100  0.634456  0.633845  0.635067
   6 │ OK      0.0700319    10000   1000  0.63207   0.631878  0.632262
   7 │ OK      0.0253826   100000     10  0.651411  0.650793  0.652029
   8 │ OK      0.0795937   100000    100  0.634     0.633807  0.634193
   9 │ OK      0.708287    100000   1000  0.63224   0.632179  0.632302
  10 │ OK      0.160286   1000000     10  0.65129   0.651095  0.651486
  11 │ OK      0.803433   1000000    100  0.633995  0.633934  0.634056
  12 │ OK      7.23958    1000000   1000  0.63232   0.632301  0.63234
```

</details>

### Exercise 9

Checks that execution time roughly scales proportionally to the product
of `k` times `n`.

<details>
<summary>Solution</summary>

```
julia> using DataFramesMeta

julia> @chain df begin
           @rselect(:k, :n, :avg_time = :time / (:k * :n))
           unstack(:k, :n, :avg_time)
       end
4×4 DataFrame
 Row │ k        10          100         1000
     │ Int64    Float64?    Float64?    Float64?
─────┼─────────────────────────────────────────────
   1 │    1000  6.784e-8    3.8374e-8   1.50844e-8
   2 │   10000  1.4071e-8   8.443e-9    7.00319e-9
   3 │  100000  2.53826e-8  7.95937e-9  7.08287e-9
   4 │ 1000000  1.60286e-8  8.03433e-9  7.23958e-9
```

We see that indeed this is the case. For large `k` and `n` the average time per
single sample stabilizes (for small values the runtime is low so the timing is
more affected by external noise and the other operations that the functions do
affect the results more).

</details>

### Exercise 10

Plot the expected fraction of seen elements in the set as a function of
`n` by `k` along with 95% confidence interval around these values.

<details>
<summary>Solution</summary>

```
using Plots
gdf = groupby(df, :k, sort=true)
plot([bar(string.(g.n), g.mv;
          ylim=(0.62, 0.66), xlabel="n", ylabel="estimate",
          legend=false, title=first(g.k),
          yerror=(g.mv - g.lo95, g.hi95-g.mv)) for g in gdf]...)
```

As expected error bandwidth gets smaller as `k` increases.
Note that as `n` increases the estimated value tends to `1-exp(-1)`.

</details>