SoL updates

2021-01-26 11:20:00 +08:00 · 2021-01-26 11:20:00 +08:00 · b8f381b14a
commit b8f381b14a
parent f39cc81873
3 changed files with 75 additions and 50 deletions
--- a/diffphys-code-sol.ipynb
+++ b/diffphys-code-sol.ipynb
@ -3,7 +3,7 @@
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
-      "name": "SoL-karman2d.ipynb",
+      "name": "diffphys-code-sol.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
@ -22,9 +22,11 @@
        "# Reducing Numerical Errors with Deep Learning\n",
        "\n",
        "Next, we'll target numerical errors that arise in the discretization of a continuous PDE $\\mathcal P^*$, i.e. when we formulate $\\mathcal P$. This approach will demonstrate that, despite the lack of closed-form descriptions, discretization errors often are functions with regular and repeating structures and, thus, can be learned by a neural network. Once the network is trained, it can be evaluated locally to improve the solution of a PDE-solver, i.e., to reduce its numerical error. The resulting method is a hybrid one: it will always run (a coarse) PDE solver, and then improve if at runtime with corrections inferred by an NN.\n",
-        "\n",
+        " \n",
        "Pretty much all numerical methods contain some form of iterative process. That can be repeated updates over time for explicit solvers,or within a single update step for implicit solvers. Below we'll target iterations over time, an example for the second case could be found [here](https://github.com/tum-pbs/CG-Solver-in-the-Loop).\n",
        "\n",
+        "## Problem Formulation\n",
+        "\n",
        "In the context of reducing errors, it's crucial to have a _differentiable physics solver_, so that the learning process can take the reaction of the solver into account. This interaction is not possible with supervised learning or PINN training. Even small inference errors of a supervised NN can accumulate over time, and lead to a data distribution that differs from the distribution of the pre-computed data. This distribution shift can lead to sub-optimal results, or even cause blow-ups of the solver.\n",
        "\n",
        "In order to learn the error function, we'll consider two different discretizations of the same PDE $\\mathcal P^*$: \n",
@ -36,7 +38,7 @@
        "\n",
        "```{figure} resources/diffphys-sol-manifolds.jpeg\n",
        "---\n",
-        "height: 280px\n",
+        "height: 150px\n",
        "name: diffphys-sol-manifolds\n",
        "---\n",
        "Visual overview of coarse and reference manifolds\n",
@ -88,9 +90,7 @@
        "The overall learning goal now becomes\n",
        "\n",
        "$\n",
-        "\\text{argmin}_\\theta | \n",
-        "( \\pdec \\corr )^n ( \\project \\vr{t} )\n",
-        "- \\project \\vr{t}|^2\n",
+        "\\text{argmin}_\\theta | ( \\pdec \\corr )^n ( \\project \\vr{t} ) - \\project \\vr{t}|^2\n",
        "$\n",
        "\n",
        "A crucial bit here that's easy to overlook is that the correction depends on the modified states, i.e.\n",
@ -102,10 +102,12 @@
        "**TL;DR**:\n",
        "We'll train a network $\\mathcal{C}$ to reduce the numerical errors of a simulator with a more accurate reference. Here it's crucial to have the _source_ solver realized as a differential physics operator, such that it can give gradients for an improved training of $\\mathcal{C}$.\n",
        "\n",
-        "\\\\\n",
+        "<br>\n",
        "\n",
        "---\n",
        "\n",
+        "## Getting started with the Implementation\n",
+        "\n",
        "First, let's download the prepared data set (for details on generation & loading cf. https://github.com/tum-pbs/Solver-in-the-Loop), and let's get the data handling out of the way, so that we can focus on the _interesting_ parts..."
      ]
    },
@ -130,7 +132,7 @@
        "with open('data-karman2d-train.pickle', 'rb') as f: dataPreloaded = pickle.load(f)\n",
        "print(\"Loaded data, {} training sims\".format(len(dataPreloaded)) )\n"
      ],
-      "execution_count": 1,
+      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
@ -174,7 +176,7 @@
        "np.random.seed(42)\n",
        "tf.compat.v1.set_random_seed(42)\n"
      ],
-      "execution_count": 2,
+      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
@ -212,6 +214,8 @@
        "id": "OhnzPdoww11P"
      },
      "source": [
+        "## Simulation Setup\n",
+        "\n",
        "Now we can set up the _source_ simulation $\\newcommand{\\pdec}{\\pde_{s}} \\pdec$. \n",
        "Note that we won't deal with \n",
        "$\\newcommand{\\pder}{\\pde_{r}} \\pder$\n",
@ -259,7 +263,7 @@
        "\n",
        "        return super().step(fluid=fluid, dt=dt, obstacles=[self.obst], gravity=gravity, density_effects=[self.infl], velocity_effects=())\n"
      ],
-      "execution_count": 3,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -268,6 +272,8 @@
        "id": "RYFUGICgxk0K"
      },
      "source": [
+        "## Network Architecture\n",
+        "\n",
        "We'll also define two alternative neural networks to represent \n",
        "$\\newcommand{\\vcN}{\\mathbf{s}} \\newcommand{\\corr}{\\mathcal{C}} \\corr$: \n",
        "\n",
@ -296,7 +302,7 @@
        "        keras.layers.Conv2D(filters=2,  kernel_size=5, padding='same', activation=None),  # u, v\n",
        "    ])\n"
      ],
-      "execution_count": 4,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -352,7 +358,7 @@
        "    l_output = keras.layers.Conv2D(filters=2,  kernel_size=5, padding='same')(block_5)\n",
        "    return keras.models.Model(inputs=l_input, outputs=l_output)\n"
      ],
-      "execution_count": 5,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -387,7 +393,7 @@
        "def to_staggered(tensor_cen, box):\n",
        "  return StaggeredGrid(math.pad(tensor_cen, ((0,0), (0,1), (0,1), (0,0))), box=box)\n"
      ],
-      "execution_count": 12,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -398,6 +404,8 @@
      "source": [
        "---\n",
        "\n",
+        "## Data Handling\n",
+        "\n",
        "So far so good - we also need to take care of a few more mundane tasks, e.g. the some data handling and randomization. Below we define a `Dataset` class that stores all \"ground truth\" reference data (already downsampled).\n",
        "\n",
        "We actually have a lot of data dimensions: multiple simulations, with many time steps, each with different fields. This makes the code below a bit more difficult to read.\n",
@ -477,7 +485,7 @@
        "    def nextStep(self):\n",
        "        self.stepIdx += 1\n"
      ],
-      "execution_count": 7,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -528,7 +536,7 @@
        "        ]\n",
        "        return [marker_dens, velocity, ext]\n"
      ],
-      "execution_count": 8,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -560,7 +568,7 @@
        "#print(format(getData(dataset,1)))\n",
        "#print(format(dataset.getData(1)))\n"
      ],
-      "execution_count": 9,
+      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
@ -624,7 +632,7 @@
        "network.summary() \n",
        "\n"
      ],
-      "execution_count": 10,
+      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
@ -665,6 +673,8 @@
        "id": "AbpNPzplQZMF"
      },
      "source": [
+        "## Interleaving Simulation and Network\n",
+        "\n",
        "Now comes the **most crucial** step in the whole setup: we define the chain of simulation steps and network evaluations to be used at training time. After all the work defining helper functions, it's acutally pretty simple: we loop over `msteps`, call the simulator via `KarmanFlow.step` for an input state, and afterwards evaluate the correction via `network(to_keras())`. The correction is then added to the last simultation state in the `prediction` list (we're actually simply overwriting the last simulated step `prediction[-1]` with `velocity + correction[-1]`.\n",
        "\n",
        "One other important things that's happening here is normalization: the inputs to the network are divided by the standard deviations in `dataset.dataStats`. This is slightly complicated as we have to append the scaling for the Reynolds numbers to the normalization for the velocity. After evaluating the `network`, we only have a velocity left, so we can simply multiply by the standard deviation again (`* dataset.dataStats['std'][1]`)."
@ -702,7 +712,7 @@
        "\n",
        "    prediction[-1] = prediction[-1].copied_with(velocity=prediction[-1].velocity + correction[-1])\n"
      ],
-      "execution_count": 13,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -729,7 +739,7 @@
        "]\n",
        "loss = tf.reduce_sum(loss_steps)/msteps\n"
      ],
-      "execution_count": 14,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -738,17 +748,15 @@
        "id": "E6Vly1_0QhZ1"
      },
      "source": [
+        "## Training\n",
+        "\n",
        "For the training, we use a standard Adam optimizer, and only run 4 epochs by default. This could (should) be increased for the larger network or to obtain more accurate results."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
-        "id": "PuljFamYQksW",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "e71bcaae-187c-4c10-cee8-f03bb8964af0"
+        "id": "PuljFamYQksW"
      },
      "source": [
        "lr = 1e-4\n",
@ -771,19 +779,8 @@
        "    ld_network = keras.models.load_model(output_dir+'/nn_epoch{:04d}.h5'.format(resume))\n",
        "    network.set_weights(ld_network.get_weights())\n"
      ],
-      "execution_count": 15,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/phi/tf/session.py:28: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.\n",
-            "\n",
-            "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/phi/tf/session.py:29: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.\n",
-            "\n"
-          ],
-          "name": "stdout"
-        }
-      ]
+      "execution_count": null,
+      "outputs": []
    },
    {
      "cell_type": "markdown",
@ -809,7 +806,7 @@
        "    elif epoch == 10: lr *= 1e-1\n",
        "    return lr\n"
      ],
-      "execution_count": 16,
+      "execution_count": null,
      "outputs": []
    },
    {
@ -830,7 +827,7 @@
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
-        "outputId": "3bea702a-14d0-43a7-ebc5-25289e27c5a5"
+        "outputId": "148d951b-7070-4a95-c6d7-0fd91d29606e"
      },
      "source": [
        "current_lr = lr\n",
@ -855,7 +852,7 @@
        "            _, l2 = sess.run([train_step, loss], my_feed_dict)\n",
        "            steps += 1\n",
        "\n",
-        "            if (j==0 and i<3) or (ib==0 and i%10==0):\n",
+        "            if (j==0 and i<3) or (j==0 and ib==0 and i%31==0) or (ib==0 and i%124==0):\n",
        "              print('epoch {:03d}/{:03d}, batch {:03d}/{:03d}, step {:04d}/{:04d}: loss={}'.format( j+1, epochs, ib+1, dataset.numBatches, i+1, dataset.numSteps, l2 ))\n",
        "            dataset.nextStep()\n",
        "\n",
@ -863,7 +860,7 @@
        "\n",
        "    if j%10==9: network.save(output_dir+'/nn_epoch{:04d}.h5'.format(j+1))\n",
        "\n",
-        "#tf_writer_tr.close()\n",
+        "# all done! save final version\n",
        "network.save(output_dir+'/final.h5')\n"
      ],
      "execution_count": null,
@ -871,11 +868,39 @@
        {
          "output_type": "stream",
          "text": [
-            "epoch 001/004, batch 001/002, step 0001/0496: loss=6816.912109375\n",
-            "epoch 001/004, batch 001/002, step 0002/0496: loss=4036.171875\n",
-            "epoch 001/004, batch 001/002, step 0003/0496: loss=1627.9716796875\n",
-            "epoch 001/004, batch 001/002, step 0011/0496: loss=1403.9822998046875\n",
-            "epoch 001/004, batch 001/002, step 0021/0496: loss=841.949951171875\n"
+            "epoch 001/004, batch 001/002, step 0001/0496: loss=8114.626953125\n",
+            "epoch 001/004, batch 001/002, step 0002/0496: loss=3371.28125\n",
+            "epoch 001/004, batch 001/002, step 0003/0496: loss=1594.294189453125\n",
+            "epoch 001/004, batch 001/002, step 0032/0496: loss=261.2645263671875\n",
+            "epoch 001/004, batch 001/002, step 0063/0496: loss=124.70037078857422\n",
+            "epoch 001/004, batch 001/002, step 0094/0496: loss=86.60037231445312\n",
+            "epoch 001/004, batch 001/002, step 0125/0496: loss=93.21685028076172\n",
+            "epoch 001/004, batch 001/002, step 0156/0496: loss=64.77877807617188\n",
+            "epoch 001/004, batch 001/002, step 0187/0496: loss=58.933082580566406\n",
+            "epoch 001/004, batch 001/002, step 0218/0496: loss=51.40797805786133\n",
+            "epoch 001/004, batch 001/002, step 0249/0496: loss=42.819091796875\n",
+            "epoch 001/004, batch 001/002, step 0280/0496: loss=46.30024719238281\n",
+            "epoch 001/004, batch 001/002, step 0311/0496: loss=41.07358932495117\n",
+            "epoch 001/004, batch 001/002, step 0342/0496: loss=40.12362289428711\n",
+            "epoch 001/004, batch 001/002, step 0373/0496: loss=41.094932556152344\n",
+            "epoch 001/004, batch 001/002, step 0404/0496: loss=36.17275619506836\n",
+            "epoch 001/004, batch 001/002, step 0435/0496: loss=37.64105987548828\n",
+            "epoch 001/004, batch 001/002, step 0466/0496: loss=33.44026184082031\n",
+            "epoch 001/004, batch 002/002, step 0001/0496: loss=36.6204719543457\n",
+            "epoch 001/004, batch 002/002, step 0002/0496: loss=29.037982940673828\n",
+            "epoch 001/004, batch 002/002, step 0003/0496: loss=27.977163314819336\n",
+            "epoch 002/004, batch 001/002, step 0001/0496: loss=13.540712356567383\n",
+            "epoch 002/004, batch 001/002, step 0125/0496: loss=12.313040733337402\n",
+            "epoch 002/004, batch 001/002, step 0249/0496: loss=11.129035949707031\n",
+            "epoch 002/004, batch 001/002, step 0373/0496: loss=11.969249725341797\n",
+            "epoch 003/004, batch 001/002, step 0001/0496: loss=8.394614219665527\n",
+            "epoch 003/004, batch 001/002, step 0125/0496: loss=7.2177557945251465\n",
+            "epoch 003/004, batch 001/002, step 0249/0496: loss=8.274188041687012\n",
+            "epoch 003/004, batch 001/002, step 0373/0496: loss=9.177286148071289\n",
+            "epoch 004/004, batch 001/002, step 0001/0496: loss=6.306344985961914\n",
+            "epoch 004/004, batch 001/002, step 0125/0496: loss=4.158570289611816\n",
+            "epoch 004/004, batch 001/002, step 0249/0496: loss=4.282064437866211\n",
+            "epoch 004/004, batch 001/002, step 0373/0496: loss=5.2111334800720215\n"
          ],
          "name": "stdout"
        }
@ -887,7 +912,7 @@
        "id": "swG7GeDpWT_Z"
      },
      "source": [
-        "The loss should go down from ca. 1000 initially to around 1. This is a good sign, but of course it's even more important to see how the resulting solver fares on new inputs.\n",
+        "The loss should go down from above 1000 initially to below 10. This is a good sign, but of course it's even more important to see how the resulting solver fares on new inputs.\n",
        "\n",
        "Note that after training we're realized a hybrid solver, consisting of a regular _source_ simulator, and a network that was trained to specificially interact with this simulator for a chosen domain of simulation cases.\n",
        "\n",
@ -897,7 +922,7 @@
        "\n",
        "## Next steps\n",
        "\n",
-        "* Modify the training to further reduce the training error\n",
+        "* Modify the training to further reduce the training error. With the medium network you should be able to get the loss down to around 1.\n",
        "\n",
        "* Export the network to the external github code, and run it on new wake flow cases. You'll see that a reduced training error not always directly correlates with an improved test performance\n",
        "\n",
--- a/resources/diffphys-sol-domain.jpeg
+++ b/resources/diffphys-sol-domain.jpeg
--- a/supervised.md
+++ b/supervised.md
@ -2,8 +2,8 @@ Supervised Training
 =======================

 _Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of 
-deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" of course also doesn't always mean bad
- it's just that we'll be able to do better than simple supervised training later on. 
+deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" of course also doesn't 
+always mean bad - it's just that we'll be able to do better than simple supervised training later on. 

 In a way, the viewpoint of "supervised training" is a starting point for all projects one would encounter in the context of DL, and
 hence is worth studying. And although it typically yields inferior results to approaches that more tightly