From 98df3e43fcda540a888a45ee98a1e70a44d28a4b Mon Sep 17 00:00:00 2001 From: NT Date: Tue, 20 Jul 2021 21:39:43 +0200 Subject: [PATCH] rmoved model as name for NNs --- bayesian-code.ipynb | 16 ++++++++-------- intro-teaser.ipynb | 28 ++++++++++++++-------------- overview-equations.md | 2 +- reinflearn-code.ipynb | 6 +++--- supervised-airfoils.ipynb | 30 +++++++++++++++--------------- 5 files changed, 41 insertions(+), 41 deletions(-) diff --git a/bayesian-code.ipynb b/bayesian-code.ipynb index 171a016..b409236 100644 --- a/bayesian-code.ipynb +++ b/bayesian-code.ipynb @@ -184,10 +184,10 @@ "id": "C2gdKINAG_Qs" }, "source": [ - "### Model Definition\n", + "### Neural Network Definition\n", "Now let's look at how we can implement BNNs. Instead of PyTorch, we will use TensorFlow, in particular the extension TensorFlow Probability, which has easy-to-implement probabilistic layers. Like in the other notebook, we use a U-Net structure consisting of Convolutional blocks with skip-layer connections. For now, we only want to set up the decoder, i.e. second part of the U-Net as bayesian. For this, we will take advantage of TensorFlows _flipout_ layers (in particular, the convolutional implementation). \n", "\n", - "In a forward pass, those layers automatically sample from the current posterior distribution and store the KL-divergence between prior and posterior in _model.losses_. One can specify the desired divergence measure (typically KL-divergence) and modify the prior and approximate posterior distributions, if other than normal distributions are desired. Other than that, the flipout layers can be used just like regular layers in sequential models. The code below implements a single convolutional block of the U-Net:" + "In a forward pass, those layers automatically sample from the current posterior distribution and store the KL-divergence between prior and posterior in _model.losses_. One can specify the desired divergence measure (typically KL-divergence) and modify the prior and approximate posterior distributions, if other than normal distributions are desired. Other than that, the flipout layers can be used just like regular layers in sequential architectures. The code below implements a single convolutional block of the U-Net:" ] }, { @@ -242,7 +242,7 @@ "source": [ "Next we define the full network with these blocks - the structure is almost identical to the previous notebook. We manually define the kernel-divergence function as `kdf` and rescale it with a factor called `kl_scaling`. There are two reasons for this: \n", "\n", - "First, we should only apply the kl-divergence once per epoch if we want to use the correct loss (like introduced in {doc}`bayesian-intro`). Since we will use batch-wise training, we need to rescale the Kl-divergence by the number of batches, such that in every parameter update only _kdf / num_batches_ is added to the loss. During one epoch, _num_batches_ parameter updates are performed and the 'full' KL-divergence is used. This batch scaling is computed and passed to the network initialization via `kl_scaling` when instantiating the `Bayes_DfpNet` model later on.\n", + "First, we should only apply the kl-divergence once per epoch if we want to use the correct loss (like introduced in {doc}`bayesian-intro`). Since we will use batch-wise training, we need to rescale the Kl-divergence by the number of batches, such that in every parameter update only _kdf / num_batches_ is added to the loss. During one epoch, _num_batches_ parameter updates are performed and the 'full' KL-divergence is used. This batch scaling is computed and passed to the network initialization via `kl_scaling` when instantiating the `Bayes_DfpNet` NN later on.\n", "\n", "Second, by scaling the KL-divergence part of the loss up or down, we have a way of tuning how much randomness we want to allow in the network: If we neglect the KL-divergence completely, we would just minimize the regular loss (e.g. MSE or MAE), like in a conventional neural network. If we instead neglect the negative-log-likelihood, we would optimize the network such that we obtain random draws from the prior distribution. Balancing those extremes can be done by fine-tuning the scaling of the KL-divergence and is hard in practice. " ] @@ -547,7 +547,7 @@ "id": "7aM5Ra2C7k1v" }, "source": [ - "The model is trained! Let's look at the loss. Since the loss consists of two separate parts, it is helpful to monitor both parts (MAE and KL)." + "The BNN is trained! Let's look at the loss. Since the loss consists of two separate parts, it is helpful to monitor both parts (MAE and KL)." ] }, { @@ -836,9 +836,9 @@ "source": [ "## Test evaluation\n", "\n", - "Like in the case for a conventional neural network, let's now look at **proper** test samples, i.e. OOD samples, for which in this case we'll use new airfoil shapes. These are shapes that the network never saw in any training samples, and hence it tells us a bit about how well the model generalizes to new shapes.\n", + "Like in the case for a conventional neural network, let's now look at **proper** test samples, i.e. OOD samples, for which in this case we'll use new airfoil shapes. These are shapes that the network never saw in any training samples, and hence it tells us a bit about how well the network generalizes to new shapes.\n", "\n", - "As these samples are at least slightly OOD, we can draw conclusions about how well the model generalizes, which the validation data would not really tell us. In particular, we would like to investigate if the model is more uncertain when handling OOD data. Like before, we first download the test samples ..." + "As these samples are at least slightly OOD, we can draw conclusions about how well the network generalizes, which the validation data would not really tell us. In particular, we would like to investigate if the NN is more uncertain when handling OOD data. Like before, we first download the test samples ..." ] }, { @@ -1132,9 +1132,9 @@ "\n", "But now it's time to experiment with BNNs yourself. \n", "\n", - "* One interesting thing to look at is how the behavior of our model changes, if we adjust the KL-prefactor. In the training loop above we set it to 5000 without further justification. You can check out what happens, if you use a value of 1, as it is suggested by the theory, instead of 5000. According to our implementation, this should make the model 'more bayesian', since we assign larger importance to the KL-divergence than before. \n", + "* One interesting thing to look at is how the behavior of our BNN changes, if we adjust the KL-prefactor. In the training loop above we set it to 5000 without further justification. You can check out what happens, if you use a value of 1, as it is suggested by the theory, instead of 5000. According to our implementation, this should make the network 'more bayesian', since we assign larger importance to the KL-divergence than before. \n", "\n", - "* So far, we have only worked with variational BNNs, implemented via TensorFlows probabilistic layers. Recall that there is a simpler way of getting uncertainty estimates: Using dropout not only at training, but also at inference time. You can check out how the outputs change for that case. In order to do so, you can, for instance, just pass a non-zero dropout rate to the model specification and change the prediction phase in the above implementation from _model.predict(...)_ to _model(..., training=True)_. Setting the _training=True_ flag will tell TensorFlow to forward the input as if it were training data and hence, it will apply dropout. Please note that the _training=True_ flag can also affect other features of the network. Batch normalization, for instance, works differently in training and prediction mode. As long as we don't deal with overly different data and use sufficiently large batch-sizes, this should not introduce large errors, though. Sensible dropout rates to start experimenting with are e.g. around 0.1." + "* So far, we have only worked with variational BNNs, implemented via TensorFlows probabilistic layers. Recall that there is a simpler way of getting uncertainty estimates: Using dropout not only at training, but also at inference time. You can check out how the outputs change for that case. In order to do so, you can, for instance, just pass a non-zero dropout rate to the network specification and change the prediction phase in the above implementation from _model.predict(...)_ to _model(..., training=True)_. Setting the _training=True_ flag will tell TensorFlow to forward the input as if it were training data and hence, it will apply dropout. Please note that the _training=True_ flag can also affect other features of the network. Batch normalization, for instance, works differently in training and prediction mode. As long as we don't deal with overly different data and use sufficiently large batch-sizes, this should not introduce large errors, though. Sensible dropout rates to start experimenting with are e.g. around 0.1." ] } ] diff --git a/intro-teaser.ipynb b/intro-teaser.ipynb index c97acde..6df92b9 100644 --- a/intro-teaser.ipynb +++ b/intro-teaser.ipynb @@ -121,7 +121,7 @@ "id": "stone-science", "metadata": {}, "source": [ - "Now we can define a network, the loss, and the training configuration. We'll use a simple `keras` model with three hidden layers, ReLU activations." + "Now we can define a network, the loss, and the training configuration. We'll use a simple `keras` architecture with three hidden layers, ReLU activations." ] }, { @@ -133,7 +133,7 @@ "source": [ "# Neural network\n", "act = tf.keras.layers.ReLU()\n", - "model_sv = tf.keras.models.Sequential([\n", + "nn_sv = tf.keras.models.Sequential([\n", " tf.keras.layers.Dense(10, activation=act),\n", " tf.keras.layers.Dense(10, activation=act),\n", " tf.keras.layers.Dense(1,activation='linear')])" @@ -174,10 +174,10 @@ "# Loss function\n", "loss_sv = tf.keras.losses.MeanSquaredError()\n", "optimizer_sv = tf.keras.optimizers.Adam(lr=0.001)\n", - "model_sv.compile(optimizer=optimizer_sv, loss=loss_sv)\n", + "nn_sv.compile(optimizer=optimizer_sv, loss=loss_sv)\n", "\n", "# Training\n", - "results_sv = model_sv.fit(X, Y, epochs=5, batch_size= 5, verbose=1)" + "results_sv = nn_sv.fit(X, Y, epochs=5, batch_size= 5, verbose=1)" ] }, { @@ -185,7 +185,7 @@ "id": "governmental-mixture", "metadata": {}, "source": [ - "As both model and data set are very small, the training converges very quickly. However, if we inspect the predictions of the network, we can see that it is nowhere near the solution we were hoping to find: it averages between the data points on both sides of the x-axis and therefore fails to find satisfying solutions to the problem above.\n", + "As both NN and data set are very small, the training converges very quickly. However, if we inspect the predictions of the network, we can see that it is nowhere near the solution we were hoping to find: it averages between the data points on both sides of the x-axis and therefore fails to find satisfying solutions to the problem above.\n", "\n", "The following plot nicely highlights this: it shows the data in light gray, and the supervised solution in red. " ] @@ -212,7 +212,7 @@ "source": [ "# Results\n", "plt.plot(X,Y,'.',label='Data points', color=\"lightgray\")\n", - "plt.plot(X,model_sv.predict(X),'.',label='Supervised', color=\"red\")\n", + "plt.plot(X,nn_sv.predict(X),'.',label='Supervised', color=\"red\")\n", "plt.xlabel('y')\n", "plt.ylabel('x')\n", "plt.title('Standard approach')\n", @@ -248,7 +248,7 @@ "source": [ "Now let's apply a differentiable physics approach to find $f$: we'll directly include our discretized model $\\mathcal P$ in the training. \n", "\n", - "There is no real data generation step; we only need to sample from the $[0,1]$ interval. We'll simply keep the same $x$ locations used in the previous case, and a new instance of a model with the same architecture as before `model_dp`:" + "There is no real data generation step; we only need to sample from the $[0,1]$ interval. We'll simply keep the same $x$ locations used in the previous case, and a new instance of a NN with the same architecture as before `nn_dp`:" ] }, { @@ -263,7 +263,7 @@ "# Y is evaluated on the fly\n", "\n", "# Model\n", - "model_dp = tf.keras.models.Sequential([\n", + "nn_dp = tf.keras.models.Sequential([\n", " tf.keras.layers.Dense(10, activation=act),\n", " tf.keras.layers.Dense(10, activation=act),\n", " tf.keras.layers.Dense(1, activation='linear')])" @@ -292,7 +292,7 @@ " return mse(y_true,y_pred**2)\n", "\n", "optimizer_dp = tf.keras.optimizers.Adam(lr=0.001)\n", - "model_dp.compile(optimizer=optimizer_dp, loss=loss_dp)" + "nn_dp.compile(optimizer=optimizer_dp, loss=loss_dp)" ] }, { @@ -320,7 +320,7 @@ ], "source": [ "#Training\n", - "results_dp = model_dp.fit(X, X, epochs=5, batch_size=5, verbose=1)" + "results_dp = nn_dp.fit(X, X, epochs=5, batch_size=5, verbose=1)" ] }, { @@ -353,8 +353,8 @@ "source": [ "# Results\n", "plt.plot(X,Y,'.',label='Datapoints', color=\"lightgray\")\n", - "#plt.plot(X,model_sv.predict(X),'.',label='Supervised', color=\"red\") # optional for comparison\n", - "plt.plot(X,model_dp.predict(X),'.',label='Diff. Phys.', color=\"green\") \n", + "#plt.plot(X,nn_sv.predict(X),'.',label='Supervised', color=\"red\") # optional for comparison\n", + "plt.plot(X,nn_dp.predict(X),'.',label='Diff. Phys.', color=\"green\") \n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.title('Differentiable physics approach')\n", @@ -373,9 +373,9 @@ "\n", "- We've prevented an undesired averaging of multiple modes in the solution by evaluating our discrete model w.r.t. current prediction of the network, rather than using a pre-computed solution. This lets us find the best single mode near the network prediction, and prevents an averaging of the modes that exist in the solution manifold.\n", "\n", - "- We're still only getting one side of the curve! This is to be expected, because we're representing the solutions with a deterministic function. Hence, we can only represent a single mode. Interestingly, whether it's the top or bottom mode is determined by the random initialization of the weights in $f$ - run the example a couple of time to see this effect in action. To capture multiple modes we'd need to extend the model to capture the full distribution of the outputs and parametrize it with additional dimensions.\n", + "- We're still only getting one side of the curve! This is to be expected, because we're representing the solutions with a deterministic function. Hence, we can only represent a single mode. Interestingly, whether it's the top or bottom mode is determined by the random initialization of the weights in $f$ - run the example a couple of time to see this effect in action. To capture multiple modes we'd need to extend the NN to capture the full distribution of the outputs and parametrize it with additional dimensions.\n", "\n", - "- The region with $x$ near zero is typically still off in this example. The model essentially learns a linear approximation of one half of the parabola here. This is partially caused by the weak neural network: it is very small and shallow. In addition, the evenly spread of sample points along the x axis bias the model towards the larger y values. These contribute more to the loss, and hence the network invests most of its resources to reduce the error in this region.\n" + "- The region with $x$ near zero is typically still off in this example. The network essentially learns a linear approximation of one half of the parabola here. This is partially caused by the weak neural network: it is very small and shallow. In addition, the evenly spread of sample points along the x axis bias the NN towards the larger y values. These contribute more to the loss, and hence the network invests most of its resources to reduce the error in this region.\n" ] }, { diff --git a/overview-equations.md b/overview-equations.md index 4b42582..1428de0 100644 --- a/overview-equations.md +++ b/overview-equations.md @@ -2,7 +2,7 @@ Models and Equations ============================ Below we'll give a brief (really _very_ brief!) intro to deep learning, primarily to introduce the notation. -In addition we'll discuss some _model equations_ below. Note that we won't use _model_ to denote trained neural networks, in contrast to some other texts. These will only be called "NNs" or "networks". A "model" will always denote a set of model equations for a physical effect, typically PDEs. +In addition we'll discuss some _model equations_ below. Note that we'll avoid using _model_ to denote trained neural networks, in contrast to some other texts and APIs. These will be called "NNs" or "networks". A "model" will typically denote a set of model equations for a physical effect, usually PDEs. ## Deep learning and neural networks diff --git a/reinflearn-code.ipynb b/reinflearn-code.ipynb index 80326ed..cc0eae1 100644 --- a/reinflearn-code.ipynb +++ b/reinflearn-code.ipynb @@ -22,11 +22,11 @@ "\n", "## Overview\n", "\n", - "Reinforcement learning describes an agent perceiving an environment and taking actions inside it. It aims at maximizing an accumulated sum of rewards, which it receives for those actions by the environment. Thus, the agent learns empirically which actions to take in different situations. _Proximal policy optimization_ [PPO](https://arxiv.org/abs/1707.06347v2) is a widely used reinforcement learning algorithm describing two neural networks: a policy model selecting actions for given observations and a value estimator network rating the reward potential of those states. These value estimates form the loss of the policy model, given by the change in reward potential by the chosen action.\n", + "Reinforcement learning describes an agent perceiving an environment and taking actions inside it. It aims at maximizing an accumulated sum of rewards, which it receives for those actions by the environment. Thus, the agent learns empirically which actions to take in different situations. _Proximal policy optimization_ [PPO](https://arxiv.org/abs/1707.06347v2) is a widely used reinforcement learning algorithm describing two neural networks: a policy NN selecting actions for given observations and a value estimator network rating the reward potential of those states. These value estimates form the loss of the policy network, given by the change in reward potential by the chosen action.\n", "\n", "This notebook illustrates how PPO reinforcement learning can be applied to the described control problem of Burgers' equation. In comparison to the DP approach, the RL method does not have access to a differentiable physics solver, it is _model-free_. \n", "\n", - "However, the goal of the value estimator model is to compensate for this lack of a solver, as it tries to capture the long term effect of individual actions. Thus, an interesting question the following code example should answer is: can the model-free PPO reinforcement learning match the performance of the model-based DP training. We will compare this in terms of learning speed and the amount of required forces.\n" + "However, the goal of the value estimator NN is to compensate for this lack of a solver, as it tries to capture the long term effect of individual actions. Thus, an interesting question the following code example should answer is: can the model-free PPO reinforcement learning match the performance of the model-based DP training. We will compare this in terms of learning speed and the amount of required forces.\n" ] }, { @@ -51,7 +51,7 @@ }, "outputs": [], "source": [ - "!pip install stable-baselines3 phiflow==1.5.1\n", + "!pip install stable-baselines3==1.0 phiflow==1.5.1\n", "!git clone https://github.com/Sh0cktr4p/PDE-Control-RL.git\n", "!git clone https://github.com/holl-/PDE-Control.git" ] diff --git a/supervised-airfoils.ipynb b/supervised-airfoils.ipynb index ad89d62..476e283 100644 --- a/supervised-airfoils.ipynb +++ b/supervised-airfoils.ipynb @@ -373,7 +373,7 @@ "source": [ "Next, we can initialize an instance of the `DfpNet`.\n", "\n", - "Below, the `EXPO` parameter here controls the exponent for the feature maps of our Unet: this directly scales the network size (3 gives a model with ca. 150k parameters). This is relatively small for a generative model with $3 \\times 128^2 = \\text{ca. }49k$ outputs, but yields fast training times and prevents overfitting given the relatively small data set we're using here. Hence it's a good starting point." + "Below, the `EXPO` parameter here controls the exponent for the feature maps of our Unet: this directly scales the network size (3 gives a network with ca. 150k parameters). This is relatively small for a generative NN with $3 \\times 128^2 = \\text{ca. }49k$ outputs, but yields fast training times and prevents overfitting given the relatively small data set we're using here. Hence it's a good starting point." ] }, { @@ -403,8 +403,8 @@ "net = DfpNet(channelExponent=EXPO)\n", "#print(net) # to double check the details...\n", "\n", - "model_parameters = filter(lambda p: p.requires_grad, net.parameters())\n", - "params = sum([np.prod(p.size()) for p in model_parameters])\n", + "nn_parameters = filter(lambda p: p.requires_grad, net.parameters())\n", + "params = sum([np.prod(p.size()) for p in nn_parameters])\n", "\n", "# crucial parameter to keep in view: how many parameters do we have?\n", "print(\"Trainable params: {} -> crucial! always keep in view... \".format(params)) \n", @@ -435,7 +435,7 @@ "source": [ "## Training\n", "\n", - "Finally, we can train the model. This step can take a while, as the training runs over all 320 samples 100 times, and continually evaluates the validation samples to keep track of how well the current state of the NN is doing." + "Finally, we can train the NN. This step can take a while, as the training runs over all 320 samples 100 times, and continually evaluates the validation samples to keep track of how well the current state of the NN is doing." ] }, { @@ -475,7 +475,7 @@ "Epoch: 40, L1 train: 0.03682, L1 vali: 0.03222\n", "Epoch: 60, L1 train: 0.03176, L1 vali: 0.02710\n", "Epoch: 80, L1 train: 0.02772, L1 vali: 0.02522\n", - "Training done, saved model\n" + "Training done, saved trained network\n" ] } ], @@ -483,9 +483,9 @@ "history_L1 = []\n", "history_L1val = []\n", "\n", - "if os.path.isfile(\"model\"):\n", - " print(\"Found existing model, loading & skipping training\")\n", - " net.load_state_dict(torch.load(\"model\")) # optionally, load existing model\n", + "if os.path.isfile(\"network\"):\n", + " print(\"Found existing network, loading & skipping training\")\n", + " net.load_state_dict(torch.load(\"network\")) # optionally, load existing network\n", "\n", "else:\n", " print(\"Training from scratch\")\n", @@ -526,8 +526,8 @@ " if epoch<3 or epoch%20==0:\n", " print( \"Epoch: {}, L1 train: {:7.5f}, L1 vali: {:7.5f}\".format(epoch, history_L1[-1], history_L1val[-1]) )\n", "\n", - " torch.save(net.state_dict(), \"model\" )\n", - " print(\"Training done, saved model\")\n" + " torch.save(net.state_dict(), \"network\" )\n", + " print(\"Training done, saved network\")\n" ] }, { @@ -536,7 +536,7 @@ "id": "4KuUpJsSL3Jv" }, "source": [ - "The model is finally trained! The losses should have nicely gone down in terms of absolute values: With the standard settings from an initial value of around 0.2 for the validation loss, to ca. 0.02 after 100 epochs. \n", + "The NN is finally trained! The losses should have nicely gone down in terms of absolute values: With the standard settings from an initial value of around 0.2 for the validation loss, to ca. 0.02 after 100 epochs. \n", "\n", "Let's look at the graphs to get some intuition for how the training progressed over time. This is typically important to identify longer-term trends in the training. In practice it's tricky to spot whether the overall trend of 100 or so noisy numbers in a command line log is going slightly up or down - this is much easier to spot in a visualization." ] @@ -591,7 +591,7 @@ "\n", "If you look closely at this graph, you should spot something peculiar:\n", "_Why is the validation loss lower than the training loss_?\n", - "The data is similar to the training data of course, but in a way it's slightly \"tougher\", because the model certainly never received any validation samples during training. It is natural that the validation loss slightly deviates from the training loss, but how can the L1 loss be _lower_ for these inputs?\n", + "The data is similar to the training data of course, but in a way it's slightly \"tougher\", because the network certainly never received any validation samples during training. It is natural that the validation loss slightly deviates from the training loss, but how can the L1 loss be _lower_ for these inputs?\n", "\n", "This is a subtlety of the training loop above: it runs a training step first, and the loss for each point in the graph is measured with the evolving state of the network in an epoch. The network is updated, and afterwards runs through the validation samples. Thus all validation samples are using a state that is definitely different (and hopefully a bit better) than the initial states of the epoch. Hence, the validation loss can be slightly lower.\n", "\n", @@ -663,7 +663,7 @@ "source": [ "## Test evaluation\n", "\n", - "Now let's look at actual test samples: In this case we'll use new airfoil shapes as out-of-distribution (OOD) data. These are shapes that the network never saw in any training samples, and hence it tells us a bit about how well the model generalizes to unseen inputs (the validation data wouldn't suffice to draw conclusions about generalization).\n", + "Now let's look at actual test samples: In this case we'll use new airfoil shapes as out-of-distribution (OOD) data. These are shapes that the network never saw in any training samples, and hence it tells us a bit about how well the NN generalizes to unseen inputs (the validation data wouldn't suffice to draw conclusions about generalization).\n", "\n", "We'll use the same visualization as before, and as indicated by the Bernoulli equation, especially the _pressure_ in the first column is a challenging quantity for the network. Due to it's cubic scaling w.r.t. the input freestream velocity and localized peaks, it is the toughest quantity to infer for the network.\n", "\n", @@ -787,7 +787,7 @@ "\n", "Looking at the visualizations, you'll notice that especially high-pressure peaks and pockets of larger y-velocities are missing in the outputs. This is primarily caused by the small network, which does not have enough resources to reconstruct details.\n", "\n", - "Nonetheless, we have successfully replaced a fairly sophisticated RANS solver with a very small and fast neural network model. It has GPU support \"out-of-the-box\" (via pytorch), is differentiable, and introduces an error of only a few per-cent.\n", + "Nonetheless, we have successfully replaced a fairly sophisticated RANS solver with a very small and fast neural network architecture. It has GPU support \"out-of-the-box\" (via pytorch), is differentiable, and introduces an error of only a few per-cent.\n", "\n", "---\n", "\n", @@ -804,7 +804,7 @@ "\n", "There are many obvious things to try here (see the suggestions below), e.g. longer training, larger data sets, larger networks etc. \n", "\n", - "* Experiment with learning rate, dropout, and model size to reduce the error on the test set. How small can you make it with the given training data?\n", + "* Experiment with learning rate, dropout, and network size to reduce the error on the test set. How small can you make it with the given training data?\n", "\n", "* The setup above uses normalized data. Instead you can recover [the original fields by undoing the normalization](https://github.com/thunil/Deep-Flow-Prediction) to check how well the network does w.r.t. the original quantities.\n", "\n",