corrections Maximilian for BNNs

2021-07-06 10:03:27 +02:00 · 2021-07-06 10:03:27 +02:00 · 40ec002745
commit 40ec002745
parent dec213d4b0
2 changed files with 19 additions and 17 deletions
--- a/bayesian-code.ipynb
+++ b/bayesian-code.ipynb
@ -65,14 +65,14 @@
        "import os.path, random\n",
        "\n",
        "# as before, either download or use gdrive\n",
-        "if False:\n",
+        "if True:\n",
        "    if not os.path.isfile('data-airfoils.npz'):\n",
        "        import urllib.request\n",
        "        url=\"https://ge.in.tum.de/download/2019-deepFlowPred/data.npz\"\n",
        "        print(\"Downloading training data (300MB), this can take a few minutes the first time...\")\n",
        "        urllib.request.urlretrieve(url, 'data-airfoils.npz')\n",
        "    npfile=np.load('data-airfoils.npz')\n",
-        "else:\n",
+        "else: # cf supervised airfoil code:\n",
        "    from google.colab import drive\n",
        "    drive.mount('/content/gdrive')\n",
        "    npfile=np.load('gdrive/My Drive/data-airfoils.npz')\n",
@ -185,7 +185,7 @@
      },
      "source": [
        "### Model Definition\n",
-        "Now let's look at how we can implement BNNs. Instead of PyTorch, we will use TensorFlow now, in particular the extension TensorFlow Probability, which has easy-to-implement probabilistic layers. Like in the other notebook, we use a U-Net structure consisting of Convolutional blocks with skip-layer connections. For now, we only want to set up the decoder, i.e. second part of the U-Net as bayesian. For this, we will take advantage of TensorFlows _flipout_ layers (in particular, the convolutional implementation). \n",
+        "Now let's look at how we can implement BNNs. Instead of PyTorch, we will use TensorFlow, in particular the extension TensorFlow Probability, which has easy-to-implement probabilistic layers. Like in the other notebook, we use a U-Net structure consisting of Convolutional blocks with skip-layer connections. For now, we only want to set up the decoder, i.e. second part of the U-Net as bayesian. For this, we will take advantage of TensorFlows _flipout_ layers (in particular, the convolutional implementation). \n",
        "\n",
        "In a forward pass, those layers automatically sample from the current posterior distribution and store the KL-divergence between prior and posterior in _model.losses_. One can specify the desired divergence measure (typically KL-divergence) and modify the prior and approximate posterior distributions, if other than normal distributions are desired. Other than that, the flipout layers can be used just like regular layers in sequential models. The code below implements a single convolutional block of the U-Net:"
      ]
@ -242,7 +242,7 @@
      "source": [
        "Next we define the full network with these blocks - the structure is almost identical to the previous notebook. We manually define the kernel-divergence function as `kdf` and rescale it with a factor called `kl_scaling`. There are two reasons for this: \n",
        "\n",
-        "First, we should only apply the kl-divergence once per epoch if we want to use the correct loss (like introduced on the top of this notebook). Since we will use a batch-wise training, we need to rescale the Kl-divergence by the number of batches, such that in every parameter update only _kdf / num_batches_ is added to the loss. During one epoch, _num_batches_ parameter updates were performed and the 'full' KL-divergence was used. This batch scaling computed and passed to the network initialization via `kl_scaling` when instantiating the `Bayes_DfpNet` model later on.\n",
+        "First, we should only apply the kl-divergence once per epoch if we want to use the correct loss (like introduced in {doc}`bayesian-intro`). Since we will use batch-wise training, we need to rescale the Kl-divergence by the number of batches, such that in every parameter update only _kdf / num_batches_ is added to the loss. During one epoch, _num_batches_ parameter updates are performed and the 'full' KL-divergence is used. This batch scaling is computed and passed to the network initialization via `kl_scaling` when instantiating the `Bayes_DfpNet` model later on.\n",
        "\n",
        "Second, by scaling the KL-divergence part of the loss up or down, we have a way of tuning how much randomness we want to allow in the network: If we neglect the KL-divergence completely, we would just minimize the regular loss (e.g. MSE or MAE), like in a conventional neural network. If we instead neglect the negative-log-likelihood, we would optimize the network such that we obtain random draws from the prior distribution. Balancing those extremes can be done by fine-tuning the scaling of the KL-divergence and is hard in practice. "
      ]
@ -355,7 +355,7 @@
        "id": "3lUU7A0o1PzV"
      },
      "source": [
-        "We can visualize the learning rate decay: We start off with a constant rate and after half of the EPOCHS we start to decay it exponentially, until arriving at half of the original learning rate."
+        "We can visualize the learning rate decay: We start off with a constant rate and after half of the `EPOCHS` we start to decay it exponentially, until arriving at half of the original learning rate."
      ]
    },
    {
@ -1008,7 +1008,7 @@
        "id": "icgfvAIqoMpE"
      },
      "source": [
-        "This is reassuring: The error on the OOD test set with new shapes is higher than on the validation set. However, also the uncertainty is larger.\n",
+        "This is reassuring: The uncertainties on the OOD test set with new shapes are higher than on the validation set. The mean error is also larger for most cases.\n",
        "In general it is hard to obtain a calibrated uncertainty estimate, but since we are dealing with a fairly simple problem here, it seems that the BNN is able to estimate it reasonably well."
      ]
    },
@ -1138,4 +1138,4 @@
      ]
    }
  ]
-}
+}
--- a/bayesian-intro.md
+++ b/bayesian-intro.md
@ -1,13 +1,13 @@
 Introduction to Posterior Inference
 =======================

-We should keep in mind that for all measurements, models, and discretizations we have uncertainties. For the former, this typically appears in the form of measurements errors, while model equations usually encompass only parts of a system we're interested in, and for numerical simulations we inherently introduce discretization errors. So a very important question to ask here is how sure we can be sure that an answer we obtain is the correct one. From a statistics viewpoint, we'd like to know the probability distribution for the posterior, i.e., the different outcomes that are possible.
+We should keep in mind that for all measurements, models, and discretizations we have uncertainties. For measurements and observations, this typically appears in the form of measurement errors. Model equations equations, on the other hand, usually encompass only parts of a system we're interested in (leaving the remainder as an uncertainty), while for numerical simulations we inherently introduce discretization errors. So a very important question to ask here is how we can be sure that an answer we obtain is the correct one. From a statisticians viewpoint, we'd like to know the posterior pobability distribution, a distribution that captures possible uncertainties we have about our model or data.

 ## Uncertainty 

 This admittedly becomes even more difficult in the context of machine learning:
 we're typically facing the task of approximating complex and unknown functions.
-From a probabilistic perspective, the standard process of training an NN here
+From a probabilistic perspective, the standard process of training a NN here
 yields a _maximum likelihood estimation_ (MLE) for the parameters of the network.
 However, this MLE viewpoint does not take any of the uncertainties mentioned above into account:
 for DL training, we likewise have a numerical optimization, and hence an inherent
@ -18,8 +18,8 @@ _posterior inference_ possible,
 i.e. learn to produce the full output distribution. However, this turns out to be an
 extremely difficult task.

-This where so called _Bayesian neural network_ (BNN) approaches come into play. They 
-make a form of posterior inference possible by making assumptions about the probability 
+This is where so-called _Bayesian neural network_ (BNN) approaches come into play. They 
+allow for a form of posterior inference by making assumptions about the probability 
 distributions of individual parameters of the network. This gives a distribution for the
 parameters, with which we can evaluate the network multiple times to obtain different versions
 of the output, and in this way sample the distribution of the output.
@ -48,24 +48,26 @@ However, as a word of caution: if they appear together, the different kinds of u

 ## Introduction to Bayesian Neural Networks

-In order to combine posterior inference with Neural Networks, we can use standard techniques from Bayesian Modeling and combine them with the Deep Learning machinery. In Bayesian Modeling, we aim at learning _distributions_ over the model parameters instead of these fixed point estimates. In the case of NNs, the model parameters are the weights and biases of the network, summarized as $\theta$. Our goal is therefore to learn a so-called _posterior distribution_ $p({\theta}|{D})$ from the data $D$, which captures the uncertainty we have about the networks weights and biases _after_ observing the data $D$. This posterior distribution is the central quantity of interest here: if we can estimate it reasonably well, we can use it to make good predictions, but also assess uncertainties related to those predictions. For both objectives it is necessary to _marginalize_ over the posterior distribution, i.e. integrate it out. A single prediction for input $x_{i}$ can for example be obtained via
+In order to combine posterior inference with Neural Networks, we can use standard techniques from Bayesian Modeling and combine them with the Deep Learning machinery. In Bayesian Modeling, we aim at learning _distributions_ over the model parameters instead of these fixed point estimates. In the case of NNs, the model parameters are the weights and biases, summarized by $\theta$, for a neural network $f$. Our goal is therefore to learn a so-called _posterior distribution_ $p({\theta}|{D})$ from the data $D$, which captures the uncertainty we have about the networks weights and biases _after_ observing the data $D$. This posterior distribution is the central quantity of interest here: if we can estimate it reasonably well, we can use it to make good predictions, but also assess uncertainties related to those predictions. For both objectives it is necessary to _marginalize_ over the posterior distribution, i.e. integrate it out. A single prediction for input $x_{i}$ can for example be obtained via

 $$
-    \hat{y_{i}}=\int f_{\theta}(x_{i}) ~ p(\theta|D) ~ d\theta
+    \hat{y_{i}}=\int f(x_{i}; \theta) ~ p(\theta|D) ~ d\theta
 $$

 Similarly, one can for instance compute the standard deviation in order to assess a measure of uncertainty over the prediction $\hat{y_{i}}$.

 ## Prior distributions

-In order to obtain the required posterior distribution, in Bayesian modeling one first has to define a _prior distribution_ $p({\theta})$ over the network parameters. This prior distribution should encompass the knowledge we have about the network weights _before_ training the model. We for instance know that the weights of neural networks are usually rather small and can be both positive and negative. Centered normal distributions with some small variance parameter are therefore a standard choice. For computational simplicity, they are typically also assumed to be independent from another. When observing data ${D}$, the prior distribution over the weights is updated to the so-called posterior according to Bayes rule:
+In order to obtain the required posterior distribution, in Bayesian modeling one first has to define a _prior distribution_ $p({\theta})$ over the network parameters. This prior distribution should encompass the knowledge we have about the network weights _before_ training the model. We for instance know that the weights of neural networks are usually rather small and can be both positive and negative. Centered normal distributions with some small variance parameter are therefore a standard choice. For computational simplicity, they are typically also assumed to be independent from another. When observing data ${D}$, the prior distribution over the weights is updated to the  posterior according to Bayes rule:

 $$
    p({\theta}|{D}) = \frac{p({D}|{\theta})p({\theta})}{p({D})}
    \text{ . }
 $$

-This is, we update our a-priori knowledge after observing data, i.e. we _learn_ from data. The computation required for the Bayesian update is usually intractable, especially when dealing with non-trivial network architectures. Therefore, the posterior $p({\theta}|{D})$ is approximated with an easy-to-evaluate variational distribution $q_{\phi}(\theta)$, parametrized by $\phi$. Again, independent Normal distributions are typically used for each weight. Hence, the parameters $\phi$ contain all mean and variance parameters $\mu, \sigma$ of those normal distributions. The optimization goal is then to find a distribution $q_{\phi}(\theta)$ that is close to the true posterior. One can measure the similarity of two distributions via the KL-divergence. 
+This is, we update our a-priori knowledge after observing data, i.e. we _learn_ from data. The computation required for the Bayesian update is usually intractable, especially when dealing with non-trivial network architectures. Therefore, the posterior $p({\theta}|{D})$ is approximated with an easy-to-evaluate variational distribution $q_{\phi}(\theta)$, parametrized by $\phi$. Again, independent Normal distributions are typically used for each weight. Hence, the parameters $\phi$ contain all mean and variance parameters $\mu, \sigma$ of those normal distributions. 
+The optimization goal is then to find a distribution $q_{\phi}(\theta)$ that is close to the true posterior.
+One way of assessing this closeness is the KL-divergence, a method used widely in practice for measuring the similarity of two distributions.

 ## Evidence lower bound

@ -94,9 +96,9 @@ If $S>1$, steps 2 and 3 have to be repeated $S$ times in order to compute the EL
 ## Dropout as alternative

 % todo, cite [Gal et al](https://arxiv.org/abs/1506.02142) 
-Previous work has also shown that using dropout is mathematically  equivalent  to  an  approximation to  the  probabilistic  deep  Gaussian  process. In other words, if you train a network with dropout (and L2 regularization), this is equivalent to a variational Bayesian network with a specific prior (satisfying the so-called KL-condition) and specific approximate posterior choice (a product of Bernoulli distributions). 
+Previous work has shown that using dropout is mathematically equivalent to an approximation to the probabilistic deep Gaussian process. Furthermore, for a specific prior (satisfying the so-called KL-condition) and specific approximate posterior choice (a product of Bernoulli distributions), training a neural network with dropout and L2 regularization and training a variational BNN results in an equivalent optimization procedure. In other words, dropout neural networks are a form of Bayesian neural networks.

-Obtaining uncertainty estimates is then as easy as training a conventional neural network with dropout and extending dropout, which traditionally has only been used during the training phase, to the prediction phase. This will lead to the predictions being random (because dropout will randomly drop some of the weights), which allows us to compute average and standard deviation statistics for single samples from the dataset, just like for the variational BNN case.
+Obtaining uncertainty estimates is then as easy as training a conventional neural network with dropout and extending dropout, which traditionally has only been used during the training phase, to the prediction phase. This will lead to the predictions being random (because dropout will randomly drop some of the activations), which allows us to compute average and standard deviation statistics for single samples from the dataset, just like for the variational BNN case.
 It is an ongoing discussion in the field whether variational or dropout-based methods are preferable. 

 ## A practical example