update bnn chapter

2021-06-26 16:37:42 +02:00 · 2021-06-26 16:37:42 +02:00 · 61ec8bfbaa
commit 61ec8bfbaa
parent b1e09b8225
3 changed files with 1185 additions and 1112 deletions
--- a/bayesian-code.ipynb
+++ b/bayesian-code.ipynb
--- a/bayesian-intro.md
+++ b/bayesian-intro.md
@ -3,6 +3,8 @@ Introduction to Posterior Inference

 We should keep in mind that for all measurements, models, and discretizations we have uncertainties. For the former, this typically appears in the form of measurements errors, while model equations usually encompass only parts of a system we're interested in, and for numerical simulations we inherently introduce discretization errors. So a very important question to ask here is how sure we can be sure that an answer we obtain is the correct one. From a statistics viewpoint, we'd like to know the probability distribution for the posterior, i.e., the different outcomes that are possible.

+### Uncertainty 
+
 This admittedly becomes even more difficult in the context of machine learning:
 we're typically facing the task of approximating complex and unknown functions.
 From a probabilistic perspective, the standard process of training an NN here
@ -10,61 +12,83 @@ yields a _maximum likelihood estimation_ (MLE) for the parameters of the network
 However, this MLE viewpoint does not take any of the uncertainties mentioned above into account:
 for DL training, we likewise have a numerical optimization, and hence an inherent
 approximation error and uncertainty regarding the learned representation.
-Ideally, we should reformulate our learning problem such that it enables _posterior inference_,
+Ideally, we should reformulate our the learning process such that it takes 
+its own uncertainties into account, and it should make
+_posterior inference_ possible,
 i.e. learn to produce the full output distribution. However, this turns out to be an
 extremely difficult task.

 This where so called _Bayesian neural network_ (BNN) approaches come into play. They 
 make a form of posterior inference possible by making assumptions about the probability 
-distributions of individual parameters of the network. With a distribution for the
-parameters we can evaluate the network multiple times to obtain different versions
+distributions of individual parameters of the network. This gives a distribution for the
+parameters, with which we can evaluate the network multiple times to obtain different versions
 of the output, and in this way sample the distribution of the output.

 Nonetheless, the task
 remains very challenging. Training a BNN is typically significantly more difficult
-than training a regular NN. However, this should come as no surprise, as we're trying to 
+than training a regular NN. This should come as no surprise, as we're trying to 
 learn something fundamentally different here: a full probability distribution 
 instead of a point estimate. (All previous chapters "just" dealt with
-learning such point estimates.)
+learning such point estimates, and the tasks were still far from trivial.)

 ![Divider](resources/divider5.jpg)

 ## Introduction to Bayesian Neural Networks

-In order to combine posterior inference with Neural Networks, we can use standard techniques from Bayesian Modeling and combine them with the Deep Learning machinery. In Bayesian Modeling, we aim at learning _distributions_ over the model parameters instead of fixed point estimates. In the case of NNs, the model parameters are the weights and biases of the network (summarized as $\theta$). Our goal is therefore to learn a so-called _posterior distribution_ $p({\theta}|{D})$ from the data $D$, which captures the uncertainty we have about the networks weights and biases _after_ observing the data $D$. This posterior distribution is really the quantity of interest here; if we can estimate it reasonably well, we can use it to make good predictions, but also assess uncertainties related to those predictions. For both objectives it is necessary to _marginalize_ over the posterior distribution, i.e. integrate it out. A single prediction for input $x_{i}$ can for example be obtained via
+In order to combine posterior inference with Neural Networks, we can use standard techniques from Bayesian Modeling and combine them with the Deep Learning machinery. In Bayesian Modeling, we aim at learning _distributions_ over the model parameters instead of these fixed point estimates. In the case of NNs, the model parameters are the weights and biases of the network, summarized as $\theta$. Our goal is therefore to learn a so-called _posterior distribution_ $p({\theta}|{D})$ from the data $D$, which captures the uncertainty we have about the networks weights and biases _after_ observing the data $D$. This posterior distribution is the central quantity of interest here: if we can estimate it reasonably well, we can use it to make good predictions, but also assess uncertainties related to those predictions. For both objectives it is necessary to _marginalize_ over the posterior distribution, i.e. integrate it out. A single prediction for input $x_{i}$ can for example be obtained via

-$\hat{y_{i}}=\int f_{\theta}(x_{i})p(\theta|D)d\theta$
+$$
+    \hat{y_{i}}=\int f_{\theta}(x_{i}) ~ p(\theta|D) ~ d\theta
+$$

 Similarly, one can for instance compute the standard deviation in order to assess a measure of uncertainty over the prediction $\hat{y_{i}}$.

+## Prior distributions
+
 In order to obtain the required posterior distribution, in Bayesian modeling one first has to define a _prior distribution_ $p({\theta})$ over the network parameters. This prior distribution should encompass the knowledge we have about the network weights _before_ training the model. We for instance know that the weights of neural networks are usually rather small and can be both positive and negative. Centered normal distributions with some small variance parameter are therefore a standard choice. For computational simplicity, they are typically also assumed to be independent from another. When observing data ${D}$, the prior distribution over the weights is updated to the so-called posterior according to Bayes rule:

-$
-p({\theta}|{D})=\frac{p({D}|{\theta})p({\theta})}{p({D})}
-$.
+$$
+    p({\theta}|{D}) = \frac{p({D}|{\theta})p({\theta})}{p({D})}
+    \text{ . }
+$$

-This is, we update our a-priori knowledge after observing data, i.e. we _learn_ from data. The computation required for the Bayesian update is usually intractable, especially when dealing with non-trivial network architectures. Therefore, the posterior $p({\theta}|{D})$ is approximated with an easy-to-evaluate variational distribution $q_{\phi}(\theta)$, parametrized by $\phi$. Again, independent Normal distributions are typically used for each weight. Hence, the parameters $\phi$ contain all mean and variance parameters $\mu, \sigma$ of those normal distributions. The optimization goal is then to find a distribution $q_{\phi}(\theta)$ that is close to the true posterior. One can measure the similarity of two distributions via the KL-divergence. We cannot directly minimize the KL-divergence between approximate and true posterior $KL(q_{\phi}({\theta})||p({\theta|D})$, because we do not have access to the true posterior distribution. It is however possible to show that one can equivalently maximize the so-called evidence lower bound (ELBO), a quantity well known from variational inference: 
+This is, we update our a-priori knowledge after observing data, i.e. we _learn_ from data. The computation required for the Bayesian update is usually intractable, especially when dealing with non-trivial network architectures. Therefore, the posterior $p({\theta}|{D})$ is approximated with an easy-to-evaluate variational distribution $q_{\phi}(\theta)$, parametrized by $\phi$. Again, independent Normal distributions are typically used for each weight. Hence, the parameters $\phi$ contain all mean and variance parameters $\mu, \sigma$ of those normal distributions. The optimization goal is then to find a distribution $q_{\phi}(\theta)$ that is close to the true posterior. One can measure the similarity of two distributions via the KL-divergence. 

-$
-\mathcal{L}(\phi)=  E_{q_{\phi}}[\log(p(D|{\theta}))] - KL(q_{\phi}({\theta})||p({\theta}))
-$, 
+## Evidence lower bound
+
+We cannot directly minimize the KL-divergence between approximate and true posterior $KL(q_{\phi}({\theta})||p({\theta|D})$, because we do not have access to the true posterior distribution. It is however possible to show that one can equivalently maximize the so-called evidence lower bound (ELBO), a quantity well known from variational inference: 
+
+$$
+    \mathcal{L}(\phi)=  E_{q_{\phi}}[\log(p(D|{\theta}))] - KL(q_{\phi}({\theta})||p({\theta})) 
+    \text{ , }
+$$ 

 The ELBO (or negative ELBO, if one prefers to minimize instead of maximize) is the optimization objective for BNNs. The first term is an expected log-likelihood of the data. Maximizing it means explaining the data as well as possible. In practice, the log-likelihood is often a conventional loss functions like mean squared error (note that MSE can be seen as negative log-likelihood for normal noise with unit variance). The second term is the negative KL-divergence between the approximate posterior and the prior. For suitable prior and approximate posterior choices (like the ones above), this term can be computed analytically. Maximizing it means encouraging the approximate network weight distributions to stay close to the prior distribution. In that sense, the two terms of the ELBO  have opposite goals: The first term encourages the model to explain the data as well as possible, whereas the second term encourages the model to stay close to the (random) prior distributions, which implies randomness and regularization.

-The expectation of the log-likelihood is typically not available in analytical form, but can be approximated in several ways. One can, for instance, use Monte-Carlo sampling and draw $S$ samples from $q_{\phi}({\theta})$. The expectation is then approximated via $\frac{1}{S}\sum_{s=1}^{S}\log(p(D|{\theta_{s}})$. In practice, even a single sample, i.e. $S=1$ might be enough. Furthermore, the expectation of the log-likelihood is typically not evaluated on the whole dataset, but approximated by a batch of data $D_{batch}$, which enables the use of batch-wise stochastic gradient descent. For $S=1$, one iteration of the training loop then boils down to
+The expectation of the log-likelihood is typically not available in analytical form, but can be approximated in several ways. One can, for instance, use Monte-Carlo sampling and draw $S$ samples from $q_{\phi}({\theta})$. The expectation is then approximated via $\frac{1}{S}\sum_{s=1}^{S}\log(p(D|{\theta_{s}})$. In practice, even a single sample, i.e. $S=1$ can be enough. Furthermore, the expectation of the log-likelihood is typically not evaluated on the whole dataset, but approximated by a batch of data $D_{batch}$, which enables the use of batch-wise stochastic gradient descent. 
+
+## The BNN training loop
+
+For $S=1$, one iteration of the training loop then boils down to
 1. sampling a batch of data $D_{batch}$
 2. sampling network weights and biases from $q_{\phi}({\theta})$
 3. forwarding the batch of data through the network according to the sampled weights
 4. evaluating the ELBO
 5. backpropagating the ELBO through the network and updating $\phi$

-If $S>1$, steps 2 and 3 have to be repeated $S$ times in order to compute the ELBO. In that sense, training a variational BNN is fairly similar to training a conventional NN: We still use SGD and forward-backward passes to optimize our loss function. Only that now we are optimizing over distributions instead of single values. If you are curious on how one can backpropagate through distributions, you can read about it [here](https://arxiv.org/abs/1505.05424). An excellent, much more thorough introduction to Bayesian Neural Networks is available in Yarin Gals [thesis](https://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf) in chapters 2 & 3.
+If $S>1$, steps 2 and 3 have to be repeated $S$ times in order to compute the ELBO. In that sense, training a variational BNN is fairly similar to training a conventional NN: We still use SGD and forward-backward passes to optimize our loss function. Only that now we are optimizing over distributions instead of single values. If you are curious on how one can backpropagate through distributions, you can, e.g., read about it [here](https://arxiv.org/abs/1505.05424), and a more detailed introduction to Bayesian Neural Networks is available in Y. Gals [thesis](https://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf), in chapters 2 & 3.

-And there is more! [Gal et al](https://arxiv.org/abs/1506.02142) showed that using dropout is mathematically  equivalent  to  an  approximation to  the  probabilistic  deep  Gaussian  process. In simple words, if you train a network with dropout (and L2 regularization), this is equivalent to a variational Bayesian network with a specific prior (satisfying the so-called KL-condition) and specific approximate posterior choice (a product of Bernoulli distributions). Obtaining uncertainty estimates is then as easy as training a conventional neural network with dropout and extending dropout, which traditionally has only been used during the training phase, to the prediction phase. This will lead to the predictions being random (because dropout will randomly drop some of the weights), which allows us to compute average and standard deviation statistics for single samples from the dataset, just like for the variational BNN case.
+## Dropout as alternative
+
+% todo, cite [Gal et al](https://arxiv.org/abs/1506.02142) 
+Previous work has also shown that using dropout is mathematically  equivalent  to  an  approximation to  the  probabilistic  deep  Gaussian  process. In other words, if you train a network with dropout (and L2 regularization), this is equivalent to a variational Bayesian network with a specific prior (satisfying the so-called KL-condition) and specific approximate posterior choice (a product of Bernoulli distributions). 
+
+Obtaining uncertainty estimates is then as easy as training a conventional neural network with dropout and extending dropout, which traditionally has only been used during the training phase, to the prediction phase. This will lead to the predictions being random (because dropout will randomly drop some of the weights), which allows us to compute average and standard deviation statistics for single samples from the dataset, just like for the variational BNN case.
+It is an ongoing discussion in the field whether variational or dropout-based methods are preferable. 

 ## A practical example

-As a first real example for posterior inference with BNNs, let's revisit the
+As a first real example for posterior inference with variational BNNs, let's revisit the
 case of turbulent flows around airfoils, from {doc}`supervised-airfoils`. 
 However, in contrast to the point estimate learned in this section, we'll now aim for
 learning the full posterior.
--- a/supervised-airfoils.ipynb
+++ b/supervised-airfoils.ipynb
@ -94,24 +94,43 @@
    "from torch.utils.data import Dataset\n",
    "print(\"Torch version {}\".format(torch.__version__))\n",
    "\n",
-    "if not os.path.isfile('data-airfoils.npz'):\n",
-    "  import urllib.request\n",
-    "  url=\"https://ge.in.tum.de/download/2019-deepFlowPred/data.npz\"\n",
-    "  print(\"Downloading training data (300MB), this can take a few minutes the first time...\")\n",
-    "  urllib.request.urlretrieve(url, 'data-airfoils.npz')\n",
+    "# get training data\n",
+    "if True:\n",
+    "    # download\n",
+    "    if not os.path.isfile('data-airfoils.npz'):\n",
+    "        import urllib.request\n",
+    "        url=\"https://ge.in.tum.de/download/2019-deepFlowPred/data.npz\"\n",
+    "        print(\"Downloading training data (300MB), this can take a few minutes the first time...\")\n",
+    "        urllib.request.urlretrieve(url, 'data-airfoils.npz')\n",
+    "    npfile=np.load('data-airfoils.npz')\n",
+    "\n",
+    "else:\n",
+    "    # alternative: load from google drive (upload there beforehand):\n",
+    "    from google.colab import drive\n",
+    "    drive.mount('/content/gdrive')\n",
+    "    npfile=np.load('gdrive/My Drive/data-airfoils.npz')\n",
+    "\n",
    "\n",
-    "npfile=np.load(\"data-airfoils.npz\")\n",
    "print(\"Loaded data, {} training, {} validation samples\".format(len(npfile[\"inputs\"]),len(npfile[\"vinputs\"])))\n",
    "\n",
    "print(\"Size of the inputs array: \"+format(npfile[\"inputs\"].shape))\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you run this notebook in colab, the `else` statement above (which is deactivated by default) might be interesting for you: instead of downloading the training data anew every time, you can manually download it once and store it in your google drive. We assume it's stored in the root directory as `data-airfoils.npz`. Afterwards, you can use the code above to load the file from your google drive, which is typically much faster. This is highly recommended if you want to experiment more extensively via colab."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RY1F4kdWPLNG"
   },
   "source": [
+    "## RANS training data\n",
+    "\n",
    "Now we have some training data. In general it's very important to understand the data we're working with as much as possible (for any ML task the _garbage-in-gargabe-out_ principle definitely holds). We should at least understand the data in terms of dimensions and rough statistics, but ideally also in terms of content. Otherwise we'll have a very hard time interpreting the results of a training run. And despite all the DL magic: if you can't make out any patterns in your data, NNs surely won't find any useful ones.\n",
    "\n",
    "Hence, let's look at one of the training samples... The following is just some helper code to show images side by side."
@ -292,12 +311,12 @@
    "        self.layer1 = nn.Sequential()\n",
    "        self.layer1.add_module('layer1', nn.Conv2d(3, channels, 4, 2, 1, bias=True))\n",
    "\n",
-    "        self.layer2 = blockUNet(channels  , channels*2, 'enc_layer2', transposed=False, bn=True,  relu=False, dropout=dropout )\n",
-    "        self.layer3 = blockUNet(channels*2, channels*2, 'enc_layer3', transposed=False, bn=True,  relu=False, dropout=dropout )\n",
-    "        self.layer4 = blockUNet(channels*2, channels*4, 'enc_layer4', transposed=False, bn=True,  relu=False, dropout=dropout )\n",
-    "        self.layer5 = blockUNet(channels*4, channels*8, 'enc_layer5', transposed=False, bn=True,  relu=False, dropout=dropout ) \n",
-    "        self.layer6 = blockUNet(channels*8, channels*8, 'enc_layer6', transposed=False, bn=True,  relu=False, dropout=dropout , size=2,pad=0)\n",
-    "        self.layer7 = blockUNet(channels*8, channels*8, 'enc_layer7', transposed=False, bn=False, relu=False, dropout=dropout , size=2,pad=0)\n",
+    "        self.layer2 = blockUNet(channels  , channels*2, 'enc_layer2', transposed=False, bn=True, relu=False, dropout=dropout )\n",
+    "        self.layer3 = blockUNet(channels*2, channels*2, 'enc_layer3', transposed=False, bn=True, relu=False, dropout=dropout )\n",
+    "        self.layer4 = blockUNet(channels*2, channels*4, 'enc_layer4', transposed=False, bn=True, relu=False, dropout=dropout )\n",
+    "        self.layer5 = blockUNet(channels*4, channels*8, 'enc_layer5', transposed=False, bn=True, relu=False, dropout=dropout ) \n",
+    "        self.layer6 = blockUNet(channels*8, channels*8, 'enc_layer6', transposed=False, bn=True, relu=False, dropout=dropout , size=2,pad=0)\n",
+    "        self.layer7 = blockUNet(channels*8, channels*8, 'enc_layer7', transposed=False, bn=True, relu=False, dropout=dropout , size=2,pad=0)\n",
    "     \n",
    "        # note, kernel size is internally reduced by one for the decoder part\n",
    "        self.dlayer7 = blockUNet(channels*8, channels*8, 'dec_layer7', transposed=True, bn=True, relu=True, dropout=dropout , size=2,pad=0)\n",