unified writing of data set
This commit is contained in:
parent
6e36804033
commit
0e3cd32658
@ -80,7 +80,7 @@ $$
|
||||
|
||||
The ELBO (or negative ELBO, if one prefers to minimize instead of maximize) is the optimization objective for BNNs. The first term is an expected log-likelihood of the data. Maximizing it means explaining the data as well as possible. In practice, the log-likelihood is often a conventional loss functions like mean squared error (note that MSE can be seen as negative log-likelihood for normal noise with unit variance). The second term is the negative KL-divergence between the approximate posterior and the prior. For suitable prior and approximate posterior choices (like the ones above), this term can be computed analytically. Maximizing it means encouraging the approximate network weight distributions to stay close to the prior distribution. In that sense, the two terms of the ELBO have opposite goals: The first term encourages the model to explain the data as well as possible, whereas the second term encourages the model to stay close to the (random) prior distributions, which implies randomness and regularization.
|
||||
|
||||
The expectation of the log-likelihood is typically not available in analytical form, but can be approximated in several ways. One can, for instance, use Monte-Carlo sampling and draw $S$ samples from $q_{\phi}({\theta})$. The expectation is then approximated via $\frac{1}{S}\sum_{s=1}^{S}\log(p(D|{\theta_{s}})$. In practice, even a single sample, i.e. $S=1$ can be enough. Furthermore, the expectation of the log-likelihood is typically not evaluated on the whole dataset, but approximated by a batch of data $D_{batch}$, which enables the use of batch-wise stochastic gradient descent.
|
||||
The expectation of the log-likelihood is typically not available in analytical form, but can be approximated in several ways. One can, for instance, use Monte-Carlo sampling and draw $S$ samples from $q_{\phi}({\theta})$. The expectation is then approximated via $\frac{1}{S}\sum_{s=1}^{S}\log(p(D|{\theta_{s}})$. In practice, even a single sample, i.e. $S=1$ can be enough. Furthermore, the expectation of the log-likelihood is typically not evaluated on the whole data set, but approximated by a batch of data $D_{batch}$, which enables the use of batch-wise stochastic gradient descent.
|
||||
|
||||
## The BNN training loop
|
||||
|
||||
@ -98,7 +98,7 @@ If $S>1$, steps 2 and 3 have to be repeated $S$ times in order to compute the EL
|
||||
% todo, cite [Gal et al](https://arxiv.org/abs/1506.02142)
|
||||
Previous work has shown that using dropout is mathematically equivalent to an approximation to the probabilistic deep Gaussian process. Furthermore, for a specific prior (satisfying the so-called KL-condition) and specific approximate posterior choice (a product of Bernoulli distributions), training a neural network with dropout and L2 regularization and training a variational BNN results in an equivalent optimization procedure. In other words, dropout neural networks are a form of Bayesian neural networks.
|
||||
|
||||
Obtaining uncertainty estimates is then as easy as training a conventional neural network with dropout and extending dropout, which traditionally has only been used during the training phase, to the prediction phase. This will lead to the predictions being random (because dropout will randomly drop some of the activations), which allows us to compute average and standard deviation statistics for single samples from the dataset, just like for the variational BNN case.
|
||||
Obtaining uncertainty estimates is then as easy as training a conventional neural network with dropout and extending dropout, which traditionally has only been used during the training phase, to the prediction phase. This will lead to the predictions being random (because dropout will randomly drop some of the activations), which allows us to compute average and standard deviation statistics for single samples from the data set, just like for the variational BNN case.
|
||||
It is an ongoing discussion in the field whether variational or dropout-based methods are preferable.
|
||||
|
||||
## A practical example
|
||||
|
@ -3,7 +3,7 @@ Discussion of Supervised Approaches
|
||||
|
||||
The previous example illustrates that we can quite easily use
|
||||
supervised training to solve complex tasks. The main workload is
|
||||
collecting a large enough dataset of examples. Once that exists, we can
|
||||
collecting a large enough data set of examples. Once that exists, we can
|
||||
train a network to approximate the solution manifold sampled
|
||||
by these solutions, and the trained network can give us predictions
|
||||
very quickly. There are a few important points to keep in mind when
|
||||
@ -58,7 +58,7 @@ approach for how to best achieve this, we can strongly recommend to track
|
||||
a broad range of statistics of your data set. A good starting point are
|
||||
per quantity mean, standard deviation, min and max values.
|
||||
If some of these contain unusual values, this is a first indicator of bad
|
||||
samples in the dataset.
|
||||
samples in the data set.
|
||||
|
||||
These values can
|
||||
also be easily visualized in terms of histograms, to track down
|
||||
|
Loading…
Reference in New Issue
Block a user