Consistent caps in sub-headings

This commit is contained in:
NT 2021-08-20 16:51:41 +02:00
parent f6c7622664
commit 479aa10e11
5 changed files with 18 additions and 18 deletions

View File

@ -1112,7 +1112,7 @@
"id": "h_z2_i_VA1HP"
},
"source": [
"## Next Steps \n",
"## Next steps \n",
"\n",
"But now it's time to experiment with BNNs yourself. \n",
"\n",

View File

@ -19,7 +19,7 @@ as the inference of super-resolution solutions where the range of possible
results can be highly ambiguous.
```
## Maximum Likelihood Estimation
## Maximum likelihood estimation
To train a GAN we have to briefly turn to classification problems.
For these, the learning objective takes a slightly different form than the
@ -50,7 +50,7 @@ The takeaway message here is that the wide-spread training via cross entropy
is effectively a maximum likelihood estimation for probabilities over the inputs,
as defined in equation {eq}`mle-prob`.
## Adversarial Training
## Adversarial training
MLE is a crucial component for GANs: here we have a _generator_ that is typically
similar to a decoder network, e.g., the second half of an autoencoder from {doc}`others-timeseries`.

View File

@ -28,7 +28,7 @@ the time evolution with $f_t$, and then decode the full spatial information with
```
## Reduced Order Models
## Reduced order models
Reducing the dimension and complexity of computational models, often called _reduced order modeling_ (ROM) or _model reduction_, is a classic topic in the computational field. Traditional techniques often employ techniques such as principal component analysis to arrive at a basis for a chosen space of solution. However, being linear by construction, these approaches have inherent limitations when representing complex, non-linear solution manifolds. In practice, all "interesting" solutions are highly non-linear, and hence DL has received a substantial amount of interest as a way to learn non-linear representations. Due to the non-linearity, DL representations can potentially yield a high accuracy with fewer degrees of freedom in the reduced model compared to classic approaches.

View File

@ -128,7 +128,7 @@
"id": "mCUbc-sovPME"
},
"source": [
"## Data Generation"
"## Data generation"
]
},
{
@ -226,7 +226,7 @@
"id": "plZUZD_av3YH"
},
"source": [
"## Reinforcement Learning Training\n",
"## Training via reinforcement learning\n",
"\n",
"Next we set up the RL environment. The PPO approach uses a dedicated value estimator network (the \"critic\") to predict the sum of rewards generated from a certain state. These predicted rewards are then used to update a policy network (the \"actor\") which, analogously to the CFE network of {doc}`diffphys-control`, predicts the forces to control the simulation."
]
@ -267,7 +267,7 @@
"id": "vX0BsYq5ZVad"
},
"source": [
"### Gym Environment \n",
"### Gym environment \n",
"\n",
"The gym environment specification provides an interface leveraging the interaction with the agent. Environments implementing it must specify observation and action spaces, which represent the in- and output spaces of the agent. Further, they have to define a set of methods, the most important ones being `reset`, `step`, and `render`. \n",
"\n",
@ -277,7 +277,7 @@
"\n",
"`stable-baselines3` expands on the default gym environment by providing an interface for vectorized environments. This makes it possible to compute the forward pass for multiple trajectories simultaneously which can in turn increase time efficiency because of better resource utilization. In practice, this means that the methods now work on vectors of observations, actions, rewards, terminal state flags and info dictionaries. The step method is split into `step_async` and `step_wait`, making it possible to run individual instances of the environment on different threads.\n",
"\n",
"### Physics Simulation \n",
"### Physics simulation \n",
"\n",
"The environment for Burgers' equation contains a `Burgers` physics object provided by `phiflow`. The states are internally stored as `BurgersVelocity` objects. To create the initial states, the environment generates batches of random fields in the same fashion as in the data set generation process shown above. The observation space consists of the velocity fields of the current and target states stacked in the channel dimension with another channel specifying the current time step. Actions are taken in the form of a one dimensional array covering every velocity value. The `step` method calls the physics object to advance the internal state by one time step, also applying the actions as a `FieldEffect`.\n",
"\n",
@ -290,7 +290,7 @@
"id": "1fjV4HOSGGim"
},
"source": [
"### Neural Network\n",
"### Neural network setup\n",
"\n",
"We use two different neural network architectures for the actor and critic respectively. The former uses the U-Net variant from {cite}`holl2019pdecontrol`, while the latter consists of a series of 1D convolutional and pooling layers reducing the feature map size to one. The final operation is a convolution with kernel size one to combine the feature maps and retain one output value. The `CustomActorCriticPolicy` class then makes it possible to use these two separate network architectures for the reinforcement learning agent.\n",
"\n",
@ -425,7 +425,7 @@
"id": "7WlqEvsOL7Rt"
},
"source": [
"## RL Evaluation\n",
"## RL evaluation\n",
"\n",
"Now that we have a trained model, let's take a look at the results. The leftmost plot shows the results of the reinforcement learning agent. As reference, next to it are shown the ground truth, i.e. the trajectory the agent should reconstruct, and the uncontrolled simulation where the system follows its natural evolution."
]
@ -495,7 +495,7 @@
"id": "2E_sFqgo2SiU"
},
"source": [
"## Differentiable Physics Training\n",
"## Differentiable physics training\n",
"\n",
"To classify the results of the reinforcement learning method, we now compare them to an approach using differentiable physics training. In contrast to the full approach from {doc}`diffphys-control` which includes a second _OP_ network, we aim for a direct control here. The OP network represents a separate \"physics-predictor\", which is omitted here for fairness when comparing with the RL version.\n",
"\n",
@ -905,7 +905,7 @@
"id": "Yh-xD2cG7d9A"
},
"source": [
"### Trajectory Comparison\n",
"### Trajectory comparison\n",
"\n",
"To compare the resulting trajectories, we generate trajectories from the test set with either method. Also, we collect the ground truth simulations and the natural evolution of the test set fields.\n",
"\n"
@ -1005,7 +1005,7 @@
"id": "ZsksKs4e4QJA"
},
"source": [
"### Comparison of Exerted Forces\n",
"### Comparison of exerted forces\n",
"\n",
"Next, we compute the forces the approaches have generated and applied for the test set trajectories."
]
@ -1223,7 +1223,7 @@
"id": "DGBlUpQ271Ww"
},
"source": [
"## Training Progress Comparison\n",
"## Training progress comparison\n",
"\n",
"Although the quality of the control in terms of force magnitudes is the primary goal of the setup above, there are interesting differences in terms of how both methods behave at training time. The main difference of the physics-unaware RL training and the DP approach with its tightly coupled solver is that the latter results in a significantly faster convergence. I.e., the gradients provided by the numerical solver give a much better learning signal than the undirected exploration of the RL process. The behavior of the RL training, on the other hand, can in part be ascribed to the on-policy nature of training data collection and to the \"brute-force\" exploration of the reinforcement learning technique.\n",
"\n",

View File

@ -36,7 +36,7 @@ Value-based methods, such as _Q-Learning_, on the other hand work by optimizing
In addition, _actor-critic_ methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state. _Proximal policy optimization_ (PPO) {cite}`schulman2017proximal` is one example from this class of algorithms and is our choice for the example task of this chapter, which is controlling Burgers' equation as a physical environment.
## Proximal Policy Optimization
## Proximal policy optimization
As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic.
The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore).
@ -47,7 +47,7 @@ DEBUG TEST ts ts agents vs's TODO remove!!!
PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement.
### PPO-Clip
### PPO-clip
More specifically, we will use the algorithm _PPO-clip_ {cite}`schulman2017proximal`. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript $_p$ below) to limit the change per step of the learning process.
In the following, we will denote the network parameters of the actor network as $\theta$ and those of the critic as $\phi$.
@ -79,7 +79,7 @@ As the actor network is trained to provide the expected value, at training time
additional standard deviation is used to sample values from a Gaussian distribution around this mean.
It is decreased over the course of the training, and at inference time we only evaluate the mean (i.e. a distribution with variance 0).
### Critic and Advantage
### Critic and advantage
The critic is represented by a value function $V(s; \phi)$ that predicts the expected cumulative reward to be received from state $s$.
Its objective is to minimize the squared advantage $A$:
@ -102,7 +102,7 @@ the GAE can be understood as a discounted cumulative sum of these estimates, fro
---
## Application to Inverse Problems
## Application to inverse problems
Reinforcement learning is widely used for trajectory optimization with multiple decision problems building upon one another. However, in the context of physical systems and PDEs, reinforcement learning algorithms are likewise attractive. In this setting, they can operate in a fashion that's similar to supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.