Consistent caps in sub-headings

2021-08-20 16:51:41 +02:00
parent f6c7622664
commit 479aa10e11
5 changed files with 18 additions and 18 deletions
--- a/reinflearn-intro.md
+++ b/reinflearn-intro.md
@@ -36,7 +36,7 @@ Value-based methods, such as _Q-Learning_, on the other hand work by optimizing
 In addition, _actor-critic_ methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state. _Proximal policy optimization_ (PPO) {cite}`schulman2017proximal` is one example from this class of algorithms and is our choice for the example task of this chapter, which is controlling Burgers' equation as a physical environment.


-## Proximal Policy Optimization
+## Proximal policy optimization

 As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic.
 The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore). 
@@ -47,7 +47,7 @@ DEBUG TEST t’s t’s agent’s vs's TODO remove!!!

 PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement. 

-### PPO-Clip
+### PPO-clip

 More specifically, we will use the algorithm _PPO-clip_ {cite}`schulman2017proximal`. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript $_p$ below) to limit the change per step of the learning process. 
 In the following, we will denote the network parameters of the actor network as $\theta$ and those of the critic as $\phi$. 
@@ -79,7 +79,7 @@ As the actor network is trained to provide the expected value, at training time
 additional standard deviation is used to sample values from a Gaussian distribution around this mean. 
 It is decreased over the course of the training, and at inference time we only evaluate the mean (i.e. a distribution with variance 0).

-### Critic and Advantage 
+### Critic and advantage 

 The critic is represented by a value function $V(s; \phi)$ that predicts the expected cumulative reward to be received from state $s$. 
 Its objective is to minimize the squared advantage $A$:
@@ -102,7 +102,7 @@ the GAE can be understood as a discounted cumulative sum of these estimates, fro

 ---

-## Application to Inverse Problems
+## Application to inverse problems

 Reinforcement learning is widely used for trajectory optimization with multiple decision problems building upon one another. However, in the context of physical systems and PDEs, reinforcement learning algorithms are likewise attractive. In this setting, they can operate in a fashion that's similar to supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.