first round of updates for RL chapter

2021-07-22 16:19:19 +02:00 · 2021-07-22 16:19:19 +02:00 · 6b938e735f
commit 6b938e735f
parent b09f34935a
2 changed files with 1577 additions and 886 deletions
--- a/reinflearn-code.ipynb
+++ b/reinflearn-code.ipynb
--- a/reinflearn-intro.md
+++ b/reinflearn-intro.md
@ -13,7 +13,7 @@ In vanilla policy gradient methods, the trained neural networks directly select

 Value-based methods, such as _Q-Learning_, on the other hand work by optimizing a state-action value function, the so-called _Q-Function_. The network in this case receives state $s$ and action $a$ to predict the average cumulative reward resulting from this input for the remainder of the trajectory, i.e. $Q(s,a)$. Actions are chosen to maximize $Q$ given the state. 

-In addition, _actor-critic_ methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state. _Proximal policy optimization_ (PPO) {cite}`schulman2017proximal` is one example from this class of algorithms and is our choice for the example task of this chapter, i.e., controlling Burgers' equation as a physical environment.
+In addition, _actor-critic_ methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state. _Proximal policy optimization_ (PPO) {cite}`schulman2017proximal` is one example from this class of algorithms and is our choice for the example task of this chapter, which is controlling Burgers' equation as a physical environment.

 ### Proximal Policy Optimization 

@ -22,38 +22,48 @@ The objective of the actor in actor-critic approaches depends on the output of t
 PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence individual state value estimates can have on the change of the actors behavior during learning. PPO is a popular choice especially when working on continuous action spaces and has been extensively used in research, for example by [OpenAI](https://openai.com/). This can be attributed to the fact that it tends to achieve good results with a stable learning progress while still being comparatively easy to implement. Because of these factors we will use PPO as well.

 More specifically, we will use the algorithm _PPO-clip_. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. We'll denote a policy realized as a neural network with weights $\theta$ as $\pi(a; s,\theta)$, i.e., $\pi$ is conditioned on both a state of the environment $s$, and the weights of the NN. 
-In addition, $\theta_{\text{p}}$ will denote the weights from a previous state of the policy network. Correspondingly
-$\pi(a; s,\theta_{\text{p}})$ or in short $\pi(\theta_{\text{p}})$ denotes a policy evaluation with the previous state of the network.
+In addition, $\theta_p$ will denote the weights from a previous state of the policy network. Correspondingly,
+$\pi(a; s,\theta_p)$ or in short $\pi(\theta_p)$ denotes a policy evaluation with the previous state of the network.
 Formally, this gives the following objective for the actor:

 $$\begin{aligned}
 \text{arg max}_{\theta} \ 
-    \mathbb{E}_{a,s \sim \pi_{\theta_{\text{p}}}}
+    \mathbb{E}_{a,s \sim \pi_{\theta_p}}
    \Big[ & \text{min} \Big(
-            \frac{\pi(a;s,\theta)}{\pi(a;s,\theta_{\text{p}} )} A^{\pi(\theta_{\text{p}})}(s, a), 
+            \frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p )} A^{\pi(\theta_p)}(s, a), 
    \\ &
-            \ \text{clip}(\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_{\text{p}} )}, 1-\epsilon, 1+\epsilon) 
-            A^{\pi(\theta{\text{p}})}(s, a) \Big) \Big] 
+            \ \text{clip}(\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p )}, 1-\epsilon, 1+\epsilon) 
+            A^{\pi(\theta_p)}(s, a) \Big) \Big] 
 \end{aligned}$$

-Here, $\epsilon$ defines the bound for the deviation from the previous policy. $A^{\pi(\theta)}(s, a)$ denotes the estimate for the advantage $A$ of performing action $a$ in a state $s$, i.e. how well the agent performs compared to a random agent. Its value depends on the critic's output, for we use the so-called _Generalized Advantage Estimation_ (GAE) {cite}`schulman2015high`:
+Here, $\epsilon$ defines the bound for the deviation from the previous policy, which $\text{clip}()$ enforces as a bound in the equation above. $A^{\pi(\theta)}(s, a)$ denotes the estimate for the advantage $A$ of performing action $a$ in a state $s$, i.e. how well the agent performs compared to a random agent. Hence, the objective above maximizes the expectation of the advantage of the actor for a chosen action.
+
+Here, the value of $A$ depends on the critic's output, for which we use the so-called _Generalized Advantage Estimation_ (GAE) {cite}`schulman2015high`:

 $$\begin{aligned}
 A^{\pi(\theta)} &= \sum\limits_{i=0}^{n-t-1}(\gamma\lambda)^i\delta_{t+i} \\
-\delta_t             &= r_t+\gamma [V(s_{t+1} ; \phi) - V(s_t ; \theta_C)]
+\delta_t        &= r_t + \gamma V(s_{t+1} ; \theta_c) - V(s_t ; \theta_c)
 \end{aligned}$$

-Here $\theta_C$ represents the parameters of the critic's network, $r_t$ describes the reward obtained in time step t, while n denotes the total length of the trajectory. $\gamma$ and $\lambda$ are two hyperparameters 
+Here, $\theta_c$ represents the parameters of a critic network $V$, $r_t$ describes the reward obtained in time step t, while n denotes the total length of the trajectory. $\gamma$ and $\lambda$ are two hyperparameters 
 which influence rewards and state value predictions from the distant futures have on the advantage calculation. 
 They are typically set to values smaller than one.

-The $\delta_t$ in the formulation above represents a biased approximation of the advantage $A(s, a) = Q(s, a) - V(s)$, where $Q$ is the state-action value function as defined above and $V$ is the state value. The GAE can then be understood as a discounted cumulative sum of these estimates, from the current timestep until the end of the trajectory.
+The $\delta_t$ in the formulation above represents a biased approximation of the advantage $A(s, a) = Q(s, a) - V(s)$, where $Q$ is the state-action value function as defined above and $V$ is the state value as estimated by the critic network. The GAE can then be understood as a discounted cumulative sum of these estimates, from the current timestep until the end of the trajectory.
+
+The critic $V(s; \theta_C)$ is represented by a second neural network that maps observations to state value estimations. As such, the objective of the critic is to maximize the estimation of the advantage based on the observations it is given:
+
+$$
+    \text{arg max}_{\theta_c} \mathbb{E}_{s, a \sim \pi_{\theta_p}}[A^{\pi_{\theta_p}}(s, a)]
+$$
+
+**TODO, finalize**

 ### Application to Inverse Problems

 Reinforcement learning is widely used for trajectory optimization with multiple decision problems building upon one another. However, in the context of physical systems and PDEs, reinforcement learning algorithms are likewise attractive. In this setting, they can operate in a fashion that's similar to supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.

-However, the approaches differ in terms of how this optimization is performed. For example, reinforcement learning algorithms like PPO try to explore the action space during training by adding a random offset to the actions selected by the actor. This way the algorithm can discover new behavioral patterns that are more refined than the previous ones.
+However, the approaches differ in terms of how this optimization is performed. For example, reinforcement learning algorithms like PPO try to explore the action space during training by adding a random offset to the actions selected by the actor. This way, the algorithm can discover new behavioral patterns that are more refined than the previous ones.

 The way how long term effects of generated forces are taken into account can also differ for physical systems. In a control force estimator setup with differentiable physics (DP) loss, as discussed e.g. in {doc}`diffphys-code-burgers`, these dependencies are handled by passing the loss gradient through the simulation step back into previous time steps. Contrary to that, reinforcement learning usually treats the environment as a black box without gradient information. When using PPO, the value estimator network is instead used to track the long term dependencies by predicting the influence any action has for the future system evolution.

@ -63,6 +73,7 @@ $$
 \mathbf{u}_{t+1}=\mathcal{P}(\mathbf{u}_t+\pi(\mathbf{u}_t, \mathbf{u}^*, t; \theta)\Delta t)
 $$

+The $*$ superscript (as usual) denotes a reference or target quantity, and hence here $\mathbf{u}^*$ denotes a velocity target.
 The reward is calculated in a similar fashion as the Loss in the DP approach: it consists of two parts, one of which amounts to the negative square norm of the applied forces and is given at every time step. The other part adds a punishment proportional to the $L^2$ distance between the final approximation and the target state at the end of each trajectory. 

 $$\begin{aligned}
@ -81,7 +92,7 @@ In the following, we'll describe a way to implement a PPO-based RL training for

 To train a reinforcement learning agent to control a PDE-governed system, the physical model has to be formalized as an RL environment. The [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) framework, which we use in the following to implement a PPO training, uses a vectorized version of the gym environment. This way, rollout collection can be performed on multiple trajectories in parallel for better resource utilization and wall time efficiency. 

-Vectorized environments require a definition of observation and action spaces, i.e. the in- and output spaces of the agent policy. In our case, the former consists of the current and goal physical states, e.g., velocity fields, stacked along their channel dimension. Another channel is added for the elapsed time since the start of the simulation divided by the total trajectory length. The action space encompasses one force value for each cell of the velocity field.
+Vectorized environments require a definition of observation and action spaces, meaning the in- and output spaces of the agent policy. In our case, the former consists of the current physical states and the goal states, e.g., velocity fields, stacked along their channel dimension. Another channel is added for the elapsed time since the start of the simulation divided by the total trajectory length. The action space (the output) encompasses one force value for each cell of the velocity field.

 The most relevant methods of vectorized environments are `reset`, `step_async`, `step_wait` and `render`. The first of these is used to start a new trajectory by computing initial and goal states and returning the first observation for each vectorized instance. As these instances in other applications are not bound to finish trajectories synchronously, `reset` has to be called from within the environment itself when entering a terminal state. `step_async` and `step_wait` are the two main parts of the `step` method, which takes actions, applies them to the velocity fields and performs one iteration of the physics models. The split into async and wait enables supporting vectorized environments that run each instance on separate threads. However, this is not required in our approach, as phiflow handles the simulation of  batches internally. The `render` method is called to display training results, showing reconstructed trajectories in real time or rendering them to files.