updated RL chapter

2021-08-03 21:55:07 +02:00 · 2021-08-03 21:55:07 +02:00 · 7910aa23e9
commit 7910aa23e9
parent 8ee86cf058
4 changed files with 146 additions and 418 deletions
--- a/reinflearn-code.ipynb
+++ b/reinflearn-code.ipynb
--- a/reinflearn-intro.md
+++ b/reinflearn-intro.md
@ -5,6 +5,16 @@ Deep reinforcement learning, referred to as just _reinforcement learning_ (RL) f

 The setup for RL generally consists of two parts: the environment and the agent. The environment receives actions $a$ from the agent while supplying it with observations in the form of states $s$, and rewards $r$. The observations represent the fraction of the information from the respective environment state that the agent is able to perceive. The rewards are given by a predefined function, usually tailored to the environment and might contain, e.g., a game score, a penalty for wrong actions or a bounty for successfully finished tasks.

+
+```{figure} resources/rl-overview.jpg
+---
+height: 200px
+name: rl-overview
+---
+Reinforcement learning is formulated in terms of an environment that gives observations in the form of states and rewards to an agent. The agent interacts with the environment by performing actions.
+```
+
+
 In its simplest form, the learning goal for reinforcement learning tasks can be formulated as
 $$
 \text{arg max}_{\theta} \mathbb{E} \sum_t=0^T r_t , 
@ -25,14 +35,16 @@ In addition, _actor-critic_ methods combine elements from both approaches. Here,

 ## Proximal Policy Optimization

-As PPO methods are an actor-critic approach, we need to train two networks: the actor, and the critic.
-Hence, the objective of the actor inherently depends on the output of the critic network. This fact can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to affect the agent’s behavior unrestrictedly, the learning progress can collapse.
+As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic.
+The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore). 
+
+This interdependence can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to unrestrictedly affect the agent’s behavior, the learning progress can collapse.

 PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement. 

 ### PPO-Clip

-More specifically, we will use the algorithm _PPO-clip_ {cite}`schulman2017proximal`. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript $_p$ in below) to limits the change per update step in comparison to before. 
+More specifically, we will use the algorithm _PPO-clip_ {cite}`schulman2017proximal`. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript $_p$ in below) to limit the change per update step in comparison to before. 
 In the following, we will denote the network parameters of the actor network as $\theta$ and those of the critic as $\phi$. 

 <!-- simplified notation:
@ -45,14 +57,16 @@ In the following, we will denote the network parameters of the actor network as

 ### Actor

-The actor computes a policy function $\pi(s; \theta)$ returning the probability distribution for actions in a certain state $s$. $\pi(a;s,\theta)$ describes the probability of choosing a certain action $a$ conditioned by the current network parameters $\theta$ and the state $s$. As mentioned above, the training procedure computes a certain number of weight updates using policy evaluations with a fixed previous network state $\pi(s; \theta_p)$, and in intervals re-initializes the previous weights $\theta_p$ from $\theta$.
+The actor computes a policy function returning the probability distribution for the actions conditioned by the current network parameters $\theta$ and a state $s$. 
+In the following we'll denote the probability of choosing a specific action $a$ from the distribution with $\pi(a; s,\theta)$.
+As mentioned above, the training procedure computes a certain number of weight updates using policy evaluations with a fixed previous network state $\pi(a;s, \theta_p)$, and in intervals re-initializes the previous weights $\theta_p$ from $\theta$.
 To limit the changes, the objective function makes use of a $\text{clip}(a,b,c)$ function, which simply returns $a$ clamped to the interval $[b,c]$.

 $\epsilon$ defines the bound for the deviation from the previous policy. 
 In combination, the objective for the actor is given by the following expression:

 $$\begin{aligned}
-\text{arg max}_{\theta} \mathbb{E}_{s, a \sim \pi(;\theta_p)} \Big[ \text{min} \big(
+\text{arg max}_{\theta} \mathbb{E}_{s, a \sim \pi(;s,\theta_p)} \Big[ \text{min} \big(
    \frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p)} 
        A(s, a; \phi), 
    \text{clip}(\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p)}, 1-\epsilon, 1+\epsilon)
@ -60,26 +74,27 @@ $$\begin{aligned}
    \big) \Big]
 \end{aligned}$$

+As the actor network is trained to provide the expected value, at training time an 
+additional standard deviation is used to sample values from a Gaussian distribution around this mean. 
+It is decreased over the course of the training, and at inference time we only evaluate the mean (i.e. a distribution with variance 0).
+
 ### Critic and Advantage 

 The critic is represented by a value function $V(s; \phi)$ that predicts the expected cumulative reward to be received from state $s$. 
 Its objective is to minimize the squared advantage $A$:

-% \phi_{k+1} = 
 $$\begin{aligned}
-\text{arg min}_{\phi}\mathbb{E}_{s, a \sim \pi(;\theta)}[A(s, a; \phi)^2] \ , 
+\text{arg min}_{\phi}\mathbb{E}_{s, a \sim \pi(;s,\theta_p)}[A(s, a; \phi)^2] \ , 
 \end{aligned}$$

-where the advantage function $A(s, a; \phi)$ builds upon $V$: it's goal is to evaluate the deviation from an average cumulative reward. I.e., we're interested in estimating how much the decision made via $\pi(;\theta)$ improves upon making random decisions. We use the so-called Generalized Advantage Estimation (GAE) to compute $A$ as:
+where the advantage function $A(s, a; \phi)$ builds upon $V$: it's goal is to evaluate the deviation from an average cumulative reward. I.e., we're interested in estimating how much the decision made via $\pi(;s,\theta_p)$ improves upon making random decisions (again, evaluated via the unchanging, previous network state $\theta_p$). We use the so-called Generalized Advantage Estimation (GAE) {cite}`schulman2015high` to compute $A$ as:

-% A^{\pi_{\theta_p}}_t
 $$\begin{aligned}
 A(s_t, a_t; \phi) &= \sum\limits_{i=0}^{n-t-1}(\gamma\lambda)^i\delta_{t+i} \\
 \delta_t &= r_t+\gamma [V(s_{t+1} ; \phi) - V(s_t ; \phi)]
 \end{aligned}$$

 Here $r_t$ describes the reward obtained in time step $t$, while $n$ denotes the total length of the trajectory. $\gamma$ and $\lambda$ are two hyperparameters which influence rewards and state value predictions from the distant futures have on the advantage calculation. They are typically set to values smaller than one.
-For training stability, this advantage function makes use of the policy $\pi(;\theta_p)$, i.e. the policy evaluated with the previous weight states $\theta_p$.

 The $\delta_t$ in the formulation above represent a biased approximation of the true advantage. Hence
 the GAE can be understood as a discounted cumulative sum of these estimates, from the current time step until the end of the trajectory. 
@ -97,10 +112,11 @@ The way how long term effects of generated forces are taken into account can als
 Working with Burgers' equation as physical environment, the trajectory generation process can be summarized as follows. It shows how the simulation steps of the environment and the neural network evaluations of the agent are interleaved:

 $$
-\mathbf{u}_{t+1}=\mathcal{P}(\mathbf{u}_t+\pi(\mathbf{u}_t, \mathbf{u}^*, t; \theta)\Delta t)
+\mathbf{u}_{t+1}=\mathcal{P}(\mathbf{u}_t+\pi(\mathbf{u}_t; \mathbf{u}^*, t, \theta)\Delta t)
 $$

-The $*$ superscript (as usual) denotes a reference or target quantity, and hence here $\mathbf{u}^*$ denotes a velocity target.
+The $*$ superscript (as usual) denotes a reference or target quantity, and hence here $\mathbf{u}^*$ denotes a velocity target. For the continuous action space of the PDE, $\pi$ directly computes an action in terms of a force, rather than probabilities for a discrete set of different actions.
+
 The reward is calculated in a similar fashion as the Loss in the DP approach: it consists of two parts, one of which amounts to the negative square norm of the applied forces and is given at every time step. The other part adds a punishment proportional to the $L^2$ distance between the final approximation and the target state at the end of each trajectory. 

 $$\begin{aligned}
--- a/resources/pbdl-figures.key
+++ b/resources/pbdl-figures.key
--- a/resources/rl-overview.jpg
+++ b/resources/rl-overview.jpg