Deep reinforcement learning, referred to as just _reinforcement learning_ (RL) from now on, is a class of methods in the larger field of deep learning that lets an artificial intelligence agent explore the interactions with a surrounding environment. While doing this, the agent receives reward signals for its actions and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}`silver2017mastering`, and it bears promise for engineering applications such as robotics.
The setup for RL generally consists of two parts: the environment and the agent. The environment receives actions from the agent while supplying it with observations and rewards. The observations represent the fraction of the information from the respective environment state that the agent is able to perceive. The rewards are given by a predefined function, usually tailored to the environment and might contain, e.g., a game score, a penalty for wrong actions or a bounty for successfully finished tasks.
The agent on the other hand encompasses a neural network which decides on actions given observations. In the learning process it uses the combined information of state, action and corresponding rewards to increase the cumulative intensity of reward signals it receives over each trajectory. To achieve this goal, multiple algorithms have been proposed, which can be roughly divided into two classes: _policy gradient_ and _value-based_ methods {cite}`sutton2018rl`.
## Algorithms
In vanilla policy gradient methods, the trained neural networks directly select actions $a$ from environment observations. In the learning process, probabilities for actions leading to higher rewards in the rest of the respective trajectories are increased, while actions with smaller return are made less likely.
Value-based methods, such as _Q-Learning_, on the other hand work by optimizing a state-action value function, the so-called _Q-Function_. The network in this case receives state $s$ and action $a$ to predict the average cumulative reward resulting from this input for the remainder of the trajectory, i.e. $Q(s,a)$. Actions are chosen to maximize $Q$ given the state.
In addition, _actor-critic_ methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state. _Proximal policy optimization_ (PPO) {cite}`schulman2017proximal` is one example from this class of algorithms and is our choice for the example task of this chapter, which is controlling Burgers' equation as a physical environment.
The objective of the actor in actor-critic approaches depends on the output of the critic network. This fact can promote instabilities, for example as strongly over- or underestimated state values would give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to affect the agent's behavior unrestrictedly, the learning progress can collapse.
PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence individual state value estimates can have on the change of the actors behavior during learning. PPO is a popular choice especially when working on continuous action spaces and has been extensively used in research, for example by [OpenAI](https://openai.com/). This can be attributed to the fact that it tends to achieve good results with a stable learning progress while still being comparatively easy to implement. Because of these factors we will use PPO as well.
More specifically, we will use the algorithm _PPO-clip_. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. We'll denote a policy realized as a neural network with weights $\theta$ as $\pi(a; s,\theta)$, i.e., $\pi$ is conditioned on both a state of the environment $s$, and the weights of the NN.
Here, $\epsilon$ defines the bound for the deviation from the previous policy, which $\text{clip}()$ enforces as a bound in the equation above. $A^{\pi(\theta)}(s, a)$ denotes the estimate for the advantage $A$ of performing action $a$ in a state $s$, i.e. how well the agent performs compared to a random agent. Hence, the objective above maximizes the expectation of the advantage of the actor for a chosen action.
Here, the value of $A$ depends on the critic's output, for which we use the so-called _Generalized Advantage Estimation_ (GAE) {cite}`schulman2015high`:
Here, $\theta_c$ represents the parameters of a critic network $V$, $r_t$ describes the reward obtained in time step t, while n denotes the total length of the trajectory. $\gamma$ and $\lambda$ are two hyperparameters
The $\delta_t$ in the formulation above represents a biased approximation of the advantage $A(s, a) = Q(s, a) - V(s)$, where $Q$ is the state-action value function as defined above and $V$ is the state value as estimated by the critic network. The GAE can then be understood as a discounted cumulative sum of these estimates, from the current timestep until the end of the trajectory.
The critic $V(s; \theta_C)$ is represented by a second neural network that maps observations to state value estimations. As such, the objective of the critic is to maximize the estimation of the advantage based on the observations it is given:
$$
\text{arg max}_{\theta_c} \mathbb{E}_{s, a \sim \pi_{\theta_p}}[A^{\pi_{\theta_p}}(s, a)]
Reinforcement learning is widely used for trajectory optimization with multiple decision problems building upon one another. However, in the context of physical systems and PDEs, reinforcement learning algorithms are likewise attractive. In this setting, they can operate in a fashion that's similar to supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.
However, the approaches differ in terms of how this optimization is performed. For example, reinforcement learning algorithms like PPO try to explore the action space during training by adding a random offset to the actions selected by the actor. This way, the algorithm can discover new behavioral patterns that are more refined than the previous ones.
The way how long term effects of generated forces are taken into account can also differ for physical systems. In a control force estimator setup with differentiable physics (DP) loss, as discussed e.g. in {doc}`diffphys-code-burgers`, these dependencies are handled by passing the loss gradient through the simulation step back into previous time steps. Contrary to that, reinforcement learning usually treats the environment as a black box without gradient information. When using PPO, the value estimator network is instead used to track the long term dependencies by predicting the influence any action has for the future system evolution.
Working with Burgers' equation as physical environment, the trajectory generation process can be summarized as follows. It shows how the simulation steps of the environment and the neural network evaluations of the agent are interleaved:
The reward is calculated in a similar fashion as the Loss in the DP approach: it consists of two parts, one of which amounts to the negative square norm of the applied forces and is given at every time step. The other part adds a punishment proportional to the $L^2$ distance between the final approximation and the target state at the end of each trajectory.
In the following, we'll describe a way to implement a PPO-based RL training for physical systems. This implementation is also the basis for the notebook of the next section, i.e., {doc}`reinflearn-code`. While this notebook provides a practical example, and an evaluation in comparison to DP training, we'll first give a more generic overview below.
To train a reinforcement learning agent to control a PDE-governed system, the physical model has to be formalized as an RL environment. The [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) framework, which we use in the following to implement a PPO training, uses a vectorized version of the gym environment. This way, rollout collection can be performed on multiple trajectories in parallel for better resource utilization and wall time efficiency.
Vectorized environments require a definition of observation and action spaces, meaning the in- and output spaces of the agent policy. In our case, the former consists of the current physical states and the goal states, e.g., velocity fields, stacked along their channel dimension. Another channel is added for the elapsed time since the start of the simulation divided by the total trajectory length. The action space (the output) encompasses one force value for each cell of the velocity field.
The most relevant methods of vectorized environments are `reset`, `step_async`, `step_wait` and `render`. The first of these is used to start a new trajectory by computing initial and goal states and returning the first observation for each vectorized instance. As these instances in other applications are not bound to finish trajectories synchronously, `reset` has to be called from within the environment itself when entering a terminal state. `step_async` and `step_wait` are the two main parts of the `step` method, which takes actions, applies them to the velocity fields and performs one iteration of the physics models. The split into async and wait enables supporting vectorized environments that run each instance on separate threads. However, this is not required in our approach, as phiflow handles the simulation of batches internally. The `render` method is called to display training results, showing reconstructed trajectories in real time or rendering them to files.
Because of the strongly differing output spaces of the actor and critic networks, we use different architectures for each of them. The network yielding the actions uses the variant of the U-Net architecture proposed by Holl et al. {cite}`holl2019pdecontrol`, in line with the $CFE$ function performing actions there. The other network consists of a series of convolutions with kernel size 3, each followed by a max pooling layer with kernel size 2 and stride 2. After the feature maps have been downsampled to one value in this way, a final fully connected layer merges all channels, generating the predicted state value.
In the example implementation of the next chapter, the `BurgersTraining` class manages all aspects of this training internally, including the setup of agent and environment and storing trained models and monitor logs to disk. It also includes a variant of the Burgers' equation environment described above, which, instead of computing random trajectories, uses data from a predefined set. During training, the agent is evaluated in this environment in regular intervals to be able to compare the training progress to the DP method more accurately.
The next chapter will use this `BurgersTraining` class to run a full PPO scenario, evaluate its performance, and compare it to an approach that uses more domain knowledge of the physical system, i.e., the gradient-based control training with a DP approach.