Deep reinforcement learning, referred to as just _reinforcement learning_ (RL) from now on, is a class of methods in the larger field of deep learning that lets an artificial intelligence agent explore the interactions with a surrounding environment. While doing this, the agent receives reward signals for its actions and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}`silver2017mastering`, and it bears promise for engineering applications such as robotics.
The setup for RL generally consists of two parts: the environment and the agent. The environment receives actions from the agent while supplying it with observations and rewards. The observations represent the fraction of the information from the respective environment state that the agent is able to perceive. The rewards are given by a predefined function, usually tailored to the environment and might contain e.g. a game score, a penalty for wrong actions or a bounty for successfully finished tasks.
The agent on the other hand encompasses a neural network which decides on actions given observations. In the learning process it uses the combined information of state, action and corresponding rewards to increase the cumulative intensity of reward signals it receives over each trajectory. To achieve this goal, multiple algorithms have been proposed, which can be roughly divided into two classes: policy gradient and value based methods {cite}`sutton2018rl`.
In vanilla policy gradient methods, the trained neural networks directly select actions $a$ from environment observations. In the learning process, probabilities for actions leading to higher rewards in the rest of the respective trajectories are increased, while actions with smaller return are made less likely.
Value based methods, such as _Q-Learning_, on the other hand work by optimizing a state-action value function, the so-called Q-Function. The network in this case receives state $s$ and action $a$ to predicts the average cumulative reward resulting from that pair for the remainder of the trajectory. Actions are chosen to maximize this Q-Function given the state.
Actor-critic methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state.
Here, $\epsilon$ defines the bound for the deviation from the previous policy as described above. $A^{\pi_{\theta_k}}(s, a)$ denotes the estimate for the advantage of $a$ in a state $s$, i.e. how well the agent performs compared to a random agent. Its value depends on the critic's output [\[Ach18\]](https://spinningup.openai.com/en/latest/):
However, the loss function of the actor being dependent on another neural network can create instabilities in the learning process. _PPO-Clip_ is an algorithm which tries to counteract this problem by setting a bound for the change of action probability motivated by individual samples. This way, outliers have a limited effect on the agents behavior [\[Sch+17\]](https://arxiv.org/abs/1707.06347v2).
Overall, reinforcement learning is primarily used for trajectory optimization with multiple decision problems building upon one another. In the context of PDE control, reinforcement learning algorithms can operate in a quite similar fashion as supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.
However, the approaches diverge in the manner this optimization is in fact done. For example, reinforcement learning algorithms like PPO try to explore the action space during training by adding a random offset to the actions selected by the actor. This way the algorithm may discover new behavioral patterns that are more refined than the previous ones [\[Ach18\]](https://spinningup.openai.com/en/latest/).
Also the way how long term effects of generated forces are taken into account is different. In a control force estimator setup with differentiable physics loss, these dependencies are handled by passing the loss gradient through the simulation step back into previous time steps. Contrary to that, reinforcement learning usually treats the environment as a kind of black box providing no gradient information. When using PPO, the value estimator network is instead used to track the long term dependencies by predicting the influence any action has for the future system evolution.
Working on Burgers' equation, the trajectory generation process can be summarized as following, showing how the simulation steps of the environment and the neural network evaluations of the agent are interleaved.
The reward is calculated in a similar fashion as the Loss in the DP approach: it consists of two parts, one of which amounts to the negative square norm of the applied forces and is given at every time step. The other part adds a punishment proportional to the L2-distance between the final approximation and the target state at the end of each trajectory.
-multiple decisions of the agent building upon one another
-operation similar to supervised single shooting approaches
-but: reinforcement learning algorithms designed to deal with problems such as
-sparse rewards
=> closeness to the goal state only relevant at the end of trajectory
-tradeoff exploration vs exploitation
=> possibly better sample efficiency
-but but: no differential physics data available, cannot track loss through
environment steps
-scheme:
-environment: fields governed by PDE
-actions: control forces acting on the fields
-reward: penalty for action strength and distance between final state and
target state
-->
<!--
Formally, reinforcement learning is an implementation of the Markov decision process using neural networks. It defines an environment perceived by an agent. Based on its observations, the agent may take actions to interact with this environment in order to reach new states and receive rewards. The rewards are generated by a function taylored to the environment and might contain e.g. a game score, a penalty for wrong actions or a bounty for successfully finished tasks. Based on them, the agent adapts its behavior to increase the rewards accumulated over individual trajectories, i.e. the periods between environment resets. -->
<!--More compound algorithms also exist, like DDPG, where a policy network is trained to maximize a Q-Function represented by another deep learning model. Actor-Critic methods on the other hand rate actions based on a corresponding change in state value. These values are generated by a neural network and estimate the expected cumulative return of a state.-->
<!-- While supervised learning in principal relies on predefined training, validation, and test data sets, in reinforcement learning samples are generated during the training process. With Q-Learning methods, this can be done in off-policy fashion, making it possible to reuse samples from earlier points in the training. However, with policy gradient and many actor-critic methods such as PPO this does not work. This makes stablility of the algorithms crucial as better policies tend to also yield better training samples.
While reinforcement learning in itself represents a learning scheme based on trial and error, the value estimation in actor critic methods add a possibility for tracking long term dependencies and make it easier for the algorithm to focus on long term goals.
In these cases only recently obtained samples can be used for updating the policy