update RL chapter

2021-06-15 13:54:23 +02:00 · 2021-06-15 13:54:23 +02:00 · 4c1a8cd7cf
commit 4c1a8cd7cf
parent aa4e3e194f
4 changed files with 258 additions and 217 deletions
--- a/_toc.yml
+++ b/_toc.yml
@ -65,17 +65,6 @@
 - part: End Matter
  chapters:
  - file: outlook.md
- #  - file: old-phiflow1.md
- #    sections:
- #      - file: overview-burgers-forw-v1.ipynb
- #      - file: overview-ns-forw-v1.ipynb
- #      - file: physicalloss-code-v2.ipynb
- #      - file: diffphys-code-gradient-v1.ipynb
- #      - file: diffphys-code-tf.ipynb
- #      - file: diffphys-code-ns-v1.ipynb
- #      - file: diffphys-code-ns-v2a.ipynb
  - file: references.md
  - file: notation.md
  
-  
-  
--- a/old-phiflow1.md
+++ b/old-phiflow1.md
@ -1,4 +0,0 @@
-# Old Phiflow1 Examples
-
-Remove sometime...
-
--- a/reinflearn-code.ipynb
+++ b/reinflearn-code.ipynb
@ -6,7 +6,11 @@
    "id": "Aml7ksJPtCmf"
   },
   "source": [
-    "# Controlling Burgers' Equation with Reinforcement Learning"
+    "# Controlling Burgers' Equation with Reinforcement Learning\n",
+    "\n",
+    "In the following, we will target inverse problems with Burgers equation as a testbed for reinforcement learning (RL). The setup is similar to the inverse problems previously targeted with differentiable physics (DP) training (cf. {doc}`diffphys-control`), and hence we'll also directly compare to these approaches below. Similar to before, Burgers equation is simple but non-linear with interesting dynamics, and hence a good starting point for RL experiments. In the following, the goal is to train a control force estimator network that should predict the forces needed to generate smooth transitions between two given states. \n",
+    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/reinflearn-code.ipynb)\n",
+    "\n"
   ]
  },
  {
@ -15,13 +19,12 @@
    "id": "B87Sa-fMYcOx"
   },
   "source": [
-    "In the following, we will target inverse problems with Burgers equation as a testbed for reinforcement learning (RL). The setup is similar to the inverse problems previously targeted with differentiable physics (DP) training (cf. {doc}`diffphys-control`), and hence we'll also directly compare to these approaches below. Similar to before, Burgers equation is simple but non-linear with interesting dynamics, and hence a good starting point for RL experiments. In the following, the goal is to train a control force estimator network that should predict the forces needed to generate smooth transitions between two given states. \n",
    "\n",
    "## Overview\n",
    "\n",
    "Reinforcement learning describes an agent perceiving an environment and taking actions inside it. It aims at maximizing an accumulated sum of rewards, which it receives for those actions by the environment. Thus, the agent learns empirically which actions to take in different situations. _Proximal policy optimization_ [PPO](https://arxiv.org/abs/1707.06347v2) is a widely used reinforcement learning algorithm describing two neural networks: a policy model selecting actions for given observations and a value estimator network rating the reward potential of those states. These value estimates form the loss of the policy model, given by the change in reward potential by the chosen action.\n",
    "\n",
-    "This notebook illustrates how PPO reinforcement learning can be applied to the described control problem of Burgers' equation. In comparison to the DP approach, the RL method does not have access to a differentiable physics solver, it is _model-free_. For the RL setup, this effectively means that we're able to pass gradients through the environment function, which is not always a given. \n",
+    "This notebook illustrates how PPO reinforcement learning can be applied to the described control problem of Burgers' equation. In comparison to the DP approach, the RL method does not have access to a differentiable physics solver, it is _model-free_. \n",
    "\n",
    "However, the goal of the value estimator model is to compensate for this lack of a solver, as it tries to capture the long term effect of individual actions. Thus, an interesting question the following code example should answer is: can the model-free PPO reinforcement learning match the performance of the model-based DP training. We will compare this in terms of learning speed and the amount of required forces.\n"
   ]
@ -34,9 +37,10 @@
   "source": [
    "## Software installation\n",
    "\n",
-    "This example uses the reinforcement learning framework [stable_baselines3](https://github.com/DLR-RM/stable-baselines3) and version 1.5.1 of the differentiable PDE solver [Φ<sub>Flow</sub>](https://github.com/tum-pbs/PhiFlow). [PPO](https://arxiv.org/abs/1707.06347v2) was chosen as reinforcement learning algorithm.\n",
+    "This example uses the reinforcement learning framework [stable_baselines3](https://github.com/DLR-RM/stable-baselines3) with [PPO](https://arxiv.org/abs/1707.06347v2) as reinforcement learning algorithm.\n",
+    "For the simulation, version 1.5.1 of the differentiable PDE solver [Φ<sub>Flow</sub>](https://github.com/tum-pbs/PhiFlow) is used. \n",
    "\n",
-    "Additionally, a supervised control force estimator is trained as a performance baseline. This method was introduced by Holl et al. [\\(2020\\)](https://ge.in.tum.de/publications/2020-iclr-holl/)."
+    "After the RL training is completed, we'll additionally compare to a differentiable physics approach using a \"control force estimator\" (CFE) network from {doc}`diffphys-control` (as introduced by {cite}`holl2019pdecontrol`)."
   ]
  },
  {
@ -93,7 +97,7 @@
    "id": "wSELidjsvRyd"
   },
   "source": [
-    "At first we generate a dataset to train the CFE model on and to evaluate the performance of both approaches during and after training. The code below simulates 1000 cases (i.e. phiflow \"scenes\"), and keeps 100 of them as validation and test cases, respectively. The remaining 800 are used for training."
+    "At first we generate a dataset to train the differentiable physics model on and to evaluate the performance of both approaches during and after training. The code below simulates 1000 cases (i.e. phiflow \"scenes\"), and keeps 100 of them as validation and test cases, respectively. The remaining 800 are used for training."
   ]
  },
  {
@ -104,19 +108,19 @@
   },
   "outputs": [],
   "source": [
-    "domain = Domain([32], box=box[0:1])     # Size and shape of the fields\n",
-    "viscosity = 0.003\n",
-    "step_count = 32                         # Trajectory length\n",
-    "dt = 0.03\n",
-    "diffusion_substeps = 1\n",
+    "DOMAIN = Domain([32], box=box[0:1])     # Size and shape of the fields\n",
+    "VISCOSITY = 0.003\n",
+    "STEP_COUNT = 32                         # Trajectory length\n",
+    "DT = 0.03\n",
+    "DIFFUSION_SUBSTEPS = 1\n",
    "\n",
-    "data_path = 'forced-burgers-clash'\n",
-    "scene_count = 1000\n",
-    "batch_size = 100\n",
+    "DATA_PATH = 'forced-burgers-clash'\n",
+    "SCENE_COUNT = 1000\n",
+    "BATCH_SIZE = 100\n",
    "\n",
-    "train_range = range(200, 1000)\n",
-    "val_range = range(100, 200)\n",
-    "test_range = range(0, 100)"
+    "TRAIN_RANGE = range(200, 1000)\n",
+    "VAL_RANGE = range(100, 200)\n",
+    "TEST_RANGE = range(0, 100)"
   ]
  },
  {
@ -127,22 +131,22 @@
   },
   "outputs": [],
   "source": [
-    "for batch_index in range(scene_count // batch_size):\n",
-    "    scene = Scene.create(data_path, count=batch_size)\n",
+    "for batch_index in range(SCENE_COUNT // BATCH_SIZE):\n",
+    "    scene = Scene.create(DATA_PATH, count=BATCH_SIZE)\n",
    "    print(scene)\n",
    "    world = World()\n",
    "    u0 = BurgersVelocity(\n",
-    "        domain, \n",
-    "        velocity=GaussianClash(batch_size), \n",
-    "        viscosity=viscosity, \n",
-    "        batch_size=batch_size, \n",
+    "        DOMAIN, \n",
+    "        velocity=GaussianClash(BATCH_SIZE), \n",
+    "        viscosity=VISCOSITY, \n",
+    "        batch_size=BATCH_SIZE, \n",
    "        name='burgers'\n",
    "    )\n",
-    "    u = world.add(u0, physics=Burgers(diffusion_substeps=diffusion_substeps))\n",
-    "    force = world.add(FieldEffect(GaussianForce(batch_size), ['velocity']))\n",
+    "    u = world.add(u0, physics=Burgers(diffusion_substeps=DIFFUSION_SUBSTEPS))\n",
+    "    force = world.add(FieldEffect(GaussianForce(BATCH_SIZE), ['velocity']))\n",
    "    scene.write(world.state, frame=0)\n",
-    "    for frame in range(1, step_count + 1):\n",
-    "        world.step(dt=dt)\n",
+    "    for frame in range(1, STEP_COUNT + 1):\n",
+    "        world.step(dt=DT)\n",
    "        scene.write(world.state, frame=frame)"
   ]
  },
@ -154,9 +158,7 @@
   "source": [
    "## Reinforcement Learning Training\n",
    "\n",
-    "Next we set up the RL environment.\n",
-    "\n",
-    "The reinforcement learning approach uses a dedicated value estimator network (the \"critic\") to predict the sum of rewards generated from a certain state. These are then used to update a policy network (the \"actor\") which, analogously to the control force estimator network of {doc}`diffphys-control` and the next section below, predicts the forces to control the simulation. "
+    "Next we set up the RL environment. The PPO approach uses a dedicated value estimator network (the \"critic\") to predict the sum of rewards generated from a certain state. These predicted rewards are then used to update a policy network (the \"actor\") which, analogously to the CFE network of {doc}`diffphys-control`, predicts the forces to control the simulation."
   ]
  },
  {
@ -169,12 +171,12 @@
   "source": [
    "from experiment import BurgersTraining\n",
    "\n",
-    "n_envs = 10                         # On how many environments to train in parallel, load balancing\n",
-    "final_reward_factor = step_count    # Penalty for not reaching the goal state\n",
-    "steps_per_rollout = step_count * 10 # How many steps to collect per environment between agent updates\n",
-    "n_epochs = 10                       # How many epochs to perform during each agent update\n",
-    "rl_learning_rate = 1e-4             # Learning rate for agent updates\n",
-    "rl_batch_size = 128                 # Batch size for agent updates"
+    "N_ENVS = 10                         # On how many environments to train in parallel, load balancing\n",
+    "FINAL_REWARD_FACTOR = STEP_COUNT    # Penalty for not reaching the goal state\n",
+    "STEPS_PER_ROLLOUT = STEP_COUNT * 10 # How many steps to collect per environment between agent updates\n",
+    "N_EPOCHS = 10                       # How many epochs to perform during each agent update\n",
+    "RL_LEARNING_RATE = 1e-4             # Learning rate for agent updates\n",
+    "RL_BATCH_SIZE = 128                 # Batch size for agent updates"
   ]
  },
  {
@ -183,11 +185,29 @@
    "id": "U4FKqSjwv9jR"
   },
   "source": [
-    "To start training, we create a trainer object which manages the environment and the agent internally. Additionally, a directory for storing models, logs, and hyperparameters is created. This way, training can be continued at any later point using the same configuration. If the model folder specified in exp_name already exists, the agent within is loaded. Otherwise, a new agent is created.\n",
+    "To start training, we create a trainer object which manages the environment and the agent internally. Additionally, a directory for storing models, logs, and hyperparameters is created. This way, training can be continued at any later point using the same configuration. If the model folder specified in `exp_name` already exists, the agent within is loaded; otherwise, a new agent is created. For the PPO reinforcement learning algorithm, the implementation of `stable_baselines3` is used. The trainer class acts as a wrapper for this system. Under the hood, an instance of a `BurgersEnv` gym environment is created, which is loaded into the PPO algorithm. It generates random initial states and precomputes corresponding ground truth simulations and handles the system evolution influenced by the agent's actions. Furthermore, the trainer regularly evaluates the performance on the validation set by loading a different environment that uses the initial and target states of the validation set.\n",
    "\n",
-    "By default, an agent is stored at `PDE-Control-RL/networks/rl-models/bench`, and loaded if it exists. To generate a new model, replace the specified path with another.\n",
+    "### Gym Environment \n",
    "\n",
-    "**TODO, explain PPO setup and environment**\n"
+    "The gym environment specification provides an interface leveraging the interaction with the agent. Environments implementing it must specify observation and action spaces, which represent the in- and output spaces of the agent. Further, they have to define a set of methods, the most important ones being `reset`, `step`, and `render`. \n",
+    "\n",
+    "* `reset` is called after a trajectory has ended, to revert the environment to an initial state, and returns the corresponding observation. \n",
+    "* `step` takes an action given by the agent and iterates the environment to the next state. It returns the resulting observation, the received reward, a flag determining whether a terminal state has been reached and a dictionary for debugging and logging information. \n",
+    "* `render` is called to display the current environment state in a way the creator of the environment specifies. This function can be used to inspect the training results.\n",
+    "\n",
+    "`stable-baselines3` expands on the default gym environment by providing an interface for vectorized environments. This makes it possible to compute the forward pass for multiple trajectories simultaneously which can in turn increase time efficiency because of better resource utilization. In practice, this means that the methods now work on vectors of observations, actions, rewards, terminal state flags and info dictionarys. The step method is split into `step_async` and `step_wait`, making it possible to run individual instances of the environment on different threads.\n",
+    "\n",
+    "### Physics Simulation \n",
+    "\n",
+    "The environment for Burgers' equation contains a `Burgers` physics object provided by `phiflow`. The states are internally stored as `BurgersVelocity` objects. To create the initial states, the environment generates batches of random fields in the same fashion as in the data set generation process shown above. The observation space consists of the velocity fields of the current and target states stacked in the channel dimension with another channel specifying the current time step. Actions are taken in the form of a one dimensional array covering every velocity value. The `step` method calls the physics object to advance the internal state by one time step, also applying the actions as a `FieldEffect`.\n",
+    "\n",
+    "The rewards encompass a penalty equal to the square norm of the generated forces at every timestep. Additionally, the $L^2$ distance to the target field, scaled by a predefined factor (`FINAL_REWARD_FACTOR`) is subtracted at the end of each trajectory. The rewards are then normalized with a running estimate for the reward mean and standard deviation.\n",
+    "\n",
+    "### Neural Network\n",
+    "\n",
+    "We use two different neural network architectures for the actor and critic respectively. The former uses the U-Net variant from {cite}`holl2019pdecontrol`, while the latter consists of a series of 1D convolutional and pooling layers reducing the feature map size to one. The final operation is a convolution with kernel size one to combine the feature maps and retain one output value. The `CustomActorCriticPolicy` class then makes it possible to use these two separate network architectures for the reinforcement learning agent.\n",
+    "\n",
+    "By default, an agent is stored at `PDE-Control-RL/networks/rl-models/bench`, and loaded if it exists. (If necessary, replace the specified path with another to generate a new model.)"
   ]
  },
  {
@ -200,20 +220,20 @@
   "source": [
    "rl_trainer = BurgersTraining(\n",
    "    path='PDE-Control-RL/networks/rl-models/bench', # Replace this to train a new model\n",
-    "    domain=domain,\n",
-    "    viscosity=viscosity,\n",
-    "    step_count=step_count,\n",
-    "    dt=dt,\n",
-    "    diffusion_substeps=diffusion_substeps,\n",
-    "    n_envs=n_envs,\n",
-    "    final_reward_factor=final_reward_factor,\n",
-    "    steps_per_rollout=steps_per_rollout,\n",
-    "    n_epochs=n_epochs,\n",
-    "    learning_rate=rl_learning_rate,\n",
-    "    batch_size=rl_batch_size,\n",
-    "    data_path=data_path,\n",
-    "    val_range=val_range,\n",
-    "    test_range=test_range,\n",
+    "    domain=DOMAIN,\n",
+    "    viscosity=VISCOSITY,\n",
+    "    step_count=STEP_COUNT,\n",
+    "    dt=DT,\n",
+    "    diffusion_substeps=DIFFUSION_SUBSTEPS,\n",
+    "    n_envs=N_ENVS,\n",
+    "    final_reward_factor=FINAL_REWARD_FACTOR,\n",
+    "    steps_per_rollout=STEPS_PER_ROLLOUT,\n",
+    "    n_epochs=N_EPOCHS,\n",
+    "    learning_rate=RL_LEARNING_RATE,\n",
+    "    batch_size=RL_BATCH_SIZE,\n",
+    "    data_path=DATA_PATH,\n",
+    "    val_range=VAL_RANGE,\n",
+    "    test_range=TEST_RANGE,\n",
    ")"
   ]
  },
@ -223,7 +243,9 @@
    "id": "skE_zAdGwkM2"
   },
   "source": [
-    "The following cell opens tensorboard inside the notebook to display the progress of the training. If a new model was created at a different location, please change the path to the location at which you stored your model."
+    "The following cell is optional but very useful for debugging: it opens _tensorboard_ inside the notebook to display the progress of the training. If a new model was created at a different location, please change the path to the location at which you stored it. When resuming the learning process of a pre-trained agent, the new run is shown separately in tensorboard.\n",
+    "\n",
+    "The graph titled \"forces\" shows how the overall amount of forces generated by the network evolves during training. \"rew_unnormalized\" shows the raw reward values without the normalization step described above. The corresponding values with normalization are shown under \"rollout/ep_rew_mean\". \"val_set_forces\" outlines the performance of the agent on the validation set."
   ]
  },
  {
@ -266,9 +288,7 @@
   "source": [
    "## RL Evaluation\n",
    "\n",
-    "Let us take a glance what the results look like. \n",
-    "\n",
-    "**TODO, explain: source and target, show unmodified evolution, in comparison to controlled one**\n"
+    "Now that we have a trainned model, let's take a look at the results. The leftmost plot shows the results of the reinforcement learning agent. As reference, next to it are shown the ground truth, i.e. the trajectory the agent should reconstruct, and the uncontrolled simulation where the system follows its natural evolution."
   ]
  },
  {
@ -279,23 +299,35 @@
   },
   "outputs": [],
   "source": [
-    "rl_frames, _, _ = rl_trainer.infer_test_set_frames()\n",
+    "rl_frames, gt_frames, unc_frames = rl_trainer.infer_test_set_frames()\n",
    "\n",
    "index_in_set = 0    # Change this to display a reconstruction of another scene\n",
    "\n",
-    "bplt.burgers_figure('Reinforcement Learning')\n",
-    "for frame in range(0, step_count + 1):\n",
-    "    plt.plot(rl_frames[frame][index_in_set,:], color=bplt.gradient_color(frame, step_count+1), linewidth=0.8)"
+    "fig, axs = plt.subplots(1, 3, figsize=(18.9, 9.6))\n",
+    "\n",
+    "axs[0].set_title(\"Reinforcement Learning\")\n",
+    "axs[1].set_title(\"Ground Truth\")\n",
+    "axs[2].set_title(\"Uncontrolled\")\n",
+    "\n",
+    "for plot in axs:\n",
+    "    plot.set_ylim(-2, 2)\n",
+    "    plot.set_xlabel('x')\n",
+    "    plot.set_ylabel('u(x)')\n",
+    "\n",
+    "for frame in range(0, STEP_COUNT + 1):\n",
+    "    frame_color = bplt.gradient_color(frame, STEP_COUNT+1);\n",
+    "    axs[0].plot(rl_frames[frame][index_in_set,:], color=frame_color, linewidth=0.8)\n",
+    "    axs[1].plot(gt_frames[frame][index_in_set,:], color=frame_color, linewidth=0.8)\n",
+    "    axs[2].plot(unc_frames[frame][index_in_set,:], color=frame_color, linewidth=0.8)"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "id": "guKuBtm4xt-U"
+   },
   "source": [
-    "\n",
-    "**TODO, what do we see? briefly discuss: seems to work quite well already**\n",
-    "\n",
-    "\n"
+    "As we can see, a trained reinforcement learning agent is able to reconstruct the trajectories fairly well. However, they still appear noticeably less smooth than the ground truth."
   ]
  },
  {
@ -306,9 +338,9 @@
   "source": [
    "## Differentiable Physics Training\n",
    "\n",
-    "To classify the results of the reinforcement learning method, we now compare them to an approach using differentiable physics training. In contrast to the full approach from {doc}`diffphys-control` which includes a second _OP_ network, we aim for a direct control here. The OP network represents a separate \"physics-predictor\", which is omitted here for fairness with the RL version.\n",
+    "To classify the results of the reinforcement learning method, we now compare them to an approach using differentiable physics training. In contrast to the full approach from {doc}`diffphys-control` which includes a second _OP_ network, we aim for a direct control here. The OP network represents a separate \"physics-predictor\", which is omitted here for fairness when comparing with the RL version.\n",
    "\n",
-    "The DP approach has access to the gradient data provided by the differentiable solver, making it possible to trace the loss over multiple timesteps and enabling the model to comprehend long term effects of generated forces better. The reinforcement learning algorithm, on the other hand, is not limited by training set size like the CFE approach, as new training samples are generated on policy. However, this also introduces additional simulation overhead during training, which can increase the time needed for convergence. "
+    "The DP approach has access to the gradient data provided by the differentiable solver, making it possible to trace the loss over multiple timesteps and enabling the model to comprehend long term effects of generated forces better. The reinforcement learning algorithm, on the other hand, is not limited by training set size like the DP algorithm, as new training samples are generated on policy. However, this also introduces additional simulation overhead during training, which can increase the time needed for convergence. "
   ]
  },
  {
@ -342,12 +374,12 @@
   },
   "outputs": [],
   "source": [
-    "cfe_app = ControlTraining(\n",
-    "    step_count,\n",
-    "    BurgersPDE(domain, viscosity, dt),\n",
-    "    datapath=data_path,\n",
-    "    val_range=val_range,\n",
-    "    train_range=train_range,\n",
+    "dp_app = ControlTraining(\n",
+    "    STEP_COUNT,\n",
+    "    BurgersPDE(DOMAIN, VISCOSITY, DT),\n",
+    "    datapath=DATA_PATH,\n",
+    "    val_range=VAL_RANGE,\n",
+    "    train_range=TRAIN_RANGE,\n",
    "    trace_to_channel=lambda trace: 'burgers_velocity',\n",
    "    obs_loss_frames=[],\n",
    "    trainable_networks=['CFE'],\n",
@ -356,7 +388,7 @@
    "    view_size=20,\n",
    "    learning_rate=1e-3,\n",
    "    learning_rate_half_life=1000,\n",
-    "    dt=dt\n",
+    "    dt=DT\n",
    ").prepare()"
   ]
  },
@ -366,7 +398,7 @@
    "id": "3ReXUkzI1L3t"
   },
   "source": [
-    "Now we can execute the model training. This cell might take long to execute, depending on the number of iterations (ca. 1.8h for 1000 iterations)."
+    "Now we can execute the model training. This cell typically also takes a while to execute (ca. 1.8h for 1000 iterations)."
   ]
  },
  {
@ -377,19 +409,19 @@
   },
   "outputs": [],
   "source": [
-    "cfe_training_eval_data = []\n",
+    "dp_training_eval_data = []\n",
    "\n",
    "start_time = time.time()\n",
    "\n",
-    "cfe_training_iterations = 2000  # Change this to change training duration\n",
+    "dp_training_iterations = 2000  # Change this to change training duration\n",
    "\n",
-    "for epoch in range(cfe_training_iterations):\n",
-    "    cfe_app.progress()\n",
+    "for epoch in range(dp_training_iterations):\n",
+    "    dp_app.progress()\n",
    "    # Evaluate validation set at regular intervals to track learning progress\n",
    "    # Size of intervals determined by RL epoch count per iteration for accurate comparison\n",
-    "    if epoch % n_epochs == 0:\n",
-    "        f = cfe_app.infer_scalars(val_range)['Total Force'] / dt\n",
-    "        cfe_training_eval_data.append((time.time() - start_time, epoch, f))"
+    "    if epoch % N_EPOCHS == 0:\n",
+    "        f = dp_app.infer_scalars(VAL_RANGE)['Total Force'] / DT\n",
+    "        dp_training_eval_data.append((time.time() - start_time, epoch, f))"
   ]
  },
  {
@ -398,7 +430,7 @@
    "id": "31B72FBR1pXr"
   },
   "source": [
-    "We store the trained model and the validation performance with respect to iterations and wall time.\n",
+    "The trained model and the validation performance `val_forces.csv` with respect to iterations and wall time are saved on disk:\n",
    "\n"
   ]
  },
@ -410,19 +442,19 @@
   },
   "outputs": [],
   "source": [
-    "cfe_store_path = 'networks/cfe-models/bench'\n",
-    "if not os.path.exists(cfe_store_path):\n",
-    "    os.makedirs(cfe_store_path)\n",
+    "dp_store_path = 'networks/dp-models/bench'\n",
+    "if not os.path.exists(dp_store_path):\n",
+    "    os.makedirs(dp_store_path)\n",
    "\n",
    "# store training progress information\n",
-    "with open(os.path.join(cfe_store_path, 'val_forces.csv'), 'at') as log_file:\n",
+    "with open(os.path.join(dp_store_path, 'val_forces.csv'), 'at') as log_file:\n",
    "    logger = csv.DictWriter(log_file, ('time', 'epoch', 'forces'))\n",
    "    logger.writeheader()\n",
-    "    for (t, e, f) in cfe_training_eval_data:\n",
+    "    for (t, e, f) in dp_training_eval_data:\n",
    "        logger.writerow({'time': t, 'epoch': e, 'forces': f})\n",
    "\n",
-    "cfe_checkpoint = cfe_app.save_model()\n",
-    "shutil.move(cfe_checkpoint, cfe_store_path)"
+    "dp_checkpoint = dp_app.save_model()\n",
+    "shutil.move(dp_checkpoint, dp_store_path)"
   ]
  },
  {
@ -431,8 +463,7 @@
    "id": "r4r6sOh87B-1"
   },
   "source": [
-    "Alternatively, run the cell below to load an existing network model.\n",
-    "\n"
+    "Alternatively, uncomment the code in the cell below to load an existing network model.\n"
   ]
  },
  {
@ -443,10 +474,10 @@
   },
   "outputs": [],
   "source": [
-    "cfe_path = 'PDE-Control-RL/networks/cfe-models/bench/checkpoint_00020000/'\n",
-    "networks_to_load = ['OP2', 'OP4', 'OP8', 'OP16', 'OP32']\n",
+    "#dp_path = 'PDE-Control-RL/networks/dp-models/bench/checkpoint_00020000/'\n",
+    "#networks_to_load = ['OP2', 'OP4', 'OP8', 'OP16', 'OP32']\n",
    "\n",
-    "cfe_app.load_checkpoints({net: cfe_path for net in networks_to_load})"
+    "#dp_app.load_checkpoints({net: dp_path for net in networks_to_load})"
   ]
  },
  {
@ -455,9 +486,7 @@
    "id": "V8inOSSE0OMf"
   },
   "source": [
-    "The next cell plots an example to show visually how well the DP-based model does. With this, we have an RL and a DP version, which we can compare in more detail in the next section.\n",
-    "\n",
-    "**TODO, like for RL above , show unmodified and controlled**\n"
+    "Similar to the RL version, the next cell plots an example to visually show how well the DP-based model does. The leftmost plot again shows the learned results, this time of the DP-based model. Like above, the other two show the ground truth and the natural evolution. "
   ]
  },
  {
@ -468,13 +497,41 @@
   },
   "outputs": [],
   "source": [
-    "cfe_frames = cfe_app.infer_all_frames(test_range)\n",
+    "dp_frames = dp_app.infer_all_frames(TEST_RANGE)\n",
+    "dp_frames = [s.burgers.velocity.data for s in dp_frames]\n",
+    "_, gt_frames, unc_frames = rl_trainer.infer_test_set_frames()\n",
    "\n",
-    "index_in_set = 1    # Change this to display a reconstruction of another scene\n",
+    "index_in_set = 0    # Change this to display a reconstruction of another scene\n",
    "\n",
-    "bplt.burgers_figure('Supervised Control Force Estimator')\n",
-    "for frame in range(0, step_count + 1):\n",
-    "    plt.plot(cfe_frames[frame].burgers.velocity.data[index_in_set,:,0], color=bplt.gradient_color(frame, step_count+1), linewidth=0.8)"
+    "fig, axs = plt.subplots(1, 3, figsize=(18.9, 9.6))\n",
+    "\n",
+    "axs[0].set_title(\"Differentiable Physics\")\n",
+    "axs[1].set_title(\"Ground Truth\")\n",
+    "axs[2].set_title(\"Uncontrolled\")\n",
+    "\n",
+    "for plot in axs:\n",
+    "    plot.set_ylim(-2, 2)\n",
+    "    plot.set_xlabel('x')\n",
+    "    plot.set_ylabel('u(x)')\n",
+    "\n",
+    "for frame in range(0, STEP_COUNT + 1):\n",
+    "    frame_color = bplt.gradient_color(frame, STEP_COUNT+1)\n",
+    "    axs[0].plot(dp_frames[frame][index_in_set,:], color=frame_color, linewidth=0.8)\n",
+    "    axs[1].plot(gt_frames[frame][index_in_set,:], color=frame_color, linewidth=0.8)\n",
+    "    axs[2].plot(unc_frames[frame][index_in_set,:], color=frame_color, linewidth=0.8)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "HTNaM8_C2KPf"
+   },
+   "source": [
+    "The trained DP model also reconstructs the original trajectories closely. Furthermore, the generated results seem less noisy than using the RL agent.\n",
+    "\n",
+    "With this, we have an RL and a DP version, which we can compare in more detail in the next section.\n",
+    "\n",
+    "---\n"
   ]
  },
  {
@ -485,7 +542,7 @@
   "source": [
    "## Comparison between RL and DP\n",
    "\n",
-    "Next, the results of both methods are compared in terms of visual quality of the resulting trajectories as well as quantitatively via the amounf of generated forces. The latter provides insights about the performance of either approaches as both methods aspire to minimize this metric during training, and the task is trivially solved with by applying a huge force. Rather, an ideal solution takes into account the dynamics of the PDE to apply as little forces as possible. Hence, this metric is a very good one to measure how well the network has learned about the underlying physical environment (Burgers equation in this example).\n",
+    "Next, the results of both methods are compared in terms of visual quality of the resulting trajectories as well as quantitatively via the amounf of generated forces. The latter provides insights about the performance of either approaches as both methods aspire to minimize this metric during training. This is also important as the task is trivially solved with by applying a huge force at the last time step. Rather, an ideal solution takes into account the dynamics of the PDE to apply as little forces as possible. Hence, this metric is a very good one to measure how well the network has learned about the underlying physical environment (Burgers equation in this example).\n",
    "\n",
    "\n"
   ]
@ -524,15 +581,15 @@
   "source": [
    "rl_frames, gt_frames, unc_frames = rl_trainer.infer_test_set_frames()\n",
    "\n",
-    "cfe_frames = cfe_app.infer_all_frames(test_range)\n",
-    "cfe_frames = [s.burgers.velocity.data for s in cfe_frames]\n",
+    "dp_frames = dp_app.infer_all_frames(TEST_RANGE)\n",
+    "dp_frames = [s.burgers.velocity.data for s in dp_frames]\n",
    "\n",
    "frames = {\n",
    "    (0, 0): ('Ground Truth', gt_frames),\n",
    "    (0, 1): ('Uncontrolled', unc_frames),\n",
    "    (1, 0): ('Reinforcement Learning', rl_frames),\n",
-    "    (1, 1): ('Supervised Control Force Estimator', cfe_frames),\n",
-    "}\n"
+    "    (1, 1): ('Differentiable Physics', dp_frames),\n",
+    "}"
   ]
  },
  {
@ -552,21 +609,15 @@
    "    axs[xy].set_title(title)\n",
    "\n",
    "    label = 'Initial state in dark red, final state in dark blue'\n",
-    "\n",
-    "    for step_idx in range(0, step_count + 1):\n",
-    "        color = bplt.gradient_color(step_idx, step_count+1)\n",
+    "    for step_idx in range(0, STEP_COUNT + 1):\n",
+    "        color = bplt.gradient_color(step_idx, STEP_COUNT+1)\n",
    "        axs[xy].plot(\n",
-    "            field[step_idx][index_in_set].squeeze(), \n",
-    "            color=color, \n",
-    "            linewidth=0.8, \n",
-    "            label=label\n",
-    "        )\n",
+    "            field[step_idx][index_in_set].squeeze(), color=color, linewidth=0.8, label=label)\n",
    "        label = None\n",
    "\n",
    "    axs[xy].legend()\n",
    "\n",
    "fig, axs = plt.subplots(2, 2, figsize=(12.8, 9.6))\n",
-    "\n",
    "for xy in frames:\n",
    "    plot(axs, xy, *frames[xy])\n",
    "    "
@ -574,9 +625,11 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "id": "mMk9AvC5xt-X"
+   },
   "source": [
-    "**TODO discuss, what do we see?**"
+    "This diagram connects the two plots shown above after each training. Here we again see that the differentiable physics approach seems to generate less noisy trajectories than the RL agent, while both manage to approximate the ground truth."
   ]
  },
  {
@ -585,9 +638,9 @@
    "id": "ZsksKs4e4QJA"
   },
   "source": [
-    "### Forces Comparison\n",
+    "### Comparison of Exerted Forces\n",
    "\n",
-    "Next, we compute the forces the approaches have generated for the test set trajectories.\n"
+    "Next, we compute the forces the approaches have generated and applied for the test set trajectories."
   ]
  },
  {
@ -599,22 +652,47 @@
   "outputs": [],
   "source": [
    "gt_forces = utils.infer_forces_sum_from_frames(\n",
-    "    gt_frames, domain, diffusion_substeps, viscosity, dt\n",
+    "    gt_frames, DOMAIN, DIFFUSION_SUBSTEPS, VISCOSITY, DT\n",
    ")\n",
-    "cfe_forces = utils.infer_forces_sum_from_frames(\n",
-    "    cfe_frames, domain, diffusion_substeps, viscosity, dt\n",
+    "dp_forces = utils.infer_forces_sum_from_frames(\n",
+    "    dp_frames, DOMAIN, DIFFUSION_SUBSTEPS, VISCOSITY, DT\n",
    ")\n",
    "rl_forces = rl_trainer.infer_test_set_forces()\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FF7IwLYSxt-X"
+   },
+   "source": [
+    "At first, we will compare the total sum of the forces that are generated by the RL and DP approaches and compare them to the ground truth."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "3HIoOZcUxt-Y"
+   },
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(12.8, 9.6))\n",
+    "plt.bar(\n",
+    "    [\"Reinforcement Learning\", \"Differentiable Physics\", \"Ground Truth\"], \n",
+    "    [np.sum(rl_forces), np.sum(dp_forces), np.sum(gt_forces)], \n",
+    "    color = [\"#0065bd\", \"#e37222\", \"#a2ad00\"],\n",
+    "    align='center', label='Absolute forces comparison' )"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "B75xKFuw4414"
   },
   "source": [
-    "\n",
-    "**TODO compute the sum of forces for all test scenes for RL and DP, compare to ground truth values?**\n",
+    "As visualized in these bar plots, the DP approach learns to apply slightly lower forces than the RL model.\n",
+    "**TODO, also plot remaining L2 error**\n",
    "\n",
    "In the following, the forces generated by both methods are also visually compared to the ground truth of the respective sample. Dots placed above the blue line denote stronger forces in the analyzed deep learning approach than in the ground truth and vice versa.\n"
   ]
@ -628,15 +706,13 @@
   "outputs": [],
   "source": [
    "plt.figure(figsize=(12.8, 9.6))\n",
-    "plt.scatter(gt_forces, cfe_forces, label='CFE')\n",
-    "plt.scatter(gt_forces, rl_forces, label='RL')\n",
-    "plt.plot([x * 100 for x in range(15)], [x * 100 for x in range(15)], label='Same forces as original')\n",
+    "plt.scatter(gt_forces, rl_forces, color=\"#0065bd\", label='RL')\n",
+    "plt.scatter(gt_forces, dp_forces, color=\"#e37222\", label='DP')\n",
+    "plt.plot([x * 100 for x in range(15)], [x * 100 for x in range(15)], color=\"#a2ad00\", label='Same forces as original')\n",
    "plt.xlabel('ground truth')\n",
-    "plt.xlim(0, 1500)\n",
-    "plt.ylim(0, 1500)\n",
+    "plt.xlim(0, 1500); plt.ylim(0, 1500)\n",
    "plt.ylabel('reconstruction')\n",
-    "plt.grid()\n",
-    "plt.legend()"
+    "plt.grid(); plt.legend()"
   ]
  },
  {
@ -657,13 +733,10 @@
   "outputs": [],
   "source": [
    "plt.figure(figsize=(12.8, 9.6))\n",
-    "plt.scatter(rl_forces, cfe_forces)\n",
-    "plt.xlabel('Reinforcement Learning')\n",
-    "plt.ylabel('Control Force Estimator')\n",
-    "plt.plot([x * 100 for x in range(15)], [x * 100 for x in range(15)], label='Same forces cfe rl')\n",
-    "plt.xlim(0, 1500)\n",
-    "plt.ylim(0, 1500)\n",
-    "plt.grid()\n",
+    "plt.scatter(rl_forces, dp_forces, color=\"#0065bd\")\n",
+    "plt.xlabel('Reinforcement Learning'); plt.ylabel('Differentiable Physics')\n",
+    "plt.plot([x * 100 for x in range(15)], [x * 100 for x in range(15)], color=\"#e37222\", label='Same forces DP RL')\n",
+    "plt.xlim(0, 1500); plt.ylim(0, 1500); plt.grid()\n",
    "plt.legend()"
   ]
  },
@ -673,7 +746,7 @@
    "id": "7JqW7Cca6HUJ"
   },
   "source": [
-    "The following plot displays the performance of all reinforcement learning, control force estimator and ground truth with respect to individual samples."
+    "The following plot displays the performance of all reinforcement learning, differentiable physics and ground truth with respect to individual samples."
   ]
  },
  {
@ -684,53 +757,25 @@
   },
   "outputs": [],
   "source": [
-    "w=0.25\n",
-    "plot_count=20\n",
+    "w=0.25; plot_count=20   # How many scenes to show\n",
    "plt.figure(figsize=(12.8, 9.6))\n",
-    "plt.bar(\n",
-    "    [i - w for i in range(plot_count)], \n",
-    "    rl_forces[:plot_count], \n",
-    "    width=w, \n",
-    "    align='center', \n",
-    "    label='RL'\n",
-    ")\n",
-    "plt.bar(\n",
-    "    [i + w for i in range(plot_count)], \n",
-    "    cfe_forces[:plot_count], \n",
-    "    width=w, \n",
-    "    align='center', \n",
-    "    label='CFE'\n",
-    ")\n",
-    "plt.bar(\n",
-    "    [i for i in range(plot_count)], \n",
-    "    gt_forces[:plot_count], \n",
-    "    width=w, \n",
-    "    align='center', \n",
-    "    label='GT'\n",
-    ")\n",
-    "plt.xlabel('Scenes')\n",
-    "plt.xticks(range(plot_count))\n",
-    "plt.ylabel('Forces')\n",
-    "plt.legend()\n",
+    "plt.bar( [i - w for i in range(plot_count)], rl_forces[:plot_count], color=\"#0065bd\", width=w, align='center', label='RL' )\n",
+    "plt.bar( [i     for i in range(plot_count)], dp_forces[:plot_count], color=\"#e37222\", width=w, align='center', label='DP' )\n",
+    "plt.bar( [i + w for i in range(plot_count)], gt_forces[:plot_count], color=\"#a2ad00\", width=w, align='center', label='GT' )\n",
+    "plt.xlabel('Scenes'); plt.xticks(range(plot_count))\n",
+    "plt.ylabel('Forces'); plt.legend()\n",
    "plt.show()"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "9Ee3Us_hD9nR"
-   },
-   "source": [
-    "## Training Progress Comparison"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DGBlUpQ271Ww"
   },
   "source": [
-    "Although the quality of the control in terms of force magnitudes is the primary goal of the setup above, there are interesting differences in terms of how both methods behave at training time. The main difference of the physics-unaware RL training and the DP approach with tightly coupled solver results in a significantly faster convergence for the latter. I.e., the gradients provided by the numerical solver give a much better learning signal than the undirected exploration of the RL process. The behavior of the RL training, on the other hand, can in part be ascribed to the on-policy nature of training data collection and to the more brute-force natured learning technique.\n",
+    "## Training Progress Comparison\n",
+    "\n",
+    "Although the quality of the control in terms of force magnitudes is the primary goal of the setup above, there are interesting differences in terms of how both methods behave at training time. The main difference of the physics-unaware RL training and the DP approach with its tightly coupled solver is that the latter results in a significantly faster convergence. I.e., the gradients provided by the numerical solver give a much better learning signal than the undirected exploration of the RL process. The behavior of the RL training, on the other hand, can in part be ascribed to the on-policy nature of training data collection and to the \"brute-force\" exploration of the reinforcement learning technique.\n",
    "\n",
    "The next cell visualizes the training progress of both methods with respect to iterations and wall time.\n",
    "\n"
@ -744,26 +789,26 @@
   },
   "outputs": [],
   "source": [
-    "def get_cfe_val_set_forces(experiment_path):\n",
+    "def get_dp_val_set_forces(experiment_path):\n",
    "    path = os.path.join(experiment_path, 'val_forces.csv')\n",
    "    table = pd.read_csv(path)\n",
    "    return list(table['time']), list(table['epoch']), list(table['forces'])\n",
    "\n",
    "rl_w_times, rl_step_nums, rl_val_forces = rl_trainer.get_val_set_forces_data()\n",
-    "cfe_w_times, cfe_epochs, cfe_val_forces = get_cfe_val_set_forces('PDE-Control-RL/networks/cfe-models/bench')\n",
+    "dp_w_times, dp_epochs, dp_val_forces = get_dp_val_set_forces('PDE-Control-RL/networks/dp-models/bench')\n",
    "\n",
    "fig, axs = plt.subplots(2, 1, figsize=(12.8, 9.6))\n",
    "\n",
-    "axs[0].plot(np.array(rl_step_nums), rl_val_forces, label='RL')\n",
-    "axs[0].plot(np.array(cfe_epochs), cfe_val_forces, label='CFE')\n",
+    "axs[0].plot(np.array(rl_step_nums), rl_val_forces, color=\"#0065bd\", label='RL')\n",
+    "axs[0].plot(np.array(dp_epochs), dp_val_forces, color=\"#e37222\", label='DP')\n",
    "axs[0].set_xlabel('Epochs')\n",
    "axs[0].set_ylabel('Forces')\n",
    "axs[0].set_ylim(0, 1500)\n",
    "axs[0].grid()\n",
    "axs[0].legend()\n",
    "\n",
-    "axs[1].plot(np.array(rl_w_times) / 3600, rl_val_forces, label='RL')\n",
-    "axs[1].plot(np.array(cfe_w_times) / 3600, cfe_val_forces, label='CFE')\n",
+    "axs[1].plot(np.array(rl_w_times) / 3600, rl_val_forces, color=\"#0065bd\", label='RL')\n",
+    "axs[1].plot(np.array(dp_w_times) / 3600, dp_val_forces, color=\"#e37222\", label='DP')\n",
    "axs[1].set_xlabel('Wall time (hours)')\n",
    "axs[1].set_ylabel('Forces')\n",
    "axs[1].set_ylim(0, 1500)\n",
@ -790,25 +835,22 @@
   "source": [
    "## Next steps\n",
    "\n",
-    "**TODO, what would be interesting to try out for students / readers with the code above?**\n",
+    "- See how different values for hyperparameters, such as learning rate, influence the training process\n",
    "\n",
+    "- Work with fields of different resolution and see how the two approaches then compare to each other. Larger resolutions make the physical dynamics more complex, and hence harder to control\n",
+    "\n",
+    "- Use trained models in settings with different environment parameters (e.g. viscosity, dt) and test how well they generalize \n",
    "\n"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
-   "name": "PDE-Control-RL.ipynb",
-   "provenance": []
+   "name": "PDE_Control_RL_may19.ipynb",
+   "provenance": [],
+   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
--- a/reinflearn-intro.md
+++ b/reinflearn-intro.md
@ -21,30 +21,44 @@ The objective of the actor in actor-critic approaches depends on the output of t

 PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence individual state value estimates can have on the change of the actors behavior during learning. PPO is a popular choice especially when working on continuous action spaces and has been extensively used in research, for example by [OpenAI](https://openai.com/). This can be attributed to the fact that it tends to achieve good results with a stable learning progress while still being comparatively easy to implement. Because of these factors we will use PPO as well.

-More specifically, we will use the algorithm _PPO-clip_. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. Formally, this generates the following update rule and objective for the actor:
+More specifically, we will use the algorithm _PPO-clip_. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. We'll denote a policy realized as a neural network with weights $\theta$ as $\pi(a; s,\theta)$, i.e., $\pi$ is conditioned on both a state of the environment $s$, and the weights of the NN. 
+In addition, $\theta_{\text{p}}$ will denote the weights from a previous state of the policy network. Correspondingly
+$\pi(a; s,\theta_{\text{p}})$ or in short $\pi(\theta_{\text{p}})$ denotes a policy evaluation with the previous state of the network.
+Formally, this gives the following objective for the actor:

 $$\begin{aligned}
+\text{arg max}_{\theta} \ 
+    \mathbb{E}_{a,s \sim \pi_{\theta_{\text{p}}}}
+    \Big[ & \text{min} \Big(
+            \frac{\pi(a;s,\theta)}{\pi(a;s,\theta_{\text{p}} )} A^{\pi(\theta_{\text{p}})}(s, a), 
+    \\ &
+            \ \text{clip}(\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_{\text{p}} )}, 1-\epsilon, 1+\epsilon) 
+            A^{\pi(\theta{\text{p}})}(s, a) \Big) \Big] 
+\end{aligned}$$
+
+<!-- $$\begin{aligned}
 \theta_{k+1}  &= \text{arg max}_{\theta}L(\theta_k, \theta) \\
 L(\theta_k, \theta) &= 
    \mathbb{E}_{s, a \sim \pi_{\theta_k}}[\text{min}(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s, a), \text{clip}(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}, 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s, a))]
-\end{aligned}$$
+\end{aligned}$$ -->

-Here, $\epsilon$ defines the bound for the deviation from the previous policy. $A^{\pi_{\theta_k}}(s, a)$ denotes the estimate for the advantage of $a$ in a state $s$, i.e. how well the agent performs compared to a random agent. Its value depends on the critic's output:
+Here, $\epsilon$ defines the bound for the deviation from the previous policy. $A^{\pi(\theta)}(s, a)$ denotes the estimate for the advantage of performing action $a$ in a state $s$, i.e. how well the agent performs compared to a random agent. Its value depends on the critic's output:

 $$\begin{aligned}
-A^{\pi_{\theta_k}}_t &= \sum\limits_{i=0}^{n-t-1}(\gamma\lambda)^i\delta_{t+i} \\
+A^{\pi(\theta)} &= \sum\limits_{i=0}^{n-t-1}(\gamma\lambda)^i\delta_{t+i} \\
 \delta_t             &= r_t+\gamma [V(s_{t+1} ; \phi) - V(s_t ; \phi)]
 \end{aligned}$$

+
 ### Application to Inverse Problems

 Reinforcement learning is widely used for trajectory optimization with multiple decision problems building upon one another. However, in the context of physical systems and PDEs, reinforcement learning algorithms are likewise attractive. In this setting, they can operate in a fashion that's similar to supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.

 However, the approaches differ in terms of how this optimization is performed. For example, reinforcement learning algorithms like PPO try to explore the action space during training by adding a random offset to the actions selected by the actor. This way the algorithm can discover new behavioral patterns that are more refined than the previous ones.

-The way how long term effects of generated forces are taken into account can also differ for physical systems. In a control force estimator setup with differentiable physics (DP) loss, as discussed e.g. in {doc}`diffphys-code-burgers.ipynb`, these dependencies are handled by passing the loss gradient through the simulation step back into previous time steps. Contrary to that, reinforcement learning usually treats the environment as a black box without gradient information. When using PPO, the value estimator network is instead used to track the long term dependencies by predicting the influence any action has for the future system evolution.
+The way how long term effects of generated forces are taken into account can also differ for physical systems. In a control force estimator setup with differentiable physics (DP) loss, as discussed e.g. in {doc}`diffphys-code-burgers`, these dependencies are handled by passing the loss gradient through the simulation step back into previous time steps. Contrary to that, reinforcement learning usually treats the environment as a black box without gradient information. When using PPO, the value estimator network is instead used to track the long term dependencies by predicting the influence any action has for the future system evolution.

-Working on Burgers' equation, the trajectory generation process can be summarized as following, showing how the simulation steps of the environment and the neural network evaluations of the agent are interleaved.
+Working with Burgers' equation as physical environment, the trajectory generation process can be summarized as follows. It shows how the simulation steps of the environment and the neural network evaluations of the agent are interleaved:

 $$
 \mathbf{u}_{t+1}=\mathcal{P}(\mathbf{u}_t+\pi(\mathbf{u}_t, \mathbf{u}^*, t; \theta)\Delta t)
@ -64,7 +78,7 @@ r_t^o &=

 ## Implementation

-In the following, we'll describe a way to implement a PPO-based RL training for physical systems. This implementation is also the basis for the notebook of the next section, i.e., {doc}`reinflearn-code.ipynb`. While this notebook provides a practical example, and an evaluation in comparison to DP training, we'll first give a more generic overview below.
+In the following, we'll describe a way to implement a PPO-based RL training for physical systems. This implementation is also the basis for the notebook of the next section, i.e., {doc}`reinflearn-code`. While this notebook provides a practical example, and an evaluation in comparison to DP training, we'll first give a more generic overview below.

 To train a reinforcement learning agent to control a PDE-governed system, the physical model has to be formalized as an RL environment. The [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) framework, which we use in the following to implement a PPO training, uses a vectorized version of the gym environment. This way, rollout collection can be performed on multiple trajectories in parallel for better resource utilization and wall time efficiency.