tweaked expressions

This commit is contained in:
NT 2021-08-17 21:10:55 +02:00
parent 2b08a15778
commit e731e11393
3 changed files with 11 additions and 9 deletions

View File

@ -14,14 +14,14 @@ additional properties, and summarize the pros and cons.
## Time steps and iterations ## Time steps and iterations
When using DP approaches for learning application, When using DP approaches for learning applications,
there is a lot of flexibility w.r.t. the combination of DP and NN building blocks. there is a lot of flexibility w.r.t. the combination of DP and NN building blocks.
As some of the differences are subtle, the following section will go into more detail As some of the differences are subtle, the following section will go into more detail.
We'll especially focus on solvers that repeat the PDE and NN evaluations multiple times,
e.g., to compute multiple states of the physical system over time.
**XXX** To re-cap, here's the previous figure about combining NNs and DP operators.
In the figure these operators look like a loss term: they typically don't have weights,
To re-cap, this is the previous figure illustrating NNs with DP operators.
Here, these operators look like a loss term: they typically don't have weights,
and only provide a gradient that influences the optimization of the NN weights: and only provide a gradient that influences the optimization of the NN weights:
```{figure} resources/diffphys-shortened.jpg ```{figure} resources/diffphys-shortened.jpg
@ -37,7 +37,7 @@ Similar to the previously described _physical losses_ (from {doc}`physicalloss`)
**Switching the Order** **Switching the Order**
However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine to switch the NN and DP components, giving the following structure: However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine a swap of the NN and DP components, giving the following structure:
```{figure} resources/diffphys-switched.jpg ```{figure} resources/diffphys-switched.jpg
--- ---

View File

@ -60,7 +60,7 @@ of physical processes into the learning algorithms.
Over the course of the last decades, Over the course of the last decades,
highly specialized and accurate discretization schemes have highly specialized and accurate discretization schemes have
been developed to solve fundamental model equations such been developed to solve fundamental model equations such
as the Navier-Stokes, Maxwells, or Schroedingers equations. as the Navier-Stokes, Maxwell's, or Schroedinger's equations.
Seemingly trivial changes to the discretization can determine Seemingly trivial changes to the discretization can determine
whether key phenomena are visible in the solutions or not. whether key phenomena are visible in the solutions or not.
Rather than discarding the powerful methods that have been Rather than discarding the powerful methods that have been

View File

@ -41,7 +41,9 @@ In addition, _actor-critic_ methods combine elements from both approaches. Here,
As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic. As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic.
The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore). The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore).
This interdependence can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to unrestrictedly affect the agents behavior, the learning progress can collapse. This interdependence can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to unrestrictedly affect the agent's behavior, the learning progress can collapse.
DEBUG TEST ts ts agents vs's TODO remove!!!
PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement. PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement.