tweaked expressions
This commit is contained in:
parent
2b08a15778
commit
e731e11393
@ -14,14 +14,14 @@ additional properties, and summarize the pros and cons.
|
||||
|
||||
## Time steps and iterations
|
||||
|
||||
When using DP approaches for learning application,
|
||||
When using DP approaches for learning applications,
|
||||
there is a lot of flexibility w.r.t. the combination of DP and NN building blocks.
|
||||
As some of the differences are subtle, the following section will go into more detail
|
||||
As some of the differences are subtle, the following section will go into more detail.
|
||||
We'll especially focus on solvers that repeat the PDE and NN evaluations multiple times,
|
||||
e.g., to compute multiple states of the physical system over time.
|
||||
|
||||
**XXX**
|
||||
|
||||
To re-cap, this is the previous figure illustrating NNs with DP operators.
|
||||
Here, these operators look like a loss term: they typically don't have weights,
|
||||
To re-cap, here's the previous figure about combining NNs and DP operators.
|
||||
In the figure these operators look like a loss term: they typically don't have weights,
|
||||
and only provide a gradient that influences the optimization of the NN weights:
|
||||
|
||||
```{figure} resources/diffphys-shortened.jpg
|
||||
@ -37,7 +37,7 @@ Similar to the previously described _physical losses_ (from {doc}`physicalloss`)
|
||||
|
||||
**Switching the Order**
|
||||
|
||||
However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine to switch the NN and DP components, giving the following structure:
|
||||
However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine a swap of the NN and DP components, giving the following structure:
|
||||
|
||||
```{figure} resources/diffphys-switched.jpg
|
||||
---
|
||||
|
@ -60,7 +60,7 @@ of physical processes into the learning algorithms.
|
||||
Over the course of the last decades,
|
||||
highly specialized and accurate discretization schemes have
|
||||
been developed to solve fundamental model equations such
|
||||
as the Navier-Stokes, Maxwell’s, or Schroedinger’s equations.
|
||||
as the Navier-Stokes, Maxwell's, or Schroedinger's equations.
|
||||
Seemingly trivial changes to the discretization can determine
|
||||
whether key phenomena are visible in the solutions or not.
|
||||
Rather than discarding the powerful methods that have been
|
||||
|
@ -41,7 +41,9 @@ In addition, _actor-critic_ methods combine elements from both approaches. Here,
|
||||
As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic.
|
||||
The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore).
|
||||
|
||||
This interdependence can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to unrestrictedly affect the agent’s behavior, the learning progress can collapse.
|
||||
This interdependence can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to unrestrictedly affect the agent's behavior, the learning progress can collapse.
|
||||
|
||||
DEBUG TEST t’s t’s agent’s vs's TODO remove!!!
|
||||
|
||||
PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user