smaller fixes
This commit is contained in:
parent
a5b2435918
commit
e636861373
@ -16,13 +16,16 @@ Reinforcement learning is formulated in terms of an environment that gives obser
|
||||
|
||||
|
||||
In its simplest form, the learning goal for reinforcement learning tasks can be formulated as
|
||||
$$
|
||||
\text{arg max}_{\theta} \mathbb{E} \sum_t=0^T r_t ,
|
||||
$$ (learn-l2)
|
||||
where the reward at time $t$, denoted by $r_t$ above, is a result of the action at time $t$.
|
||||
of the agent $a_t$ as generated by a learned policy.
|
||||
|
||||
The actions are performed by agents who make their choice based on a neural network policy which decides via a set of given observations. In the learning process it uses the combined information of state, action and corresponding rewards to increase the cumulative intensity of reward signals it receives over each trajectory. To achieve this goal, multiple algorithms have been proposed, which can be roughly divided into two larger classes: _policy gradient_ and _value-based_ methods {cite}`sutton2018rl`.
|
||||
$$
|
||||
\text{arg max}_{\theta} \mathbb{E}_{a \sim \pi(;s,\theta_p)} \big[ \sum_t r_t \big],
|
||||
$$ (learn-l2)
|
||||
|
||||
where the reward at time $t$ (denoted by $r_t$ above) is the result of an action $a$ performed by an agent.
|
||||
The agents choose their actions based on a neural network policy which decides via a set of given observations.
|
||||
The policy $\pi(a;s, \theta)$ returns the probability for the action, and is conditioned on the state $s$ of the environment and the weights $\theta$.
|
||||
|
||||
During the learning process the central aim of RL is to uses the combined information of state, action and corresponding rewards to increase the cumulative intensity of reward signals over each trajectory. To achieve this goal, multiple algorithms have been proposed, which can be roughly divided into two larger classes: _policy gradient_ and _value-based_ methods {cite}`sutton2018rl`.
|
||||
|
||||
## Algorithms
|
||||
|
||||
@ -44,14 +47,10 @@ PPO was introduced as a method to specifically counteract this problem. The idea
|
||||
|
||||
### PPO-Clip
|
||||
|
||||
More specifically, we will use the algorithm _PPO-clip_ {cite}`schulman2017proximal`. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript $_p$ in below) to limit the change per update step in comparison to before.
|
||||
More specifically, we will use the algorithm _PPO-clip_ {cite}`schulman2017proximal`. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript $_p$ below) to limit the change per step of the learning process.
|
||||
In the following, we will denote the network parameters of the actor network as $\theta$ and those of the critic as $\phi$.
|
||||
|
||||
<!-- simplified notation:
|
||||
\pi_{\theta_p} \pi(;\theta_p)
|
||||
\pi_{\theta} \pi(;\theta)
|
||||
A^{\pi_{\theta_p}} A
|
||||
-->
|
||||
<!-- simplified notation: \pi_{\theta_p} \pi(;\theta_p) , \pi_{\theta} \pi(;\theta) , A^{\pi_{\theta_p}} A -->
|
||||
%Here, $\theta_p$ denotes a _previous_ state of weights, in comparison to which the performance is evaluated (only $\theta$ is updated while learning).
|
||||
%$A(s, a) = Q(s, a) - V(s)$, where $Q(s, a)$ is the state-action value function as defined above and $V(s)$ is the state value.
|
||||
|
||||
@ -66,7 +65,7 @@ $\epsilon$ defines the bound for the deviation from the previous policy.
|
||||
In combination, the objective for the actor is given by the following expression:
|
||||
|
||||
$$\begin{aligned}
|
||||
\text{arg max}_{\theta} \mathbb{E}_{s, a \sim \pi(;s,\theta_p)} \Big[ \text{min} \big(
|
||||
\text{arg max}_{\theta} \mathbb{E}_{a \sim \pi(;s,\theta_p)} \Big[ \text{min} \big(
|
||||
\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p)}
|
||||
A(s, a; \phi),
|
||||
\text{clip}(\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p)}, 1-\epsilon, 1+\epsilon)
|
||||
@ -84,14 +83,14 @@ The critic is represented by a value function $V(s; \phi)$ that predicts the exp
|
||||
Its objective is to minimize the squared advantage $A$:
|
||||
|
||||
$$\begin{aligned}
|
||||
\text{arg min}_{\phi}\mathbb{E}_{s, a \sim \pi(;s,\theta_p)}[A(s, a; \phi)^2] \ ,
|
||||
\text{arg min}_{\phi}\mathbb{E}_{a \sim \pi(;s,\theta_p)}[A(s, a; \phi)^2] \ ,
|
||||
\end{aligned}$$
|
||||
|
||||
where the advantage function $A(s, a; \phi)$ builds upon $V$: it's goal is to evaluate the deviation from an average cumulative reward. I.e., we're interested in estimating how much the decision made via $\pi(;s,\theta_p)$ improves upon making random decisions (again, evaluated via the unchanging, previous network state $\theta_p$). We use the so-called Generalized Advantage Estimation (GAE) {cite}`schulman2015high` to compute $A$ as:
|
||||
|
||||
$$\begin{aligned}
|
||||
A(s_t, a_t; \phi) &= \sum\limits_{i=0}^{n-t-1}(\gamma\lambda)^i\delta_{t+i} \\
|
||||
\delta_t &= r_t+\gamma [V(s_{t+1} ; \phi) - V(s_t ; \phi)]
|
||||
\delta_t &= r_t+\gamma \big( V(s_{t+1} ; \phi) - V(s_t ; \phi) \big)
|
||||
\end{aligned}$$
|
||||
|
||||
Here $r_t$ describes the reward obtained in time step $t$, while $n$ denotes the total length of the trajectory. $\gamma$ and $\lambda$ are two hyperparameters which influence rewards and state value predictions from the distant futures have on the advantage calculation. They are typically set to values smaller than one.
|
||||
|
Binary file not shown.
Loading…
Reference in New Issue
Block a user