

PPO-penalty(PPO1)
$\begin{cases} &J^{\theta'}{PPO}=J^{\theta'}(\theta)-\beta KL(\theta,\theta'),\quad J^{\theta'}(\theta)=\mathbb E{s_t,a_t\sim \pi_{\theta'}}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)\right]\\ &\mathcal{L}^{\text{PENALTY}}(\theta) = \mathbb{E}t \left[ \hat{A}t \frac{\pi\theta(a_t | s_t)}{\pi{\theta_{\text{old}}}(a_t | s_t)} - \beta D_{KL} \left( \pi_{\theta_{\text{old}}}(\cdot | s_t) \parallel \pi_\theta(\cdot | s_t) \right) \right] \end{cases}$
PPO-clip(PPO2)
$J_{PPO2}^{\theta^k}(\theta) \approx \sum_{(s_t, a_t)} \min \left( \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} A^{\theta^k}(s_t, a_t), \ \text{clip} \left( \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon \right) A^{\theta^k}(s_t, a_t) \right)$
