VerifiedarXiv:1707.0634722 min
Reinforcement learning · Policy gradients

Proximal Policy Optimization Algorithms

Clip one ratio, and the policy can't run away.

Standard policy gradients use a batch of experience once and throw it away, because reusing it makes the policy overshoot into something worse. PPO clips the update so you can squeeze many epochs out of each batch and still stay near the policy you started from.

Explaining the paperProximal Policy Optimization AlgorithmsSchulman, Wolski, Dhariwal, Radford, Klimov · OpenAI · 2017 · arXiv:1707.06347

What if you could squeeze ten gradient steps out of one batch of experience, instead of one, without the policy moving somewhere much worse?

Reinforcement learning is expensive in a way ordinary supervised learning is not. A classifier has a fixed dataset sitting on disk. An agent has to generate its own data by acting in the world, and useful data only comes from a halfway-decent policy, which you only get by training on data. Every gradient step depends on fresh experience that costs real interaction to collect. So the currency that matters most is sample efficiency: how much the policy improves per unit of experience spent.

The classic policy-gradient methods spend that currency carelessly. They collect a batch of experience, take exactly one gradient step, and discard the batch. Take a second step on the same data and the update is no longer justified. It can push the policy somewhere much worse, sometimes badly enough that it never recovers. You pay full price for the data and use it once.

PPO, out of OpenAI in 2017, is the fix that stuck. It sits at the end of a line of work about a single question: how far is a policy allowed to move in one update? Conservative policy iteration asked it, then TRPO answered it with a real but heavy machine. PPO keeps the answer and throws away the weight. A change of a few lines to a vanilla policy gradient lets you reuse each batch for ten epochs of updates, held in check by a clip that stops rewarding the policy once it moves too far. It became the default reinforcement-learning algorithm, and the one behind the RL stage of RLHF that produced the first instruction-following language models.

To see why a clip is enough, we build a short tower. What a policy gradient is, and why advantage is the right thing to weight it by. Why reusing a batch is dangerous. What a trust region is. And finally the clipped objective that recreates the trust region with a clip and a minimum.

Learning from your own actions

Start with the object being trained: a policy, written πθ\pi_\theta, a function with parameters θ\theta that reads the current situation (the state ss) and outputs a probability distribution over what to do (the action aa). It is stochastic on purpose: the agent samples its action, which both explores and makes the math differentiable. The world responds with a reward and a new state, and the sum of (discounted) future reward from a state is its return. The whole goal is to adjust θ\theta so the policy collects more return.

How do you nudge θ\theta using only sampled experience? The answer is the policy gradient, and it is older than deep RL. It comes from one identity, the score-function trick:

θπ=πθlogπ\nabla_\theta\, \pi = \pi\,\nabla_\theta \log \pi

Plug it into the derivative of expected return and the gradient turns into something you can estimate from samples:

g^=E^t ⁣[θlogπθ(atst)A^t]\hat{g} = \hat{\mathbb{E}}_t\!\left[\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_t \,\right]
(1)

Read it as an instruction. For each action you actually took, take the gradient of its log-probability, the direction in parameter space that would make that action more likely, and scale it by a number A^t\hat{A}_t that says how good the action turned out to be. Sum over the batch. The hat on E^t\hat{\mathbb{E}}_t is a reminder that this is an empirical average over a finite batch of samples, not an exact expectation. This is REINFORCE (Williams, 1992), and nothing about (1) is new to PPO. Note the sign: g^\hat{g} already points uphill. You add it to θ\theta (gradient ascent), because the aim is to maximize return, not minimize a loss.

Everything now rides on that weight A^t\hat{A}_t. The naive choice is the raw return: reward good outcomes, punish bad ones. It works, but it is needlessly noisy. If every action in a good state earns a return of around 100, the gradient mostly encodes "this state was good" rather than "this action was good," even for an average action. What you actually want to know is whether an action did better or worse than the state's own baseline. That is the advantage:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)

Here Vπ(s)V^\pi(s) is the value of the state, the return you expect from ss if you just follow the policy, and Qπ(s,a)Q^\pi(s,a) is the value of taking action aa first. Their difference is "how much better than usual was this action." Subtracting a baseline that depends only on the state, not the action, is free in expectation: the baseline is the same number no matter which action was taken, and the gradient weights it by θlogπ\nabla_\theta \log \pi, whose average over the policy's own action distribution is exactly zero, so the baseline terms sum to zero,

ab(s)θπ(as)=b(s)θaπ(as)=b(s)θ1=0\sum_a b(s)\,\nabla_\theta \pi(a \mid s) = b(s)\,\nabla_\theta \sum_a \pi(a \mid s) = b(s)\,\nabla_\theta 1 = 0

and the subtraction leaves the gradient's average direction unchanged while cutting its variance. Recentering around zero only sharpens the signal: the gradient stops spending itself on the state and concentrates on the choice.

Below, the amber curve is the (unknown) true return over a one-dimensional action. The teal bell is the current policy. Each step it samples a handful of actions, scores each by its advantage (return minus the dashed baseline), then pushes the likely-to-help actions up and the likely-to-hurt actions down. The bell climbs:

Figure 1 · the policy gradient
0%
The return R(a) over actions is fixed but unknown. The policy samples actions and weights each by its advantage: actions above the baseline V get an up arrow (make more likely), below it a down arrow. Averaged, that is exactly ∇logπ·A, and the policy climbs toward higher return.

In practice A^t\hat{A}_t is itself estimated, by generalized advantage estimation (GAE), from a learned value function VV. GAE sums one-step surprises with a decay λ\lambda:

δt=rt+γV(st+1)V(st),A^t=l0(γλ)lδt+l\delta_t = r_t + \gamma\,V(s_{t+1}) - V(s_t), \qquad \hat{A}_t = \sum_{l \ge 0} (\gamma\lambda)^{l}\, \delta_{t+l}
(11)–(12)

The term δt\delta_t is a one-step advantage estimate (reward plus discounted next-value, minus current value), and γ\gamma is the discount, λ\lambda a knob trading variance for bias. The clean "baseline subtraction is unbiased" story is about the ideal advantage; GAE is a deliberately slightly biased estimate of it, a distinction that matters again later. The rest of PPO is about what you do with the gradient, not how you estimate the advantage.

The waste: one update per rollout

The inefficiency, stated plainly: Equation (1) is an average over data drawn from the current policy πθ\pi_\theta. The instant you take a gradient step, θ\theta changes, so the policy changes, so the batch you collected was drawn from a policy that no longer exists. Use it for a second step and you are averaging the gradient of the new policy against actions sampled from the old one. The estimate is biased, and the more steps you take, the more wrong it gets: each step moves the policy a little further from the one that collected the batch, the recorded actions become less and less representative of what the current policy would actually do, and the importance ratios that correct for the mismatch, introduced next, only get noisier as the gap widens.

There is a standard fix for "I have samples from the wrong distribution": importance sampling. Reweight each sample by how much more or less likely the new policy is to have taken that action. The weight is a ratio, and it is the single most important symbol in PPO:

rt(θ)=πθ(atst)πθold(atst),rt(θold)=1r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}, \qquad r_t(\theta_\text{old}) = 1

At the start of an update the new policy equals the old one, so every rt=1r_t = 1. As you optimize, an action the new policy favors more gets rt>1r_t > 1, one it favors less gets rt<1r_t < 1. Weighting the advantage by this ratio gives a surrogate objective you can keep optimizing on the fixed batch:

LCPI(θ)=E^t ⁣[πθ(atst)πθold(atst)A^t]=E^t ⁣[rt(θ)A^t]L^{CPI}(\theta) = \hat{\mathbb{E}}_t\!\left[\, \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}\, \hat{A}_t \,\right] = \hat{\mathbb{E}}_t\!\left[\, r_t(\theta)\, \hat{A}_t \,\right]
(6)

The superscript CPICPI stands for conservative policy iteration (Kakade and Langford, 2002), where this objective was proposed. It is a good objective for a small step. Near θold\theta_\text{old} it agrees with the true return to first order, which is the formal version of "if you barely move, the reweighted batch is a faithful guide." The trouble is the word barely.

Maximize LCPIL^{CPI} without restraint and the optimizer notices something it can exploit. For an action with positive advantage, the objective rtA^tr_t \hat{A}_t just keeps growing as rtr_t grows. So gradient ascent drives rtr_t as high as it can, making a handful of actions far more likely, well outside the range where the old batch says anything reliable. The policy lurches, the next batch is collected by a worse policy, and the whole process can spiral. This is exactly the destructive update from before, now with a name: an unbounded importance ratio. We need a way to let the policy move a little and refuse to let it move a lot.

TRPO: move, but not too far

TRPO's answer is to make "don't move too far" a literal constraint. Measure the distance between the old policy and the new one with the KL divergence, a standard way to score how different two probability distributions are (zero when identical, growing as they diverge), and forbid any update that moves more than a small budget δ\delta:

maxθ E^t ⁣[rt(θ)A^t]subject toE^t ⁣[KL[πθold(st),πθ(st)]]δ\max_\theta\ \hat{\mathbb{E}}_t\!\left[\, r_t(\theta)\, \hat{A}_t \,\right] \quad\text{subject to}\quad \hat{\mathbb{E}}_t\!\left[\, \mathrm{KL}\big[\pi_{\theta_\text{old}}(\cdot \mid s_t),\, \pi_\theta(\cdot \mid s_t)\big] \,\right] \le \delta
(3)–(4)

The set of new policies that satisfy this KL budget is the policy's trust region, the neighborhood of θold\theta_\text{old} within which the surrogate stays a faithful stand-in for the return. This is not arbitrary caution. There is a theorem underneath: a surrogate like (6), minus a penalty proportional to the KL, is a genuine lower bound on the policy's true return. Improve the bounded quantity and you are guaranteed to improve the real thing, which gives monotonic improvement, a sequence of policies that never gets worse. The guarantee, though, needs the maximum KL over all states, while the constraint above (and every implementation) uses the mean KL, a deliberate relaxation that trades the proof for tractability. And the KL is written KL[πold,πnew]\mathrm{KL}[\pi_\text{old}, \pi_\text{new}], old policy first, which is not symmetric: it asks how surprised the old policy is by the actions the new one now prefers, and penalizes the new policy for piling probability where the old one put little. The guarantee is real in theory and approximate in practice: TRPO keeps the structure of the bound and relaxes the part that made it provable.

Below, the amber bell is the old policy, fixed. Drag the new policy's mean and watch the KL grow. Inside the shaded band the step is small enough to trust; push past δ\delta and you are in the region where the surrogate stops being a reliable stand-in for the return:

Figure 2 · the trust region
KL = 0.10

within the trust region, surrogate still trustworthy

The old policy is fixed; shifting the new policy grows the KL divergence between them. The shaded band is the trust region, the set of new policies within KL ≤ δ. TRPO permits any step inside it and rejects the rest. (δ here is enlarged for legibility; real runs keep KL near 0.01.)

It works, and it is a pain. The constraint turns each update into a constrained optimization, solved with conjugate gradients, a linear approximation to the objective, and a quadratic approximation to the KL built from the Fisher information matrix. That is second-order machinery. It does not play nicely with the things modern networks do all the time: parameter sharing between the policy and the value function, dropout, any architecture with noise. PPO's whole ambition is to get TRPO's "stay close" behavior with nothing but first-order gradient steps.

PPO's move: clip the ratio

Look again at what went wrong with LCPIL^{CPI}. The objective rtA^tr_t \hat{A}_t rewards the optimizer for pushing rtr_t ever further in the helpful direction, with no diminishing returns. PPO's fix is direct: cap the reward. Past a small band around 1, stop counting any further change in the ratio.

Concretely, clip the ratio to [1ϵ,1+ϵ][1-\epsilon,\, 1+\epsilon] (with ϵ\epsilon a small number like 0.2), and take the minimum of the clipped and unclipped objectives:

LCLIP(θ)=E^t ⁣[min ⁣(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\!\left[\, \min\!\Big( r_t(\theta)\, \hat{A}_t,\ \ \mathrm{clip}\big(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\, \hat{A}_t \Big) \,\right]
(7)

That min\min is doing more than it looks. Take it apart by the sign of the advantage, because the two cases behave differently, and the difference is the entire idea.

Good action (A^t>0\hat{A}_t > 0). You want to make it more likely, so you push rtr_t up. The objective rises along rtA^tr_t \hat{A}_t until rtr_t hits 1+ϵ1+\epsilon, then goes flat: the clip caps the reward at (1+ϵ)A^t(1+\epsilon)\hat{A}_t, so beyond that point there is nothing to gain and the gradient for this sample is zero. The optimizer stops pushing. The lower clip 1ϵ1-\epsilon never enters; for a good action it is inert.

Bad action (A^t<0\hat{A}_t < 0). Now you want it less likely, so you push rtr_t down. By symmetry the binding clip is the lower one: the objective is flat at (1ϵ)A^t(1-\epsilon)\hat{A}_t for rt<1ϵr_t < 1-\epsilon, and only the upper region is sloped. So the clip that actually bites is 1+ϵ1+\epsilon for good actions and 1ϵ1-\epsilon for bad ones. The other one sits idle in each case.

Drag the ratio and flip the sign of the advantage. The thick teal curve is LCLIPL^{CLIP}; the thin amber line is the unclipped rtA^tr_t \hat{A}_t it would have followed. The red dot at rt=1r_t = 1 is where every update starts:

Figure 3 · the clipped objective
r = 1.00

ε = 0.20 · L = 1.00 · inside the trust band, gradient active

One timestep's objective versus the ratio r. The clipped objective follows the unclipped r·A until r leaves [1−ε, 1+ε], then flattens, so the gradient dies. The binding clip is the upper 1+ε when A>0 and the lower 1−ε when A<0. Drag r and toggle the sign of the advantage.

This is the trust region of TRPO, achieved with arithmetic instead of a constrained solve. Keeping each update near the old policy is what proximal means, the P in PPO. There is no Fisher matrix, no conjugate gradient, no line search. There is a min\min, a clip\mathrm{clip}, and a hyperparameter ϵ\epsilon. And because the clip only flattens the objective, it composes with anything: shared value heads, dropout, recurrent policies, all fine.

It is worth being precise about why this stands in for a trust region. TRPO enforces a hard limit: it measures the KL distance from the old policy to the new one and refuses any step that exceeds a budget δ\delta, so each update stays inside the region where the local surrogate is still a trustworthy guide to the true return. PPO's clip is the soft, cheap version of that idea. Once a sample's ratio rtr_t leaves [1ϵ,1+ϵ][1-\epsilon,\, 1+\epsilon] the objective goes flat, its gradient dies, and the optimizer stops pushing that sample any further. Nothing computes a KL or solves a constraint, yet the new policy is held near the old one because the only samples still contributing gradient are the ones that have not yet moved far. This approximates the trust region, it does not reproduce it: the clip caps each sample one at a time and never bounds the KL the way TRPO's constraint does, so it is a stand-in for TRPO's stay-close behavior, not for its monotonic-improvement proof.

Why the minimum makes it a floor

One detail in (7) is easy to skim past and is the reason the whole thing is safe: it is a min\min, not just a clip. Why take the smaller of the clipped and unclipped values, rather than always using the clipped one?

The paper's own phrasing is the clearest: you ignore the change in the ratio only when it would make the objective improve, and you keep the unclipped ratio when it makes the objective worse. Walk the good-action case. For rt>1+ϵr_t > 1+\epsilon, the unclipped rtA^tr_t \hat{A}_t is larger than the clipped value, so themin\min discards it: you decline the extra reward, the gradient dies, the policy is not encouraged to push further. Good. But suppose an optimization step has overshot the wrong way and made a good action less likely, dropping rtr_t below 1ϵ1-\epsilon. There the unclipped value is smaller, so the min\min keeps it, the gradient stays alive, and it pulls rtr_t back up where it belongs. Themin\min removes the reward for over-improving, and keeps the gradient that fixes an over-correction.

The compact way to say it: LCLIPL^{CLIP} is a pessimistic lower bound on the unclipped surrogate LCPIL^{CPI}. It never reports a rosier number than LCPIL^{CPI} would, so optimizing it cannot be fooled into a large step by a runaway ratio. A crucial qualifier, and the most common misconception about PPO: this is a lower bound on the surrogate, not on the true return. The lower-bound-on-return story is TRPO's, through its max-KL penalty. PPO keeps the stay-close behavior and drops the second-order proof.

You can watch the bound do its job. As the policy moves further from θold\theta_\text{old} (optimizing the one batch harder), the unclipped LCPIL^{CPI} keeps climbing, the way that tempted the optimizer into trouble. The clipped LCLIPL^{CLIP} rises, peaks while the KL is still small, then falls: a built-in penalty for moving too far. The maximum sits right where you want the update to stop:

Figure 4 · a lower bound with a penalty
L^CLIP=0.06

KL=0.021 · L^CPI=0.08 · L^CLIP peaks near KL=0.020, then falls

Optimizing one batch harder moves the policy rightward. The unclipped surrogate L^CPI rises without bound; L^CLIP rises, peaks near a small KL, then declines, so its maximum is a natural stopping point. KL (violet) climbs throughout. A faithful reconstruction of the paper's Figure 2, computed on a synthetic minibatch.

With ordinary gradient ascent on the unclipped surrogate you would keep climbing the amber curve straight off the edge; the clip turns the objective into one that wants you to stop near the old policy, so plain SGD, run for several epochs, lands in roughly the right place on its own.

The other knob: a KL penalty

Clipping is not the only way to discourage large steps, and the paper is careful to test the obvious alternative: put the KL back in the objective as a penalty, the way the TRPO theory suggests, and adapt its strength so each update lands near a target KL dtargd_\text{targ}:

LKLPEN(θ)=E^t ⁣[rt(θ)A^tβKL[πθold(st),πθ(st)]]L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t\!\left[\, r_t(\theta)\, \hat{A}_t - \beta\, \mathrm{KL}\big[\pi_{\theta_\text{old}}(\cdot \mid s_t),\, \pi_\theta(\cdot \mid s_t)\big] \,\right]
(8)

After each update you measure the KL you actually got. If it came in well under target, the penalty was too strong, so you halve β\beta. If it overshot, you double it. The coefficient chases the right value on its own, which sidesteps TRPO's complaint that no single fixed β\beta works across problems or even across one run. It is a clean idea and it works, just worse than clipping in the paper's experiments: its best setting reaches 0.74, against 0.82 for the clip, so the penalty version is kept as a baseline and the clip is the recommendation. The rest of the algorithm is packaging.

The whole loop

A real PPO objective bundles two more terms onto the clip. If the policy and value function share a network, you need a value-prediction loss to train the shared part; and a small entropy bonus keeps the policy from collapsing to determinism too early, which preserves exploration:

LtCLIP+VF+S(θ)=E^t ⁣[LtCLIP(θ)c1LtVF(θ)+c2S[πθ](st)]L_t^{CLIP+VF+S}(\theta) = \hat{\mathbb{E}}_t\!\left[\, L_t^{CLIP}(\theta) - c_1\, L_t^{VF}(\theta) + c_2\, S\big[\pi_\theta\big](s_t) \,\right]
(9)

Mind the signs, because they trip people up. The value loss is a squared error you want small,

LtVF=(Vθ(st)Vttarg)2L_t^{VF} = \big(V_\theta(s_t) - V_t^{\text{targ}}\big)^2

where VttargV_t^{\text{targ}} is the return actually observed from sts_t, equivalently A^t+V(st)\hat{A}_t + V(s_t). The value net is trained to predict the very returns the advantage was built from, and because LtVFL_t^{VF} is a loss, it is subtracted. The entropy SS is a bonus you want large, so it is added. Both signs are correct precisely because the whole expression is maximized; code usually minimizes the negative of this, which flips all three signs back, and the paper calling LL a "loss" is just framework habit. For the continuous-control runs the policy and value nets are separate, so c1c_1 does not even apply and there is no entropy bonus; the extra terms earn their keep mainly on the shared-network Atari agent, where c1=1c_1 = 1 and c2=0.01c_2 = 0.01.

Now the loop, which is short enough to hold in your head. Each iteration, NN actors run the current policy in parallel for TT timesteps, tagging every step with an advantage estimate. That fills a buffer of NTN\,T samples. Then you optimize the clipped objective on that buffer for KK epochs of minibatch ascent, refresh θoldθ\theta_\text{old} \leftarrow \theta, and collect again. The KK is the entire point: it is the data reuse that vanilla policy gradients could not safely do.

Figure 5 · one PPO iteration
K = 10

one rollout → 10 gradient passes over the same data (MuJoCo uses K=10)

N actors collect T timesteps each into a buffer of NT samples; the same buffer is reused for K epochs of clipped ascent before θ_old is refreshed and fresh data is collected. Drag K. At K=1 this is ordinary policy gradient; PPO's win is the reuse (MuJoCo uses K=10).

And the same thing in code, where the ascent-as-negative-loss sign convention shows up directly:

# one PPO iteration (actor-critic style)
data = []                              # collect with the CURRENT policy
for actor in range(N):                 # N actors run in parallel
    traj = run(pi_old, T)              # T timesteps each
    adv  = gae(traj, V, gamma, lam)    # advantage per step (Eqs 11-12)
    data += traj_with(adv, logp_old)   # store action log-probs under pi_old

for epoch in range(K):                 # reuse the SAME data K times
    for mb in minibatches(data, M):
        r      = exp(logp(pi, mb) - mb.logp_old)    # ratio r_t
        unclip = r * mb.adv
        clip   = clamp(r, 1 - eps, 1 + eps) * mb.adv
        # take the smaller: once the ratio drifts past the clip range,
        # the objective stops improving and the gradient dies
        L_clip = mean(min(unclip, clip))            # Eq 7
        L_vf   = mean((V(mb.s) - mb.ret) ** 2)      # value loss
        loss   = -(L_clip - c1 * L_vf + c2 * ent(pi, mb))  # ascend: minimize -L
        loss.backward(); opt.step()

pi_old = pi                            # refresh, then collect again

That is the whole algorithm. Collect a batch, store each action's probability under the old policy, then take KK epochs of clipped minibatch steps before the data goes stale. A few lines on top of REINFORCE, and the sample efficiency of a trust-region method. You can watch why the repeated epochs stay safe, scrub them:

Figure 6 · K epochs on one stale batch
epoch 0.0

mean |r−1|: 0.00 clipped · 0.00 unclipped · ε = 0.2

One fixed minibatch, optimized by plain ascent for 10 epochs under each objective. The unclipped L^CPI keeps pushing every ratio harder, and the per-sample ratios (strip below, one dot each) drift far outside the band. L^CLIP flattens after a few epochs: each sample's gradient dies once its ratio crosses 1±ε, so the ratios pile up at the band edges and extra epochs change almost nothing. A toy simulation of the mechanism (real formulas, synthetic advantages, ε=0.2, K=10 as in MuJoCo).

So what does it do

The first experiment is an ablation that earns the design. On seven MuJoCo continuous-control tasks, the paper sweeps the surrogate variants and scores each (0 = a random policy, 1 = the best result), averaged over all runs. Drag through them:

Figure 7 · which surrogate wins
ε=0.2

ε=0.20.82 · Clipping with ε=0.2. The winner, and the value the paper reuses for every later experiment.

Average normalized score on 7 MuJoCo tasks (Table 1). Clipping at ε=0.2 wins (0.82). Dropping clipping and penalty entirely is catastrophic (−0.39): with nothing holding the ratio back, one task runs off so badly the average drops below the random policy. The KL-penalty variants work but trail. Click a bar.

Clipping at ϵ=0.2\epsilon = 0.2 wins cleanly across the whole sweep, which is why the paper reuses it for every later experiment; that ϵ=0.2\epsilon = 0.2 lives in this sweep in the body of the paper, not in the hyperparameter tables, and the often-quoted c1=0.5c_1 = 0.5 value comes from later libraries, not from here. The 0.39-0.39 for "no clipping or penalty" is the runaway ratio made visible: with nothing restraining the step, one environment (half cheetah) goes so far wrong it drags the whole average below a policy that does nothing.

Beyond the ablation, PPO with ϵ=0.2\epsilon = 0.2 beats or matches the contenders of its day (TRPO, A2C, the cross-entropy method, vanilla policy gradients) on almost all of the continuous-control benchmarks, and it scales up to 3D humanoids that learn to run, steer toward moving targets, and get up after being knocked down. On Atari the picture is more nuanced. Across 49 games PPO won 30 on the fast-learning metric (average reward over all of training), ahead of A2C and ACER. On final performance, the average over the last 100 episodes, ACER edged it, 28 games to 19, with one tie. PPO learns quickly and reliably, and it is not uniformly the strongest final performer.

None of that fully explains why PPO took over. The deeper reason is the one this whole post has been building toward: it is simple. It is a few lines on top of a policy gradient, it has essentially one hyperparameter that matters, it tolerates the architectures people actually use, and it almost never blows up. When the RLHF pipeline needed a reinforcement-learning algorithm to fine-tune language models against a reward model, PPO was the obvious choice, and it is the algorithm in InstructGPT (with a per-token KL penalty bolted on, the same "stay close" instinct in a new place).

Step back and the argument is four moves long. A policy gradient nudges each action by its advantage. Reusing the batch needs an importance ratio, and an unrestrained ratio is what destroys the policy. A trust region fixes that but is heavy. And clipping the ratio recreates the trust region with a min\min and a clip\mathrm{clip}, cheap enough to run for ten epochs a batch. The hard-won stability of trust-region methods turned out to be available for the price of one line of arithmetic. That is the whole reason PPO is everywhere.

Provenance Verified against primary literature
REINFORCE / PG theoremThe score-function policy gradient ∇logπ·A (Williams 1992; Sutton et al. 2000).
CPI (Kakade & Langford 2002)The ratio×advantage surrogate L^CPI that PPO clips; the source of the CPI superscript.
TRPO (Schulman 2015)The KL trust region and monotonic-improvement bound PPO emulates first-order.
GAE (Schulman 2015)The advantage estimator  that the surrogate is weighted by.
correctionL^CLIP is a lower bound on the surrogate L^CPI, not on the true return. TRPO’s monotonic-improvement guarantee (a bound on the return) is a separate result that needs the max KL over states; PPO relaxes that to the mean KL and keeps the intuition, not the proof.

Questions you might still have

?

Why take many small clipped steps instead of one big gradient step?
A big step on a stale batch overshoots: once the policy has moved, the collected data no longer reflects it, and the surrogate stops being a good guide. Many small clipped steps keep the policy near where the data was collected, so each step stays trustworthy, and you still extract far more from the batch than one step would.

?

Does clipping actually keep the KL divergence small?
Not as a hard guarantee. Clipping is a soft, per-sample heuristic: it removes the incentive to push a single sample’s ratio past 1±ε, but it does not bound the KL the way TRPO’s constraint does. The average KL usually stays small, yet a few samples can still move, which is why many implementations add an explicit KL early-stop on top.

?

What does PPO actually inherit from TRPO?
The intuition, not the theorem. TRPO’s monotonic-improvement guarantee needs the maximum KL over states and a second-order solve. PPO keeps only the idea (stay near the old policy so the surrogate stays trustworthy) and gets it first-order with a clip, carrying no such proof.

?

Is the clipped objective a lower bound on the true return?
No. It is a lower bound on the unclipped surrogate L^CPI, not on the policy’s true performance. The bound-on-return story belongs to TRPO, through a different, max-KL penalty. Conflating the two is the most common PPO misconception.

Footnotes & further reading

  1. The paper: Schulman, Wolski, Dhariwal, Radford, Klimov, Proximal Policy Optimization Algorithms (OpenAI, 2017). A clear secondary reference is OpenAI's Spinning Up page.
  2. The trust region it descends from: Schulman, Levine, Abbeel, Jordan, Moritz, Trust Region Policy Optimization (2015), with the monotonic-improvement bound PPO emulates first-order.
  3. The advantage estimator: Schulman, Moritz, Levine, Jordan, Abbeel, High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE, 2015). The clean GAE weights are γl\gamma^l on rewards and (γλ)l(\gamma\lambda)^l on residuals.
  4. The surrogate's origin: Kakade and Langford, Approximately Optimal Approximate Reinforcement Learning (ICML 2002), the conservative policy iteration that names LCPIL^{CPI}.
  5. The policy gradient itself: Williams, Simple Statistical Gradient-Following Algorithms (REINFORCE, 1992), and Sutton, McAllester, Singh, Mansour, Policy Gradient Methods for RL with Function Approximation (2000).
  6. PPO in the wild: Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT, 2022), whose RL stage is PPO with a per-token KL penalty.