Reinforcement learning · Policy gradients

Proximal Policy Optimization Algorithms

Clip one ratio, and the policy can't run away.

Standard policy gradients use a batch of experience once and throw it away, because reusing it makes the policy overshoot into something worse. PPO clips the update so you can squeeze many epochs out of each batch and still stay near the policy you started from.

Explaining the paperProximal Policy Optimization AlgorithmsSchulman, Wolski, Dhariwal, Radford, Klimov · OpenAI · 2017 · arXiv:1707.06347 ↗

TRPO bought its stability with a heavy second-order solve; PPO gets the same stay-close behavior from a clip cheap enough to reuse each batch for ten epochs, and it became the RL stage behind the first instruction-following language models.

Reinforcement learning is expensive in a way ordinary supervised learning is not. A classifier has a fixed dataset sitting on disk. An agent has to generate its own data by acting in the world, and useful data only comes from a halfway-decent policy, which you only get by training on data. Every gradient step depends on fresh experience that costs real interaction to collect. So the currency that matters most is sample efficiency: how much the policy improves per unit of experience spent.

The classic policy-gradient methods spend that currency carelessly. They collect a batch of experience, take exactly one gradient step, and discard the batch. Take a second step on the same data and the update is no longer justified. It can push the policy somewhere much worse, sometimes badly enough that it never recovers.

PPO, out of OpenAI in 2017, is the version that stuck. It sits at the end of a line of work about a single question: how far is a policy allowed to move in one update? Conservative policy iteration asked it, then TRPO answered it with a real but heavy machine. PPO keeps the answer and drops the heavy machinery. A change of a few lines to a vanilla policy gradient lets you reuse each batch for ten epochs of updates, held in check by a clip that stops rewarding the policy once it moves too far. It became the default reinforcement-learning algorithm, and the one behind the RL stage of RLHF (reinforcement learning from human feedback) that produced the first instruction-following language models.

To see why a clip is enough, a few ideas explain it: what a policy gradient is, and why advantage is the right thing to weight it by. Why reusing a batch is dangerous. What a trust region is. And finally the clipped objective that recreates the trust region with a clip and a minimum.

Learning from your own actions

The object being trained is a policy, written $\pi_\theta$ , a function with parameters $\theta$ that reads the current situation (the state $s$ ) and outputs a probability distribution over what to do (the action $a$ ). It is stochastic on purpose: the agent samples its action, which both explores and makes the math differentiable. The world responds with a reward and a new state, and the sum of (discounted) future reward from a state is its return. The goal is to adjust $\theta$ so the policy collects more return.

How do you nudge $\theta$ using only sampled experience? The policy gradient, older than deep RL, falls out of one identity, the score-function trick:

\nabla_\theta\, \pi = \pi\,\nabla_\theta \log \pi

Plug it into the derivative of expected return and the gradient turns into something you can estimate from samples:

\hat{g} = \hat{\mathbb{E}}_t\!\left[\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_t \,\right]

(1)

Equation (1) reads as an instruction. For each action you actually took, take the gradient of its log-probability, the direction in parameter space that would make that action more likely, and scale it by a number $\hat{A}_t$ that says how good the action turned out to be. Sum over the batch. The hat on $\hat{\mathbb{E}}_t$ is a reminder that this is an empirical average over a finite batch of samples, not an exact expectation. This is REINFORCE (Williams, 1992), and nothing about (1) is new to PPO. Note the sign: $\hat{g}$ already points uphill. You add it to $\theta$ (gradient ascent), because the aim is to maximize return, not minimize a loss.

Everything now rides on that weight $\hat{A}_t$ . The naive choice is the raw return: reward good outcomes, punish bad ones. It works, but it is needlessly noisy. If every action in a good state earns a return of around 100, the gradient mostly encodes "this state was good" rather than "this action was good," even for an average action. What you actually want to know is whether an action did better or worse than the state's own baseline. That is the advantage:

A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)

Here $V^\pi(s)$ is the value of the state, the return you expect from $s$ if you just follow the policy, and $Q^\pi(s,a)$ is the value of taking action $a$ first. Their difference is "how much better than usual was this action." Subtracting a baseline that depends only on the state, not the action, leaves the gradient's expectation unchanged: the baseline is the same number no matter which action was taken, and the gradient weights it by $\nabla_\theta \log \pi$ , whose average over the policy's own action distribution is exactly zero, so the baseline terms sum to zero,

\sum_a b(s)\,\nabla_\theta \pi(a \mid s) = b(s)\,\nabla_\theta \sum_a \pi(a \mid s) = b(s)\,\nabla_\theta 1 = 0

and the subtraction leaves the gradient's average direction unchanged while cutting its variance. Recentering around zero only sharpens the signal: the gradient stops spending itself on the state and concentrates on the choice.

Below, the amber curve is the (unknown) true return over a one-dimensional action. The teal bell is the current policy. Each step it samples a handful of actions, scores each by its advantage (return minus the dashed baseline), then pushes the likely-to-help actions up and the likely-to-hurt actions down. The bell climbs:

Figure 1 · the policy gradient

The return R(a) over actions is fixed but unknown. The policy samples actions and weights each by its advantage: actions above the baseline V get an up arrow (make more likely), below it a down arrow. Averaged, that is exactly ∇logπ·A, and the policy climbs toward higher return.

In practice $\hat{A}_t$ is itself estimated, by generalized advantage estimation (GAE), from a learned value function $V$ . GAE sums one-step surprises with a decay $\lambda$ :

\delta_t = r_t + \gamma\,V(s_{t+1}) - V(s_t), \qquad \hat{A}_t = \sum_{l \ge 0} (\gamma\lambda)^{l}\, \delta_{t+l}

(11)–(12)

The term $\delta_t$ is a one-step advantage estimate (reward plus discounted next-value, minus current value), and $\gamma$ is the discount, $\lambda$ a knob trading variance for bias. The clean "baseline subtraction is unbiased" story is about the ideal advantage; GAE is a deliberately slightly biased estimate of it, a distinction that matters again later. The rest of PPO is about what you do with the gradient, not how you estimate the advantage.

The waste: one update per rollout

The inefficiency, stated plainly: Equation (1) is an average over data drawn from the current policy $\pi_\theta$ . The instant you take a gradient step, $\theta$ changes, so the policy changes, so the batch you collected was drawn from a policy that no longer exists. Use it for a second step and you are averaging the gradient of the new policy against actions sampled from the old one. The estimate is biased, and the more steps you take, the more wrong it gets: each step moves the policy a little further from the one that collected the batch, the recorded actions become less and less representative of what the current policy would actually do, and the importance ratios that correct for the mismatch (defined below) only get noisier as the gap widens.

There is a standard fix for "I have samples from the wrong distribution": importance sampling. Reweight each sample by how much more or less likely the new policy is to have taken that action. The weight is a ratio, and this ratio is central to the rest of PPO:

r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}, \qquad r_t(\theta_\text{old}) = 1

At the start of an update the new policy equals the old one, so every $r_t = 1$ . As you optimize, an action the new policy favors more gets $r_t > 1$ , one it favors less gets $r_t < 1$ . Weighting the advantage by this ratio gives a surrogate objective you can keep optimizing on the fixed batch:

L^{CPI}(\theta) = \hat{\mathbb{E}}_t\!\left[\, \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}\, \hat{A}_t \,\right] = \hat{\mathbb{E}}_t\!\left[\, r_t(\theta)\, \hat{A}_t \,\right]

(6)

The superscript $CPI$ stands for conservative policy iteration (Kakade and Langford, 2002), where this objective was proposed. It is a good objective for a small step. Near $\theta_\text{old}$ it agrees with the true return to first order, which is the formal version of "if you barely move, the reweighted batch is a sound guide." The hard part is keeping the step barely small.

Maximize $L^{CPI}$ without restraint and there is nothing to stop the optimizer. For an action with positive advantage, the objective $r_t \hat{A}_t$ grows without bound as $r_t$ grows, so gradient ascent drives $r_t$ as high as the data allows, making a handful of actions far more likely, well outside the range where the old batch says anything reliable. The policy lurches, the next batch is collected by a worse policy, and the process can spiral. This is exactly the destructive update from before, now with a name: an unbounded importance ratio. We need a way to let the policy move a little and refuse to let it move a lot.

TRPO: move, but not too far

TRPO's answer is to make "don't move too far" a literal constraint. Measure the distance between the old policy and the new one with the KL divergence, a standard way to score how different two probability distributions are (zero when identical, growing as they diverge), and forbid any update that moves more than a small budget $\delta$ :

\max_\theta\ \hat{\mathbb{E}}_t\!\left[\, r_t(\theta)\, \hat{A}_t \,\right] \quad\text{subject to}\quad \hat{\mathbb{E}}_t\!\left[\, \mathrm{KL}\big[\pi_{\theta_\text{old}}(\cdot \mid s_t),\, \pi_\theta(\cdot \mid s_t)\big] \,\right] \le \delta

(3)–(4)

The set of new policies that satisfy this KL budget is the policy's trust region, the neighborhood of $\theta_\text{old}$ within which the surrogate stays a faithful stand-in for the return. This is not arbitrary caution. There is a theorem underneath: a surrogate like (6), minus a penalty proportional to the KL, is a genuine lower bound on the policy's true return. Improve the bounded quantity and you are guaranteed to improve the real thing, which gives monotonic improvement, a sequence of policies that never gets worse. The guarantee, though, needs the maximum KL over all states, while the constraint above (and every implementation) uses the mean KL, a deliberate relaxation that trades the proof for tractability. And the KL is written $\mathrm{KL}[\pi_\text{old}, \pi_\text{new}]$ , old policy first, which is not symmetric: it asks how surprised the old policy is by the actions the new one now prefers, and penalizes the new policy for piling probability where the old one put little. The guarantee is real in theory and approximate in practice: TRPO keeps the structure of the bound and relaxes the part that made it provable.

Below, the amber bell is the old policy, fixed. Drag the new policy's mean and watch the KL grow. Inside the shaded band the step is small enough to trust; push past $\delta$ and you are in the region where the surrogate stops being a reliable proxy for the return:

Figure 2 · the trust region

shiftKL = 0.10

within the trust region, surrogate still trustworthy

The old policy is fixed; shifting the new policy grows the KL divergence between them. The shaded band is the trust region, the set of new policies within KL ≤ δ. TRPO permits any step inside it and rejects the rest. (δ here is enlarged for legibility; real runs keep KL near 0.01.)

It works, and it is a pain. The constraint turns each update into a constrained optimization, solved with conjugate gradients, a linear approximation to the objective, and a quadratic approximation to the KL built from the Fisher information matrix. That is second-order machinery. It does not play nicely with the things modern networks do all the time: parameter sharing between the policy and the value function, dropout, any architecture with noise. PPO aims to get TRPO's "stay close" behavior with nothing but first-order gradient steps.

PPO's move: clip the ratio

The same problem from $L^{CPI}$ returns here. The objective $r_t \hat{A}_t$ increases as $r_t$ is pushed further in the helpful direction, with no diminishing returns. PPO caps the reward. Past a small band around 1, it stops counting any further change in the ratio.

Concretely, clip the ratio to $[1-\epsilon,\, 1+\epsilon]$ (with $\epsilon$ a small number like 0.2), and take the minimum of the clipped and unclipped objectives:

L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\!\left[\, \min\!\Big( r_t(\theta)\, \hat{A}_t,\ \ \mathrm{clip}\big(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\, \hat{A}_t \Big) \,\right]

(7)

Splitting the objective by the sign of the advantage shows the mechanism, because the two cases behave differently.

Good action ( $\hat{A}_t > 0$ ). You want to make it more likely, so you push $r_t$ up. The objective rises along $r_t \hat{A}_t$ until $r_t$ hits $1+\epsilon$ , then goes flat: the clip caps the reward at $(1+\epsilon)\hat{A}_t$ , so beyond that point there is nothing to gain and the gradient for this sample is zero, so no further update is applied to it. The lower clip $1-\epsilon$ never enters; for a good action it is inert.

Bad action ( $\hat{A}_t < 0$ ). Now you want it less likely, so you push $r_t$ down. By symmetry the binding clip is the lower one: the objective is flat at $(1-\epsilon)\hat{A}_t$ for $r_t < 1-\epsilon$ , and only the upper region is sloped. So the clip that actually bites is $1+\epsilon$ for good actions and $1-\epsilon$ for bad ones, with the other left idle.

Drag the ratio and flip the sign of the advantage. The thick teal curve is $L^{CLIP}$ ; the thin amber line is the unclipped $r_t \hat{A}_t$ it would have followed. The red dot marks $r_t = 1$ , where every update starts:

Figure 3 · the clipped objective

εr = 1.00

ε = 0.20 · L = 1.00 · inside the trust band, gradient active

One timestep's objective versus the ratio r. The clipped objective follows the unclipped r·A until r leaves [1−ε, 1+ε], then flattens, so the gradient dies. The binding clip is the upper 1+ε when A>0 and the lower 1−ε when A<0. Drag r and toggle the sign of the advantage.

This is the trust region of TRPO, achieved with arithmetic instead of a constrained solve. Keeping each update near the old policy is what proximal means, the P in PPO. There is no Fisher matrix, no conjugate gradient, no line search. There is a $\min$ , a $\mathrm{clip}$ , and a hyperparameter $\epsilon$ . And because the clip only flattens the objective, it composes with anything: shared value heads, dropout, recurrent policies, all fine.

Here is the precise sense in which this stands in for a trust region. TRPO enforces a hard limit: it measures the KL distance from the old policy to the new one and refuses any step that exceeds a budget $\delta$ , so each update stays inside the region where the local surrogate still tracks the true return. PPO's clip is the soft, cheap version of that idea. Once a sample's ratio $r_t$ leaves $[1-\epsilon,\, 1+\epsilon]$ the objective goes flat, its gradient goes to zero, so that sample no longer contributes to the update. Nothing computes a KL or solves a constraint, yet the new policy is held near the old one because the only samples still contributing gradient are the ones that have not yet moved far. This approximates the trust region, it does not reproduce it: the clip caps each sample one at a time and never bounds the KL the way TRPO's constraint does, so it is a stand-in for TRPO's stay-close behavior, not for its monotonic-improvement proof.

Why the minimum makes it a floor

The objective in (7) is a $\min$ , not a clip alone, and that is what keeps it safe. Why take the smaller of the clipped and unclipped values, rather than always using the clipped one?

The paper's own phrasing is the clearest: you ignore the change in the ratio only when it would make the objective improve, and you keep the unclipped ratio when it makes the objective worse. Consider the good-action case first. For $r_t > 1+\epsilon$ , the unclipped $r_t \hat{A}_t$ is larger than the clipped value, so the $\min$ discards it: you decline the extra reward, the gradient for this sample goes to zero, and the policy is not pushed further. Good. But suppose an optimization step has overshot the wrong way and made a good action less likely, dropping $r_t$ below $1-\epsilon$ . There the unclipped value is smaller, so the $\min$ keeps it, the gradient remains nonzero and moves $r_t$ back toward 1. The $\min$ drops the term that would reward over-improving while retaining the gradient that corrects an over-correction.

The compact way to say it: $L^{CLIP}$ is a pessimistic lower bound on the unclipped surrogate $L^{CPI}$ . It never reports a rosier number than $L^{CPI}$ would, so optimizing it will not produce a large step from a runaway ratio. This is a lower bound on the surrogate, not on the true return. The lower-bound-on-return story is TRPO's, through its max-KL penalty.

You can watch the bound do its job. As the policy moves further from $\theta_\text{old}$ (optimizing the one batch harder), the unclipped $L^{CPI}$ keeps climbing, which is what drove the optimizer too far before. The clipped $L^{CLIP}$ rises, peaks while the KL is still small, then falls: a built-in penalty for moving too far. The maximum sits right where you want the update to stop:

Figure 4 · a lower bound with a penalty

L^CLIP=0.06

KL=0.021 · L^CPI=0.08 · L^CLIP peaks near KL=0.020, then falls

Optimizing one batch harder moves the policy rightward. The unclipped surrogate L^CPI rises without bound; L^CLIP rises, peaks near a small KL, then declines, so its maximum is a natural stopping point. KL (violet) climbs throughout. A close reconstruction of the paper's Figure 2, computed on a synthetic minibatch.

With ordinary gradient ascent on the unclipped surrogate the policy would keep moving far past the old policy; the clip turns the objective into one whose maximum sits near the old policy, so plain SGD, run for several epochs, lands in roughly the right place on its own.

The other knob: a KL penalty

Clipping is not the only way to discourage large steps, and the paper is careful to test the obvious alternative: put the KL back in the objective as a penalty, the way the TRPO theory suggests, and adapt its strength so each update lands near a target KL $d_\text{targ}$ :

L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t\!\left[\, r_t(\theta)\, \hat{A}_t - \beta\, \mathrm{KL}\big[\pi_{\theta_\text{old}}(\cdot \mid s_t),\, \pi_\theta(\cdot \mid s_t)\big] \,\right]

(8)

After each update you measure the KL you actually got. If it came in well under target, the penalty was too strong, so you halve $\beta$ . If it overshot, you double it. The coefficient converges toward the right value on its own, which sidesteps TRPO's complaint that no single fixed $\beta$ works across problems or even across one run. It is a clean idea and it works, though worse than clipping in the paper's experiments: its best setting reaches 0.74, against 0.82 for the clip, so the penalty version is kept as a baseline and the clip is the recommendation. The rest of the algorithm adds only routine machinery.

The full loop

A real PPO objective bundles two more terms onto the clip. If the policy and value function share a network, you need a value-prediction loss to train the shared part; and a small entropy bonus keeps the policy from collapsing to determinism too early, which preserves exploration:

L_t^{CLIP+VF+S}(\theta) = \hat{\mathbb{E}}_t\!\left[\, L_t^{CLIP}(\theta) - c_1\, L_t^{VF}(\theta) + c_2\, S\big[\pi_\theta\big](s_t) \,\right]

(9)

The three terms do not all carry the same sign. The value loss is a squared error you want small,

L_t^{VF} = \big(V_\theta(s_t) - V_t^{\text{targ}}\big)^2

where $V_t^{\text{targ}}$ is the return actually observed from $s_t$ , equivalently $\hat{A}_t + V(s_t)$ . The value net is trained to predict the very returns the advantage was built from, and because $L_t^{VF}$ is a loss, it is subtracted. The entropy $S$ is a bonus you want large, so it is added. Both signs are correct because the expression is maximized; code usually minimizes the negative of this, which flips all three signs back, and the paper calling $L$ a "loss" is framework habit. For the continuous-control runs the policy and value nets are separate, so $c_1$ does not even apply and there is no entropy bonus; the extra terms matter mainly on the shared-network Atari agent, where $c_1 = 1$ and $c_2 = 0.01$ .

Now the loop, which is short enough to hold in your head. Each iteration, $N$ actors run the current policy in parallel for $T$ timesteps, tagging every step with an advantage estimate. That fills a buffer of $N\,T$ samples. Then you optimize the clipped objective on that buffer for $K$ epochs of minibatch ascent, refresh $\theta_\text{old} \leftarrow \theta$ , and collect again. $K$ is what enables the data reuse that vanilla policy gradients could not safely do.

Figure 5 · one PPO iteration

epochs KK = 10

one rollout → 10 gradient passes over the same data (MuJoCo uses K=10)

N actors collect T timesteps each into a buffer of NT samples; the same buffer is reused for K epochs of clipped ascent before θ_old is refreshed and fresh data is collected. Drag K. At K=1 this is ordinary policy gradient; PPO's win is the reuse (MuJoCo uses K=10).

And the same thing in code, where the ascent-as-negative-loss sign convention shows up directly:

# one PPO iteration (actor-critic style)
data = []                              # collect with the CURRENT policy
for actor in range(N):                 # N actors run in parallel
    traj = run(pi_old, T)              # T timesteps each
    adv  = gae(traj, V, gamma, lam)    # advantage per step (Eqs 11-12)
    data += traj_with(adv, logp_old)   # store action log-probs under pi_old

for epoch in range(K):                 # reuse the SAME data K times
    for mb in minibatches(data, M):
        r      = exp(logp(pi, mb) - mb.logp_old)    # ratio r_t
        unclip = r * mb.adv
        clip   = clamp(r, 1 - eps, 1 + eps) * mb.adv
        # take the smaller: once the ratio drifts past the clip range,
        # the objective stops improving and the gradient dies
        L_clip = mean(min(unclip, clip))            # Eq 7
        L_vf   = mean((V(mb.s) - mb.ret) ** 2)      # value loss
        loss   = -(L_clip - c1 * L_vf + c2 * ent(pi, mb))  # ascend: minimize -L
        loss.backward(); opt.step()

pi_old = pi                            # refresh, then collect again

That is the algorithm, start to finish. Collect a batch, store each action's probability under the old policy, then take $K$ epochs of clipped minibatch steps before the data goes stale. A few lines on top of REINFORCE, and the sample efficiency of a trust-region method. You can watch why the repeated epochs stay safe, scrub them:

Figure 6 · K epochs on one stale batch

epoch 0.0

mean |r−1|: 0.00 clipped · 0.00 unclipped · ε = 0.2

One fixed minibatch, optimized by plain ascent for 10 epochs under each objective. The unclipped L^CPI keeps pushing every ratio harder, and the per-sample ratios (strip below, one dot each) drift far outside the band. L^CLIP flattens after a few epochs: each sample's gradient dies once its ratio crosses 1±ε, so the ratios pile up at the band edges and extra epochs change almost nothing. A toy simulation of the mechanism (real formulas, synthetic advantages, ε=0.2, K=10 as in MuJoCo).

Why the clip won out

The first experiment is an ablation that earns the design. On seven MuJoCo continuous-control tasks, the paper sweeps the surrogate variants and scores each (0 = a random policy, 1 = the best result), averaged over all runs. Drag through them:

Figure 7 · which surrogate wins

ε=0.2

ε=0.2 → 0.82 · Clipping with ε=0.2. The winner, and the value the paper reuses for every later experiment.

Average normalized score on 7 MuJoCo tasks (Table 1). Clipping at ε=0.2 wins (0.82). Dropping clipping and penalty entirely is catastrophic (−0.39): with nothing holding the ratio back, one task runs off so badly the average drops below the random policy. The KL-penalty variants work but trail. Click a bar.

Clipping at $\epsilon = 0.2$ wins cleanly across the sweep, which is why the paper reuses it for every later experiment; that $\epsilon = 0.2$ lives in this sweep in the body of the paper, not in the hyperparameter tables, and the often-quoted $c_1 = 0.5$ value comes from later libraries, not from here. The $-0.39$ for "no clipping or penalty" is the runaway ratio made visible: with nothing restraining the step, one environment goes so far wrong it drags the whole average below a policy that does nothing.

Beyond the ablation, PPO with $\epsilon = 0.2$ beats or matches the contenders of its day (TRPO, A2C, the cross-entropy method, vanilla policy gradients) on almost all of the continuous-control benchmarks, and it scales up to 3D humanoids that learn to run, steer toward moving targets, and get up after being knocked down. On Atari the picture is more nuanced. Across 49 games PPO won 30 on the fast-learning metric (average reward over all of training), ahead of A2C and ACER. On final performance, the average over the last 100 episodes, ACER edged it, 28 games to 19, with one tie. PPO learns quickly and reliably, and it is not uniformly the strongest final performer.

None of that fully explains why PPO took over. The deeper reason is the one everything here has been building toward: it is simple. It is a few lines on top of a policy gradient, it has essentially one hyperparameter that matters, it tolerates the architectures people actually use, and it almost never blows up. When the RLHF pipeline needed a reinforcement-learning algorithm to fine-tune language models against a reward model, PPO was the obvious choice, and it is the algorithm in InstructGPT (with a per-token KL penalty bolted on, the same "stay close" instinct in a new place).

To retrace the path: A policy gradient nudges each action by its advantage, and reusing the batch requires an importance ratio whose unrestrained growth is what destroys the policy. A trust region restrains it but is heavy, whereas clipping the ratio recreates the trust region with a $\min$ and a $\mathrm{clip}$ that is cheap enough to run for ten epochs a batch. The hard-won stability of trust-region methods came down to the price of one line of arithmetic, which is why PPO is everywhere.

Provenance Verified against primary literature

REINFORCE / PG theoremThe score-function policy gradient ∇logπ·A (Williams 1992; Sutton et al. 2000).

CPI (Kakade & Langford 2002)The ratio×advantage surrogate L^CPI that PPO clips; the source of the CPI superscript.

TRPO (Schulman 2015)The KL trust region and monotonic-improvement bound PPO emulates first-order.

GAE (Schulman 2015)The advantage estimator Â that the surrogate is weighted by.

correctionL^CLIP is a lower bound on the surrogate L^CPI, not on the true return. TRPO’s monotonic-improvement guarantee (a bound on the return) is a separate result that needs the max KL over states; PPO relaxes that to the mean KL and keeps the intuition, not the proof.

Questions you might still have

Why take many small clipped steps instead of one big gradient step?
A big step on a stale batch overshoots: once the policy has moved, the collected data no longer reflects it, and the surrogate stops being a good guide. Many small clipped steps keep the policy near where the data was collected, so each step stays trustworthy, and you still extract far more from the batch than one step would.

Does clipping actually keep the KL divergence small?
Not as a hard guarantee. Clipping is a soft, per-sample heuristic: it removes the incentive to push a single sample’s ratio past 1±ε, but it does not bound the KL the way TRPO’s constraint does. The average KL usually stays small, yet a few samples can still move, which is why many implementations add an explicit KL early-stop on top.

What does PPO actually inherit from TRPO?
The intuition, not the theorem. TRPO’s monotonic-improvement guarantee needs the maximum KL over states and a second-order solve. PPO keeps only the idea (stay near the old policy so the surrogate stays trustworthy) and gets it first-order with a clip, carrying no such proof.

Is the clipped objective a lower bound on the true return?
No. It is a lower bound on the unclipped surrogate L^CPI, not on the policy’s true performance. The bound-on-return story belongs to TRPO, through a different, max-KL penalty.

Footnotes & further reading

The paper: Schulman, Wolski, Dhariwal, Radford, Klimov, Proximal Policy Optimization Algorithms (OpenAI, 2017). A clear secondary reference is OpenAI's Spinning Up page.
The trust region it descends from: Schulman, Levine, Abbeel, Jordan, Moritz, Trust Region Policy Optimization (2015), with the monotonic-improvement bound PPO emulates first-order.
The advantage estimator: Schulman, Moritz, Levine, Jordan, Abbeel, High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE, 2015). The clean GAE weights are $\gamma^l$ on rewards and $(\gamma\lambda)^l$ on residuals.
The surrogate's origin: Kakade and Langford, Approximately Optimal Approximate Reinforcement Learning (ICML 2002), the conservative policy iteration that names $L^{CPI}$ .
The policy gradient itself: Williams, Simple Statistical Gradient-Following Algorithms (REINFORCE, 1992), and Sutton, McAllester, Singh, Mansour, Policy Gradient Methods for RL with Function Approximation (2000).
PPO in the wild: Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT, 2022), whose RL stage is PPO with a per-token KL penalty.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.