Proximal Policy Optimization Algorithms
Clip one ratio, and the policy can't run away.
Standard policy gradients use a batch of experience once and throw it away, because reusing it makes the policy overshoot into something worse. PPO clips the update so you can squeeze many epochs out of each batch and still stay near the policy you started from.
Explaining the paperProximal Policy Optimization AlgorithmsWhat if you could squeeze ten gradient steps out of one batch of experience, instead of one, without the policy moving somewhere much worse?
Reinforcement learning is expensive in a way ordinary supervised learning is not. A classifier has a fixed dataset sitting on disk. An agent has to generate its own data by acting in the world, and useful data only comes from a halfway-decent policy, which you only get by training on data. Every gradient step depends on fresh experience that costs real interaction to collect. So the currency that matters most is sample efficiency: how much the policy improves per unit of experience spent.
The classic policy-gradient methods spend that currency carelessly. They collect a batch of experience, take exactly one gradient step, and discard the batch. Take a second step on the same data and the update is no longer justified. It can push the policy somewhere much worse, sometimes badly enough that it never recovers. You pay full price for the data and use it once.
PPO, out of OpenAI in 2017, is the fix that stuck. It sits at the end of a line of work about a single question: how far is a policy allowed to move in one update? Conservative policy iteration asked it, then TRPO answered it with a real but heavy machine. PPO keeps the answer and throws away the weight. A change of a few lines to a vanilla policy gradient lets you reuse each batch for ten epochs of updates, held in check by a clip that stops rewarding the policy once it moves too far. It became the default reinforcement-learning algorithm, and the one behind the RL stage of RLHF that produced the first instruction-following language models.
To see why a clip is enough, we build a short tower. What a policy gradient is, and why advantage is the right thing to weight it by. Why reusing a batch is dangerous. What a trust region is. And finally the clipped objective that recreates the trust region with a clip and a minimum.
Learning from your own actions
Start with the object being trained: a policy, written , a function with parameters that reads the current situation (the state ) and outputs a probability distribution over what to do (the action ). It is stochastic on purpose: the agent samples its action, which both explores and makes the math differentiable. The world responds with a reward and a new state, and the sum of (discounted) future reward from a state is its return. The whole goal is to adjust so the policy collects more return.
How do you nudge using only sampled experience? The answer is the policy gradient, and it is older than deep RL. It comes from one identity, the score-function trick:
Plug it into the derivative of expected return and the gradient turns into something you can estimate from samples:
Read it as an instruction. For each action you actually took, take the gradient of its log-probability, the direction in parameter space that would make that action more likely, and scale it by a number that says how good the action turned out to be. Sum over the batch. The hat on is a reminder that this is an empirical average over a finite batch of samples, not an exact expectation. This is REINFORCE (Williams, 1992), and nothing about (1) is new to PPO. Note the sign: already points uphill. You add it to (gradient ascent), because the aim is to maximize return, not minimize a loss.
Everything now rides on that weight . The naive choice is the raw return: reward good outcomes, punish bad ones. It works, but it is needlessly noisy. If every action in a good state earns a return of around 100, the gradient mostly encodes "this state was good" rather than "this action was good," even for an average action. What you actually want to know is whether an action did better or worse than the state's own baseline. That is the advantage:
Here is the value of the state, the return you expect from if you just follow the policy, and is the value of taking action first. Their difference is "how much better than usual was this action." Subtracting a baseline that depends only on the state, not the action, is free in expectation: the baseline is the same number no matter which action was taken, and the gradient weights it by , whose average over the policy's own action distribution is exactly zero, so the baseline terms sum to zero,
and the subtraction leaves the gradient's average direction unchanged while cutting its variance. Recentering around zero only sharpens the signal: the gradient stops spending itself on the state and concentrates on the choice.
Below, the amber curve is the (unknown) true return over a one-dimensional action. The teal bell is the current policy. Each step it samples a handful of actions, scores each by its advantage (return minus the dashed baseline), then pushes the likely-to-help actions up and the likely-to-hurt actions down. The bell climbs:
In practice is itself estimated, by generalized advantage estimation (GAE), from a learned value function . GAE sums one-step surprises with a decay :
The term is a one-step advantage estimate (reward plus discounted next-value, minus current value), and is the discount, a knob trading variance for bias. The clean "baseline subtraction is unbiased" story is about the ideal advantage; GAE is a deliberately slightly biased estimate of it, a distinction that matters again later. The rest of PPO is about what you do with the gradient, not how you estimate the advantage.
The waste: one update per rollout
The inefficiency, stated plainly: Equation (1) is an average over data drawn from the current policy . The instant you take a gradient step, changes, so the policy changes, so the batch you collected was drawn from a policy that no longer exists. Use it for a second step and you are averaging the gradient of the new policy against actions sampled from the old one. The estimate is biased, and the more steps you take, the more wrong it gets: each step moves the policy a little further from the one that collected the batch, the recorded actions become less and less representative of what the current policy would actually do, and the importance ratios that correct for the mismatch, introduced next, only get noisier as the gap widens.
There is a standard fix for "I have samples from the wrong distribution": importance sampling. Reweight each sample by how much more or less likely the new policy is to have taken that action. The weight is a ratio, and it is the single most important symbol in PPO:
At the start of an update the new policy equals the old one, so every . As you optimize, an action the new policy favors more gets , one it favors less gets . Weighting the advantage by this ratio gives a surrogate objective you can keep optimizing on the fixed batch:
The superscript stands for conservative policy iteration (Kakade and Langford, 2002), where this objective was proposed. It is a good objective for a small step. Near it agrees with the true return to first order, which is the formal version of "if you barely move, the reweighted batch is a faithful guide." The trouble is the word barely.
Maximize without restraint and the optimizer notices something it can exploit. For an action with positive advantage, the objective just keeps growing as grows. So gradient ascent drives as high as it can, making a handful of actions far more likely, well outside the range where the old batch says anything reliable. The policy lurches, the next batch is collected by a worse policy, and the whole process can spiral. This is exactly the destructive update from before, now with a name: an unbounded importance ratio. We need a way to let the policy move a little and refuse to let it move a lot.
TRPO: move, but not too far
TRPO's answer is to make "don't move too far" a literal constraint. Measure the distance between the old policy and the new one with the KL divergence, a standard way to score how different two probability distributions are (zero when identical, growing as they diverge), and forbid any update that moves more than a small budget :
The set of new policies that satisfy this KL budget is the policy's trust region, the neighborhood of within which the surrogate stays a faithful stand-in for the return. This is not arbitrary caution. There is a theorem underneath: a surrogate like (6), minus a penalty proportional to the KL, is a genuine lower bound on the policy's true return. Improve the bounded quantity and you are guaranteed to improve the real thing, which gives monotonic improvement, a sequence of policies that never gets worse. The guarantee, though, needs the maximum KL over all states, while the constraint above (and every implementation) uses the mean KL, a deliberate relaxation that trades the proof for tractability. And the KL is written , old policy first, which is not symmetric: it asks how surprised the old policy is by the actions the new one now prefers, and penalizes the new policy for piling probability where the old one put little. The guarantee is real in theory and approximate in practice: TRPO keeps the structure of the bound and relaxes the part that made it provable.
Below, the amber bell is the old policy, fixed. Drag the new policy's mean and watch the KL grow. Inside the shaded band the step is small enough to trust; push past and you are in the region where the surrogate stops being a reliable stand-in for the return:
within the trust region, surrogate still trustworthy
It works, and it is a pain. The constraint turns each update into a constrained optimization, solved with conjugate gradients, a linear approximation to the objective, and a quadratic approximation to the KL built from the Fisher information matrix. That is second-order machinery. It does not play nicely with the things modern networks do all the time: parameter sharing between the policy and the value function, dropout, any architecture with noise. PPO's whole ambition is to get TRPO's "stay close" behavior with nothing but first-order gradient steps.
PPO's move: clip the ratio
Look again at what went wrong with . The objective rewards the optimizer for pushing ever further in the helpful direction, with no diminishing returns. PPO's fix is direct: cap the reward. Past a small band around 1, stop counting any further change in the ratio.
Concretely, clip the ratio to (with a small number like 0.2), and take the minimum of the clipped and unclipped objectives:
That is doing more than it looks. Take it apart by the sign of the advantage, because the two cases behave differently, and the difference is the entire idea.
Good action (). You want to make it more likely, so you push up. The objective rises along until hits , then goes flat: the clip caps the reward at , so beyond that point there is nothing to gain and the gradient for this sample is zero. The optimizer stops pushing. The lower clip never enters; for a good action it is inert.
Bad action (). Now you want it less likely, so you push down. By symmetry the binding clip is the lower one: the objective is flat at for , and only the upper region is sloped. So the clip that actually bites is for good actions and for bad ones. The other one sits idle in each case.
Drag the ratio and flip the sign of the advantage. The thick teal curve is ; the thin amber line is the unclipped it would have followed. The red dot at is where every update starts:
ε = 0.20 · L = 1.00 · inside the trust band, gradient active
This is the trust region of TRPO, achieved with arithmetic instead of a constrained solve. Keeping each update near the old policy is what proximal means, the P in PPO. There is no Fisher matrix, no conjugate gradient, no line search. There is a , a , and a hyperparameter . And because the clip only flattens the objective, it composes with anything: shared value heads, dropout, recurrent policies, all fine.
It is worth being precise about why this stands in for a trust region. TRPO enforces a hard limit: it measures the KL distance from the old policy to the new one and refuses any step that exceeds a budget , so each update stays inside the region where the local surrogate is still a trustworthy guide to the true return. PPO's clip is the soft, cheap version of that idea. Once a sample's ratio leaves the objective goes flat, its gradient dies, and the optimizer stops pushing that sample any further. Nothing computes a KL or solves a constraint, yet the new policy is held near the old one because the only samples still contributing gradient are the ones that have not yet moved far. This approximates the trust region, it does not reproduce it: the clip caps each sample one at a time and never bounds the KL the way TRPO's constraint does, so it is a stand-in for TRPO's stay-close behavior, not for its monotonic-improvement proof.
Why the minimum makes it a floor
One detail in (7) is easy to skim past and is the reason the whole thing is safe: it is a , not just a clip. Why take the smaller of the clipped and unclipped values, rather than always using the clipped one?
The paper's own phrasing is the clearest: you ignore the change in the ratio only when it would make the objective improve, and you keep the unclipped ratio when it makes the objective worse. Walk the good-action case. For , the unclipped is larger than the clipped value, so the discards it: you decline the extra reward, the gradient dies, the policy is not encouraged to push further. Good. But suppose an optimization step has overshot the wrong way and made a good action less likely, dropping below . There the unclipped value is smaller, so the keeps it, the gradient stays alive, and it pulls back up where it belongs. The removes the reward for over-improving, and keeps the gradient that fixes an over-correction.
The compact way to say it: is a pessimistic lower bound on the unclipped surrogate . It never reports a rosier number than would, so optimizing it cannot be fooled into a large step by a runaway ratio. A crucial qualifier, and the most common misconception about PPO: this is a lower bound on the surrogate, not on the true return. The lower-bound-on-return story is TRPO's, through its max-KL penalty. PPO keeps the stay-close behavior and drops the second-order proof.
You can watch the bound do its job. As the policy moves further from (optimizing the one batch harder), the unclipped keeps climbing, the way that tempted the optimizer into trouble. The clipped rises, peaks while the KL is still small, then falls: a built-in penalty for moving too far. The maximum sits right where you want the update to stop:
KL=0.021 · L^CPI=0.08 · L^CLIP peaks near KL=0.020, then falls
With ordinary gradient ascent on the unclipped surrogate you would keep climbing the amber curve straight off the edge; the clip turns the objective into one that wants you to stop near the old policy, so plain SGD, run for several epochs, lands in roughly the right place on its own.
The other knob: a KL penalty
Clipping is not the only way to discourage large steps, and the paper is careful to test the obvious alternative: put the KL back in the objective as a penalty, the way the TRPO theory suggests, and adapt its strength so each update lands near a target KL :
After each update you measure the KL you actually got. If it came in well under target, the penalty was too strong, so you halve . If it overshot, you double it. The coefficient chases the right value on its own, which sidesteps TRPO's complaint that no single fixed works across problems or even across one run. It is a clean idea and it works, just worse than clipping in the paper's experiments: its best setting reaches 0.74, against 0.82 for the clip, so the penalty version is kept as a baseline and the clip is the recommendation. The rest of the algorithm is packaging.
The whole loop
A real PPO objective bundles two more terms onto the clip. If the policy and value function share a network, you need a value-prediction loss to train the shared part; and a small entropy bonus keeps the policy from collapsing to determinism too early, which preserves exploration:
Mind the signs, because they trip people up. The value loss is a squared error you want small,
where is the return actually observed from , equivalently . The value net is trained to predict the very returns the advantage was built from, and because is a loss, it is subtracted. The entropy is a bonus you want large, so it is added. Both signs are correct precisely because the whole expression is maximized; code usually minimizes the negative of this, which flips all three signs back, and the paper calling a "loss" is just framework habit. For the continuous-control runs the policy and value nets are separate, so does not even apply and there is no entropy bonus; the extra terms earn their keep mainly on the shared-network Atari agent, where and .
Now the loop, which is short enough to hold in your head. Each iteration, actors run the current policy in parallel for timesteps, tagging every step with an advantage estimate. That fills a buffer of samples. Then you optimize the clipped objective on that buffer for epochs of minibatch ascent, refresh , and collect again. The is the entire point: it is the data reuse that vanilla policy gradients could not safely do.
one rollout → 10 gradient passes over the same data (MuJoCo uses K=10)
And the same thing in code, where the ascent-as-negative-loss sign convention shows up directly:
# one PPO iteration (actor-critic style)
data = [] # collect with the CURRENT policy
for actor in range(N): # N actors run in parallel
traj = run(pi_old, T) # T timesteps each
adv = gae(traj, V, gamma, lam) # advantage per step (Eqs 11-12)
data += traj_with(adv, logp_old) # store action log-probs under pi_old
for epoch in range(K): # reuse the SAME data K times
for mb in minibatches(data, M):
r = exp(logp(pi, mb) - mb.logp_old) # ratio r_t
unclip = r * mb.adv
clip = clamp(r, 1 - eps, 1 + eps) * mb.adv
# take the smaller: once the ratio drifts past the clip range,
# the objective stops improving and the gradient dies
L_clip = mean(min(unclip, clip)) # Eq 7
L_vf = mean((V(mb.s) - mb.ret) ** 2) # value loss
loss = -(L_clip - c1 * L_vf + c2 * ent(pi, mb)) # ascend: minimize -L
loss.backward(); opt.step()
pi_old = pi # refresh, then collect againThat is the whole algorithm. Collect a batch, store each action's probability under the old policy, then take epochs of clipped minibatch steps before the data goes stale. A few lines on top of REINFORCE, and the sample efficiency of a trust-region method. You can watch why the repeated epochs stay safe, scrub them:
mean |r−1|: 0.00 clipped · 0.00 unclipped · ε = 0.2
So what does it do
The first experiment is an ablation that earns the design. On seven MuJoCo continuous-control tasks, the paper sweeps the surrogate variants and scores each (0 = a random policy, 1 = the best result), averaged over all runs. Drag through them:
ε=0.2 → 0.82 · Clipping with ε=0.2. The winner, and the value the paper reuses for every later experiment.
Clipping at wins cleanly across the whole sweep, which is why the paper reuses it for every later experiment; that lives in this sweep in the body of the paper, not in the hyperparameter tables, and the often-quoted value comes from later libraries, not from here. The for "no clipping or penalty" is the runaway ratio made visible: with nothing restraining the step, one environment (half cheetah) goes so far wrong it drags the whole average below a policy that does nothing.
Beyond the ablation, PPO with beats or matches the contenders of its day (TRPO, A2C, the cross-entropy method, vanilla policy gradients) on almost all of the continuous-control benchmarks, and it scales up to 3D humanoids that learn to run, steer toward moving targets, and get up after being knocked down. On Atari the picture is more nuanced. Across 49 games PPO won 30 on the fast-learning metric (average reward over all of training), ahead of A2C and ACER. On final performance, the average over the last 100 episodes, ACER edged it, 28 games to 19, with one tie. PPO learns quickly and reliably, and it is not uniformly the strongest final performer.
None of that fully explains why PPO took over. The deeper reason is the one this whole post has been building toward: it is simple. It is a few lines on top of a policy gradient, it has essentially one hyperparameter that matters, it tolerates the architectures people actually use, and it almost never blows up. When the RLHF pipeline needed a reinforcement-learning algorithm to fine-tune language models against a reward model, PPO was the obvious choice, and it is the algorithm in InstructGPT (with a per-token KL penalty bolted on, the same "stay close" instinct in a new place).
Step back and the argument is four moves long. A policy gradient nudges each action by its advantage. Reusing the batch needs an importance ratio, and an unrestrained ratio is what destroys the policy. A trust region fixes that but is heavy. And clipping the ratio recreates the trust region with a and a , cheap enough to run for ten epochs a batch. The hard-won stability of trust-region methods turned out to be available for the price of one line of arithmetic. That is the whole reason PPO is everywhere.
Questions you might still have
Why take many small clipped steps instead of one big gradient step?
A big step on a stale batch overshoots: once the policy has moved, the collected data no longer reflects it, and the surrogate stops being a good guide. Many small clipped steps keep the policy near where the data was collected, so each step stays trustworthy, and you still extract far more from the batch than one step would.
Does clipping actually keep the KL divergence small?
Not as a hard guarantee. Clipping is a soft, per-sample heuristic: it removes the incentive to push a single sample’s ratio past 1±ε, but it does not bound the KL the way TRPO’s constraint does. The average KL usually stays small, yet a few samples can still move, which is why many implementations add an explicit KL early-stop on top.
What does PPO actually inherit from TRPO?
The intuition, not the theorem. TRPO’s monotonic-improvement guarantee needs the maximum KL over states and a second-order solve. PPO keeps only the idea (stay near the old policy so the surrogate stays trustworthy) and gets it first-order with a clip, carrying no such proof.
Is the clipped objective a lower bound on the true return?
No. It is a lower bound on the unclipped surrogate L^CPI, not on the policy’s true performance. The bound-on-return story belongs to TRPO, through a different, max-KL penalty. Conflating the two is the most common PPO misconception.
Footnotes & further reading
- The paper: Schulman, Wolski, Dhariwal, Radford, Klimov, Proximal Policy Optimization Algorithms (OpenAI, 2017). A clear secondary reference is OpenAI's Spinning Up page.
- The trust region it descends from: Schulman, Levine, Abbeel, Jordan, Moritz, Trust Region Policy Optimization (2015), with the monotonic-improvement bound PPO emulates first-order.
- The advantage estimator: Schulman, Moritz, Levine, Jordan, Abbeel, High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE, 2015). The clean GAE weights are on rewards and on residuals.
- The surrogate's origin: Kakade and Langford, Approximately Optimal Approximate Reinforcement Learning (ICML 2002), the conservative policy iteration that names .
- The policy gradient itself: Williams, Simple Statistical Gradient-Following Algorithms (REINFORCE, 1992), and Sutton, McAllester, Singh, Mansour, Policy Gradient Methods for RL with Function Approximation (2000).
- PPO in the wild: Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT, 2022), whose RL stage is PPO with a per-token KL penalty.
How could this explainer be improved? Found an error, or something unclear? I read every message.