Trust Region Policy Optimization
Bound how much the policy changes each step, and each step can be large without the policy collapsing.
One policy-gradient step that is too large can wreck a working policy. Using KL divergence to measure how much the policy's behavior changes, TRPO caps that change and takes the largest improving step inside the cap.
Explaining the paperTrust Region Policy OptimizationThe policy gradient tells you which way to nudge a policy, not how far. Nudge too far and a good policy can fall apart in a single update. TRPO is a principled answer to "how far," and the answer is a bound on how much the policy's behavior is allowed to shift.
Reinforcement learning by policy gradients has a recurring failure. You collect some experience, estimate which way to push the policy to get more reward, take a step, and sometimes the policy that comes out is far worse than the one you started with. Not a little worse. Collapsed. It has forgotten how to stand, or it has locked onto one bad action and stopped exploring. You throw away the step, shrink the learning rate, and try again. The gradient was pointing the right way; the step was too big, and the landscape it was walking on turned out to bend sharply just past where it looked flat.
The reason a step can betray you is that the thing you optimized is not the thing you care about. You care about the policy's expected return. What you can actually estimate from a batch of old experience is an approximation to that return, and the approximation is only trustworthy near the policy you collected the batch with. Step a short distance and the approximation tracks the truth. Step far and the two peel apart, so the update that looked like a big win on the approximation is a loss on the real objective.
TRPO turns that observation into an algorithm. It writes down the exact relationship between two policies' returns, replaces the uncomputable part with a tractable surrogate, proves how far the surrogate can be trusted, and then, at every update, takes the largest step that stays inside the region where the trust holds. That region is measured in how much the policy's output distribution moves, its KL divergence from the old policy, rather than in raw parameter distance. A few ideas assemble the method: the exact improvement identity, the local surrogate and why it is only local, a distance on policies and the bound it buys, the trust region the bound justifies, and the natural-gradient machinery that solves for that step cheaply. None is heavy on its own.
The step-size problem, stated once
Fix the vocabulary first, because every later step leans on it. The agent lives in a Markov decision process: at each state it draws an action from a stochastic policy (for a neural policy, a network reads and outputs the parameters of a distribution over actions), the environment pays a reward and moves it to a new state, and this repeats. The single number we want to maximize is the expected discounted return, weighting a reward steps out by for a discount so the infinite sum stays finite:
Three bookkeeping quantities make workable. The value is the expected return from state if you follow forever. The action value is the same, but you take one specific action first. Their difference is the advantage:
The advantage answers a local, practical question: at this state, is action better or worse than what the policy would have done on average? Positive means better than the policy's own habit, negative means worse. It has one property worth pinning down now, because a cancellation later depends on it: averaged over the policy's own actions, the advantage is exactly zero. A quick instance makes it concrete. Suppose at some state two actions are available, the policy picks with probability and with , and their action values are , . Then , so the advantages are and , and their policy-weighted mean is . The advantage is a re-centered score, and its baseline is the current policy. Under a different policy the same advantages need not average to zero, and that nonzero average is precisely the improvement signal the next section is built on.
One last object, and one convention flag. Write for how often the policy visits state , discounted by when the visit happens: . The paper leaves this unnormalized, so it sums to rather than to one. That is not a probability distribution, and the stray factor of reappears later as a harmless constant when the sum over states becomes an expectation. Keep it in view; it is the one piece of bookkeeping that silently changes shape between the theory and the implementation.
The exact improvement identity
One fact, due to Kakade and Langford (2002), carries the rest of the paper. The return of any new policy can be written as the old return plus a single correction, the discounted sum of the old policy's advantages, but accumulated along the new policy's trajectories:
This is exact, not an approximation, and it holds for any two policies. It is also easy to believe once you see the one-line reason: the advantage is, on average, the one-step temporal-difference error . Summed with discounting along a trajectory the terms telescope, every intermediate value cancels against the next, and all that survives is plus the trajectory's actual discounted reward. Take the expectation over trajectories generated by and those two pieces are exactly and . Rearrange and you have (1).
Roll the trajectory expectation up into a sum over states, weighting each state by how often visits it, and (1) becomes a form that shows exactly where the trouble is:
Read (2) as a recipe for improvement: to make better than , put probability on actions whose advantage under is positive. If every state has nonnegative expected advantage the new policy is guaranteed at least as good, which is the classical reason greedy policy iteration works. But (2) cannot be optimized as written, because the visitation weight is the new policy's state distribution, and you do not have it: to know which states visits you would already have to run , which is the thing you are trying to design. The exact identity is true and useless in the same breath.
A surrogate you can only trust nearby
TRPO makes a single, deliberate cheat: use the old policy's visitation in place of the unknown . That gives the surrogate objective:
Now everything on the right is computable from data the old policy already collected. You keep the same states it visited and just reweight their actions by the new policy's probabilities. The reweighting is like re-scoring an old survey under new respondent weights, while pretending the new policy would have knocked on the same doors. For a tiny change in policy the doors barely move and the pretense costs almost nothing; for a large change the new policy genuinely visits a different population of states, and the pretense is exactly what makes the surrogate wrong.
"Almost nothing" can be made precise. For a parameterized policy , the surrogate matches the true return to first order at the current parameters : same value, and same gradient.
The gradients agree because the term the surrogate throws away, the change in state visitation, is itself second order in the size of the policy change: at the policy has not moved yet, so the visitation has not moved either, and its derivative contributes nothing to first order. That first-order match is the license for optimizing at all. Improve by a small step and, because their slopes match at the start, you improve too. Improve by a large step and the guarantee evaporates, because past the immediate neighborhood the two curves have different curvature and can keep climbing while has already turned over.
The figure below makes that sentence visible; it shows the method in one slice. The amber curve is the true return as the policy moves a distance from where it started. The teal line is the surrogate : it touches the true return at the start with the same height and the same slope, then keeps rising because it holds the state visitation fixed even as the true distribution drifts. The violet curve is a lower bound derived two sections from now; ignore it for a moment. Drag the trust region and watch the chosen step slide right. A small region buys a small, certain gain. Widen it toward the peak and the true gain is largest. Widen it too far and the step sails past the peak, the surrogate is still promising more, and the true return has dropped below where you began. That last frame, the overshoot, is the collapse that opened this article.
So the surrogate is trustworthy in a neighborhood and treacherous outside it. The rest of the paper is a precise account of how big that neighborhood is, measured in the right units.
Measuring policy change: total variation and KL
"How far the policy moved" needs a definition, and parameter distance is the wrong one: two policies with nearly identical weights can behave very differently, and two with very different weights can behave the same. What matters is how much the action distributions move. Take one state; the policy there is a distribution over actions. The bluntest distance between two distributions is the total variation, half the summed absolute difference:
Total variation is intuitive but awkward to differentiate through, so TRPO trades it for the KL divergence , the same quantity that shows up whenever you compare distributions in machine learning. The two are linked by a standard inequality (the paper cites Pollard, and uses a deliberately loose form of it):
So bounding the KL bounds the total variation too, hence bounds how differently the two policies can ever behave. One convention to fix, because it is asymmetric and easy to get backwards: TRPO puts the old policy first, . The figure lets you drag the new policy away from a fixed old one and watch both distances grow, and watch the KL cross the trust-region bound . It also lets you check the inequality by eye: stays under the KL at every setting.
The improvement bound, and a guarantee
Now the payoff. Kakade and Langford proved a lower bound for their conservative policy-iteration scheme, which only mixed in a new policy a little at a time. TRPO's central theoretical contribution is to extend that bound to any two stochastic policies, with the KL divergence as the distance. The true return is never worse than the surrogate minus a penalty proportional to the worst-case KL between the policies:
Two details in (5) are load-bearing and routinely garbled, so state them carefully. First, the constant grew from the conservative-policy-iteration version by a factor of two (its coefficient was , here ), and that factor does not come from the KL step. It comes from a coupling argument: run the two policies with a shared random seed and they pick the same action a fraction of the time, where is how often they disagree (bounded by the total-variation distance from the previous section), and the error you accrue on the rare disagreements is a difference of two advantages, each as large as , which is where the extra factor of two enters. Second, itself was redefined: in the earlier bound it was the worst state's expected advantage, here it is the worst single anywhere, a larger quantity. Both changes make the general bound looser, which is the price of dropping the restriction to gently mixed policies.
The bound is the violet curve you saw in Figure 1. Because it equals at the old policy and sits below it everywhere else, it is a minorizer: a floor under the true return that touches the ground exactly under your feet. That single property yields a guarantee. Let be the minorizer built at the current policy. Maximize it, move to wherever its top is, and:
The first inequality holds because everywhere while ; the second holds because was chosen to maximize . The real return cannot decrease. This is a minorization-maximization scheme, the same logic that makes the EM algorithm climb: build a floor under the thing you cannot optimize directly, jump to the top of the floor, rebuild the floor there, repeat. The picture is an inflatable floor under a foggy mountain: it touches the terrain under your feet and lies below it everywhere else, so the highest point of the floor is real ground at least as high as where you stood.
It is worth being exact about what this guarantee covers, because the popular one-line summary of TRPO overstates it. The monotonic-improvement result is a theorem about the idealized procedure in (6): exact advantages, an exact maximization over all policies, the theoretical penalty , and the worst-case (maximum) KL. The algorithm you actually run replaces all four of those with approximations, and in doing so gives up the formal guarantee. The paper is candid about this: in practice TRPO "tends to" give monotonic improvement. Read the guarantee as the motivation for the design, not as a promise about any single update.
From a penalty to a trust region
If the theory hands you a bound, why not just maximize it, surrogate minus penalty, and enjoy the guarantee? Because of the constant. With a realistic discount the penalty coefficient is enormous. At , the denominator is , so , roughly forty thousand times the largest advantage. Faithfully maximizing with a penalty that steep moves the policy only a hair per iteration: the bound holds, but the steps it licenses are uselessly small.
TRPO makes the pragmatic swap that names the method. Drop the penalty and impose the KL as a hard constraint instead, a trust region of a chosen radius , and take the biggest surrogate-improving step that keeps the policy inside it:
The swap trades a toll for a fence. The theoretical penalty is a forty-thousand-per-unit toll on policy change, so you never leave the driveway; the constraint is a fence at KL-distance , and you drive freely right up to it. Two more adjustments make (7) practical. The bound used the maximum KL over all states, which is one constraint per state and hopeless to enforce, so TRPO uses the average KL over the states it actually visited, written above, an admitted heuristic that works well. And becomes a single tuned dial; the paper uses for every experiment, a small budget of about a hundredth of a nat of average KL, so the action distribution shifts only slightly each step. That fixed, small budget is a large part of why TRPO needs so little per-task tuning: each step is as large as the trust region allows, and the trust region is the same size on every problem.
Estimating the objective from rollouts
Problem (7) is still written in terms of sums over all states and actions and the true advantage function. To run it you estimate everything from sampled experience. Three substitutions reduce (7) to something you can evaluate on a batch. The sum over states weighted by becomes an expectation over sampled states, which is where that mass reappears as a constant multiplier that does not move the maximizer. The advantage is replaced by the action value , which shifts the objective only by a per-state that is constant in and so changes neither the maximizer nor the gradient. And the sum over actions becomes an importance-sampling estimate: sample actions from some distribution and reweight by . Together they give the objective TRPO actually optimizes:
The importance weight is also a second reason the trust region is not optional. That reweighting is exact for any , but the weights blow up as pulls away from the sampling policy, so the estimate's variance explodes exactly when the surrogate stops being accurate. Keeping the KL small keeps the weights tame and the surrogate accurate at the same time, one constraint doing two jobs.
That leaves how to get the estimates and what to sample from, and the paper gives two schemes. The single path method is ordinary policy-gradient sampling: run the current policy for whole trajectories, treat every state on a path as a training point, and estimate its as the discounted reward that actually followed it. One rollout per state, no special requirements, and it runs on a real robot. The vine method spends more to get cleaner numbers: from a set of states along the trajectories it fires several short rollouts, one per candidate action, and reuses the same random-number seed across them so the shared randomness cancels when it compares actions (common random numbers). Many rollouts per state give much lower-variance advantage estimates, but the branching needs the simulator reset to arbitrary states, so vine is simulation-only. Toggle between them:
Taking the step: the natural gradient
One piece remains: how to actually solve the constrained problem (8) at each update, over a policy with tens of thousands of parameters, without anything blowing up. TRPO does it with two approximations and a line search. Approximate the objective by its gradient (linear in the step) and the KL constraint by its curvature (quadratic in the step). The curvature of the average KL at the current policy is a specific matrix, the Fisher information matrix (the code below writes it F too, since is already the advantage), and the step that maximizes a linear objective inside a quadratic constraint points along , where is the surrogate gradient. That direction has a name: the natural gradient (Kakade's natural policy gradient, built on Amari's natural gradient).
The natural gradient is worth a picture, because it shows geometrically why KL is the right way to measure the step. It is the same gradient , re-measured in KL rather than in raw parameter distance. Where the policy is sensitive, a small parameter change moves the behavior a lot, so the trust region is short in that direction; where the policy is insensitive, the region stretches. A plain gradient step ignores all of this and moves the same Euclidean distance in every direction, which can change the behavior wildly along a sensitive axis. The natural step reshapes itself to spend an equal KL budget in every direction instead. Drag the anisotropy and watch the two steps, the plain one and the KL one, pull apart:
Computing naively would mean forming and inverting a matrix with a row and column per parameter, which is out of the question at this scale. TRPO never forms . It uses conjugate gradient, which solves using only the ability to multiply by a vector, and each such Fisher-vector product costs about one gradient evaluation. Ten conjugate-gradient iterations suffice, and the products are computed on a tenth of the batch, so the search direction costs roughly one more gradient on top of the one you already needed. One subtlety: is the analytic curvature of the KL, which integrates over actions in closed form and is positive semidefinite by construction, not the covariance of sampled gradients that some implementations substitute for it.
The direction fixes only where to go, not how far. The quadratic KL model gives a first guess at the step length: along the direction the model says the KL grows as , so setting that equal to gives . Then, because the model is only a model, TRPO runs a backtracking line search from that length: shrink the step until the true surrogate actually improves and the true KL is within . The paper notes that without this line search the algorithm occasionally takes a step that looks fine to the quadratic model and is catastrophic in reality. This is also the exact point where TRPO parts ways with the plain natural policy gradient, which takes a fixed-size step without re-checking it: TRPO re-solves the step length against the real constraint every iteration.
Putting one full iteration in one place makes the shape concrete. The policy is a small network: for continuous control, a Gaussian whose mean comes from a two-layer network (30 to 50 hidden units) and whose diagonal covariance is a separate set of parameters independent of the state; for Atari, a convnet with two 16-filter layers and a 20-unit hidden layer, about parameters in all. One iteration collects a batch, estimates advantages, computes the surrogate gradient, finds the natural direction by conjugate gradient, scales it to the trust region, and backtracks to be safe:
# one TRPO iteration, single-path, reward-maximization convention
batch = collect_trajectories(pi_old) # states, actions, rewards
Qhat = discounted_returns(batch) # Monte-Carlo Q along each path
Ahat = Qhat - value_baseline(batch.s) # advantage estimates
g = grad_surrogate(pi, batch, Ahat) # gradient of L at theta_old
# search direction s ~= F^-1 g, F = average KL Hessian (never formed)
s = conjugate_gradient(fisher_vector_product, g, iters=10)
sAs = s @ fisher_vector_product(s)
beta = sqrt(2 * delta / sAs) # delta = 0.01 (the KL budget)
step = beta * s
# backtrack: shrink until the TRUE surrogate improves AND TRUE KL <= delta
for _ in range(max_backtracks):
theta_new = theta_old + step
if improved(theta_new) and mean_kl(theta_old, theta_new) <= delta:
break
step *= 0.5Three parts make up the complete algorithm: an objective that is a first-order-faithful proxy for the return, a KL trust region that says how far to trust it, and a natural-gradient step with a safety line search that moves as far as the region allows. Everything upstream, the identity through the bound, exists to justify the update rule (9).
Locomotion and Atari from pixels
The experiments were chosen to stress two things at once: whether a single, lightly-tuned method can learn hard control from scratch, and whether it beats the alternatives it is related to. On simulated robotic locomotion, swimming, hopping, and walking with general-purpose networks and a bare reward for moving forward, TRPO learned all three gaits. The comparison that matters is with the natural policy gradient using a fixed step instead of the KL constraint: that method handled the easy tasks but could not produce hopping or walking gaits that made forward progress. The gradient-free methods (the cross-entropy method, CMA) scaled badly with the number of parameters and did poorly on the larger problems. The reading is direct: enforcing a fixed KL trust region, rather than a fixed step or a fixed penalty, is what made the updates robust enough to solve the hard tasks with the same everywhere.
The Atari results test generality rather than dominance. The same recipe, learning from raw pixels with a convnet, was run once on each of the seven games from the deep Q-learning benchmark. It reaches reasonable scores on all seven and occasionally beats the specialized prior methods, the vine variant on Q*bert scoring several times deep Q-learning's, single path topping it on Enduro and Seaquest, but it is not state of the art and does not claim to be. What it demonstrates is that one general policy-search method, unchanged, spans problems as different as a swimming robot and a pixel game. Slide through the games:
The machinery outlived the benchmark numbers. The trust-region idea, take the largest step that keeps the policy's behavior close in KL, became the template for the policy-optimization methods that followed. TRPO's own successor, PPO, keeps the goal and throws out the expensive parts: it replaces the hard KL constraint and the conjugate-gradient solve with a clipped surrogate you can optimize with ordinary stochastic gradient descent, trading a little of the theory for a lot of simplicity. The line from "prove how far you can trust the surrogate" to "just clip the surrogate so a single step cannot go too far" runs straight through this paper.
Questions you might still have
Does TRPO actually guarantee the policy never gets worse?
No. Only the idealized procedure, with exact advantages, an exact maximization over all policies, the theoretical penalty C, and the maximum KL, provably never decreases return. The practical algorithm makes four approximations (a hard constraint instead of the penalty, average instead of maximum KL, sampled estimates, and a linear/quadratic step) that give up the formal guarantee. The paper says it only "tends to" improve.
Why constrain KL instead of just capping the parameter step size?
Because equal parameter distance is not equal behavior change. In a sensitive direction a tiny weight change moves the policy a lot; in a flat direction a big weight change barely moves it. KL measures the behavior change directly, so a fixed KL budget spends an equal amount of real change in every direction. A fixed Euclidean step, which vanilla policy gradient uses, over-moves the sensitive directions.
What is the difference between TRPO and PPO?
PPO keeps TRPO’s goal, a bounded policy update, but discards the machinery. Instead of a hard KL constraint solved by conjugate gradient and a line search, PPO clips the surrogate objective so the probability ratio cannot move past a small window, and optimizes that with ordinary SGD over reused minibatches. Simpler, no Fisher matrix, comparable performance. Our explainer of Proximal Policy Optimization covers it.
Single path or vine, which should I use?
Single path if you cannot reset the simulator to arbitrary states, including on real hardware: it needs one rollout per state and no resets. Vine when you can reset and want lower-variance advantage estimates, and can afford many more simulator calls for the branch rollouts.
Why does the surrogate ever overestimate the true return?
It freezes the state visitation at the old policy’s. When you move the policy far it visits a genuinely different set of states, and the surrogate ignores that shift, so it can promise gains the real objective does not deliver. The KL penalty in the bound is exactly the price of that frozen-visitation approximation.
Footnotes & further reading
- The paper: Schulman, Levine, Moritz, Jordan, Abbeel, Trust Region Policy Optimization (UC Berkeley, ICML 2015). OpenAI's Spinning Up has a clean implementation-oriented writeup (it writes the KL constraint as new-from-old, the reverse of the paper's old-from-new; the two agree to second order, which is all the natural gradient sees).
- The advantage identity, conservative policy iteration, and the improvement bound TRPO extends: Kakade and Langford, Approximately Optimal Approximate Reinforcement Learning (ICML 2002).
- The natural gradient direction: Kakade, A Natural Policy Gradient (NeurIPS 2001), building on Amari's Natural Gradient Works Efficiently in Learning (1998).
- The inequality bridging the two distances, , is the form the paper cites from Pollard's Asymptopia. It is looser than the sharp Pinsker inequality, which with the convention reads ; using the sharp constant would halve the penalty . Both are valid, and the looser one only makes the bound more conservative.
- A small inconsistency in the paper worth knowing if you reproduce it: the walker's state dimension is given as 18 in the main text and 20 in the appendix table (8206 policy parameters either way). Every experiment uses and .
- The direct successor that traded the constraint for a clip: Schulman et al., Proximal Policy Optimization. TRPO and its relatives were also the reinforcement-learning engine behind early learning-from-feedback work such as Deep RL from Human Preferences.
How could this explainer be improved? Found an error, or something unclear? I read every message.