VerifiedarXiv:1502.0547726 min
Reinforcement learning · Policy optimization

Trust Region Policy Optimization

Bound how much the policy changes each step, and each step can be large without the policy collapsing.

One policy-gradient step that is too large can wreck a working policy. Using KL divergence to measure how much the policy's behavior changes, TRPO caps that change and takes the largest improving step inside the cap.

Explaining the paperTrust Region Policy OptimizationSchulman, Levine, Moritz, Jordan, Abbeel · UC Berkeley · ICML 2015 · arXiv:1502.05477

The policy gradient tells you which way to nudge a policy, not how far. Nudge too far and a good policy can fall apart in a single update. TRPO is a principled answer to "how far," and the answer is a bound on how much the policy's behavior is allowed to shift.

Reinforcement learning by policy gradients has a recurring failure. You collect some experience, estimate which way to push the policy to get more reward, take a step, and sometimes the policy that comes out is far worse than the one you started with. Not a little worse. Collapsed. It has forgotten how to stand, or it has locked onto one bad action and stopped exploring. You throw away the step, shrink the learning rate, and try again. The gradient was pointing the right way; the step was too big, and the landscape it was walking on turned out to bend sharply just past where it looked flat.

The reason a step can betray you is that the thing you optimized is not the thing you care about. You care about the policy's expected return. What you can actually estimate from a batch of old experience is an approximation to that return, and the approximation is only trustworthy near the policy you collected the batch with. Step a short distance and the approximation tracks the truth. Step far and the two peel apart, so the update that looked like a big win on the approximation is a loss on the real objective.

TRPO turns that observation into an algorithm. It writes down the exact relationship between two policies' returns, replaces the uncomputable part with a tractable surrogate, proves how far the surrogate can be trusted, and then, at every update, takes the largest step that stays inside the region where the trust holds. That region is measured in how much the policy's output distribution moves, its KL divergence from the old policy, rather than in raw parameter distance. A few ideas assemble the method: the exact improvement identity, the local surrogate and why it is only local, a distance on policies and the bound it buys, the trust region the bound justifies, and the natural-gradient machinery that solves for that step cheaply. None is heavy on its own.

The step-size problem, stated once

Fix the vocabulary first, because every later step leans on it. The agent lives in a Markov decision process: at each state ss it draws an action aa from a stochastic policy π(as)\pi(a\mid s) (for a neural policy, a network reads ss and outputs the parameters of a distribution over actions), the environment pays a reward and moves it to a new state, and this repeats. The single number we want to maximize is the expected discounted return, weighting a reward tt steps out by γt\gamma^t for a discount γ(0,1)\gamma\in(0,1) so the infinite sum stays finite:

η(π)=Es0,a0,[t=0γtr(st)],s0ρ0, atπ(st), st+1P(st,at)\eta(\pi) = \mathbb{E}_{s_0, a_0, \dots}\Big[\textstyle\sum_{t=0}^{\infty} \gamma^t\, r(s_t)\Big], \qquad s_0\sim\rho_0,\ a_t\sim\pi(\cdot\mid s_t),\ s_{t+1}\sim P(\cdot\mid s_t,a_t)

Three bookkeeping quantities make η\eta workable. The value Vπ(s)V_\pi(s) is the expected return from state ss if you follow π\pi forever. The action value Qπ(s,a)Q_\pi(s,a) is the same, but you take one specific action aa first. Their difference is the advantage:

Aπ(s,a)=Qπ(s,a)Vπ(s)A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)

The advantage answers a local, practical question: at this state, is action aa better or worse than what the policy would have done on average? Positive means better than the policy's own habit, negative means worse. It has one property worth pinning down now, because a cancellation later depends on it: averaged over the policy's own actions, the advantage is exactly zero. A quick instance makes it concrete. Suppose at some state two actions are available, the policy picks a1a_1 with probability 0.60.6 and a2a_2 with 0.40.4, and their action values are Q(s,a1)=2Q(s,a_1)=2, Q(s,a2)=0Q(s,a_2)=0. Then V(s)=0.62+0.40=1.2V(s)=0.6\cdot 2 + 0.4\cdot 0 = 1.2, so the advantages are A(s,a1)=+0.8A(s,a_1)=+0.8 and A(s,a2)=1.2A(s,a_2)=-1.2, and their policy-weighted mean is 0.6(0.8)+0.4(1.2)=00.6(0.8)+0.4(-1.2)=0. The advantage is a re-centered score, and its baseline is the current policy. Under a different policy the same advantages need not average to zero, and that nonzero average is precisely the improvement signal the next section is built on.

One last object, and one convention flag. Write ρπ(s)\rho_\pi(s) for how often the policy visits state ss, discounted by when the visit happens: ρπ(s)=tγtP(st=s)\rho_\pi(s)=\sum_{t}\gamma^t P(s_t=s). The paper leaves this unnormalized, so it sums to 1/(1γ)1/(1-\gamma) rather than to one. That is not a probability distribution, and the stray factor of 1/(1γ)1/(1-\gamma) reappears later as a harmless constant when the sum over states becomes an expectation. Keep it in view; it is the one piece of bookkeeping that silently changes shape between the theory and the implementation.

The exact improvement identity

One fact, due to Kakade and Langford (2002), carries the rest of the paper. The return of any new policy π~\tilde\pi can be written as the old return plus a single correction, the discounted sum of the old policy's advantages, but accumulated along the new policy's trajectories:

η(π~)=η(π)+Eτπ~[t=0γtAπ(st,at)]\eta(\tilde\pi) = \eta(\pi) + \mathbb{E}_{\tau \sim \tilde\pi}\Big[\textstyle\sum_{t=0}^{\infty} \gamma^t A_\pi(s_t, a_t)\Big]
(1)

This is exact, not an approximation, and it holds for any two policies. It is also easy to believe once you see the one-line reason: the advantage Aπ(s,a)A_\pi(s,a) is, on average, the one-step temporal-difference error r(s)+γVπ(s)Vπ(s)r(s) + \gamma V_\pi(s') - V_\pi(s). Summed with discounting along a trajectory the VπV_\pi terms telescope, every intermediate value cancels against the next, and all that survives is Vπ(s0)-V_\pi(s_0) plus the trajectory's actual discounted reward. Take the expectation over trajectories generated by π~\tilde\pi and those two pieces are exactly η(π)-\eta(\pi) and η(π~)\eta(\tilde\pi). Rearrange and you have (1).

Roll the trajectory expectation up into a sum over states, weighting each state by how often π~\tilde\pi visits it, and (1) becomes a form that shows exactly where the trouble is:

η(π~)=η(π)+sρπ~(s)aπ~(as)Aπ(s,a)\eta(\tilde\pi) = \eta(\pi) + \sum_s \rho_{\tilde\pi}(s) \sum_a \tilde\pi(a\mid s)\, A_\pi(s,a)
(2)

Read (2) as a recipe for improvement: to make π~\tilde\pi better than π\pi, put probability on actions whose advantage under π\pi is positive. If every state has nonnegative expected advantage the new policy is guaranteed at least as good, which is the classical reason greedy policy iteration works. But (2) cannot be optimized as written, because the visitation weight ρπ~\rho_{\tilde\pi} is the new policy's state distribution, and you do not have it: to know which states π~\tilde\pi visits you would already have to run π~\tilde\pi, which is the thing you are trying to design. The exact identity is true and useless in the same breath.

A surrogate you can only trust nearby

TRPO makes a single, deliberate cheat: use the old policy's visitation ρπ\rho_\pi in place of the unknown ρπ~\rho_{\tilde\pi}. That gives the surrogate objective:

Lπ(π~)=η(π)+sρπ(s)aπ~(as)Aπ(s,a)L_\pi(\tilde\pi) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde\pi(a\mid s)\, A_\pi(s,a)
(3)

Now everything on the right is computable from data the old policy already collected. You keep the same states it visited and just reweight their actions by the new policy's probabilities. The reweighting is like re-scoring an old survey under new respondent weights, while pretending the new policy would have knocked on the same doors. For a tiny change in policy the doors barely move and the pretense costs almost nothing; for a large change the new policy genuinely visits a different population of states, and the pretense is exactly what makes the surrogate wrong.

"Almost nothing" can be made precise. For a parameterized policy πθ\pi_\theta, the surrogate matches the true return to first order at the current parameters θ0\theta_0: same value, and same gradient.

Lπθ0(πθ0)=η(πθ0),θLπθ0(πθ)θ0=θη(πθ)θ0L_{\pi_{\theta_0}}(\pi_{\theta_0}) = \eta(\pi_{\theta_0}), \qquad \nabla_\theta L_{\pi_{\theta_0}}(\pi_\theta)\big|_{\theta_0} = \nabla_\theta \eta(\pi_\theta)\big|_{\theta_0}
(4)

The gradients agree because the term the surrogate throws away, the change in state visitation, is itself second order in the size of the policy change: at θ0\theta_0 the policy has not moved yet, so the visitation has not moved either, and its derivative contributes nothing to first order. That first-order match is the license for optimizing LL at all. Improve LL by a small step and, because their slopes match at the start, you improve η\eta too. Improve LL by a large step and the guarantee evaporates, because past the immediate neighborhood the two curves have different curvature and LL can keep climbing while η\eta has already turned over.

The figure below makes that sentence visible; it shows the method in one slice. The amber curve is the true return as the policy moves a distance θ\theta from where it started. The teal line is the surrogate LL: it touches the true return at the start with the same height and the same slope, then keeps rising because it holds the state visitation fixed even as the true distribution drifts. The violet curve is a lower bound derived two sections from now; ignore it for a moment. Drag the trust region δ\delta and watch the chosen step slide right. A small region buys a small, certain gain. Widen it toward the peak and the true gain is largest. Widen it too far and the step sails past the peak, the surrogate is still promising more, and the true return has dropped below where you began. That last frame, the overshoot, is the collapse that opened this article.

Figure 1 · the surrogate overshoots
δ = 0.40
Moving the policy a distance θ from the old one. η is the true return (it rises, peaks, then falls); L is the surrogate, tangent to η at the start but climbing forever. The shaded band is the trust region: the largest move whose KL stays within δ. Small δ, small safe gain; too large a δ and the chosen step overshoots the true peak, past the dotted collapse marker, and the real return drops below the start. The violet curve is the minorizer from the next sections.

So the surrogate is trustworthy in a neighborhood and treacherous outside it. The rest of the paper is a precise account of how big that neighborhood is, measured in the right units.

Measuring policy change: total variation and KL

"How far the policy moved" needs a definition, and parameter distance is the wrong one: two policies with nearly identical weights can behave very differently, and two with very different weights can behave the same. What matters is how much the action distributions move. Take one state; the policy there is a distribution over actions. The bluntest distance between two distributions is the total variation, half the summed absolute difference:

DTV(pq)=12ipiqi[0,1]D_{\mathrm{TV}}(p\,\|\,q) = \tfrac12 \textstyle\sum_i |p_i - q_i| \in [0,1]

Total variation is intuitive but awkward to differentiate through, so TRPO trades it for the KL divergence DKL(pq)=ipilog(pi/qi)D_{\mathrm{KL}}(p\,\|\,q)=\sum_i p_i \log(p_i/q_i), the same quantity that shows up whenever you compare distributions in machine learning. The two are linked by a standard inequality (the paper cites Pollard, and uses a deliberately loose form of it):

DTV(pq)2DKL(pq)D_{\mathrm{TV}}(p\,\|\,q)^2 \le D_{\mathrm{KL}}(p\,\|\,q)

So bounding the KL bounds the total variation too, hence bounds how differently the two policies can ever behave. One convention to fix, because it is asymmetric and easy to get backwards: TRPO puts the old policy first, DKL(πoldπnew)D_{\mathrm{KL}}(\pi_{\mathrm{old}}\,\|\,\pi_{\mathrm{new}}). The figure lets you drag the new policy away from a fixed old one and watch both distances grow, and watch the KL cross the trust-region bound δ\delta. It also lets you check the inequality by eye: DTV2D_{\mathrm{TV}}^2 stays under the KL at every setting.

Figure 2 · a distance on policies
shift 0.50
At one state the policy is a distribution over actions. π_old is fixed; drag π_new away from it. Total variation and KL both grow; the new bars turn amber once the KL leaves the trust region δ. The bottom line shows the bound TRPO relies on holding at every setting: D_TV² stays below the KL (and below half the KL, the sharper version).

The improvement bound, and a guarantee

Now the payoff. Kakade and Langford proved a lower bound for their conservative policy-iteration scheme, which only mixed in a new policy a little at a time. TRPO's central theoretical contribution is to extend that bound to any two stochastic policies, with the KL divergence as the distance. The true return is never worse than the surrogate minus a penalty proportional to the worst-case KL between the policies:

η(π~)Lπ(π~)CDKLmax(π,π~),C=4ϵγ(1γ)2,ϵ=maxs,aAπ(s,a)\eta(\tilde\pi) \ge L_\pi(\tilde\pi) - C\, D_{\mathrm{KL}}^{\max}(\pi, \tilde\pi), \qquad C = \frac{4\,\epsilon\,\gamma}{(1-\gamma)^2}, \quad \epsilon = \max_{s,a}|A_\pi(s,a)|
(5)

Two details in (5) are load-bearing and routinely garbled, so state them carefully. First, the constant grew from the conservative-policy-iteration version by a factor of two (its coefficient was 22, here 44), and that factor does not come from the KL step. It comes from a coupling argument: run the two policies with a shared random seed and they pick the same action a fraction 1α1-\alpha of the time, where α\alpha is how often they disagree (bounded by the total-variation distance from the previous section), and the error you accrue on the rare disagreements is a difference of two advantages, each as large as ϵ\epsilon, which is where the extra factor of two enters. Second, ϵ\epsilon itself was redefined: in the earlier bound it was the worst state's expected advantage, here it is the worst single Aπ(s,a)|A_\pi(s,a)| anywhere, a larger quantity. Both changes make the general bound looser, which is the price of dropping the restriction to gently mixed policies.

The bound is the violet curve you saw in Figure 1. Because it equals η\eta at the old policy and sits below it everywhere else, it is a minorizer: a floor under the true return that touches the ground exactly under your feet. That single property yields a guarantee. Let Mi(π)=Lπi(π)CDKLmax(πi,π)M_i(\pi) = L_{\pi_i}(\pi) - C\, D_{\mathrm{KL}}^{\max}(\pi_i, \pi) be the minorizer built at the current policy. Maximize it, move to wherever its top is, and:

η(πi+1)η(πi)  Mi(πi+1)Mi(πi)  0\eta(\pi_{i+1}) - \eta(\pi_i) \ \ge\ M_i(\pi_{i+1}) - M_i(\pi_i) \ \ge\ 0
(6)

The first inequality holds because ηMi\eta \ge M_i everywhere while η(πi)=Mi(πi)\eta(\pi_i) = M_i(\pi_i); the second holds because πi+1\pi_{i+1} was chosen to maximize MiM_i. The real return cannot decrease. This is a minorization-maximization scheme, the same logic that makes the EM algorithm climb: build a floor under the thing you cannot optimize directly, jump to the top of the floor, rebuild the floor there, repeat. The picture is an inflatable floor under a foggy mountain: it touches the terrain under your feet and lies below it everywhere else, so the highest point of the floor is real ground at least as high as where you stood.

It is worth being exact about what this guarantee covers, because the popular one-line summary of TRPO overstates it. The monotonic-improvement result is a theorem about the idealized procedure in (6): exact advantages, an exact maximization over all policies, the theoretical penalty CC, and the worst-case (maximum) KL. The algorithm you actually run replaces all four of those with approximations, and in doing so gives up the formal guarantee. The paper is candid about this: in practice TRPO "tends to" give monotonic improvement. Read the guarantee as the motivation for the design, not as a promise about any single update.

From a penalty to a trust region

If the theory hands you a bound, why not just maximize it, surrogate minus penalty, and enjoy the guarantee? Because of the constant. With a realistic discount the penalty coefficient is enormous. At γ=0.99\gamma = 0.99, the denominator (1γ)2(1-\gamma)^2 is 10410^{-4}, so C=4γϵ/(1γ)24×104ϵC = 4\gamma\,\epsilon/(1-\gamma)^2 \approx 4\times 10^4\,\epsilon, roughly forty thousand times the largest advantage. Faithfully maximizing LCDKLL - C\,D_{\mathrm{KL}} with a penalty that steep moves the policy only a hair per iteration: the bound holds, but the steps it licenses are uselessly small.

TRPO makes the pragmatic swap that names the method. Drop the penalty and impose the KL as a hard constraint instead, a trust region of a chosen radius δ\delta, and take the biggest surrogate-improving step that keeps the policy inside it:

maxθ  Lθold(θ)subject toDˉKLρθold(θold,θ)δ\max_{\theta}\; L_{\theta_{\mathrm{old}}}(\theta) \quad\text{subject to}\quad \bar{D}_{\mathrm{KL}}^{\,\rho_{\theta_{\mathrm{old}}}}(\theta_{\mathrm{old}}, \theta) \le \delta
(7)

The swap trades a toll for a fence. The theoretical penalty is a forty-thousand-per-unit toll on policy change, so you never leave the driveway; the constraint is a fence at KL-distance δ\delta, and you drive freely right up to it. Two more adjustments make (7) practical. The bound used the maximum KL over all states, which is one constraint per state and hopeless to enforce, so TRPO uses the average KL over the states it actually visited, written DˉKL\bar{D}_{\mathrm{KL}} above, an admitted heuristic that works well. And δ\delta becomes a single tuned dial; the paper uses δ=0.01\delta = 0.01 for every experiment, a small budget of about a hundredth of a nat of average KL, so the action distribution shifts only slightly each step. That fixed, small budget is a large part of why TRPO needs so little per-task tuning: each step is as large as the trust region allows, and the trust region is the same size on every problem.

Estimating the objective from rollouts

Problem (7) is still written in terms of sums over all states and actions and the true advantage function. To run it you estimate everything from sampled experience. Three substitutions reduce (7) to something you can evaluate on a batch. The sum over states weighted by ρθold\rho_{\theta_{\mathrm{old}}} becomes an expectation over sampled states, which is where that 1/(1γ)1/(1-\gamma) mass reappears as a constant multiplier that does not move the maximizer. The advantage AA is replaced by the action value QQ, which shifts the objective only by a per-state Vθold(s)V_{\theta_{\mathrm{old}}}(s) that is constant in θ\theta and so changes neither the maximizer nor the gradient. And the sum over actions becomes an importance-sampling estimate: sample actions from some distribution qq and reweight by πθ/q\pi_\theta/q. Together they give the objective TRPO actually optimizes:

maxθ  Esρθold, aq ⁣[πθ(as)q(as)Qθold(s,a)]  s.t.  Esρθold ⁣[DKL(πθoldπθ)]δ\max_{\theta}\; \mathbb{E}_{s\sim\rho_{\theta_{\mathrm{old}}},\ a\sim q}\!\Big[\frac{\pi_\theta(a\mid s)}{q(a\mid s)}\, Q_{\theta_{\mathrm{old}}}(s,a)\Big] \ \ \text{s.t.}\ \ \mathbb{E}_{s\sim\rho_{\theta_{\mathrm{old}}}}\!\big[D_{\mathrm{KL}}(\pi_{\theta_{\mathrm{old}}}\,\|\,\pi_\theta)\big] \le \delta
(8)

The importance weight is also a second reason the trust region is not optional. That reweighting is exact for any θ\theta, but the weights πθ/q\pi_\theta/q blow up as πθ\pi_\theta pulls away from the sampling policy, so the estimate's variance explodes exactly when the surrogate stops being accurate. Keeping the KL small keeps the weights tame and the surrogate accurate at the same time, one constraint doing two jobs.

That leaves how to get the QQ estimates and what to sample from, and the paper gives two schemes. The single path method is ordinary policy-gradient sampling: run the current policy for whole trajectories, treat every state on a path as a training point, and estimate its QQ as the discounted reward that actually followed it. One rollout per state, no special requirements, and it runs on a real robot. The vine method spends more to get cleaner numbers: from a set of states along the trajectories it fires several short rollouts, one per candidate action, and reuses the same random-number seed across them so the shared randomness cancels when it compares actions (common random numbers). Many rollouts per state give much lower-variance advantage estimates, but the branching needs the simulator reset to arbitrary states, so vine is simulation-only. Toggle between them:

Figure 3 · single path vs vine
simulator-only
Two ways to turn rollouts into advantage estimates. Single path scores every state on a trajectory and reads its Q off the same path: cheap, no resets, runs on real hardware, but each estimate is one noisy sample. Vine branches K rollouts per chosen state with a shared random seed, giving lower-variance advantages, at the cost of many more simulator calls and the need to reset to arbitrary states.

Taking the step: the natural gradient

One piece remains: how to actually solve the constrained problem (8) at each update, over a policy with tens of thousands of parameters, without anything blowing up. TRPO does it with two approximations and a line search. Approximate the objective by its gradient (linear in the step) and the KL constraint by its curvature (quadratic in the step). The curvature of the average KL at the current policy is a specific matrix, the Fisher information matrix FF (the code below writes it F too, since AA is already the advantage), and the step that maximizes a linear objective inside a quadratic constraint points along F1gF^{-1} g, where gg is the surrogate gradient. That direction has a name: the natural gradient (Kakade's natural policy gradient, built on Amari's natural gradient).

The natural gradient is worth a picture, because it shows geometrically why KL is the right way to measure the step. It is the same gradient gg, re-measured in KL rather than in raw parameter distance. Where the policy is sensitive, a small parameter change moves the behavior a lot, so the trust region is short in that direction; where the policy is insensitive, the region stretches. A plain gradient step ignores all of this and moves the same Euclidean distance in every direction, which can change the behavior wildly along a sensitive axis. The natural step reshapes itself to spend an equal KL budget in every direction instead. Drag the anisotropy and watch the two steps, the plain one and the KL one, pull apart:

Figure 4 · why KL, not parameter distance
2.40×
Parameter space, old policy at the centre. The gradient ∇L points uphill on the surrogate. The gray dashed circle is a Euclidean ball; the gray arrow is the plain gradient step to its edge. The teal ellipse is the KL trust region, stretched by the Fisher metric, and the teal arrow is TRPO's step: short where the policy is sensitive, long where it is not. At anisotropy 1 the two coincide.

Computing F1gF^{-1} g naively would mean forming and inverting a matrix with a row and column per parameter, which is out of the question at this scale. TRPO never forms FF. It uses conjugate gradient, which solves Fs=gF s = g using only the ability to multiply FF by a vector, and each such Fisher-vector product costs about one gradient evaluation. Ten conjugate-gradient iterations suffice, and the products are computed on a tenth of the batch, so the search direction costs roughly one more gradient on top of the one you already needed. One subtlety: FF is the analytic curvature of the KL, which integrates over actions in closed form and is positive semidefinite by construction, not the covariance of sampled gradients that some implementations substitute for it.

The direction fixes only where to go, not how far. The quadratic KL model gives a first guess at the step length: along the direction ss the model says the KL grows as 12β2sFs\tfrac12\beta^2 s^\top F s, so setting that equal to δ\delta gives β=2δ/(sFs)\beta = \sqrt{2\delta/(s^\top F s)}. Then, because the model is only a model, TRPO runs a backtracking line search from that length: shrink the step until the true surrogate actually improves and the true KL is within δ\delta. The paper notes that without this line search the algorithm occasionally takes a step that looks fine to the quadratic model and is catastrophic in reality. This is also the exact point where TRPO parts ways with the plain natural policy gradient, which takes a fixed-size step without re-checking it: TRPO re-solves the step length against the real constraint every iteration.

sF1g,β=2δsFs,θnew=θold+βs  (then backtrack)s \approx F^{-1} g, \qquad \beta = \sqrt{\frac{2\delta}{s^\top F\, s}}, \qquad \theta_{\mathrm{new}} = \theta_{\mathrm{old}} + \beta\, s\ \ (\text{then backtrack})
(9)

Putting one full iteration in one place makes the shape concrete. The policy is a small network: for continuous control, a Gaussian whose mean comes from a two-layer network (30 to 50 hidden units) and whose diagonal covariance is a separate set of parameters independent of the state; for Atari, a convnet with two 16-filter layers and a 20-unit hidden layer, about 33,50033{,}500 parameters in all. One iteration collects a batch, estimates advantages, computes the surrogate gradient, finds the natural direction by conjugate gradient, scales it to the trust region, and backtracks to be safe:

# one TRPO iteration, single-path, reward-maximization convention
batch  = collect_trajectories(pi_old)      # states, actions, rewards
Qhat   = discounted_returns(batch)         # Monte-Carlo Q along each path
Ahat   = Qhat - value_baseline(batch.s)    # advantage estimates
g      = grad_surrogate(pi, batch, Ahat)   # gradient of L at theta_old

# search direction s ~= F^-1 g, F = average KL Hessian (never formed)
s      = conjugate_gradient(fisher_vector_product, g, iters=10)
sAs    = s @ fisher_vector_product(s)
beta   = sqrt(2 * delta / sAs)             # delta = 0.01 (the KL budget)
step   = beta * s

# backtrack: shrink until the TRUE surrogate improves AND TRUE KL <= delta
for _ in range(max_backtracks):
    theta_new = theta_old + step
    if improved(theta_new) and mean_kl(theta_old, theta_new) <= delta:
        break
    step *= 0.5

Three parts make up the complete algorithm: an objective that is a first-order-faithful proxy for the return, a KL trust region that says how far to trust it, and a natural-gradient step with a safety line search that moves as far as the region allows. Everything upstream, the identity through the bound, exists to justify the update rule (9).

Locomotion and Atari from pixels

The experiments were chosen to stress two things at once: whether a single, lightly-tuned method can learn hard control from scratch, and whether it beats the alternatives it is related to. On simulated robotic locomotion, swimming, hopping, and walking with general-purpose networks and a bare reward for moving forward, TRPO learned all three gaits. The comparison that matters is with the natural policy gradient using a fixed step instead of the KL constraint: that method handled the easy tasks but could not produce hopping or walking gaits that made forward progress. The gradient-free methods (the cross-entropy method, CMA) scaled badly with the number of parameters and did poorly on the larger problems. The reading is direct: enforcing a fixed KL trust region, rather than a fixed step or a fixed penalty, is what made the updates robust enough to solve the hard tasks with the same δ=0.01\delta = 0.01 everywhere.

The Atari results test generality rather than dominance. The same recipe, learning from raw pixels with a convnet, was run once on each of the seven games from the deep Q-learning benchmark. It reaches reasonable scores on all seven and occasionally beats the specialized prior methods, the vine variant on Q*bert scoring several times deep Q-learning's, single path topping it on Enduro and Seaquest, but it is not state of the art and does not claim to be. What it demonstrates is that one general policy-search method, unchanged, spans problems as different as a swimming robot and a pixel game. Slide through the games:

Figure 5 · Atari, one recipe, run once
Q*bert
Scores on the seven Atari games (Table 1, verbatim). TRPO's two variants in teal against a random policy, the human expert, deep Q-learning, and the tree-search method UCC-I. TRPO lands reasonable scores everywhere with one recipe and beats deep Q-learning on several games, but the task-specific methods and the human are usually ahead. Pong is a margin in [−21, 21], hence the zero line.

The machinery outlived the benchmark numbers. The trust-region idea, take the largest step that keeps the policy's behavior close in KL, became the template for the policy-optimization methods that followed. TRPO's own successor, PPO, keeps the goal and throws out the expensive parts: it replaces the hard KL constraint and the conjugate-gradient solve with a clipped surrogate you can optimize with ordinary stochastic gradient descent, trading a little of the theory for a lot of simplicity. The line from "prove how far you can trust the surrogate" to "just clip the surrogate so a single step cannot go too far" runs straight through this paper.

Provenance Verified against primary literature
Kakade & Langford (2002)The advantage identity, conservative policy iteration, and the improvement lower bound TRPO generalizes.
Kakade (2002)The natural policy gradient: the A⁻¹g search direction.
Amari (1998)Natural gradient and the Fisher-information metric on distributions.
Pollard, AsymptopiaThe D_TV² ≤ D_KL bound bridging total variation and KL.
Nocedal & WrightTrust-region optimization and conjugate gradient with matrix-vector products.
correctionThe popular summary that TRPO guarantees monotonic improvement is an overstatement. The guarantee is a theorem about the idealized penalty algorithm (exact advantages, exact maximization, the theoretical constant C, and maximum KL); practical TRPO swaps in a hard constraint, average KL, sampled estimates, and a linearized step, and only 'tends to' improve. We teach the guarantee as the motivation, not a runtime promise.

Questions you might still have

?

Does TRPO actually guarantee the policy never gets worse?
No. Only the idealized procedure, with exact advantages, an exact maximization over all policies, the theoretical penalty C, and the maximum KL, provably never decreases return. The practical algorithm makes four approximations (a hard constraint instead of the penalty, average instead of maximum KL, sampled estimates, and a linear/quadratic step) that give up the formal guarantee. The paper says it only "tends to" improve.

?

Why constrain KL instead of just capping the parameter step size?
Because equal parameter distance is not equal behavior change. In a sensitive direction a tiny weight change moves the policy a lot; in a flat direction a big weight change barely moves it. KL measures the behavior change directly, so a fixed KL budget spends an equal amount of real change in every direction. A fixed Euclidean step, which vanilla policy gradient uses, over-moves the sensitive directions.

?

What is the difference between TRPO and PPO?
PPO keeps TRPO’s goal, a bounded policy update, but discards the machinery. Instead of a hard KL constraint solved by conjugate gradient and a line search, PPO clips the surrogate objective so the probability ratio cannot move past a small window, and optimizes that with ordinary SGD over reused minibatches. Simpler, no Fisher matrix, comparable performance. Our explainer of Proximal Policy Optimization covers it.

?

Single path or vine, which should I use?
Single path if you cannot reset the simulator to arbitrary states, including on real hardware: it needs one rollout per state and no resets. Vine when you can reset and want lower-variance advantage estimates, and can afford many more simulator calls for the branch rollouts.

?

Why does the surrogate ever overestimate the true return?
It freezes the state visitation at the old policy’s. When you move the policy far it visits a genuinely different set of states, and the surrogate ignores that shift, so it can promise gains the real objective does not deliver. The KL penalty in the bound is exactly the price of that frozen-visitation approximation.

Footnotes & further reading

  1. The paper: Schulman, Levine, Moritz, Jordan, Abbeel, Trust Region Policy Optimization (UC Berkeley, ICML 2015). OpenAI's Spinning Up has a clean implementation-oriented writeup (it writes the KL constraint as new-from-old, the reverse of the paper's old-from-new; the two agree to second order, which is all the natural gradient sees).
  2. The advantage identity, conservative policy iteration, and the improvement bound TRPO extends: Kakade and Langford, Approximately Optimal Approximate Reinforcement Learning (ICML 2002).
  3. The natural gradient direction: Kakade, A Natural Policy Gradient (NeurIPS 2001), building on Amari's Natural Gradient Works Efficiently in Learning (1998).
  4. The inequality bridging the two distances, DTV2DKLD_{\mathrm{TV}}^2 \le D_{\mathrm{KL}}, is the form the paper cites from Pollard's Asymptopia. It is looser than the sharp Pinsker inequality, which with the convention DTV=12pqD_{\mathrm{TV}}=\tfrac12\sum|p-q| reads DTV212DKLD_{\mathrm{TV}}^2 \le \tfrac12 D_{\mathrm{KL}}; using the sharp constant would halve the penalty CC. Both are valid, and the looser one only makes the bound more conservative.
  5. A small inconsistency in the paper worth knowing if you reproduce it: the walker's state dimension is given as 18 in the main text and 20 in the appendix table (8206 policy parameters either way). Every experiment uses δ=0.01\delta = 0.01 and γ=0.99\gamma = 0.99.
  6. The direct successor that traded the constraint for a clip: Schulman et al., Proximal Policy Optimization. TRPO and its relatives were also the reinforcement-learning engine behind early learning-from-feedback work such as Deep RL from Human Preferences.