VerifiedarXiv:2104.0986419 min
Architecture · Position encoding

RoFormer: Enhanced Transformer with Rotary Position Embedding

Rotate the query and key by their positions, and their dot product depends only on the distance between them.

Self-attention computes scores from content alone, so swapping any two tokens leaves every pairwise score unchanged. RoPE fixes that with one geometric trick: rotate the query and key before they meet, by an angle proportional to their position. The dot product then depends only on their relative distance, costs nothing extra at inference, and survives the linear-attention rewrite intact.

Explaining the paperRoFormer: Enhanced Transformer with Rotary Position EmbeddingSu, Lu, Pan, Murtadha, Wen, Liu · Zhuiyi Technology · arXiv:2104.09864

Encode a token's position by rotating it, and the attention score between any two tokens comes to depend only on how far apart they sit.

Self-attention is the only operation in a Transformer that lets a token look at any other token. It is also the only one that, by itself, cannot tell which token came first. The score between a query at position mm and a key at position nn is just the dot product of their content vectors qmkn\mathbf{q}_m^\top\mathbf{k}_n; rearrange the sentence and every pairwise score is the same. So every Transformer ships with a separate position encoding, a signal injected somewhere intoqm\mathbf{q}_m and kn\mathbf{k}_n that tells the model who sits where. The original paper added a fixed sinusoidal vector to the embedding; later variants learned the position vectors, or added a bias to the score, or built relative-distance tables. None of them are clean: most spend extra parameters, none compose with the modern fast-attention rewrites, and most degrade outside their training length.

RoFormer, from Su and collaborators at Zhuiyi Technology, encodes position as a rotation. It splits the query and key vectors into d/2d/2 two-dimensional planes and rotates each plane by an angle that depends on the token's absolute position. Two rotated vectors, when you dot them, produce a value that depends only on the angular difference between their rotations, which is exactly the relative offset. The encoding has no learned parameters, costs one cheap matrix multiply at inference, and slots into linear attention without breaking anything. Four ideas carry the paper: why attention without position is blind, how a rotation turns absolute positions into a relative offset, why rotating at many speeds covers both local and global, and why the resulting attention score damps with distance.

Attention is order-blind

The math hammers it home in one line. Self-attention computes a weighted average of values where the weight between a query at position mm and a key at position nn is

am,n=softmaxn ⁣(qmknd)a_{m,n} = \mathrm{softmax}_n\!\left(\frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d}}\right)

and both qm\mathbf{q}_m and kn\mathbf{k}_n come straight from the token embedding, functions of the word, not its slot in the sentence. Swap any two tokens and the same pair of content vectors reappears between the same two words: the score between them never changes. Permute the entire sentence and the matrix of pairwise scores is the same matrix, just relabelled. Attention ispermutation-equivariant over the input, which means the model literally cannot tell the cat sat on the mat from the mat sat on the cat, since both produce the same pile of content vectors and the same scores between them.

Figure 1 makes the symmetry visible. A query word (“sat”) and a highlighted key (“mat”) trade attention through a fixed score, drawn as a chip in the corner. Slide the arrangement: every word slides too, and so does its bar. Heights stay frozen. The readout in the corner never moves. That is the hole RoPE fills.

Figure 1 · the order-blind baseline
originalq·k(mat) = -0.05weight = 0.13 (never moves)
Five words and their attention weights from the query “sat”. The highlighted key “mat” carries a corner readout for qk/d\mathbf{q}^\top\mathbf{k}/\sqrt{d} and its softmax weight. Slide the word order: every bar travels with its word and no height changes. The dot product depends only on the two content vectors, so the model gets no signal about who sat where unless position is injected.

Position as a rotation

The trick lives in the two-dimensional case. Suppose the query and key are 2D vectors q,kR2\mathbf{q}, \mathbf{k}\in\mathbb{R}^2 and we want a function ff of position and content such that the dot product

f(q,m),f(k,n)=g(q,k,mn)\langle f(\mathbf{q}, m), f(\mathbf{k}, n)\rangle = g(\mathbf{q}, \mathbf{k}, m-n)
(goal)

depends on positions only through their difference mnm-n. The paper's answer is a rotation. Let Rθ\mathbf{R}_\theta be the planar rotation matrix by angle θ\theta, and apply it with an angle proportional to the position:

f(q,m)=Rmθq,f(k,n)=Rnθkf(\mathbf{q}, m) = \mathbf{R}_{m\theta}\,\mathbf{q}, \qquad f(\mathbf{k}, n) = \mathbf{R}_{n\theta}\,\mathbf{k}
(1)

Rotations are orthogonal, so RαRβ=Rβα\mathbf{R}_\alpha^\top\mathbf{R}_\beta = \mathbf{R}_{\beta-\alpha}, and the dot product collapses to

Rmθq,  Rnθk=qR(nm)θk\langle\mathbf{R}_{m\theta}\mathbf{q},\;\mathbf{R}_{n\theta}\mathbf{k}\rangle = \mathbf{q}^\top \mathbf{R}_{(n-m)\theta}\,\mathbf{k}
(2)

The right side is a function of the content vectors and the relative offset nmn-m alone. The absolute positions mm and nn have disappeared: any pair with the same gap scores identically. That is the property every prior relative-position scheme was hand-engineering by adding scalar biases bm,nb_{m,n} to the score; RoPE gets it from geometry, with no extra term in the formula and no learned parameters. Figure 2 makes the property concrete. Two unit vectors stand in forq\mathbf{q} and k\mathbf{k}; two sliders rotate them by mθm\theta and nθn\theta; the score on the right is qR(nm)θk=cos((nm)θ+φ)\mathbf{q}^\top \mathbf{R}_{(n-m)\theta}\,\mathbf{k} = \cos((n-m)\theta + \varphi), where φ\varphi is the original angle between the two un-rotated vectors. Lock the offset and drag either slider: both move together and the score never changes. Unlock it and move one alone: the score sweeps through the cosine.

Figure 2 · rotation gives relative position for free
Two unit vectors drawn as Rmq (teal) and Rnk (amber), each rotated by its absolute position. The dashed ghosts show the un-rotated q0\mathbf{q}_0 and k0\mathbf{k}_0. The right panel reads out nmn-m and the resulting score cos((nm)θ+φ)\cos((n-m)\theta + \varphi). Slide mm and nn independently and the score moves; flip lock offset on and slide either one and both move together with the score frozen. θ=0.4\theta = 0.4 here, picked for legibility; in real RoPE the fastest plane is θ1=1\theta_1 = 1 and the slowest is θ641.16 ⁣× ⁣104\theta_{64} \approx 1.16\!\times\!10^{-4}.

The 2D case is the whole idea; the general case stacks it. A dd-dimensional query splits into d/2d/2 pairs of coordinates, each treated as a 2D plane and rotated independently:

Rm,Θd=(Rmθ1Rmθd/2)\mathbf{R}^d_{m,\Theta} = \begin{pmatrix} \mathbf{R}_{m\theta_1} & & \\ & \ddots & \\ & & \mathbf{R}_{m\theta_{d/2}} \end{pmatrix}
(3)

with a different rotation rate θi\theta_i per plane (the next section sets the rates). The matrix is block-diagonal and almost entirely zero, so in code RoPE is implemented as a per-coordinate multiply-and-shift, not an actual matrix multiply: it costs the same as a single elementwise scaling, and slots in just before the dot product in every attention head.

Many speeds, one position

One rotation rate would only encode one wavelength of position. The paper uses many, on the same schedule the original Transformer used for its sinusoidal embedding, scaled geometrically from fast to slow:

θi=100002(i1)/d,i=1,2,,d/2\theta_i = 10000^{-2(i-1)/d}, \qquad i = 1, 2, \ldots, d/2
(4)

The fastest plane (i=1i=1) has θ1=1\theta_1 = 1 radian per token, so it whips through a full revolution every 2π6.282\pi \approx 6.28 tokens. The slowest plane (i=d/2i=d/2, with d=128d=128) has θ64=10000126/1281.16 ⁣× ⁣104\theta_{64}=10000^{-126/128} \approx 1.16\!\times\!10^{-4}, so its wavelength λ=2π/θ\lambda = 2\pi/\theta is about 54,00054{,}000 tokens. Across the d/2d/2 planes the wavelengths span four orders of magnitude, from a handful of tokens to tens of thousands. Fast planes resolve adjacent positions; slow planes never repeat across any context the model will see. The base 1000010000 is borrowed verbatim from the sinusoidal encoding of Vaswani et al., and it is one of the knobs the long-context extensions reach for: NTK-aware scaling and YaRN raise the base to stretch the slow planes; Position Interpolation instead rescales the position index mm itself so the same wavelengths cover a longer context.

Figure 3 stacks five representative planes as clock dials. Slide the position: every dial turns at once, but the top one whips around dozens of times while the bottom one barely moves. The single token indexmm drives all the rotations; the speed gradient is what gives one position encoding many length-scales.

Figure 3 · five frequencies, one position
64
Five of the d/2=64d/2 = 64 planes (head width d=128d=128), drawn as clock dials. Each dial rotates at its own rate θi=100002(i1)/d\theta_i = 10000^{-2(i-1)/d}, so the fastest plane turns radians per token while the slowest plane needs tens of thousands of tokens for one revolution. The same position slider drives them all; the wavelengths printed beside each dial run from about 6 tokens (fast, local) to 54,000 tokens (slow, global).

Two observations matter for the next sections. First, the encoding is built from absolute positions but the score it produces depends only on the relative offset (Equation 2 again, now stacked across planes): absolute and relative are not in conflict, they are different views of the same construction. Second, because nothing is learned, RoPE does not run out of frequencies at sequence lengths it has never seen. The slowest plane covers any length you can fit in memory; you only lose accuracy when you push past where the fast planes start to wrap around in ways the model wasn't trained to recognize.

Distance damps the score

Group the rotated coordinates of qm\mathbf{q}_m and kn\mathbf{k}_n into d/2d/2 planes, and the attention score is a sum of complex phasors,

qmR(nm),Θdkn  =  Rei=1d/2hiei(nm)θi\mathbf{q}_m^\top \mathbf{R}^d_{(n-m),\Theta}\,\mathbf{k}_n \;=\; \mathrm{Re}\sum_{i=1}^{d/2} h_i\,e^{\,\mathrm{i}\,(n-m)\theta_i}
(5)

with one complex coefficient hih_i per plane from the content vectors. The geometry behind the decay is easier to see than the algebra. Drop the content for a moment and look at the bare phasors ei(nm)θie^{\mathrm{i}(n-m)\theta_i}: at mn=0|m-n|=0 every phasor sits at +1+1 on the real axis (the rotation hasn't started yet) and their sum has magnitude d/2d/2, the maximum. As mn|m-n| grows the fastest plane whips around the unit circle while the slowest barely moves, so the phasors fan out into a cloud. By the time the slowest plane has rotated a substantial fraction of a revolution the phasors point in roughly independent directions and their sum behaves like a random walk of d/2d/2 steps: magnitude order d/2\sqrt{d/2} by the central limit theorem. The score collapses from d/2=64d/2 = 64 to d/28\sqrt{d/2} \approx 8, a factor of eight, just from this geometric loss of coherence.

Figure 4 · phasors fan out as the offset grows
Sixty-four phasors ei(nm)θie^{\mathrm{i}(n-m)\theta_i}, one per RoPE plane, drawn on the unit circle. The colours run from teal (the fastest plane) to amber (the slowest). The bold teal arrow is their sum; the dashed amber circle is the central-limit floor at d/2\sqrt{d/2}. At mn=0|m-n|=0 every arrow points right and the sum reaches the full radius. Drag the offset up: fast arrows whip around while slow ones barely move, the cloud spreads, and the bold sum shrinks toward the dashed floor. The score in Figure 5 is the same story plotted against distance instead of as a snapshot.

Section 3.4.3 of the paper turns that geometric story into a bound. By Abel summation, factor the magnitude of the phasor sum into a content-only constant and a sum-of-partial-sums of the phasors,

ihiei(nm)θi    (maxihi+1hi)j=1d/2Sj,Sj=i=1jei(nm)θi\Big|\sum_i h_i\, e^{\mathrm{i}(n-m)\theta_i}\Big| \;\le\; \big(\max_i |h_{i+1}-h_i|\big)\cdot \sum_{j=1}^{d/2} |S_j|, \quad S_j = \sum_{i=1}^{j} e^{\mathrm{i}(n-m)\theta_i}
(6)

The first factor is a constant of the content; the second is purely geometric and is the long-term decay RoPE inherits. The damping is not a bias the model learns or a window the architect chooses (the way ALiBi or sliding windows do); it is a property of summing many rotating numbers at incommensurate speeds. The curve is an oscillating envelope, not a monotone slide. At certain offsets several planes line up by accident and the bound briefly rises again. Figure 5 plots the geometry factor B(mn)=(1/(d/2))jSjB(|m-n|) = (1/(d/2))\sum_j |S_j| against distance and you can see the wiggles riding on a falling trend. The figure also exposes the knobs the later long-context work reaches for. Slide the base and the curve flattens or steepens. Toggle Position Interpolation and the same curve gets stretched, because the position index mm is divided by the extension factor before the angles are computed (Chen et al. 2023). Toggle NTK-aware scaling and the base is multiplied by sd/(d2)s^{d/(d-2)} instead (Peng et al.'s YaRN follows the same recipe); the half-decay marker jumps past the training-length line as either patch kicks in.

Figure 5 · the score damps with distance
The geometry factor B(mn)=2djSjB(|m-n|) = \tfrac{2}{d}\sum_j |S_j| against relative distance. It starts at 11 (every phasor in phase at zero offset) and trends down as distance grows, because phasors at different frequencies fall out of phase. The decay is not monotone: visible wiggles ride the falling envelope where planes accidentally re-align. The dashed vertical is a demo training-length Ltrain=32L_{\text{train}} = 32. Slide the base to see it from raw RoPE; toggle PI×4 (rescales the position index, paper: Chen et al. 2023) or NTK×4 (raises the base by sd/(d2) ⁣ ⁣4.06s^{d/(d-2)}\!\approx\!4.06 for d=128d=128, YaRN uses the same recipe). The faint dashed teal curve is the vanilla baseline at the same base, so the regime's shift is visible.

The decay comes with no learned bias, no window, and no extra parameter. The paper frames it as a built-in inductive prior toward local attention; the modern reading is that it is a soft recency bias that can be tuned away with one constant or by rescaling the position index when you want longer context.

It rides linear attention

Plain softmax attention reads every query-key pair, an N×NN\times N matrix at sequence length NN, and additive relative-position schemes (T5's bias, ALiBi) inherit the same cost: the bias bm,nb_{m,n} is a per-pair number, so you still touch every pair to apply it. RoPE does not. A rotation is multiplicative, and that means it composes with the trick that turns attention fromO(N2)O(N^2) into O(N)O(N): the linear-attention rewrite.

Linear attention replaces the softmax with a feature map ϕ\phi applied to queries and keys, so the score factors as ϕ(q)ϕ(k)\phi(\mathbf{q})^\top \phi(\mathbf{k}), and that lets you swap the order of operations. Instead of computing every ϕ(qm)ϕ(kn)vn\phi(\mathbf{q}_m)^\top \phi(\mathbf{k}_n)v_n^\top, you build one running accumulator over keys and values,

S  =  n=1Nϕ(kn)vnRd×d\mathbf{S} \;=\; \sum_{n=1}^{N} \phi(\mathbf{k}_n)\,\mathbf{v}_n^\top \quad\in\mathbb{R}^{d\times d}
(7)

and let each query read from it: ϕ(qm)S\phi(\mathbf{q}_m)^\top\mathbf{S}. One pass to build the accumulator, one read per query, total cost O(N)O(N) instead of O(N2)O(N^2). The catch is positions. An additive bias bm,nb_{m,n} sits inside the softmax and depends on thepair, so it can't be pulled out and you have to walk every pair anyway. RoPE's rotation sits outside the dot product, factors as

Rmθq,  Rnθk  =  qRmθRnθk  =  qR(nm)θk\langle\mathbf{R}_{m\theta}\mathbf{q},\;\mathbf{R}_{n\theta}\mathbf{k}\rangle \;=\; \mathbf{q}^\top\mathbf{R}_{m\theta}^\top\,\mathbf{R}_{n\theta}\mathbf{k} \;=\; \mathbf{q}^\top\mathbf{R}_{(n-m)\theta}\mathbf{k}

and Section 3.3 of the paper does the natural composition: apply the rotation FIRST, then the feature map. The key side stores Rnθϕ(kn)\mathbf{R}_{n\theta}\phi(\mathbf{k}_n), the query side computes Rmθϕ(qm)\mathbf{R}_{m\theta}\phi(\mathbf{q}_m), and the inner product is ϕ(qm)R(nm)θϕ(kn)\phi(\mathbf{q}_m)^\top \mathbf{R}_{(n-m)\theta} \phi(\mathbf{k}_n), a quadratic form in the feature map with the relative-offset rotation in the middle. For a general nonlinear ϕ\phi the rotation does not commute with the feature map (so the score is not exactly ϕ(q)R(nm)θϕ(k)\phi(\mathbf{q})^\top \mathbf{R}_{(n-m)\theta} \phi(\mathbf{k}) on the un-rotated vectors), but the accumulator structure still works: build S=nRnθϕ(kn)vn\mathbf{S} = \sum_n \mathbf{R}_{n\theta}\phi(\mathbf{k}_n)\mathbf{v}_n^\top in one pass, read from it once per query, and the cost stays O(N)O(N). RoPE is the only relative-position scheme of the era that composes with linear attention this way, which is why every fast-attention architecture has gone with it.

Figure 6 · multiplicative beats additive
Toggle the regime and drag NN. With RoPE + linear attention, every key is rotated by its position and folded into one accumulator S\mathbf{S} that every query reads: cost O(N)O(N), one box for any sequence length. With softmax or additive relative bias, the position term sits inside the softmax and forces the full N×NN\times N grid: cost O(N2)O(N^2). The cost bars at the bottom show the quadratic bar dwarfing the linear one well before N=100N=100.

Modest tables, total adoption

The paper validates RoPE on the small benchmarks that were standard in early 2021. On machine translation (WMT 2014 English-to-German, Table 1) RoFormer reaches 27.5 BLEU against the original Transformer's 27.3, a slim +0.2; on enwik8 character-level pre-training the paper reports a faster convergence curve for Performer + RoPE versus Performer alone, not a parity claim against BERT; on downstream GLUE (Table 2) RoFormer trails BERT on SST-2, QNLI, and MNLI and beats it on MRPC, STS-B, and QQP, a mixed result; the cleanest gains are on a long-document Chinese language-modelling benchmark the authors built, which is where the geometric decay starts to matter. The empirical case in 2021 was modest, and the authors are careful to say so.

What carried RoPE out of the paper and into nearly every open-weights LLM was the engineering shape, not the table. The rotation is parameter-free, so it cannot be the thing that gets fine-tuned wrong on a new corpus. It is local: every query and key carry their own rotation and need not consult any other token, which composes with linear attention (in its Performer form, where the feature map is applied to the rotated vectors) and with kernel-level fast-attention rewrites alike. It exposes one interpretable knob, the base 10,00010{,}000, so when later work needed to push context past the training length it had a clean lever to pull: NTK-aware scaling and YaRN raise the base; Position Interpolation rescales the position index itself. It separates the question of which token is this from where does it sit, so weights trained at one context length transfer cleanly to another. LLaMA, GPT-NeoX, PaLM, Mistral, and most open-weights models since 2022 use RoPE, and nearly every long-context patch is a tweak of its schedule (the base, or an effective rescaling of position) rather than a replacement.

Everything rests on one move: a 2D rotation by the absolute position, which turns the dot product into a function of the relative position with no extra term. Stack d/2d/2 such rotations at geometric speeds and one position index yields many wavelengths, fast for local and slow for global. The phasor sum that results damps with distance by Abel summation, an inductive prior toward local attention you can re-tune with a single constant. None of it is learned and none of it costs anything at inference, and because the rotation factors out of the linear-attention accumulator, it survives the fast-attention rewrites that additive biases cannot. That combination, not the modest 2021 tables, is why it became the default.

Provenance Verified against primary literature
RoFormer (2021)The source paper: the 2D rotation, the d/2-plane stack, Section 3.4.3 decay, Section 3.4.2 linear-attention compatibility, and the empirical tables on MT and GLUE.
Vaswani et al. (2017)The sinusoidal schedule with base 10000 that RoPE adopts verbatim for the per-plane angular speeds.
Shaw / T5 / ALiBiThe additive relative-bias schemes RoPE replaces; the contrast with multiplicative rotation is the basis of the linear-attention argument.
Katharopoulos et al. (2020)The linear-attention rewrite that RoPE survives because rotations factor out of the accumulator.
Position Interpolation / YaRNLong-context extensions that all tune RoPE’s base or effective wavelength; they did not exist when the paper was written but justified the design retroactively.
correctionRoPE was not adopted because of RoFormer’s empirical tables, which are modest (WMT 2014 En→De +0.2 BLEU; GLUE mixed; enwik8 shown only as a Performer convergence curve, not a parity claim against BERT). It was adopted because the encoding has no learned parameters, slots into kernel-level fast attention, and exposes interpretable knobs (the base 10000 and the position index) that long-context work can retune without retraining. The decay claim is an upper bound on the score, not an exact decay curve, and it is an oscillating envelope rather than a monotone slide. The 10000 base is borrowed from Vaswani et al., not derived in the paper. We write the relative offset as n−m to match the natural construction R_m^T R_n = R_{n−m}; the paper’s Eq 11 writes it as m−n. The two are equivalent up to a sign in θ, and the decay envelope depends only on |m−n|. The schedule index in Eq 4 starts at i=1 (paper Section 3.2.2); some code bases use i=0, with the same set of rates.

Questions you might still have

?

Why does a rotation turn absolute position into relative position?
Rotations are orthogonal, so when you transpose one and multiply it by another, only the difference of the angles survives: R_mᵀ R_n = R_{n−m}. Rotate the query by mθ and the key by nθ, then take the dot product, and the result depends only on n−m (which the paper, with a sign-flipped convention, writes as m−n). The absolute positions cancel out of the score by construction, with no extra term in the formula and no learned parameters.

?

Why d/2 planes and not just one rotation in the whole d-dim space?
One rotation has one wavelength. You want short wavelengths to resolve nearby tokens and long wavelengths to span the whole sequence, like a single position index driving a multi-band radio. Splitting into d/2 independent 2D planes lets each plane spin at its own rate θ_i = 10000^{-2(i-1)/d}, geometrically spaced from ≈1 radian per token down to ≈1.16e-4 radians per token. The same position drives all of them, so you encode every length-scale at once.

?

Why does an additive relative bias break linear attention but rotation doesn’t?
An additive bias b_{m,n} sits inside the softmax as a per-pair number. To apply it you have to walk every query-key pair, which is exactly the N² cost linear attention was designed to avoid. A rotation is multiplicative and sits outside the dot product. You can push the query’s rotation onto the query side and the key’s rotation onto the key side, fold every (rotated) key into one running accumulator, and let each (rotated) query read from it. The pair-dependence emerges in the read, not in the storage. For a general nonlinear feature map φ the rotation does not commute with φ, so the score isn’t exactly φ(q)ᵀ R_{n-m} φ(k) on the un-rotated vectors, but the accumulator structure still works as long as φ is applied to the rotated vectors.

?

How do people stretch RoPE to longer context?
Two knobs. NTK-aware scaling and YaRN raise the base 10000 to flatten the decay curve and stretch the slow planes’ wavelengths past anything in training. Position Interpolation rescales the position index m itself, so the model sees the same range of angles even on a longer sequence. The structural payoff of separating the rotation construction from the trained weights is that both knobs can be turned at inference time with a small fine-tune or none at all.

Footnotes & further reading

  1. The paper: Su, Lu, Pan, Murtadha, Wen, Liu, RoFormer: Enhanced Transformer with Rotary Position Embedding (Zhuiyi Technology, 2021; revised through 2023).
  2. The sinusoidal schedule RoPE borrows the base 10000 from: Vaswani et al., Attention Is All You Need (2017).
  3. The relative-position predecessors RoPE displaces: Shaw, Uszkoreit, Vaswani, Self-Attention with Relative Position Representations (2018), and T5's scalar bias, in Raffel et al., Exploring the Limits of Transfer Learning (2019).
  4. The linear-attention rewrite RoPE composes with: Katharopoulos, Vyas, Pappas, Fleuret, Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020).
  5. Long-context extensions of RoPE: Chen et al., Extending Context Window of Large Language Models via Positional Interpolation (2023), which rescales the position index; Peng et al., YaRN (2023), which combines a base change with attention temperature; plus the NTK-aware scaling community write-ups, which raise the base.