RoFormer: Enhanced Transformer with Rotary Position Embedding
Rotate the query and key by their positions, and their dot product depends only on the distance between them.
Self-attention computes scores from content alone, so swapping any two tokens leaves every pairwise score unchanged. RoPE fixes that with one geometric trick: rotate the query and key before they meet, by an angle proportional to their position. The dot product then depends only on their relative distance, costs nothing extra at inference, and survives the linear-attention rewrite intact.
Explaining the paperRoFormer: Enhanced Transformer with Rotary Position EmbeddingEncode a token's position by rotating it, and the attention score between any two tokens comes to depend only on how far apart they sit.
Self-attention is the only operation in a Transformer that lets a token look at any other token. It is also the only one that, by itself, cannot tell which token came first. The score between a query at position and a key at position is just the dot product of their content vectors ; rearrange the sentence and every pairwise score is the same. So every Transformer ships with a separate position encoding, a signal injected somewhere into and that tells the model who sits where. The original paper added a fixed sinusoidal vector to the embedding; later variants learned the position vectors, or added a bias to the score, or built relative-distance tables. None of them are clean: most spend extra parameters, none compose with the modern fast-attention rewrites, and most degrade outside their training length.
RoFormer, from Su and collaborators at Zhuiyi Technology, encodes position as a rotation. It splits the query and key vectors into two-dimensional planes and rotates each plane by an angle that depends on the token's absolute position. Two rotated vectors, when you dot them, produce a value that depends only on the angular difference between their rotations, which is exactly the relative offset. The encoding has no learned parameters, costs one cheap matrix multiply at inference, and slots into linear attention without breaking anything. Four ideas carry the paper: why attention without position is blind, how a rotation turns absolute positions into a relative offset, why rotating at many speeds covers both local and global, and why the resulting attention score damps with distance.
Attention is order-blind
The math hammers it home in one line. Self-attention computes a weighted average of values where the weight between a query at position and a key at position is
and both and come straight from the token embedding, functions of the word, not its slot in the sentence. Swap any two tokens and the same pair of content vectors reappears between the same two words: the score between them never changes. Permute the entire sentence and the matrix of pairwise scores is the same matrix, just relabelled. Attention ispermutation-equivariant over the input, which means the model literally cannot tell the cat sat on the mat from the mat sat on the cat, since both produce the same pile of content vectors and the same scores between them.
Figure 1 makes the symmetry visible. A query word (“sat”) and a highlighted key (“mat”) trade attention through a fixed score, drawn as a chip in the corner. Slide the arrangement: every word slides too, and so does its bar. Heights stay frozen. The readout in the corner never moves. That is the hole RoPE fills.
Position as a rotation
The trick lives in the two-dimensional case. Suppose the query and key are 2D vectors and we want a function of position and content such that the dot product
depends on positions only through their difference . The paper's answer is a rotation. Let be the planar rotation matrix by angle , and apply it with an angle proportional to the position:
Rotations are orthogonal, so , and the dot product collapses to
The right side is a function of the content vectors and the relative offset alone. The absolute positions and have disappeared: any pair with the same gap scores identically. That is the property every prior relative-position scheme was hand-engineering by adding scalar biases to the score; RoPE gets it from geometry, with no extra term in the formula and no learned parameters. Figure 2 makes the property concrete. Two unit vectors stand in for and ; two sliders rotate them by and ; the score on the right is , where is the original angle between the two un-rotated vectors. Lock the offset and drag either slider: both move together and the score never changes. Unlock it and move one alone: the score sweeps through the cosine.
The 2D case is the whole idea; the general case stacks it. A -dimensional query splits into pairs of coordinates, each treated as a 2D plane and rotated independently:
with a different rotation rate per plane (the next section sets the rates). The matrix is block-diagonal and almost entirely zero, so in code RoPE is implemented as a per-coordinate multiply-and-shift, not an actual matrix multiply: it costs the same as a single elementwise scaling, and slots in just before the dot product in every attention head.
Many speeds, one position
One rotation rate would only encode one wavelength of position. The paper uses many, on the same schedule the original Transformer used for its sinusoidal embedding, scaled geometrically from fast to slow:
The fastest plane () has radian per token, so it whips through a full revolution every tokens. The slowest plane (, with ) has , so its wavelength is about tokens. Across the planes the wavelengths span four orders of magnitude, from a handful of tokens to tens of thousands. Fast planes resolve adjacent positions; slow planes never repeat across any context the model will see. The base is borrowed verbatim from the sinusoidal encoding of Vaswani et al., and it is one of the knobs the long-context extensions reach for: NTK-aware scaling and YaRN raise the base to stretch the slow planes; Position Interpolation instead rescales the position index itself so the same wavelengths cover a longer context.
Figure 3 stacks five representative planes as clock dials. Slide the position: every dial turns at once, but the top one whips around dozens of times while the bottom one barely moves. The single token index drives all the rotations; the speed gradient is what gives one position encoding many length-scales.
Two observations matter for the next sections. First, the encoding is built from absolute positions but the score it produces depends only on the relative offset (Equation 2 again, now stacked across planes): absolute and relative are not in conflict, they are different views of the same construction. Second, because nothing is learned, RoPE does not run out of frequencies at sequence lengths it has never seen. The slowest plane covers any length you can fit in memory; you only lose accuracy when you push past where the fast planes start to wrap around in ways the model wasn't trained to recognize.
Distance damps the score
Group the rotated coordinates of and into planes, and the attention score is a sum of complex phasors,
with one complex coefficient per plane from the content vectors. The geometry behind the decay is easier to see than the algebra. Drop the content for a moment and look at the bare phasors : at every phasor sits at on the real axis (the rotation hasn't started yet) and their sum has magnitude , the maximum. As grows the fastest plane whips around the unit circle while the slowest barely moves, so the phasors fan out into a cloud. By the time the slowest plane has rotated a substantial fraction of a revolution the phasors point in roughly independent directions and their sum behaves like a random walk of steps: magnitude order by the central limit theorem. The score collapses from to , a factor of eight, just from this geometric loss of coherence.
Section 3.4.3 of the paper turns that geometric story into a bound. By Abel summation, factor the magnitude of the phasor sum into a content-only constant and a sum-of-partial-sums of the phasors,
The first factor is a constant of the content; the second is purely geometric and is the long-term decay RoPE inherits. The damping is not a bias the model learns or a window the architect chooses (the way ALiBi or sliding windows do); it is a property of summing many rotating numbers at incommensurate speeds. The curve is an oscillating envelope, not a monotone slide. At certain offsets several planes line up by accident and the bound briefly rises again. Figure 5 plots the geometry factor against distance and you can see the wiggles riding on a falling trend. The figure also exposes the knobs the later long-context work reaches for. Slide the base and the curve flattens or steepens. Toggle Position Interpolation and the same curve gets stretched, because the position index is divided by the extension factor before the angles are computed (Chen et al. 2023). Toggle NTK-aware scaling and the base is multiplied by instead (Peng et al.'s YaRN follows the same recipe); the half-decay marker jumps past the training-length line as either patch kicks in.
The decay comes with no learned bias, no window, and no extra parameter. The paper frames it as a built-in inductive prior toward local attention; the modern reading is that it is a soft recency bias that can be tuned away with one constant or by rescaling the position index when you want longer context.
It rides linear attention
Plain softmax attention reads every query-key pair, an matrix at sequence length , and additive relative-position schemes (T5's bias, ALiBi) inherit the same cost: the bias is a per-pair number, so you still touch every pair to apply it. RoPE does not. A rotation is multiplicative, and that means it composes with the trick that turns attention from into : the linear-attention rewrite.
Linear attention replaces the softmax with a feature map applied to queries and keys, so the score factors as , and that lets you swap the order of operations. Instead of computing every , you build one running accumulator over keys and values,
and let each query read from it: . One pass to build the accumulator, one read per query, total cost instead of . The catch is positions. An additive bias sits inside the softmax and depends on thepair, so it can't be pulled out and you have to walk every pair anyway. RoPE's rotation sits outside the dot product, factors as
and Section 3.3 of the paper does the natural composition: apply the rotation FIRST, then the feature map. The key side stores , the query side computes , and the inner product is , a quadratic form in the feature map with the relative-offset rotation in the middle. For a general nonlinear the rotation does not commute with the feature map (so the score is not exactly on the un-rotated vectors), but the accumulator structure still works: build in one pass, read from it once per query, and the cost stays . RoPE is the only relative-position scheme of the era that composes with linear attention this way, which is why every fast-attention architecture has gone with it.
Modest tables, total adoption
The paper validates RoPE on the small benchmarks that were standard in early 2021. On machine translation (WMT 2014 English-to-German, Table 1) RoFormer reaches 27.5 BLEU against the original Transformer's 27.3, a slim +0.2; on enwik8 character-level pre-training the paper reports a faster convergence curve for Performer + RoPE versus Performer alone, not a parity claim against BERT; on downstream GLUE (Table 2) RoFormer trails BERT on SST-2, QNLI, and MNLI and beats it on MRPC, STS-B, and QQP, a mixed result; the cleanest gains are on a long-document Chinese language-modelling benchmark the authors built, which is where the geometric decay starts to matter. The empirical case in 2021 was modest, and the authors are careful to say so.
What carried RoPE out of the paper and into nearly every open-weights LLM was the engineering shape, not the table. The rotation is parameter-free, so it cannot be the thing that gets fine-tuned wrong on a new corpus. It is local: every query and key carry their own rotation and need not consult any other token, which composes with linear attention (in its Performer form, where the feature map is applied to the rotated vectors) and with kernel-level fast-attention rewrites alike. It exposes one interpretable knob, the base , so when later work needed to push context past the training length it had a clean lever to pull: NTK-aware scaling and YaRN raise the base; Position Interpolation rescales the position index itself. It separates the question of which token is this from where does it sit, so weights trained at one context length transfer cleanly to another. LLaMA, GPT-NeoX, PaLM, Mistral, and most open-weights models since 2022 use RoPE, and nearly every long-context patch is a tweak of its schedule (the base, or an effective rescaling of position) rather than a replacement.
Everything rests on one move: a 2D rotation by the absolute position, which turns the dot product into a function of the relative position with no extra term. Stack such rotations at geometric speeds and one position index yields many wavelengths, fast for local and slow for global. The phasor sum that results damps with distance by Abel summation, an inductive prior toward local attention you can re-tune with a single constant. None of it is learned and none of it costs anything at inference, and because the rotation factors out of the linear-attention accumulator, it survives the fast-attention rewrites that additive biases cannot. That combination, not the modest 2021 tables, is why it became the default.
Questions you might still have
Why does a rotation turn absolute position into relative position?
Rotations are orthogonal, so when you transpose one and multiply it by another, only the difference of the angles survives: R_mᵀ R_n = R_{n−m}. Rotate the query by mθ and the key by nθ, then take the dot product, and the result depends only on n−m (which the paper, with a sign-flipped convention, writes as m−n). The absolute positions cancel out of the score by construction, with no extra term in the formula and no learned parameters.
Why d/2 planes and not just one rotation in the whole d-dim space?
One rotation has one wavelength. You want short wavelengths to resolve nearby tokens and long wavelengths to span the whole sequence, like a single position index driving a multi-band radio. Splitting into d/2 independent 2D planes lets each plane spin at its own rate θ_i = 10000^{-2(i-1)/d}, geometrically spaced from ≈1 radian per token down to ≈1.16e-4 radians per token. The same position drives all of them, so you encode every length-scale at once.
Why does an additive relative bias break linear attention but rotation doesn’t?
An additive bias b_{m,n} sits inside the softmax as a per-pair number. To apply it you have to walk every query-key pair, which is exactly the N² cost linear attention was designed to avoid. A rotation is multiplicative and sits outside the dot product. You can push the query’s rotation onto the query side and the key’s rotation onto the key side, fold every (rotated) key into one running accumulator, and let each (rotated) query read from it. The pair-dependence emerges in the read, not in the storage. For a general nonlinear feature map φ the rotation does not commute with φ, so the score isn’t exactly φ(q)ᵀ R_{n-m} φ(k) on the un-rotated vectors, but the accumulator structure still works as long as φ is applied to the rotated vectors.
How do people stretch RoPE to longer context?
Two knobs. NTK-aware scaling and YaRN raise the base 10000 to flatten the decay curve and stretch the slow planes’ wavelengths past anything in training. Position Interpolation rescales the position index m itself, so the model sees the same range of angles even on a longer sequence. The structural payoff of separating the rotation construction from the trained weights is that both knobs can be turned at inference time with a small fine-tune or none at all.
Footnotes & further reading
- The paper: Su, Lu, Pan, Murtadha, Wen, Liu, RoFormer: Enhanced Transformer with Rotary Position Embedding (Zhuiyi Technology, 2021; revised through 2023).
- The sinusoidal schedule RoPE borrows the base 10000 from: Vaswani et al., Attention Is All You Need (2017).
- The relative-position predecessors RoPE displaces: Shaw, Uszkoreit, Vaswani, Self-Attention with Relative Position Representations (2018), and T5's scalar bias, in Raffel et al., Exploring the Limits of Transfer Learning (2019).
- The linear-attention rewrite RoPE composes with: Katharopoulos, Vyas, Pappas, Fleuret, Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020).
- Long-context extensions of RoPE: Chen et al., Extending Context Window of Large Language Models via Positional Interpolation (2023), which rescales the position index; Peng et al., YaRN (2023), which combines a base change with attention temperature; plus the NTK-aware scaling community write-ups, which raise the base.
How could this explainer be improved? Found an error, or something unclear? I read every message.