Architecture · Position encoding

RoFormer: Enhanced Transformer with Rotary Position Embedding

Rotate the query and key by their positions, and their dot product depends only on the distance between them.

Self-attention computes scores from content alone, so swapping any two tokens leaves every pairwise score unchanged. RoPE fixes that geometrically: rotate the query and key before they meet, by an angle proportional to their position. The dot product then depends only on their relative distance, costs nothing extra at inference, and survives the linear-attention rewrite intact.

Explaining the paperRoFormer: Enhanced Transformer with Rotary Position EmbeddingSu, Lu, Pan, Murtadha, Wen, Liu · Zhuiyi Technology · arXiv:2104.09864 ↗

RoPE borrows its rotation speeds straight from the original sinusoidal encoding, yet nearly every open-weights model built since has adopted it.

Self-attention is the only operation in a Transformer that lets a token look at any other token. It is also the only one that, by itself, cannot tell which token came first. The score between a query at position $m$ and a key at position $n$ is the dot product of their content vectors $\mathbf{q}_m^\top\mathbf{k}_n$ ; rearrange the sentence and every pairwise score is the same. So every Transformer ships with a separate position encoding, a signal injected somewhere into $\mathbf{q}_m$ and $\mathbf{k}_n$ that tells the model who sits where. The original paper added a fixed sinusoidal vector to the embedding; later variants learned the position vectors, or added a bias to the score, or built relative-distance tables. None of them are ideal: most spend extra parameters, none combine with the modern fast-attention rewrites, and most degrade outside their training length.

RoFormer, from Su and collaborators at Zhuiyi Technology, encodes position as a rotation. It splits the query and key vectors into $d/2$ two-dimensional planes and rotates each plane by an angle that depends on the token's absolute position. Two rotated vectors, when you dot them, produce a value that depends only on the angular difference between their rotations, which is exactly the relative offset. The encoding has no learned parameters, costs one cheap matrix multiply at inference, and slots into linear attention without breaking anything. Four ideas carry the paper: why attention without position is blind, how a rotation turns absolute positions into a relative offset, why rotating at many speeds covers both local and global structure, and why the resulting attention score damps with distance.

Attention is order-blind

Self-attention computes a weighted average of values where the weight between a query at position $m$ and a key at position $n$ is

a_{m,n} = \mathrm{softmax}_n\!\left(\frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d}}\right)

and both $\mathbf{q}_m$ and $\mathbf{k}_n$ come straight from the token embedding, functions of the word, not its slot in the sentence. Swap any two tokens and the same pair of content vectors reappears between the same two words: the score between them never changes. Permute the entire sentence and the matrix of pairwise scores is the same matrix, just relabelled. Attention ispermutation-equivariant over the input, which means the model literally cannot tell the cat sat on the mat from the mat sat on the cat, since both produce the same pile of content vectors and the same scores between them.

Figure 1 makes the symmetry visible. A query word (“sat”) and a highlighted key (“mat”) trade attention through a fixed score, drawn as a chip in the corner. When the arrangement slides, every word slides too, and so does its bar. Heights stay frozen. The readout in the corner never moves. RoPE supplies the missing position signal.

Figure 1 · the order-blind baseline

reorder wordsoriginalq·k(mat) = -0.05weight = 0.13 (never moves)

Five words and their attention weights from the query “sat”. The highlighted key “mat” carries a corner readout for

\mathbf{q}^\top\mathbf{k}/\sqrt{d}

and its softmax weight. Slide the word order: every bar travels with its word and no height changes. The dot product depends only on the two content vectors, so the model gets no signal about who sat where unless position is injected.

Position as a rotation

The two-dimensional case is enough to set up the rest. Suppose the query and key are 2D vectors $\mathbf{q}, \mathbf{k}\in\mathbb{R}^2$ and we want a function $f$ of position and content such that the dot product

\langle f(\mathbf{q}, m), f(\mathbf{k}, n)\rangle = g(\mathbf{q}, \mathbf{k}, m-n)

(goal)

depends on positions only through their difference $m-n$ . The paper's answer is a rotation. Let $\mathbf{R}_\theta$ be the planar rotation matrix by angle $\theta$ , and apply it with an angle proportional to the position:

f(\mathbf{q}, m) = \mathbf{R}_{m\theta}\,\mathbf{q}, \qquad f(\mathbf{k}, n) = \mathbf{R}_{n\theta}\,\mathbf{k}

(1)

Rotations are orthogonal, so $\mathbf{R}_\alpha^\top\mathbf{R}_\beta = \mathbf{R}_{\beta-\alpha}$ , and the dot product collapses to

\langle\mathbf{R}_{m\theta}\mathbf{q},\;\mathbf{R}_{n\theta}\mathbf{k}\rangle = \mathbf{q}^\top \mathbf{R}_{(n-m)\theta}\,\mathbf{k}

(2)

The right side is a function of the content vectors and the relative offset $n-m$ alone. The absolute positions $m$ and $n$ have disappeared: any pair with the same gap scores identically. That is the property every prior relative-position scheme was hand-engineering by adding scalar biases $b_{m,n}$ to the score; RoPE gets it from geometry, with no extra term in the formula and no learned parameters. Figure 2 makes the property concrete. Two unit vectors stand in for $\mathbf{q}$ and $\mathbf{k}$ ; two sliders rotate them by $m\theta$ and $n\theta$ ; the score on the right is $\mathbf{q}^\top \mathbf{R}_{(n-m)\theta}\,\mathbf{k} = \cos((n-m)\theta + \varphi)$ , where $\varphi$ is the original angle between the two un-rotated vectors. With the offset locked, dragging either slider moves both together and the score never changes. Unlocked, moving one alone sweeps the score through the cosine.

Figure 2 · rotation gives relative position for free

query m3key n11

Two unit vectors drawn as R_mq (teal) and R_nk (amber), each rotated by its absolute position. The dashed ghosts show the un-rotated

\mathbf{q}_0

and

\mathbf{k}_0

. The right panel reads out

n-m

and the resulting score

\cos((n-m)\theta + \varphi)

. Slide

m

and

n

independently and the score moves; flip lock offset on and slide either one and both move together with the score frozen.

\theta = 0.4

here, picked for legibility; in real RoPE the fastest plane is

\theta_1 = 1

and the slowest is

\theta_{64} \approx 1.16\!\times\!10^{-4}

The general case stacks copies of the 2D case. A $d$ -dimensional query splits into $d/2$ pairs of coordinates, each treated as a 2D plane and rotated independently:

\mathbf{R}^d_{m,\Theta} = \begin{pmatrix} \mathbf{R}_{m\theta_1} & & \\ & \ddots & \\ & & \mathbf{R}_{m\theta_{d/2}} \end{pmatrix}

(3)

with a different rotation rate $\theta_i$ per plane (the rates are set by the schedule $\theta_i = 10000^{-2(i-1)/d}$ ). The matrix is block-diagonal and almost entirely zero, so in code RoPE is implemented as a per-coordinate multiply-and-shift, not an actual matrix multiply: it costs the same as a single elementwise scaling, and slots in just before the dot product in every attention head.

Many speeds, one position

One rotation rate would only encode one wavelength of position. The paper uses many, on the same schedule the original Transformer used for its sinusoidal embedding, scaled geometrically from fast to slow:

\theta_i = 10000^{-2(i-1)/d}, \qquad i = 1, 2, \ldots, d/2

(4)

The fastest plane ( $i=1$ ) has $\theta_1 = 1$ radian per token, so it completes a full revolution every $2\pi \approx 6.28$ tokens. The slowest plane ( $i=d/2$ , with $d=128$ ) has $\theta_{64}=10000^{-126/128} \approx 1.16\!\times\!10^{-4}$ , so its wavelength $\lambda = 2\pi/\theta$ is about $54{,}000$ tokens. Across the $d/2$ planes the wavelengths span four orders of magnitude, from a handful of tokens to tens of thousands. Fast planes resolve adjacent positions; slow planes never repeat across any context the model will see. The base $10000$ is borrowed verbatim from the sinusoidal encoding of Vaswani et al., and it is one of the settings the long-context extensions reach for: NTK-aware scaling and YaRN raise the base to stretch the slow planes; Position Interpolation instead rescales the position index $m$ itself so the same wavelengths cover a longer context.

Figure 3 stacks five representative planes as clock dials. Slide the position: every dial turns at once, but the top one rotates dozens of times while the bottom one rotates only slightly. The single token index $m$ drives all the rotations; the speed gradient gives one position encoding many length-scales.

Figure 3 · five frequencies, one position

position m64

Five of the

d/2 = 64

planes (head width

d=128

), drawn as clock dials. Each dial rotates at its own rate

\theta_i = 10000^{-2(i-1)/d}

, so the fastest plane turns radians per token while the slowest plane needs tens of thousands of tokens for one revolution. The same position slider drives them all; the wavelengths printed beside each dial run from about 6 tokens (fast, local) to 54,000 tokens (slow, global).

Two observations matter for the next sections. First, the encoding is built from absolute positions but the score it produces depends only on the relative offset (Equation 2 again, now stacked across planes): absolute and relative are not in conflict, they are different views of the same construction. Second, because nothing is learned, RoPE does not run out of frequencies at sequence lengths it has never seen. The slowest plane covers any length you can fit in memory; you only lose accuracy when you push past where the fast planes start to wrap around in ways the model wasn't trained to recognize.

Distance damps the score

Group the rotated coordinates of $\mathbf{q}_m$ and $\mathbf{k}_n$ into $d/2$ planes, and the attention score is a sum of complex phasors,

\mathbf{q}_m^\top \mathbf{R}^d_{(n-m),\Theta}\,\mathbf{k}_n \;=\; \mathrm{Re}\sum_{i=1}^{d/2} h_i\,e^{\,\mathrm{i}\,(n-m)\theta_i}

(5)

with one complex coefficient $h_i$ per plane from the content vectors. The geometry behind the decay is easier to see than the algebra. Set the content aside and consider the bare phasors $e^{\mathrm{i}(n-m)\theta_i}$ : at $|m-n|=0$ every phasor sits at $+1$ on the real axis (the rotation hasn't started yet) and their sum has magnitude $d/2$ , the maximum. As $|m-n|$ grows the fastest plane rotates a full circle while the slowest rotates only slightly, so the phasors fan out into a cloud. By the time the slowest plane has rotated a substantial fraction of a revolution the phasors point in roughly independent directions and their sum behaves like a random walk of $d/2$ steps. Summing that many unit vectors in random directions grows the resultant like the square root of the number of steps rather than linearly (the same $\sqrt{N}$ law behind diffusion), giving magnitude order $\sqrt{d/2}$ . The score collapses from $d/2 = 64$ to $\sqrt{d/2} \approx 8$ , a factor of eight, just from this geometric loss of coherence.

Figure 4 · phasors fan out as the offset grows

|m − n|0

Sixty-four phasors

e^{\mathrm{i}(n-m)\theta_i}

, one per RoPE plane, drawn on the unit circle. The colours run from teal (the fastest plane) to amber (the slowest). The bold teal arrow is their sum; the dashed amber circle is the central-limit floor at

\sqrt{d/2}

. At

|m-n|=0

every arrow points right and the sum reaches the full radius. Drag the offset up: fast arrows whip around while slow ones barely move, the cloud spreads, and the bold sum shrinks toward the dashed floor. The score in Figure 5 is the same story plotted against distance instead of as a snapshot.

Section 3.4.3 of the paper turns that geometric story into a bound. By Abel summation, factor the magnitude of the phasor sum into a content-only constant and a sum-of-partial-sums of the phasors,

\Big|\sum_i h_i\, e^{\mathrm{i}(n-m)\theta_i}\Big| \;\le\; \big(\max_i |h_{i+1}-h_i|\big)\cdot \sum_{j=1}^{d/2} |S_j|, \quad S_j = \sum_{i=1}^{j} e^{\mathrm{i}(n-m)\theta_i}

(6)

The first factor is a constant of the content; the second is purely geometric and is the long-term decay RoPE inherits. The damping is not a bias the model learns or a window the architect chooses (the way ALiBi, Attention with Linear Biases, a hand-set distance penalty on each score, or sliding windows do); it is a property of summing many rotating numbers at incommensurate speeds. The curve is an oscillating envelope, not a monotone slide. At certain offsets several planes line up by accident and the bound briefly rises again. Figure 5 plots the geometry factor $B(|m-n|) = \tfrac{2}{d}\sum_j |S_j|$ (normalized in the plot to start at 1) against distance and you can see the wiggles riding on a falling trend. The figure also exposes the knobs the later long-context work reaches for. Raising the base flattens or steepens the curve. Position Interpolation stretches the same curve, because the position index $m$ is divided by the extension factor before the angles are computed (Chen et al. 2023). NTK-aware scaling instead multiplies the base by $s^{d/(d-2)}$ (Peng et al.'s YaRN follows the same recipe); the half-decay marker moves past the training-length line once either patch is applied.

Figure 5 · the score damps with distance

base (the 10000)10k

The paper's bound

B(|m-n|) = \tfrac{2}{d}\sum_j |S_j|

against relative distance, here plotted normalized to its zero-offset value (the raw bound starts near

d/4

). So it starts at

1

(every phasor in phase at zero offset) and trends down as distance grows, because phasors at different frequencies fall out of phase. The decay is not monotone: visible wiggles ride the falling envelope where planes accidentally re-align. The dashed vertical is a demo training-length

L_{\text{train}} = 32

. Slide the base to see it from raw RoPE; toggle PI×4 (rescales the position index, paper: Chen et al. 2023) or NTK×4 (raises the base by

s^{d/(d-2)}\!\approx\!4.09

for

d=128

, YaRN uses the same recipe). The faint dashed teal curve is the vanilla baseline at the same base, so the regime's shift is visible.

The decay comes with no learned bias, no window, and no extra parameter. The paper frames it as a built-in inductive prior toward local attention; the modern reading is that it is a soft recency bias that can be tuned away with one constant or by rescaling the position index when you want longer context.

It rides linear attention

Plain softmax attention reads every query-key pair, an $N\times N$ matrix at sequence length $N$ , and additive relative-position schemes (T5's learned scalar bias per distance bucket, ALiBi's fixed linear distance penalty) inherit the same cost: the bias $b_{m,n}$ is a per-pair number, so you still touch every pair to apply it. RoPE does not. A rotation is multiplicative, and that means it composes with the rewrite that turns attention from $O(N^2)$ into $O(N)$ : the linear-attention rewrite.

Linear attention replaces the softmax with a feature map $\phi$ applied to queries and keys, so the score factors as $\phi(\mathbf{q})^\top \phi(\mathbf{k})$ , and that lets you swap the order of operations. Instead of computing every $\phi(\mathbf{q}_m)^\top \phi(\mathbf{k}_n)v_n^\top$ , you build one running accumulator over keys and values,

\mathbf{S} \;=\; \sum_{n=1}^{N} \phi(\mathbf{k}_n)\,\mathbf{v}_n^\top \quad\in\mathbb{R}^{d\times d}

(7)

and let each query read from it: $\phi(\mathbf{q}_m)^\top\mathbf{S}$ . One pass to build the accumulator, one read per query, total cost $O(N)$ instead of $O(N^2)$ . The schemes differ in how they handle positions. An additive bias $b_{m,n}$ sits inside the softmax and depends on thepair, so it can't be pulled out and you have to walk every pair anyway. RoPE's rotation sits outside the dot product, factors as

\langle\mathbf{R}_{m\theta}\mathbf{q},\;\mathbf{R}_{n\theta}\mathbf{k}\rangle \;=\; \mathbf{q}^\top\mathbf{R}_{m\theta}^\top\,\mathbf{R}_{n\theta}\mathbf{k} \;=\; \mathbf{q}^\top\mathbf{R}_{(n-m)\theta}\mathbf{k}

and Section 3.3 of the paper does the natural composition: apply the rotation FIRST, then the feature map. The key side stores $\mathbf{R}_{n\theta}\phi(\mathbf{k}_n)$ , the query side computes $\mathbf{R}_{m\theta}\phi(\mathbf{q}_m)$ , and the inner product is $\phi(\mathbf{q}_m)^\top \mathbf{R}_{(n-m)\theta} \phi(\mathbf{k}_n)$ , a quadratic form in the feature map with the relative-offset rotation in the middle. For a general nonlinear $\phi$ the rotation does not commute with the feature map (so the score is not exactly $\phi(\mathbf{q})^\top \mathbf{R}_{(n-m)\theta} \phi(\mathbf{k})$ on the un-rotated vectors), but the accumulator structure still works: build $\mathbf{S} = \sum_n \mathbf{R}_{n\theta}\phi(\mathbf{k}_n)\mathbf{v}_n^\top$ in one pass, read from it once per query, and the cost stays $O(N)$ . RoPE is the only relative-position scheme of the era that composes with linear attention this way, which is why every fast-attention architecture has gone with it.

Figure 6 · multiplicative beats additive

N24

Toggle the regime and drag

N

. With RoPE + linear attention, every key is rotated by its position and folded into one accumulator

\mathbf{S}

that every query reads: cost

O(N)

, one box for any sequence length. With softmax or additive relative bias, the position term sits inside the softmax and forces the full

N\times N

grid: cost

O(N^2)

. The cost bars at the bottom show the quadratic bar dwarfing the linear one well before

N=100

Modest tables, total adoption

The paper validates RoPE on the small benchmarks that were standard in early 2021. On machine translation (WMT 2014 English-to-German, Table 1) RoFormer reaches 27.5 BLEU against the original Transformer's 27.3, a slim +0.2; on enwik8 character-level pre-training the paper reports a faster convergence curve for Performer (a linear-attention Transformer that replaces softmax with a kernel feature map) + RoPE versus Performer alone, not a parity claim against BERT; on downstream GLUE (Table 2) RoFormer trails BERT on SST-2, QNLI, and MNLI and beats it on MRPC, STS-B, and QQP, a mixed result; the cleanest gains are on a long-document Chinese language-modelling benchmark the authors built, which is where the geometric decay starts to matter. The empirical case in 2021 was modest, and the authors are careful to say so.

RoPE spread to nearly every open-weights LLM because of its engineering shape rather than its benchmark numbers. The rotation is parameter-free, so it cannot be the thing that gets fine-tuned wrong on a new corpus. It is local: every query and key carry their own rotation and need not consult any other token, which is compatible with linear attention (in its Performer form, where the feature map is applied to the rotated vectors) and with kernel-level fast-attention rewrites alike. It exposes one interpretable knob, the base $10{,}000$ , so when later work needed to push context past the training length it had a clean lever to pull: NTK-aware scaling and YaRN raise the base; Position Interpolation rescales the position index itself. It separates the question of which token is this from where does it sit, so weights trained at one context length transfer to another. LLaMA, GPT-NeoX, PaLM, Mistral, and most open-weights models since 2022 use RoPE, and nearly every long-context patch is a tweak of its schedule (the base, or an effective rescaling of position) rather than a replacement.

Everything rests on one operation: a 2D rotation by the absolute position, which turns the dot product into a function of the relative position with no extra term. Stack $d/2$ such rotations at geometric speeds and one position index yields many wavelengths, fast for local and slow for global. The phasor sum that results damps with distance by Abel summation, an inductive prior toward local attention you can re-tune with a single constant. None of it is learned and none of it costs anything at inference, and because the rotation factors out of the linear-attention accumulator, it survives the fast-attention rewrites that additive biases cannot. That combination, rather than the modest 2021 tables, made it the default.

Provenance Verified against primary literature

RoFormer (2021)The source paper: the 2D rotation, the d/2-plane stack, Section 3.4.3 decay, Section 3.3 linear-attention compatibility, and the empirical tables on MT and GLUE.

Vaswani et al. (2017)The sinusoidal schedule with base 10000 that RoPE adopts verbatim for the per-plane angular speeds.

Shaw / T5 / ALiBiThe additive relative-bias schemes RoPE replaces; the contrast with multiplicative rotation is the basis of the linear-attention argument.

Katharopoulos et al. (2020)The linear-attention rewrite that RoPE survives because rotations factor out of the accumulator.

Position Interpolation / YaRNLong-context extensions that all tune RoPE’s base or effective wavelength; they did not exist when the paper was written but justified the design retroactively.

correctionRoPE was not adopted because of RoFormer’s empirical tables, which are modest (WMT 2014 En→De +0.2 BLEU; GLUE mixed; enwik8 shown only as a Performer convergence curve, not a parity claim against BERT). It was adopted because the encoding has no learned parameters, slots into kernel-level fast attention, and exposes interpretable knobs (the base 10000 and the position index) that long-context work can retune without retraining. The decay claim is an upper bound on the score, not an exact decay curve, and it is an oscillating envelope rather than a monotone slide. The 10000 base is borrowed from Vaswani et al., not derived in the paper. We write the relative offset as n−m to match the natural construction R_m^T R_n = R_{n−m}; the paper’s Eq 11 writes it as m−n. The two are equivalent up to a sign in θ, and the decay envelope depends only on |m−n|. The schedule index in Eq 4 starts at i=1 (paper Section 3.2.2); some code bases use i=0, with the same set of rates.

Questions you might still have

Why does a rotation turn absolute position into relative position?
Rotations are orthogonal, so when you transpose one and multiply it by another, only the difference of the angles survives: R_mᵀ R_n = R_{n−m}. Rotate the query by mθ and the key by nθ, then take the dot product, and the result depends only on n−m (which the paper, with a sign-flipped convention, writes as m−n). The absolute positions cancel out of the score by construction, with no extra term in the formula and no learned parameters.

Why d/2 planes and not just one rotation in the full d-dimensional space?
One rotation has one wavelength. You want short wavelengths to resolve nearby tokens and long wavelengths to span the entire sequence, like a single position index driving a multi-band radio. Splitting into d/2 independent 2D planes lets each plane spin at its own rate θ_i = 10000^{-2(i-1)/d}, geometrically spaced from ≈1 radian per token down to ≈1.16e-4 radians per token. The same position drives all of them, so you encode every length-scale at once.

Why does an additive relative bias break linear attention but rotation doesn’t?
An additive bias b_{m,n} sits inside the softmax as a per-pair number. To apply it you have to walk every query-key pair, which is exactly the N² cost linear attention was designed to avoid. A rotation is multiplicative and sits outside the dot product. You can push the query’s rotation onto the query side and the key’s rotation onto the key side, fold every (rotated) key into one running accumulator, and let each (rotated) query read from it. The pair-dependence emerges in the read, not in the storage. For a general nonlinear feature map φ the rotation does not commute with φ, so the score isn’t exactly φ(q)ᵀ R_{n-m} φ(k) on the un-rotated vectors, but the accumulator structure still works as long as φ is applied to the rotated vectors.

How do people stretch RoPE to longer context?
Two approaches. NTK-aware scaling and YaRN raise the base 10000 to flatten the decay curve and stretch the slow planes’ wavelengths past anything in training. Position Interpolation rescales the position index m itself, so the model sees the same range of angles even on a longer sequence. The structural payoff of separating the rotation construction from the trained weights is that both can be used at inference time with a small fine-tune or none at all.

Footnotes & further reading

The paper: Su, Lu, Pan, Murtadha, Wen, Liu, RoFormer: Enhanced Transformer with Rotary Position Embedding (Zhuiyi Technology, 2021; revised through 2023).
The sinusoidal schedule RoPE borrows the base 10000 from: Vaswani et al., Attention Is All You Need (2017).
The relative-position predecessors RoPE displaces: Shaw, Uszkoreit, Vaswani, Self-Attention with Relative Position Representations (2018), and T5's scalar bias, in Raffel et al., Exploring the Limits of Transfer Learning (2019).
The linear-attention rewrite RoPE composes with: Katharopoulos, Vyas, Pappas, Fleuret, Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020).
Long-context extensions of RoPE: Chen et al., Extending Context Window of Large Language Models via Positional Interpolation (2023), which rescales the position index; Peng et al., YaRN (2023), which combines a base change with attention temperature; plus the NTK-aware scaling community write-ups, which raise the base.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.