VerifiedarXiv:2405.2106026 min
Architecture · Sequence models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

The selective SSM, rewritten as a matrix you can multiply on a GPU.

Mamba walked the sequence one token at a time and kept a small state. Useful, but the work was scalar, and modern GPUs are built for big matrix multiplications. Mamba-2 restricts one piece of the math, and the recurrence collapses into a structured matrix you can chunk and multiply, two to eight times faster.

Explaining the paperTransformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space DualityDao, Gu · Princeton & CMU · ICML 2024 · arXiv:2405.21060

A 2.7B model that, on the same data as Pythia-2.8B, beats Pythia-6.9B on common downstream tasks. The architecture is a selective state space model, the same family as Mamba, but the inner layer is computed by a chunked matrix multiplication. Same big-O cost as Mamba's scan, much higher arithmetic intensity.

Mamba-2, from Tri Dao and Albert Gu, is in one sense a follow-up to Mamba: same selective state space layer at the core, same fixed-size running state, same linear-time scan over the sequence. In another sense it is a much larger claim. Mamba-1 made the argument that a carefully gated recurrence could compete with attention on language. Mamba-2 makes the argument that the recurrence and attention are the same object, viewed through different decompositions of one structured matrix. That matrix is what the paper calls structured state space duality (SSD), and the algorithm that comes out of it looks like attention from the outside (big matrix multiplications, tensor cores busy) and like a recurrence on the inside (linear cost in sequence length, constant state).

The mechanism is one restriction and one decomposition. The restriction tightens the per-step dynamics of the SSM from a diagonal matrix down to a scalar times the identity, so that across NN state slots there is one shared decay per token rather than NN independent ones. That restriction is what factors the sequence-transformation matrix into a 1-semiseparable mask (one cumulative-product scalar per pair of positions) multiplied entrywise by an attention-shaped Gram matrix: every coordinate of MM is now (decay) \cdot (Gram entry). The decomposition then chunks MM into a grid of small blocks: diagonals are computed by a small attention-shaped matmul, off-diagonals are absorbed by carrying a state across chunk boundaries. The work runs in linear time in the sequence length, with the inner operation being a dense matrix multiplication GPUs are built for.

A few pieces carry the argument: an SSM written as a triangular matrix, the semiseparable structure of that matrix, the duality between this matrix and a kind of masked attention, and the chunked algorithm that exploits both views. Each piece is short on its own, and they only do their job once you see them chain together.

Why Mamba was slow on a GPU

Start with the problem Mamba-2 is solving. Mamba introduced a selective state space model: a small running state htRNh_t \in \mathbb{R}^N updated at every token by

ht=Atht1+Btxt,yt=Cthth_t = A_t\, h_{t-1} + B_t\, x_t, \qquad y_t = C_t^{\top}\, h_t
(1)

with AtA_t a diagonal N×NN\times N matrix per token and Bt,CtRNB_t, C_t \in \mathbb{R}^N linear projections of the input. The state has size NN (typically 16); the head dimension PP is one channel, so the whole layer runs EDE\cdot D independent SSMs in parallel.

The cost is not the FLOP count. A scan over TT tokens does O(TN)O(T\,N) work per channel, the same asymptotic as everyone else. The cost is what those FLOPs look like to a GPU: each step is a small element-wise multiply-add followed by a reduction, nothing wider than a vector. Modern accelerators do dense matrix multiplications on tensor cores at hundreds of teraFLOPs per second, while scalar code on the same chip runs at a small fraction of that. Mamba's selective scan needs a careful CUDA kernel just to keep the GPU vaguely busy, and even with that kernel it leaves headroom on the table.

FlashAttention had already shown that for attention, the work-per-second on a GPU matters more than the FLOP count, and that phrasing the work as matrix multiplications and loading data into SRAM in tiles is what unlocks tensor cores. Mamba-2 asks the same question of selective SSMs: can the recurrence be rewritten so the inner work is a matrix multiplication, without giving up linear-in-TT cost? It can, and seeing why takes one move: write the SSM as a matrix.

A recurrence is a triangular matrix

Unroll equation (1) from h0=B0x0h_0 = B_0\, x_0. The state at step tt is a sum of every past input pushed through the chain of AA matrices since:

ht  =  i=0tAtAt1Ai+1Bixi.h_t \;=\; \sum_{i=0}^{t}\, A_t\, A_{t-1}\, \cdots\, A_{i+1}\, B_i\, x_i.

Multiplying by CtC_t^{\top} collapses the output to a single sum:

yj  =  i=0j  CjAjAj1Ai+1BiMji  xiy_j \;=\; \sum_{i=0}^{j}\; \underbrace{C_j^{\top}\, A_j\, A_{j-1}\, \cdots\, A_{i+1}\, B_i}_{M_{ji}}\; x_i
(2)

That is a matrix multiplication y=Mxy = M\, x, where MM is lower triangular (the SSM is causal: yjy_j only sees xix_i for iji \le j) and each entry is the product of a contraction (CjC_j^{\top}), a chain of step matrices (AjAi+1A_j \cdots A_{i+1}), and an expansion (BiB_i). Nothing has been added or lost. The recurrence and the matrix multiplication compute the same numbers; they are two algorithms for the same map.

The matrix MM is not a generic T×TT \times T array. It has very specific structure: the entry at (j,i)(j, i) with j>ij > i is a product whose "middle" depends on Ai+1,,AjA_{i+1}, \ldots, A_j and whose "endpoints" depend on CjC_j and BiB_i. That structure has a name.

Figure 1 · the SSM as a triangular matrix
4
The same selective SSM, written as a matrix MM with Mji=CjAjAi+1BiM_{ji} = C_j^{\top} A_j \cdots A_{i+1} B_i. The diagonal is the immediate contribution of token ii to its own output; the lower triangle is the SSM's memory, with intensity decaying as the cumulative product AjAi+1A_j \cdots A_{i+1} shrinks. The dashed box outlines one off-diagonal submatrix; its rank is at most NN, the SSM's state size. Slide NN and watch the off-diagonal cells gain or lose information capacity.

Off-diagonal blocks are low rank

Pick any block of MM that lies strictly below the diagonal. Pick rows j,,j1j, \ldots, j'-1 and columnsi,,i1i', \ldots, i-1, with iji \le j, so the block sits entirely below the diagonal (no diagonal entries inside). Every entry of that block is

Mjj,ii  =  (CjAj:j)Aj:i(Ai:iBi)M_{j' j,\, i' i} \;=\; \big(C_{j'}^{\top}\, A_{j':j}\big)\, \cdot\, A_{j:i}\, \cdot\, \big(A_{i:i'}\, B_{i'}\big)

for the corresponding indices. The center factor Aj:iA_{j:i} is a single matrix that does not depend on which row or column you are in: it is fixed by the block. The left factor varies with the row, the right factor varies with the column, and every entry of the block is therefore (left vector) \cdot (fixed middle) \cdot (right vector). That is a rank-NN factorization: the block factors through an NN-dimensional middle.

A matrix whose every off-diagonal block has rank at most NN is called N-semiseparable. The structured representation that writes each entry as CjAjAi+1BiC_j^{\top} A_j \cdots A_{i+1} B_i is called the sequentially semiseparable (SSS) form. Two non-trivial facts about this class. First, every NN-semiseparable matrix can be written in SSS form, so the SSM family and the semiseparable family are the same matrix family. Second, an NN-semiseparable matrix of size TT can be stored in O(NT)O(N\,T) parameters and multiplied by a vector in O(NT)O(N\,T) work, even when the underlying AtA_t blocks are dense (Pernet, Signargout & Villard 2023, restated as Proposition 3.6 in the paper).

The first fact, when you take it seriously, is the centre of the paper: any algorithm for multiplying a semiseparable matrix is an algorithm for evaluating an SSM, and conversely. Years of work on rank-structured matrix algebra apply directly to SSMs, and the SSM literature can borrow whichever decomposition is most convenient.

Mamba-2: one scalar per step

Mamba uses a diagonal AtA_t: each of the NN state slots has its own decay rate per token. That is flexible (each slot can remember on its own time scale), but with NN independent decays per token there is no single scalar to pull out as a 1-semiseparable mask. To get one, we want one decay per token, shared across the slots. So Mamba-2 makes a restriction: At=atINA_t = a_t \cdot I_N, a scalar times the identity. Per-step dynamics now boils down to a single number atRa_t \in \mathbb{R}.

Figure 2 · scalar-identity A versus diagonal A
In Mamba-1, the per-step matrix AtA_t is diagonal, with one decay rate per state slot. In Mamba-2 it is restricted to scalar times identity: every slot decays at the same rate ata_t at this token. Toggle between the two and watch the per-step block collapse from NN independent decays to one number.

What does the restriction cost? In principle, model capacity per step. In practice, on the kinds of sequence the paper measures, the cost vanishes: language modeling perplexity matches Mamba and Transformer++ at every size from 125M to 1.3B (Figure 9). On the MQAR synthetic the picture is even stranger: Mamba-2 with N=16N = 16 is noticeably better than Mamba-1 with the same N=16N = 16, despite being strictly less expressive per step. The paper does not pin down the cause and is honest about it.

What does the restriction buy? With AtA_t a scalar, the cumulative product AjAi+1=ajaj1ai+1A_j \cdots A_{i+1} = a_j a_{j-1} \cdots a_{i+1} is also a scalar, call it LjiL_{ji}. The matrix

Lji  =  {ajaj1ai+1ji0j<iL_{ji} \;=\; \begin{cases} a_j\, a_{j-1}\, \cdots\, a_{i+1} & j \ge i \\ 0 & j < i \end{cases}
(3)

is itself a structured matrix: lower-triangular, with each entry a chain product. This is a 1-semiseparablematrix, the simplest member of the family we just defined. And the SSM matrix MM factors through it. Writing Qj=Cj,Ki=Bi,Vi=xiQ_j = C_j, K_i = B_i, V_i = x_i,

Mji  =  CjBiajai+1  =  (QjKi)LjiM_{ji} \;=\; C_j^{\top} B_i \cdot a_j\, \cdots\, a_{i+1} \;=\; (Q_j^{\top} K_i) \cdot L_{ji}
(4)

and therefore y=Mx=(LQK)Vy = M\, x = (L \circ Q K^{\top}) \cdot V, where \circ is the elementwise (Hadamard) product. We have walked the SSM into the form of masked attention without softmax, with the causal mask LL replaced by a data-dependent decay mask.

The same matrix as masked attention

Equation (4) is the duality. The left side is the SSM, written as a linear recurrence on a state of size NN; you compute it by walking the sequence and updating ht=atht1+Btxth_t = a_t h_{t-1} + B_t x_t and yt=Cthty_t = C_t^{\top} h_t, work linear in TT. The right side is masked attention with the softmax removed and the causal mask replaced byLL; you compute it by forming the T×TT \times T Gram matrix QKQ K^{\top}, multiplying by LL elementwise, and contracting with VV, work quadratic in TT.

Same MM. Same numbers out. Two algorithms.

Figure 3 · two faces of the same matrix
Left toggle: the linear (recurrent) mode walks one column of inputs through a small state, with work O(TN2)O(T N^2) per channel and an inference state of size O(N2)O(N^2). Right toggle: the quadratic (attention-like) mode fills every cell of LQKL \circ Q K^{\top} in parallel, with work O(T2N)O(T^2 N). Same matrix, two ways to fill it in. The recurrent mode is cheap in theory and scalar in practice; the quadratic mode is expensive in theory but matmul-heavy and tensor-core friendly.

The duality is more than an analogy. The paper proves, in the other direction, that any masked-attention variant with an efficient autoregressive recurrence must use a semiseparable mask (Theorem 5.2). So the matrix class is the right place to look for fast sequence layers: structured masked attention with semiseparable LL is exactly the family that admits a linear-time recurrence, and that family includes both selective SSMs and linear attention.

What we have so far is two views and a cost: linear-in-TT but scalar, or quadratic-in-TT but matmul. To win we need an algorithm that is both. That is what the SSD chunked decomposition is for.

The SSD algorithm: chunk the matrix

Partition the T×TT \times T matrix MM into a grid of chunks of size Q×QQ \times Q. There are now T/QT/Q diagonal chunks and a triangular array of (T/Q2)\binom{T/Q}{2} off-diagonal chunks below the main diagonal. The diagonal chunks are full-rank, but they are small (Q128Q \le 128 in practice). The off-diagonal chunks inherit the structure of MM: they are off-diagonal submatrices of anNN-semiseparable matrix, so each one has rank at most NN.

Now compute by parts. The diagonal chunks are Q×QQ \times Q blocks where you actually do want the quadratic form, because Q2Q^2 is small and the form is a dense matmul. The off-diagonal chunks you do not store at all, you just remember the chunk-boundary states and let a much shorter recurrence carry them across the diagonal. Concretely:

Figure 4 · the SSD block decomposition
Q = 4
The full causal SSM matrix, chunked. Diagonal blocks are computed as small attention (a batched matmul of CBC B^{\top}, the 1-SS mask, and XX); off-diagonal blocks are low-rank, so they are never stored. Their effect is carried by a short T/QT/Q-length scan on chunk-states. Drag QQ from 1 to T to slide between pure recurrence (no matmul) and pure quadratic attention (no recurrence); the sweet spot in the middle is where SSD lives.

Count the cost on the structure above. With N=P=QN = P = Q all of the matmul terms become BMM(T/N,N,N,N)\text{BMM}(T/N, N, N, N), so the total FLOP count is O(TN2)O(T\, N^2), the same as the linear recurrent form. The inner-most operation is now a dense N×NN \times N matrix multiplication, which is what GPU tensor cores are designed for; the chunk-state scan is a length-T/QT/Q reduction with negligible cost. There is no T2T^2 blow-up because no Q×QQ \times Q' block for Q>QQ' > Q is ever materialized.

The full algorithm fits in a page of PyTorch. The paper's Listing 1, slightly annotated:

# the full SSD layer, PyTorch. block_len is the chunk length Q.
# X: (batch, T, n_heads, d_head)        the inputs (analog of V)
# A: (batch, T, n_heads)                 the SCALAR per-step decay log a_t
# B: (batch, T, n_heads, d_state)        the input projection (analog of K)
# C: (batch, T, n_heads, d_state)        the output projection (analog of Q)

# 1. cut everything into chunks of length Q along the time axis.
X, A, B, C = [rearrange(x, "b (c l) ... -> b c l ...", l=Q) for x in (X, A, B, C)]
A = rearrange(A, "b c l h -> b h c l")
A_cumsum = torch.cumsum(A, dim=-1)              # log a_{t:t0}

# 2. INTRA-CHUNK (the diagonal blocks): small attention, in parallel.
L = torch.exp(segsum(A))                         # the 1-SS mask, exp of segment sum
Y_diag = einsum("bclhn,bcshn,bhcls,bcshp->bclhp", C, B, L, X)

# 3. each chunk's exit state, supposing its initial state was 0:
decay_states = torch.exp(A_cumsum[..., -1:] - A_cumsum)
states = einsum("bclhn,bhcl,bclhp->bchpn", B, decay_states, X)

# 4. INTER-CHUNK recurrence on the chunk-states  (1-SS scan, length T/Q):
decay_chunk = torch.exp(segsum(F.pad(A_cumsum[..., -1], (1, 0))))
new_states = einsum("bhzc,bchpn->bzhpn", decay_chunk, states)
states, final_state = new_states[:, :-1], new_states[:, -1]

# 5. CARRY each chunk's true initial state to its outputs:
state_decay_out = torch.exp(A_cumsum)
Y_off = einsum("bclhn,bchpn,bhcl->bclhp", C, states, state_decay_out)

# 6. fuse: intra-chunk + cross-chunk.
Y = rearrange(Y_diag + Y_off, "b c l h p -> b (c l) h p")
return Y, final_state

segsum is a segment-cumulative-sum that turns the log-decay sequence into the 1-SS mask:L = exp(segsum(A)) assembles the lower-triangular matrix of cumulative products ajai+1a_j \cdots a_{i+1}. Everything else is one of two operations: a chunk-shaped Einstein summation (the matmul lines) or a length- T/QT/Q recurrence on chunk-states (the decay_chunk line). Even in vanilla PyTorch without custom CUDA, this version is competitive with Mamba's heavily-tuned selective scan, because the work the GPU is being asked to do is precisely the kind it does best.

The Mamba-2 block

The block itself is one small structural change from Mamba-1 and one normalization addition. In Mamba-1 the projections that produce Δ,B,C\Delta, B, C are functions of the SSM input XX, so they sit serially inside the block:XX is projected from the residual stream, then a second projection computes Δ,B,C\Delta, B, C from XX (Figure 6 in the paper). In Mamba-2 the layer is viewed as a map (A,X,B,C)Y(A, X, B, C) \mapsto Y, so all four are projected in parallel at the start of the block, the way Q,K,VQ, K, V are in attention. That removes the inner sync point, which lets tensor parallelism shard the layer cleanly across GPUs the way Megatron shards attention and MLP layers.

A GroupNorm sits just before the final output projection, after the multiplicative gate. The paper finds this normalization fixes instabilities at larger scale, and notes the same pattern in TransNormerLLM and RetNet (NormFormer-style late normalization). The head pattern Mamba-2 uses is MVA / MIS: BB and CC are shared across the HH heads of XX, the same head structure as multi-value attention.

Figure 5 · Mamba-1 vs Mamba-2 block
In Mamba-1, Δ,B,C\Delta, B, C are computed inside the block from the SSM input XX, which is itself an internal projection. That serial chain blocks tensor parallelism. In Mamba-2, a single input projection produces X,A,B,CX, A, B, C in parallel, all four feed the SSD layer, and a GroupNorm runs before the output projection. One all-reduce per block, the same as for Transformer attention and MLP layers.

The other minor change is a larger head dimension: Mamba-1 uses P=1P = 1 (one channel per SSM), Mamba-2 uses P=64P = 64 or P=128P = 128, the same head dimension a Transformer would use. That change is what makes the inner matmul N×NN \times N instead of N×1N \times 1.

Bigger state, sharper recall

One thing the SSD algorithm enables, and the change of head dimension cheaply allows, is a much larger state size NN. Mamba uses N=16N = 16 by default; Mamba-2 ships with N=128N = 128, and the paper shows the algorithm continues to be efficient up to N=256N = 256 or higher. That matters most for tasks that ask the model to memorize many key/value associations in its context, the place fixed-state recurrences traditionally lose to attention.

The multi-query associative recall (MQAR) synthetic is the cleanest test. The model is shown a sequence of key-value pairs and is later prompted with some of the keys; it must produce the matching values. Attention does this well at any sequence length because it caches every token; a small-state SSM has to compress everything into NNslots, so at short sequence length or low NN it fails outright. Watch what happens to Mamba-2 as you increase NN:

Figure 6 · MQAR accuracy versus state size
1024
Multi-query associative recall, by model dimension and state size. Softmax attention sits near 100% across the board. Mamba (N=16) struggles, especially at sequence length 1024. Mamba-2 (N=16) already beats Mamba at matched state, and Mamba-2 (N=64) and Mamba-2 (N=256) push toward attention. Bigger state, more room for in-context associations. Numbers approximate paper Fig 8.

Two things to read off the figure. First, Mamba-2 at N=16N=16 outperforms Mamba at N=16N=16: the architecture change wins even at matched state size, which the paper flags as surprising and does not fully explain. Second, the trend up to N=256N=256 is monotonic; SSD's ability to keep the inner work in matmul form is what lets a model that size remain fast enough to train.

Speed and language modeling

The headline efficiency claim is on an A100 80GB: a single SSD layer is two to eight times faster than Mamba's fused selective scan, depending on state size, and faster than FlashAttention-2 from sequence length two thousand onward. At sixteen thousand tokens, SSD is about six times faster than FlashAttention-2. Both comparisons turn on the same mechanism: SSD does the same number of FLOPs as Mamba, but those FLOPs are in big matmuls instead of small scalar steps; SSD does fewer FLOPs than FlashAttention-2 (linear vs quadratic in TT), and runs them at the same matmul utilization. The crossover with FlashAttention-2 sits around two thousand tokens.

Figure 7 · SSD time per layer, vs Mamba scan and FlashAttention-2
16k
Time per layer on A100 80GB PCIe, log-log. SSD is linear-in-TT and matmul-heavy, so it runs flat-out on tensor cores. Mamba selective scan is also linear-in-TT but does scalar work, so its line is a constant multiple above SSD's. FlashAttention-2 is quadratic in TT and crosses SSD near two thousand tokens. Drag the slider to read the gap at any sequence length. Numbers approximate paper Fig 10 left.

On standard language modeling the picture is competitive rather than dominant. Mamba-2 matches Mamba and the modern Transformer recipe (RoPE + SwiGLU + RMSNorm, no biases) on Chinchilla-style scaling sweeps from 125M to 1.3B parameters. At 2.7B trained to 300B tokens on the Pile, Mamba-2 lands a Pile perplexity of 6.09 against Mamba's 6.22 and Pythia-2.8B's 6.73, with consistent gains on common downstream tasks (LAMBADA, HellaSwag, PIQA, Arc-E, Arc-C, WinoGrande, OpenbookQA). And the 2.7B Mamba-2 outperforms Pythia-6.9B trained on the same data at most of those tasks, despite less than half the parameter count.

The hybrid results are the most interesting forward-looking piece. With ~10% of the layers in a 350M / 48-layer model switched from SSD to softmax attention, Pile perplexity drops from 8.60 (pure Mamba-2) to 8.26 (six attention layers), below the 8.68 of pure Transformer++. A 2.7B Mamba-2-Attention hybrid with six attention layers (out of 64) hits 5.95 Pile perplexity, below both pure Mamba-2 (6.09) and Transformer++ (6.13). So the architecture that emerges is mostly recurrence with a few attention layers stacked in; the recurrence does the bulk-mixing work cheaply, and a few attention layers handle the lookups the recurrence cannot.

Where Mamba-2 gives ground: the scalar-A restriction is real, and any task that genuinely needed per-slot decay rates will see a difference (none of the language benchmarks did, but the existence proof is not the same as a no-cost result). Tensor parallelism, sequence parallelism, and variable-length packing all need new conventions for SSMs that the paper has to introduce from scratch, and the ecosystem is still much thinner than Transformers'. Pure associative recall at very long sequences still favors attention even at N=256N=256, which is why the hybrids exist.

An SSM is a triangular matrix whose off-diagonal blocks have rank at most NN, the state size. Restrict AtA_t to a scalar and that matrix becomes the same object as masked attention without softmax, with a 1-semiseparable mask in place of the causal mask. Chunk that matrix into a grid: diagonals are small attention, off-diagonals collapse to a short scan on chunk-states. The work is linear in TT, the inner operation is a dense matmul, and the language-modeling cost of the scalar restriction is, on the kinds of sequence anyone runs, zero.

Provenance Verified against primary literature
Mamba-2 / SSD (2024)Dao & Gu, ICML 2024: the scalar-identity A restriction, the duality with 1-semiseparable masked attention, the chunked block-decomposition algorithm (Theorem 6.1), and the Mamba-2 block.
Mamba (2023)Gu & Dao: selective SSMs (S6), the diagonal A and input-dependent Δ, B, C, and the hardware-aware scan that Mamba-2 replaces.
Linear Attention (2020)Katharopoulos et al.: kernelized softmax + causal-mask trick that turns (Q K^T) · V into a recurrence; Mamba-2 generalizes this to any 1-semiseparable mask L.
Semiseparable matricesPernet, Signargout & Villard 2023 (and references therein): N-semiseparable matrices admit O(N·T) representation and O(N·T) matrix-vector multiplication; SSMs ARE this matrix class.
mamba_ssm code (state-spaces/mamba)Reference implementation of SSD in PyTorch (segsum, einsum block recurrence); see Listing 1.
correctionThe paper writes the Euler-step displacement with the opposite sign. We teach the correct sign and explain the discrepancy.

Questions you might still have

?

Is Mamba-2 the same as Mamba with a faster scan?
No. It is a different layer. Mamba uses a diagonal A_t, so each of the N state slots has its own decay; that flexibility is what stopped you from collapsing the recurrence into a single matrix multiply. Mamba-2 restricts A_t to a scalar times identity, so the per-step decay is one number. With that restriction the recurrence becomes a structured matrix you can chunk and multiply with tensor cores, which is what the SSD algorithm does. Same family, smaller model class, faster compute.

?

Does the scalar-A restriction cost accuracy?
On language modeling, no measurable cost: Mamba-2 matches Mamba and Transformer++ on the Chinchilla-style scaling sweep, and at 2.7B trained to 300B tokens it slightly outperforms Mamba and Pythia at twice the size. On MQAR (the multi-query associative recall synthetic) Mamba-2 actually beats Mamba-1 at matched state size, which is surprising and the paper does not fully explain.

?

If SSD is "attention without softmax", does it inherit attention's long-context behavior?
Only partly. The state is still finite (N²) and the mask L is data-dependent decay, not an unbounded lookup. So associative recall still hurts at very long sequence lengths unless N grows. In practice this line of work has converged on hybrids: stack mostly SSD layers and sprinkle a few full-attention layers in. The paper finds ~10% attention layers is optimal at the 350M scale.

?

Why does the algorithm care so much about matrix multiplications, not FLOPs?
Modern GPUs do dense matmuls on tensor cores at hundreds of teraFLOP/s; the same number of scalar FLOPs in a sequential scan runs at a small fraction of that, because tensor cores sit idle. Mamba-1's selective scan is linear-in-T but scalar; SSD is linear-in-T and matmul-shaped. Same big-O, much higher utilization. The Mamba-2 paper is largely an argument that this is the lever that matters.

Footnotes & further reading

  1. The paper: Dao, Gu, Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Princeton & CMU, ICML 2024). Code.
  2. The predecessor: Gu, Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces. The Mamba explainer covers the selective scan and the input-dependent Δ,B,C\Delta, B, C.
  3. Linear attention and the masked-attention recurrence trick: Katharopoulos et al., Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. The Mamba-2 title is a deliberate echo.
  4. Semiseparable matrices: Pernet, Signargout, Villard, Time and space efficient generators for quasiseparable matrices (the O(NT)O(NT) parameter result), and the textbook survey Vandebril, Van Barel, Mastronardi, Matrix Computations and Semiseparable Matrices.
  5. The IO-aware mindset SSD inherits: Dao, FlashAttention-2, and the original FlashAttention explainer.
  6. The two NormFormer-style adjacent papers Mamba-2 cites for late normalization: Qin et al., TransNormerLLM and Sun et al., RetNet.
  7. The MQAR task: Arora et al., Zoology: Measuring and Improving Recall in Efficient Language Models.
  8. A more discursive walk through SSD from the authors: State Space Duality (Mamba-2) on the Goomba Lab blog.