Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A model that chooses what to remember.
Attention reads the whole past at once and pays for it, every token against every other. Mamba carries a fixed-size running summary instead, and adds the one thing earlier recurrences lacked: it lets the current token decide what the summary keeps.
Explaining the paperMamba: Linear-Time Sequence Modeling with Selective State SpacesWhat if a sequence model could read a million tokens, remember the few that mattered, and never build a cache?
The Transformer won by reading everything. At every step, attention compares the current token against all the tokens before it, weighs them, and mixes them in. Nothing gets compressed and nothing gets forgotten on the way, which is exactly why it is so good. It is also why it is expensive. Reading a sequence of length means roughly comparisons, and at generation time the model keeps a growing cache of every past token's keys and values so it can attend to them again. Double the context and you quadruple the attention work and double the cache. For a paragraph this is fine. For a book, a genome, or an hour of audio it is a wall.
Mamba, from Albert Gu and Tri Dao, goes back to an older and cheaper idea: a recurrence. Walk through the sequence once, left to right, carrying a small fixed-size summary of everything seen so far. Cost grows with , not , and at generation time there is nothing to cache because the whole past lives in that one summary. Recurrences are not new, and for years they lost to attention on language for one specific reason: a plain recurrence treats every token the same way, so it cannot choose to remember a name and forget a comma. Mamba's fix is a single idea with a precise mechanism. Let the summary's update rule depend on the token it is reading. The model learns to fling its memory gate open on the tokens that matter and slam it shut on the ones that do not.
That one change, plus the engineering to make it run fast on a GPU, is the whole paper. To see why it works we build a small tower: what a state space model is, how a continuous one becomes a discrete recurrence, why that recurrence is secretly also a convolution, what breaks when you make it selective, and how a careful scan puts the speed back. None of the pieces is hard alone.
The bill attention runs up
Start with the cost, because it is the thing Mamba is trying to dodge. Two ways to read a sequence, side by side. On the left, attention: token looks back at every token , so the work it does is the area of a triangle, and that area grows like . On the right, a recurrence: one fixed-size state slides down the sequence, and each token does the same bounded amount of work to fold itself in, so the total grows like .
The recurrence has a second prize that matters even more in deployment. To generate the next token, attention has to keep a KV cache, the keys and values of every token so far, and that cache grows without bound. A recurrence keeps one state of fixed size no matter how long the sequence gets, so per-token generation is constant time and constant memory. That is where Mamba's reported 4–5× higher inference throughput comes from: no cache means you can run far larger batches on the same hardware.
So why did the field not just use recurrences? Because the obvious recurrence, a plain RNN, is slow to train (you cannot parallelize a sequential loop across the sequence) and tends to forget. The work Mamba builds on solved the first problem. Mamba solves the second. We need the machinery first.
A state space model: one running summary
The recurrence Mamba uses is a state space model (SSM), an idea borrowed from control theory. Picture a small hidden state , a vector of numbers, that summarizes the input stream up to time . As a new input arrives, the state evolves by a linear rule, and the output is read off from the state:
Three matrices run the show. is the state transition: how the summary decays and reshuffles itself on its own, ignoring new input. is how a new input gets written into the state. is how the state gets read out into an answer. The state is the running summary, decides how long memories linger, and are the write and read heads. (Earlier SSM work, the HiPPO line, showed you can choose so the state is a near-optimal compression of the whole history. Mamba inherits a simple diagonal version of that.)
One thing to hold onto, because it comes back. This equation is for a single scalar input channel. A real model has channels of width (the hidden dimension), and Mamba runs of these SSMs in parallel, one per channel, each with its own -dimensional state. So the actual hidden state is a grid, not a single -vector. Keep small (Mamba uses ) and that grid stays cheap even though it is doing a lot of bookkeeping.
From continuous to discrete: the step Δ
Equation (1) is written in continuous time, with a derivative . Real data is a discrete sequence of tokens, so we need the discrete version: given the state at step , what is the state at step ? Turning a continuous dynamical system into a discrete step is a standard move called discretization, and it introduces the single most important number in Mamba: a step size .
Think of as how much continuous time passes between two tokens. The clean way to integrate (1) over a slice of width , assuming the input is held constant across the slice, is the zero-order hold (ZOH). It gives discrete matrices that play the role of in a plain recurrence:
with the recurrence and readout
The shape is exactly a gated RNN: keep some of the old state, write in some of the new input. The is doing the work. With negative (Mamba parameterizes it as , so it stays negative and the state decays rather than blows up), a larger means more continuous time elapsed, so is smaller, the old state decays harder, and the new input dominates. A smaller means almost no time passed, , and the state coasts along nearly unchanged. So is a dial between reset to the current input and ignore the current input and persist. File that away.
The picture below makes concrete. The smooth amber curve is the true continuous state; the teal dots are the discrete recurrence sampling it every units. Small samples often and hugs the curve; large takes big coarse jumps. The discrete model only ever sees the dots.
One honest aside, and the first place the paper and the code part ways. Equation (2) is the exact ZOH formula for . The released kernel does not bother with the term; it uses the first-order approximation (a plain Euler step), keeping only exact. The two agree to first order in , and the simpler form trains fine. We flag it in the Provenance panel and cite the file.
Two faces: a recurrence and a convolution
Here is the property that made these models trainable at all, and it holds only when are the same at every step (the model is time-invariant). Unroll the recurrence (3) from a zero start. The output at position is a fixed weighted sum of the current and past inputs:
Those coefficients do not depend on . They are one fixed sequence, a convolution kernel:
So the very same model has two faces. As a recurrence you step through it one token at a time, carrying the state, which is perfect for generating tokens one by one. As a convolution you apply that single fixed kernel to the whole sequence at once, which (via the FFT) is fast and parallel, perfect for training on a long sequence in one shot. Same numbers, two schedules. This duality is what let structured SSMs train at scale.
Read the kernel and you can feel the limitation coming. is fixed before the model sees any data. The weight on "the token 5 steps ago" is whether that token was a crucial name or a throwaway "the." The model decays the past on a schedule it set in advance, blind to content. That is the flaw.
The flaw: one filter for every token
To expose it, the authors use a deliberately mean little task: selective copying. The model sees a stream of tokens, most of them random filler, a few of them marked content, and it has to reproduce only the content tokens, in order, ignoring the filler. The catch is that the filler sits at random positions, so the model cannot just learn "copy positions 3, 7, and 11." It has to decide what to keep based on what each token is.
A time-invariant SSM cannot do this, and the reason is the fixed kernel. It applies the same decay to every token, so it either keeps too much (and the filler floods the state) or forgets too fast (and loses the content). It has no knob that looks at the token and says "this one, remember; that one, drop." The numbers are stark: a non-selective SSM gets 18.3% on selective copying. Bolt a content-blind gate onto the architecture and it climbs to 57.0%, still poor. The selective model gets 99.8%.
This is the content-based reasoning the abstract says these models lacked. Attention has it for free: it scores each past token against the current one and weights it accordingly. A fixed convolution kernel cannot. So the question becomes sharp. How do you give a recurrence the ability to look at a token and choose, without giving up the linear-time cost that made it attractive?
Selection: let the input steer
The answer is direct. Make the SSM parameters functions of the input. Instead of one fixed for the whole sequence, compute a fresh at every position from the token there, via small linear layers:
where are linear maps into , is a linear map whose output is broadcast across the channels, and keeps the step positive. The authors call this selective model S6. (One matrix stays fixed: is still input-independent. It does not need to be selective, because already multiplies it inside , so steering already steers the effective decay. The paper finds is the main source of the improvement.)
Now look back at the discretization. With now a per-token quantity, is a per-token decay. A token the model deems important gets a large , which (with negative ) shrinks , wipes the stale state, and writes the new input in hard. A filler token gets a small , , and the state coasts past it untouched. has become a learned, content-dependent forget gate.
This is not a loose analogy; the paper proves it. In the simplest case (state size , , , with the softplus step), the selective recurrence collapses exactly to the classic gated update:
which is the forget gate at the heart of an LSTM or GRU. Mamba's selection is that gate, generalized to an -dimensional state and derived from the SSM rather than bolted on. Drag the gate below: push it toward persist and the state ignores new inputs and coasts; push it toward reset and the state snaps to each new input. Selection means the model learns this dial per token.
The cost of this is that the model is no longer time-invariant. Every position now has its own , so there is no single kernel to slide. The convolution face is gone. We are back to a genuine recurrence, and a naive recurrence is slow to train on a GPU. If the story stopped here, selection would be a good idea you could not afford. The last piece is making it affordable.
Running it fast: the scan
The recurrence looks irreducibly sequential, each step waiting on the one before. But it is a chain of affine maps (scale by , add ), and composing affine maps is associative. That single fact unlocks a parallel scan: instead of stepping through times, combine the maps pairwise up a tree in rounds, with every pair in a round done at once. The same prefix sums, computed in a fraction of the sequential depth.
The scan handles parallelism. The other half of the speed is about memory traffic, and it is pure systems work. The selective SSM's expanded state is a tensor of shape , big enough that writing it to the GPU's slow main memory (HBM) and reading it back would dominate the runtime. Mamba never does. The kernel is fused: it loads the small inputs into fast on-chip SRAM, performs the discretization and the scan there, and writes back only the final outputs, never the giant state. And for the backward pass it recomputes the intermediate states on the fly rather than storing them. The result, the authors report, is the same memory footprint as a well-optimized attention layer with FlashAttention, and a scan that runs far faster than a naive sequential loop. The three tricks together are kernel fusion, the parallel scan, and recomputation.
Here is the selective recurrence for one channel, unrolled, so the symbols become code:
# the selective SSM, one channel, unrolled (B,L,N math elided)
# inputs for this step: x_t (a token's value on this channel)
A = -exp(A_log) # (N,) per-channel, learned, stays negative
dt = softplus(dt_bias + s_dt(x_t)) # (1,) the step Δ, from the input
B_t = s_B(x_t) # (N,) input-dependent (was fixed in S4)
C_t = s_C(x_t) # (N,) input-dependent
Abar = exp(dt * A) # (N,) discretize: decay this step
Bbar = dt * B_t # (N,) Euler form the code ships
h = Abar * h + Bbar * x_t # (N,) carry the state forward
y_t = (C_t * h).sum() # scalar output for this channel, this token
# y_t then gets the residual skip (+ x_t * D) and the SiLU gateThe Mamba block, end to end
The selective SSM is the engine. The Mamba block is the car built around it, and the design goal was to make one simple block you can stack, with no separate attention or MLP layers. It fuses the previous SSM layer (called H3) with a gated MLP into a single repeated unit.
The block does five things, in order:
- Expand and split. A linear
in_projwidens the input from to (expansion ) and splits it into two equal branches. - Mix locally. The main branch runs a short causal
Conv1d(width 4) so each position sees a few of its neighbors, then a SiLU activation. - Select. The selective SSM runs on the main branch. Its are computed from the branch (eq 5), so this is where content-based memory happens.
- Gate. The second branch is a plain SiLU, and the two branches multiply elementwise: . This is the gated-MLP half folded in.
- Project back. A linear
out_projbrings the width back to , and a residual connection wraps the whole block.
in_proj expands and splits into an SSM branch (Conv1d, SiLU, selective SSM) and a SiLU gate. The branches multiply, then out_proj returns to width D, inside a residual.Make it concrete with shapes. Take one block at hidden width , sequence length , batch 1, the defaults from the released code: state slots, conv width 4, expansion . Note who is per-channel and who is shared. and are per channel (there are independent SSMs, each with its own 16-slot state), while and are shared across channels (shape ), broadcast over the 2048 channels. The carried state never grows with :
# one Mamba block, batch B=1, length L=2048, width D=1024
in_proj : (1, 2048, 1024) -> (1, 2048, 4096) # expand E=2, then split
x, z : each (1, 2048, 2048) # SSM branch / gate branch
conv1d : (1, 2048, 2048) width-4 causal, then SiLU
# selective SSM parameters, produced from x by a single linear (x_proj):
dt : (1, 2048, 2048) # Δ, per channel (dt_rank=64 -> 2048)
B, C : each (1, 2048, 16) # shared across the 2048 channels
A : (2048, 16) # per channel, N=16 state slots
state h : (1, 2048, 16) # carried, never grows with L
y : (1, 2048, 2048) = y * silu(z) # gate
out_proj: (1, 2048, 2048) -> (1, 2048, 1024) # back to DMost of the parameters live in the projections, about per block, with the SSM itself adding little. So Mamba is, in parameter terms, mostly a stack of big linear layers with a cheap, selective recurrence threaded between them. Stack the block, add an embedding and a final norm, and that is the model. No attention, no MLP blocks, no positional encodings (the recurrence is ordered by construction).
So what does it actually do
On language, Mamba is the first attention-free model to keep pace with a strong modern Transformer recipe (the LLaMA-style "Transformer++") at matched compute. The headline: Mamba-3B matches Transformers twice its size, both in pretraining loss and downstream evaluation, and beats same-size baselines like Pythia by a few points on common-sense reasoning. It does this while reading the sequence in linear time and generating with no KV cache, at 4–5× the throughput.
The synthetic tasks are the cleanest evidence that selection is doing the work. On selective copying, selection takes the model from 18.3% to 99.8%. On induction heads, the canonical in-context-recall task, Mamba reaches near-perfect accuracy and then does something the others cannot: trained on length 256, it extrapolates to sequences of a million tokens, 4000× longer than anything it saw in training. A fixed schedule cannot do that. A content-driven gate can, because the rule it learned ("remember this kind of token") does not depend on the length.
Selection generalizes past language too, which is the real argument for it as a backbone. On raw audio waveforms (the SC09 generation benchmark) a 24M-parameter Mamba reaches an FID of 0.67 against 1.42 for a comparable SaShiMi, beating much larger GAN and diffusion baselines. On DNA, where the alphabet is four letters and the useful context can be enormous, Mamba keeps improving as the sequence grows toward a million bases and matches strong baselines at 3–4× fewer parameters. The same block, the same selective scan, three modalities.
The limits are honest. A fixed-size state is a compression, so on tasks that need exact recall of an arbitrary past token, attention's full cache still has an edge, and hybrids that sprinkle a few attention layers among Mamba blocks tend to win on those. Selection also gives up the convolution view, so the speed depends on a custom CUDA kernel rather than a stock FFT, which is more to maintain. And the architecture is newer than the Transformer, with less of the tooling and folklore that makes a thing easy to train at the largest scales.
Step back and the argument is four facts long. A recurrence reads a sequence in linear time and generates with no cache. A structured SSM is such a recurrence, and when it is time-invariant it is also a convolution you can train fast. Making it time-invariant is exactly what stops it from choosing what to remember, so Mamba makes the gate depend on the token, which costs the convolution but buys content-based memory. A careful scan puts the speed back. The result is a model that reads everything cheaply and keeps only what matters, which turns out to be most of what attention was buying at quadratic cost.
Questions you might still have
If the state is a fixed size, doesn’t Mamba forget the distant past?
It compresses, the way an RNN does. The bet is that selection makes the compression smart: the gate keeps the few tokens that matter and drops filler, so the fixed-size state spends its room on signal. On synthetic recall it extrapolates to a million tokens; on tasks needing exact lookup of arbitrary past tokens, attention’s full cache still has an edge.
Why does making B, C, Δ depend on the input kill the convolution?
A convolution works only because the kernel is the same at every position (time-invariant). Once Δ, B, C change with the token, every position has its own dynamics, so there is no single kernel to slide. You are back to a recurrence, which is why the scan matters.
Is the selective scan actually fast, given it cannot use the FFT?
Yes, because it never materializes the big (B, L, D, N) state in slow memory. The kernel loads inputs into SRAM, runs the scan and discretization there, writes only the outputs back, and recomputes the states in the backward pass. Same memory profile as FlashAttention, and the scan itself runs far faster than a naive sequential loop.
Where did the convolution and SiLU in the block come from?
They are not part of the SSM math. The short causal Conv1d mixes a few neighboring tokens before the SSM sees them, and the SiLU gate is the gated-MLP half of the block folded in. Mamba is one repeated block that fuses the old H3 SSM layer and a gated MLP, with no separate attention or MLP blocks.
Footnotes & further reading
- The paper: Gu, Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Carnegie Mellon & Princeton, COLM 2024). Code.
- The structured SSM Mamba builds on, with the convolution view and the diagonal initialization: Gu, Goel, Ré, Efficiently Modeling Long Sequences with Structured State Spaces (S4), and Gu et al., On the Parameterization and Initialization of Diagonal State Space Models (S4D).
- The argument that an SSM state can optimally compress history: Gu et al., HiPPO: Recurrent Memory with Optimal Polynomial Projections.
- The IO-aware kernel idea Mamba's scan mirrors: Dao et al., FlashAttention.
- The block fuses the SSM layer from Fu et al., Hungry Hungry Hippos (H3), with a gated MLP in the SwiGLU family.
- The reference selective scan and block defaults are in the official repo, mamba_simple.py and selective_scan_interface.py.
How could this explainer be improved? Found an error, or something unclear? I read every message.