Architecture · Sequence models

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A model that chooses what to remember.

Attention reads everything seen so far at once and pays for it, every token against every other. Mamba carries a fixed-size running summary instead, and adds the one thing earlier recurrences lacked: it lets the current token decide what the summary keeps.

Explaining the paperMamba: Linear-Time Sequence Modeling with Selective State SpacesGu, Dao · CMU & Princeton · COLM 2024 · arXiv:2312.00752 ↗

Trained on sequences of 256 tokens, this model recalls the right ones a million tokens out, and keeps no cache to do it.

The Transformer won by reading everything. At every step, attention compares the current token against all the tokens before it, weighs them, and mixes them in. Nothing gets compressed and nothing gets forgotten on the way, which is exactly why it is so good. It is also why it is expensive. Reading a sequence of length $L$ means roughly $L^2$ comparisons, and at generation time the model keeps a growing cache of every past token's keys and values so it can attend to them again. Double the context and you quadruple the attention work and double the cache. For a paragraph this is fine. For a book, a genome, or an hour of audio it is a wall.

Mamba, from Albert Gu and Tri Dao, goes back to an older and less expensive idea: a recurrence. Walk through the sequence once, left to right, carrying a small fixed-size summary of everything seen so far. Cost grows with $L$ , not $L^2$ , and at generation time there is nothing to cache because all of the past lives in that one summary. Recurrences are not new, and for years they lost to attention on language for one specific reason: a plain recurrence treats every token the same way, so it cannot choose to remember a name and forget a comma. Mamba addresses this directly. Let the summary's update rule depend on the token it is reading. The model learns to open its memory gate on the tokens that matter and shut it on the ones that do not.

That one change, plus the engineering to make it run fast on a GPU, is most of the paper. To see why it works, a few ideas carry it: what a state space model is, how a continuous one becomes a discrete recurrence, why that recurrence is also a convolution, what breaks when you make it selective, and how a careful scan puts the speed back. Each is a short step; what matters is how they chain together.

The bill attention runs up

The cost comes first, because Mamba was built to reduce it. Two ways to read a sequence, side by side. On the left, attention: token $i$ looks back at every token $j \le i$ , so the work it does is the area of a triangle, and that area grows like $L^2$ . On the right, a recurrence: one fixed-size state slides down the sequence, and each token does the same bounded amount of work to fold itself in, so the total grows like $L$ .

Figure 1 · two ways to read a sequence

length LL = 12

Attention fills a triangle of token-to-token comparisons, so its work grows like L². A recurrence carries one fixed-size state down a single column, doing the same work per token, so its work grows like L. Drag L and watch the gap open.

The recurrence has a second advantage that matters even more in deployment. To generate the next token, attention has to keep a KV cache, the keys and values of every token so far, and that cache grows without bound. A recurrence keeps one state of fixed size no matter how long the sequence gets, so per-token generation is constant time and constant memory. Mamba's reported 4–5× higher inference throughput comes from that constant footprint: no cache means you can run far larger batches on the same hardware.

So why did the field not just use recurrences? Because the obvious recurrence, a plain RNN, is slow to train (you cannot parallelize a sequential loop across the sequence) and loses information from early in the sequence. The work Mamba builds on solved the first problem. Mamba solves the second. We need the machinery first.

A state space model: one running summary

The recurrence Mamba uses is a state space model (SSM), an idea borrowed from control theory. It carries a small hidden state $h(t) \in \mathbb{R}^N$ , a vector of $N$ numbers, that summarizes the input stream up to time $t$ . As a new input $x(t)$ arrives, the state evolves by a linear rule, and the output $y(t)$ is read off from the state:

h'(t) = \mathbf{A}\,h(t) + \mathbf{B}\,x(t), \qquad y(t) = \mathbf{C}\,h(t)

(1)

Three matrices govern the dynamics. $\mathbf{A} \in \mathbb{R}^{N\times N}$ is the state transition: how the summary decays and reshuffles itself on its own, ignoring new input. $\mathbf{B} \in \mathbb{R}^{N\times 1}$ is how a new input gets written into the state. $\mathbf{C} \in \mathbb{R}^{1\times N}$ is how the state gets read out into an answer. The state $h$ is the running summary, $\mathbf{A}$ decides how long memories linger, and $\mathbf{B}, \mathbf{C}$ are the write and read heads. (Earlier SSM work, the HiPPO line, showed you can choose $\mathbf{A}$ so the state is a near-optimal compression of the full history. Mamba inherits a simple diagonal version of that.)

This equation is for a single scalar input channel, a detail that comes back. A real model has $D$ channels of width (the hidden dimension), and Mamba runs $D$ of these SSMs in parallel, one per channel, each with its own $N$ -dimensional state. So the actual hidden state is a $D \times N$ grid, not a single $N$ -vector. Keep $N$ small (Mamba uses $N = 16$ ) and that grid stays inexpensive even though it is doing a lot of bookkeeping.

From continuous to discrete: the step Δ

Equation (1) is written in continuous time, with a derivative $h'(t)$ . Real data is a discrete sequence of tokens, so we need the discrete version: given the state at step $t-1$ , what is the state at step $t$ ? Turning a continuous dynamical system into a discrete step is a standard move called discretization, and it introduces the single most important number in Mamba: a step size $\Delta$ .

$\Delta$ measures how much continuous time passes between two tokens. The natural way to integrate (1) over a slice of width $\Delta$ , assuming the input is held constant across the slice, is the zero-order hold (ZOH). The name is literal: hold the input flat, at order zero, for the duration of the step, then solve the ODE exactly over that slice. Under that one assumption the formula below is the exact solution, not an approximation, which is why ZOH is the standard discretization rather than one option among many. It gives discrete matrices $\bar{\mathbf{A}}, \bar{\mathbf{B}}$ that play the role of $\mathbf{A}, \mathbf{B}$ in a plain recurrence:

\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}), \qquad \bar{\mathbf{B}} = (\Delta\mathbf{A})^{-1}\big(\exp(\Delta\mathbf{A}) - \mathbf{I}\big)\cdot\Delta\mathbf{B}

(2)

with the recurrence and readout

h_t = \bar{\mathbf{A}}\,h_{t-1} + \bar{\mathbf{B}}\,x_t, \qquad y_t = \mathbf{C}\,h_t

(3)

The shape $h_t = \bar{\mathbf{A}}\,h_{t-1} + \bar{\mathbf{B}}\,x_t$ is exactly a gated RNN: keep some of the old state, write in some of the new input. The $\exp(\Delta\mathbf{A})$ term controls how much of the old state carries over. With $\mathbf{A}$ negative (Mamba parameterizes it as $\mathbf{A} = -\exp(\text{A\_log})$ , so it stays negative and the state decays rather than blows up), a larger $\Delta$ means more continuous time elapsed, so $\bar{\mathbf{A}} = \exp(\Delta\mathbf{A})$ is smaller, the old state decays harder, and the new input dominates. A smaller $\Delta$ means almost no time passed, $\bar{\mathbf{A}} \approx 1$ , and the state coasts along nearly unchanged. So $\Delta$ is a dial between reset to the current input and ignore the current input and persist.

The picture below makes $\Delta$ concrete. The smooth amber curve is the true continuous state; the teal dots are the discrete recurrence sampling it every $\Delta$ units. Small $\Delta$ samples often and hugs the curve; large $\Delta$ takes big coarse jumps. The discrete model only ever sees the dots.

Figure 2 · discretization

step ΔΔ = 0.50

The continuous state h(t) evolves smoothly; the discrete recurrence samples it at step size Δ, with decay ā = exp(Δa) per step. Small Δ tracks the curve; large Δ jumps coarsely and lets the new input dominate. Δ sets how much continuous time passes between tokens.

This is also the first place the paper and the code part ways. Equation (2) is the exact ZOH formula for $\bar{\mathbf{B}}$ , but the released kernel does not bother with the $(\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A})-\mathbf{I})$ term; it uses the first-order approximation $\bar{\mathbf{B}} \approx \Delta\mathbf{B}$ (a plain Euler step), keeping only $\bar{\mathbf{A}} = \exp(\Delta\mathbf{A})$ exact. The two agree to first order in $\Delta$ , the simpler form trains fine, and the Provenance panel flags it and cites the file.

Two faces: a recurrence and a convolution

The property that made these models trainable at all holds only when $\bar{\mathbf{A}}, \bar{\mathbf{B}}, \mathbf{C}$ are the same at every step (the model is time-invariant). Unroll the recurrence (3) from a zero start. The output at position $t$ is a fixed weighted sum of the current and past inputs:

y_t = \mathbf{C}\bar{\mathbf{B}}\,x_t + \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}}\,x_{t-1} + \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}}\,x_{t-2} + \cdots

Those coefficients do not depend on $t$ . They are one fixed sequence, a convolution kernel:

\bar{\mathbf{K}} = \big(\mathbf{C}\bar{\mathbf{B}},\ \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\ \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}},\ \dots\big), \qquad y = x * \bar{\mathbf{K}}

(4)

So the very same model has two faces. As a recurrence you step through it one token at a time, carrying the state, which is perfect for generating tokens one by one. As a convolution you apply that single fixed kernel $\bar{\mathbf{K}}$ to the entire sequence at once, which (via the FFT, the Fast Fourier Transform, which computes a length-L convolution in L log L time across the whole sequence at once) is fast and parallel, perfect for training on a long sequence in one shot. Same numbers, two schedules. This duality allowed structured SSMs to train at scale.

Figure 3 · recurrence and convolution

view

One time-invariant SSM, two computations. The recurrent view carries a state left to right, one token at a time. The convolution view slides one fixed kernel K̄ over the entire sequence at once. Both give the same output. Toggle between them.

The convolution kernel shows where the limitation comes from. $\bar{\mathbf{K}}$ is fixed before the model sees any data. The weight on "the token 5 steps ago" is $\mathbf{C}\bar{\mathbf{A}}^5\bar{\mathbf{B}}$ whether that token was a crucial name or a throwaway "the." The model decays the past on a fixed schedule, independent of token content.

The flaw: one filter for every token

To expose it, the authors use a task built to defeat a fixed schedule: selective copying. The model sees a stream of tokens, most of them random filler, a few of them marked content, and it has to reproduce only the content tokens, in order, ignoring the filler. Because the filler sits at random positions, the model cannot just learn "copy positions 3, 7, and 11." It has to decide what to keep based on what each token is.

A time-invariant SSM cannot do this, and the fixed kernel is why. It applies the same decay to every token, so it either keeps too much (and the filler floods the state) or forgets too fast (and loses the content). It has no knob that looks at the token and says "this one, remember; that one, drop." The numbers are stark: a non-selective SSM gets 18.3% on selective copying. Bolt a content-blind gate onto the architecture and it climbs to 57.0%, still poor. The selective model gets 99.8%.

Figure 4 · selective copying

model

A stream where diamonds are content to remember and dots are filler to ignore. A time-invariant model uses one gate for every token and blurs content with filler. A selective model opens its gate on content and shuts it on filler, keeping the state clean. Toggle to compare.

This is the content-based reasoning the abstract says these models lacked. Attention has it built in, scoring each past token against the current one and weighting it accordingly; a fixed convolution kernel cannot. So the question is: how do you give a recurrence the ability to look at a token and choose, without giving up the linear-time cost that made it attractive?

Selection: let the input steer

Mamba does this directly: it makes the SSM parameters functions of the input. Instead of one fixed $\Delta, \mathbf{B}, \mathbf{C}$ for the entire sequence, compute a fresh $\Delta_t, \mathbf{B}_t, \mathbf{C}_t$ at every position from the token there, via small linear layers:

\mathbf{B}_t = s_B(x_t),\quad \mathbf{C}_t = s_C(x_t),\quad \Delta_t = \tau_\Delta\big(\text{param} + s_\Delta(x_t)\big)

(5)

where $s_B, s_C$ are linear maps into $\mathbb{R}^N$ , $s_\Delta$ is a linear map whose output is broadcast across the $D$ channels, and $\tau_\Delta = \text{softplus}$ keeps the step positive. The authors call this selective model S6. (One matrix stays fixed: $\mathbf{A}$ is still input-independent. It does not need to be selective, because it never acts alone: it only ever reaches the recurrence through the product $\Delta_t\mathbf{A}$ inside $\bar{\mathbf{A}} = \exp(\Delta_t \mathbf{A})$ , so once $\Delta$ is input-dependent the effective decay already changes token by token, and making $\mathbf{A}$ selective too would buy little at extra cost. The paper finds $\Delta$ is the main source of the improvement.)

The discretization changes once $\Delta$ becomes per-token. With $\Delta$ now a per-token quantity, $\bar{\mathbf{A}}_t = \exp(\Delta_t\mathbf{A})$ is a per-token decay. A token the model deems important gets a large $\Delta_t$ , which with negative $\mathbf{A}$ shrinks $\bar{\mathbf{A}}_t$ , so the stale state decays away and the new input dominates the write. A filler token gets a small $\Delta_t$ , $\bar{\mathbf{A}}_t \approx 1$ , and the state slips past the filler token untouched. $\Delta$ has become a learned, content-dependent forget gate.

This is not a loose analogy; the paper proves it. In the simplest case (state size $N=1$ , $\mathbf{A}=-1$ , $\mathbf{B}=1$ , with the softplus step), the selective recurrence collapses exactly to the classic gated update:

g_t = \sigma\big(\text{Linear}(x_t)\big), \qquad h_t = (1 - g_t)\,h_{t-1} + g_t\,x_t

(6)

which is the forget gate at the heart of an LSTM or GRU. Mamba's selection is that gate, generalized to an $N$ -dimensional state and derived from the SSM rather than bolted on. Drag the gate below: push it toward persist and the state ignores new inputs and stays put; push it toward reset and the state snaps to each new input. Selection means the model learns this dial per token.

Figure 5 · Δ is a forget gate

gate biasmixed

The selective step Δ acts as a gate (eq 6). The amber staircase is the input; the teal curve is the state h. Toward persist, the state coasts and ignores inputs; toward reset, it snaps to each new input. The bottom bars show the gate at each step.

Three tokens through it show the gate at work. Feed the channel the sequence ["important", "filler", "important"], where the step is $\Delta_t = \text{softplus}(\text{param} + s_\Delta(x_t))$ , computed fresh from each token. The model has learned that "important"-shaped tokens deserve a large step and filler a small one, so it produces a large $\Delta_1$ , a small $\Delta_2$ , and a large $\Delta_3$ . Token one: the large $\Delta_1$ makes $\bar{\mathbf{A}}_1 = \exp(\Delta_1\mathbf{A})$ small, so whatever the state held decays toward nothing and the write term $\bar{\mathbf{B}}_1 x_1$ stamps the first important token into the state, the gate flung open. Token two: the small $\Delta_2$ makes $\bar{\mathbf{A}}_2 \approx 1$ and the write term tiny, so the state glides past the filler almost untouched, still carrying token one, the gate shut. Token three: the large $\Delta_3$ opens the gate again and writes the second important token in. The state that comes out the far end carries the two important tokens and never spent any of its fixed $N$ slots on the filler, which is exactly the discrimination a fixed convolution kernel could not make, because its decay was set in advance, blind to which token was which.

The cost of this is that the model is no longer time-invariant. Every position now has its own $\bar{\mathbf{A}}_t, \bar{\mathbf{B}}_t, \mathbf{C}_t$ , so there is no single kernel to slide. The convolution face is gone. We are back to a genuine recurrence, and a naive recurrence is slow to train on a GPU. If the story stopped here, selection would be a good idea you could not afford. Making it affordable is the remaining problem.

Running it fast: the scan

The recurrence $h_t = \bar{\mathbf{A}}_t\,h_{t-1} + \bar{\mathbf{B}}_t\,x_t$ looks irreducibly sequential, each step waiting on the one before. But it is a chain of affine maps (scale by $\bar{\mathbf{A}}_t$ , add $\bar{\mathbf{B}}_t x_t$ ), and composing affine maps is associative. That single fact makes possible a parallel scan: instead of stepping through $L$ times, combine the maps pairwise up a tree in $\log_2 L$ rounds, with every pair in a round done at once.

Associativity is the entire license here, so it pays to be precise about it. Each step is an affine map, scale by $\bar{\mathbf{A}}_t$ and add $\bar{\mathbf{B}}_t x_t$ , and the composition of two affine maps is itself an affine map of the same form, with its own scale and offset. To get the state at position $t$ you compose the first $t$ of these maps and apply the result to the initial state. Composing a chain of maps in order is a prefix computation, and a prefix computation over an associative operation does not care how you bracket it: you can fold the left half and the right half independently and then join them, because $(f \circ g) \circ h$ equals $f \circ (g \circ h)$ . That freedom to re-bracket allows a length- $L$ chain to be evaluated as a balanced binary tree in $\log_2 L$ rounds of independent, parallel joins instead of $L$ joins one after another. An ordinary RNN has no such freedom: its update $h_t = \tanh(W h_{t-1} + U x_t)$ wraps a nonlinearity around the recurrence, two such steps do not compose into one step of the same form, there is no associative join to re-bracket, and the chain stays stubbornly sequential. Mamba keeps its recurrence affine precisely so this door stays open.

Figure 6 · the parallel scan

schedule

The recurrence is a chain of affine maps, and composing them is associative. The sequential view steps through all L positions in order. The parallel scan combines them pairwise up a tree in log₂L rounds, with a whole round done at once. Toggle to compare.

The scan handles the parallelism; the other half of the speed is memory traffic, and that half is pure systems work. The selective SSM's expanded state is a tensor of shape $(B, L, D, N)$ , big enough that writing it to the GPU's slow main memory (HBM) and reading it back would dominate the runtime. Mamba never does. The kernel is fused: it loads the small inputs $(\Delta, \mathbf{B}, \mathbf{C})$ into fast on-chip SRAM (the GPU's small, fast scratchpad memory), performs the discretization and the scan there, and writes back only the final outputs, never the giant state. And for the backward pass it recomputes the intermediate states on the fly rather than storing them. The result, the authors report, is the same memory footprint as a well-optimized attention layer with FlashAttention, and a scan that runs far faster than a naive sequential loop. The three tricks together are kernel fusion, the parallel scan, and recomputation.

The selective recurrence for one channel, unrolled so the symbols become code:

# the selective SSM, one channel, unrolled (B,L,N math elided)
# inputs for this step: x_t  (a token's value on this channel)
A   = -exp(A_log)              # (N,)  per-channel, learned, stays negative
dt  = softplus(dt_bias + s_dt(x_t))   # (1,)  the step Δ, from the input
B_t = s_B(x_t)                 # (N,)  input-dependent  (was fixed in S4)
C_t = s_C(x_t)                 # (N,)  input-dependent
Abar = exp(dt * A)             # (N,)  discretize: decay this step
Bbar = dt * B_t                # (N,)  Euler form the code ships
h    = Abar * h + Bbar * x_t   # (N,)  carry the state forward
y_t  = (C_t * h).sum()         # scalar output for this channel, this token
# y_t then gets the residual skip (+ x_t * D) and the SiLU gate

The Mamba block, end to end

The selective SSM does the core computation; the Mamba block is the surrounding layer that puts it to use, and the design goal was to make one simple block you can stack, with no separate attention or MLP layers. It fuses the previous SSM layer (called H3) with a gated MLP into a single repeated unit.

The block does five things, in order:

Expand and split. A linear in_proj widens the input from $D$ to $E\cdot D$ (expansion $E = 2$ ) and splits it into two equal branches.
Mix locally. The main branch runs a short causal Conv1d (width 4) so each position sees a few of its neighbors, then a SiLU (Sigmoid Linear Unit, x·σ(x), a smooth ReLU-like activation) activation.
Select. The selective SSM runs on the main branch. Its $\Delta, \mathbf{B}, \mathbf{C}$ are computed from the branch (eq 5), so content-based memory happens here.
Gate. The second branch is a plain SiLU, and the two branches multiply elementwise: $y \cdot \text{silu}(z)$ . This is the gated-MLP half folded in.
Project back. A linear out_proj brings the width back to $D$ , and a residual connection wraps the full block.

Figure 7 · the Mamba block

One repeated block, no attention. in_proj expands and splits into an SSM branch (Conv1d, SiLU, selective SSM) and a SiLU gate. The branches multiply, then out_proj returns to width D, inside a residual.

The shapes make it concrete. Take one block at hidden width $D = 1024$ , sequence length $L = 2048$ , batch 1, the defaults from the released code: $N = 16$ state slots, conv width 4, expansion $E = 2$ . Some quantities are per-channel and some are shared. $\mathbf{A}$ and $\Delta$ are per channel (there are $E\cdot D = 2048$ independent SSMs, each with its own 16-slot state), while $\mathbf{B}$ and $\mathbf{C}$ are shared across channels (shape $(B, L, N)$ ), broadcast over the 2048 channels. The carried state never grows with $L$ :

# one Mamba block, batch B=1, length L=2048, width D=1024
in_proj :  (1, 2048, 1024)  ->  (1, 2048, 4096)   # expand E=2, then split
x, z    :  each (1, 2048, 2048)                    # SSM branch / gate branch
conv1d  :  (1, 2048, 2048)  width-4 causal, then SiLU
# selective SSM parameters, produced from x by a single linear (x_proj):
dt      :  (1, 2048, 2048)        # Δ, per channel  (dt_rank=64 -> 2048)
B, C    :  each (1, 2048, 16)      # shared across the 2048 channels
A       :  (2048, 16)             # per channel, N=16 state slots
state h :  (1, 2048, 16)          # carried, never grows with L
y       :  (1, 2048, 2048)  =  y * silu(z)         # gate
out_proj:  (1, 2048, 2048)  ->  (1, 2048, 1024)    # back to D

Most of the parameters live in the projections, about $3 E D^2 = 6 D^2$ per block, with the SSM itself adding little. So Mamba is, in parameter terms, mostly a stack of big linear layers with a lightweight, selective recurrence threaded between them. Stack the block, add an embedding and a final norm, and that is the model. No attention, no MLP blocks, no positional encodings (the recurrence is ordered by construction).

How it does across modalities

On language, Mamba is the first attention-free model to keep pace with a strong modern Transformer recipe (the LLaMA-style "Transformer++") at matched compute. The headline: Mamba-3B matches Transformers twice its size, both in pretraining loss and downstream evaluation, and beats same-size baselines like Pythia by a few points on common-sense reasoning. It does this while reading the sequence in linear time and generating with no KV cache, at 4–5× the throughput.

The synthetic tasks are the clearest evidence that selection accounts for the improvement. On selective copying, selection takes the model from 18.3% to 99.8%. On induction heads, the canonical in-context-recall task (having seen the pair “A then B” earlier in the sequence, output B the next time A appears), Mamba reaches near-perfect accuracy and then does something the others cannot: trained on length 256, it extrapolates to sequences of a million tokens, 4000× longer than anything it saw in training. A fixed schedule cannot do that. A content-driven gate can, because the rule it learned ("remember this kind of token") does not depend on the length.

Selection generalizes past language too, which is the real argument for it as a backbone. On raw audio waveforms (the SC09 generation benchmark) a small (~6M-parameter) Mamba reaches an FID of 0.94 (Fréchet Inception Distance, a generation-quality score where lower is better) against 1.99 for a comparable-size SaShiMi — already beating larger GAN and diffusion baselines (WaveGAN 19M at 2.03, DiffWave 24M at 1.92); a larger 24M Mamba reaches 0.67, the best on the benchmark. On DNA, where the alphabet is four letters and the useful context can be enormous, Mamba keeps improving as the sequence grows toward a million bases and matches strong baselines at 3–4× fewer parameters.

Mamba gives ground in clear places. A fixed-size state is a compression, so on tasks that need exact recall of an arbitrary past token, attention's full cache still has an edge, and hybrids that sprinkle a few attention layers among Mamba blocks tend to win on those. Selection also gives up the convolution view, and with it the stock FFT: the speed instead depends on a custom CUDA kernel, which is more to maintain. And the architecture is newer than the Transformer, with less of the tooling and folklore that makes a thing easy to train at the largest scales.

A structured SSM is a recurrence that reads a sequence in linear time and generates with no cache, and while it stays time-invariant it doubles as a convolution you can train fast. But time-invariance prevents content-dependent memory: the decay is fixed regardless of the token. Mamba makes the gate depend on the token, trading the convolution away for content-based memory, and a careful scan pays back the speed that trade would have cost. The result reads everything cheaply and keeps only what matters, which is most of what attention was buying at quadratic cost.

Provenance Verified against primary literature

Mamba (2023)Gu & Dao: selective SSMs (S6), the Δ/B/C input-dependence, the hardware-aware scan, and the architecture.

mamba_simple.py (code)Official block: d_state=16, d_conv=4, expand=2, dt_rank=⌈d_model/16⌉, A = −exp(A_log), SiLU gate, out = y·silu(z).

selective_scan_ref (code)The reference scan uses the Euler step B̄ = Δ·B, not the paper’s zero-order-hold B̄. See the flag below.

S4 / S4D (2022)Gu et al.: the structured SSM, the convolution view, and the diagonal A initialization Mamba inherits.

HiPPO (2020)Gu et al.: the original argument that an SSM state can optimally compress a sequence’s history.

correctionThe paper writes B̄ with the full zero-order-hold formula B̄ = (ΔA)⁻¹(exp(ΔA) − I)·ΔB (eq 4). The released selective_scan uses the first-order Euler form B̄ = Δ·B (deltaB_u = einsum(delta, B, u) in selective_scan_ref). Ā = exp(ΔA) is the same in both. We teach the ZOH form and flag that the code ships the cheaper Euler approximation.

Questions you might still have

If the state is a fixed size, doesn’t Mamba forget the distant past?
It compresses, the way an RNN does. Selection is meant to make the compression smart: the gate keeps the few tokens that matter and drops filler, so the fixed-size state spends its room on signal. On synthetic recall it extrapolates to a million tokens; on tasks needing exact lookup of arbitrary past tokens, attention’s full cache still has an edge.

Why does making B, C, Δ depend on the input kill the convolution?
A convolution works only because the kernel is the same at every position (time-invariant). Once Δ, B, C change with the token, every position has its own dynamics, so there is no single kernel to slide. You are back to a recurrence, which is why the scan matters.

Is the selective scan actually fast, given it cannot use the FFT?
Yes, because it never materializes the big (B, L, D, N) state in slow memory. The kernel loads inputs into SRAM, runs the scan and discretization there, writes only the outputs back, and recomputes the states in the backward pass. Same memory profile as FlashAttention, and the scan itself runs far faster than a naive sequential loop.

Where did the convolution and SiLU in the block come from?
They are not part of the SSM math. The short causal Conv1d mixes a few neighboring tokens before the SSM sees them, and the SiLU gate is the gated-MLP half of the block folded in. Mamba is one repeated block that fuses the old H3 SSM layer and a gated MLP, with no separate attention or MLP blocks.

Footnotes & further reading

The paper: Gu, Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Carnegie Mellon & Princeton, COLM 2024). Code.
The structured SSM Mamba builds on, with the convolution view and the diagonal initialization: Gu, Goel, Ré, Efficiently Modeling Long Sequences with Structured State Spaces (S4), and Gu et al., On the Parameterization and Initialization of Diagonal State Space Models (S4D).
The argument that an SSM state can optimally compress history: Gu et al., HiPPO: Recurrent Memory with Optimal Polynomial Projections.
The IO-aware kernel idea Mamba's scan mirrors: Dao et al., FlashAttention.
The block fuses the SSM layer from Fu et al., Hungry Hungry Hippos (H3), with a gated MLP in the SwiGLU family.
The reference selective scan and block defaults are in the official repo, mamba_simple.py and selective_scan_interface.py.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.