VerifiedarXiv:2312.0075224 min
Architecture · Sequence models

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A model that chooses what to remember.

Attention reads the whole past at once and pays for it, every token against every other. Mamba carries a fixed-size running summary instead, and adds the one thing earlier recurrences lacked: it lets the current token decide what the summary keeps.

Explaining the paperMamba: Linear-Time Sequence Modeling with Selective State SpacesGu, Dao · CMU & Princeton · COLM 2024 · arXiv:2312.00752

What if a sequence model could read a million tokens, remember the few that mattered, and never build a cache?

The Transformer won by reading everything. At every step, attention compares the current token against all the tokens before it, weighs them, and mixes them in. Nothing gets compressed and nothing gets forgotten on the way, which is exactly why it is so good. It is also why it is expensive. Reading a sequence of length LL means roughly L2L^2 comparisons, and at generation time the model keeps a growing cache of every past token's keys and values so it can attend to them again. Double the context and you quadruple the attention work and double the cache. For a paragraph this is fine. For a book, a genome, or an hour of audio it is a wall.

Mamba, from Albert Gu and Tri Dao, goes back to an older and cheaper idea: a recurrence. Walk through the sequence once, left to right, carrying a small fixed-size summary of everything seen so far. Cost grows with LL, not L2L^2, and at generation time there is nothing to cache because the whole past lives in that one summary. Recurrences are not new, and for years they lost to attention on language for one specific reason: a plain recurrence treats every token the same way, so it cannot choose to remember a name and forget a comma. Mamba's fix is a single idea with a precise mechanism. Let the summary's update rule depend on the token it is reading. The model learns to fling its memory gate open on the tokens that matter and slam it shut on the ones that do not.

That one change, plus the engineering to make it run fast on a GPU, is the whole paper. To see why it works we build a small tower: what a state space model is, how a continuous one becomes a discrete recurrence, why that recurrence is secretly also a convolution, what breaks when you make it selective, and how a careful scan puts the speed back. None of the pieces is hard alone.

The bill attention runs up

Start with the cost, because it is the thing Mamba is trying to dodge. Two ways to read a sequence, side by side. On the left, attention: token ii looks back at every token jij \le i, so the work it does is the area of a triangle, and that area grows like L2L^2. On the right, a recurrence: one fixed-size state slides down the sequence, and each token does the same bounded amount of work to fold itself in, so the total grows like LL.

Figure 1 · two ways to read a sequence
L = 12
Attention fills a triangle of token-to-token comparisons, so its work grows like L². A recurrence carries one fixed-size state down a single column, doing the same work per token, so its work grows like L. Drag L and watch the gap open.

The recurrence has a second prize that matters even more in deployment. To generate the next token, attention has to keep a KV cache, the keys and values of every token so far, and that cache grows without bound. A recurrence keeps one state of fixed size no matter how long the sequence gets, so per-token generation is constant time and constant memory. That is where Mamba's reported 4–5× higher inference throughput comes from: no cache means you can run far larger batches on the same hardware.

So why did the field not just use recurrences? Because the obvious recurrence, a plain RNN, is slow to train (you cannot parallelize a sequential loop across the sequence) and tends to forget. The work Mamba builds on solved the first problem. Mamba solves the second. We need the machinery first.

A state space model: one running summary

The recurrence Mamba uses is a state space model (SSM), an idea borrowed from control theory. Picture a small hidden state h(t)RNh(t) \in \mathbb{R}^N, a vector of NN numbers, that summarizes the input stream up to time tt. As a new input x(t)x(t) arrives, the state evolves by a linear rule, and the output y(t)y(t) is read off from the state:

h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = \mathbf{A}\,h(t) + \mathbf{B}\,x(t), \qquad y(t) = \mathbf{C}\,h(t)(1)

Three matrices run the show. ARN×N\mathbf{A} \in \mathbb{R}^{N\times N} is the state transition: how the summary decays and reshuffles itself on its own, ignoring new input. BRN×1\mathbf{B} \in \mathbb{R}^{N\times 1} is how a new input gets written into the state. CR1×N\mathbf{C} \in \mathbb{R}^{1\times N} is how the state gets read out into an answer. The state hh is the running summary, A\mathbf{A} decides how long memories linger, and B,C\mathbf{B}, \mathbf{C} are the write and read heads. (Earlier SSM work, the HiPPO line, showed you can choose A\mathbf{A} so the state is a near-optimal compression of the whole history. Mamba inherits a simple diagonal version of that.)

One thing to hold onto, because it comes back. This equation is for a single scalar input channel. A real model has DD channels of width (the hidden dimension), and Mamba runs DD of these SSMs in parallel, one per channel, each with its own NN-dimensional state. So the actual hidden state is a D×ND \times N grid, not a single NN-vector. Keep NN small (Mamba uses N=16N = 16) and that grid stays cheap even though it is doing a lot of bookkeeping.

From continuous to discrete: the step Δ

Equation (1) is written in continuous time, with a derivative h(t)h'(t). Real data is a discrete sequence of tokens, so we need the discrete version: given the state at step t1t-1, what is the state at step tt? Turning a continuous dynamical system into a discrete step is a standard move called discretization, and it introduces the single most important number in Mamba: a step size Δ\Delta.

Think of Δ\Delta as how much continuous time passes between two tokens. The clean way to integrate (1) over a slice of width Δ\Delta, assuming the input is held constant across the slice, is the zero-order hold (ZOH). It gives discrete matrices Aˉ,Bˉ\bar{\mathbf{A}}, \bar{\mathbf{B}} that play the role of A,B\mathbf{A}, \mathbf{B} in a plain recurrence:

Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}), \qquad \bar{\mathbf{B}} = (\Delta\mathbf{A})^{-1}\big(\exp(\Delta\mathbf{A}) - \mathbf{I}\big)\cdot\Delta\mathbf{B}(2)

with the recurrence and readout

ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{\mathbf{A}}\,h_{t-1} + \bar{\mathbf{B}}\,x_t, \qquad y_t = \mathbf{C}\,h_t(3)

The shape ht=Aˉht1+Bˉxth_t = \bar{\mathbf{A}}\,h_{t-1} + \bar{\mathbf{B}}\,x_t is exactly a gated RNN: keep some of the old state, write in some of the new input. The exp(ΔA)\exp(\Delta\mathbf{A}) is doing the work. With A\mathbf{A} negative (Mamba parameterizes it as A=exp(A_log)\mathbf{A} = -\exp(\text{A\_log}), so it stays negative and the state decays rather than blows up), a larger Δ\Delta means more continuous time elapsed, so Aˉ=exp(ΔA)\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}) is smaller, the old state decays harder, and the new input dominates. A smaller Δ\Delta means almost no time passed, Aˉ1\bar{\mathbf{A}} \approx 1, and the state coasts along nearly unchanged. So Δ\Delta is a dial between reset to the current input and ignore the current input and persist. File that away.

The picture below makes Δ\Delta concrete. The smooth amber curve is the true continuous state; the teal dots are the discrete recurrence sampling it every Δ\Delta units. Small Δ\Delta samples often and hugs the curve; large Δ\Delta takes big coarse jumps. The discrete model only ever sees the dots.

Figure 2 · discretization
Δ = 0.50
The continuous state h(t) evolves smoothly; the discrete recurrence samples it at step size Δ, with decay ā = exp(Δa) per step. Small Δ tracks the curve; large Δ jumps coarsely and lets the new input dominate. Δ is the model's clock.

One honest aside, and the first place the paper and the code part ways. Equation (2) is the exact ZOH formula for Bˉ\bar{\mathbf{B}}. The released kernel does not bother with the (ΔA)1(exp(ΔA)I)(\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A})-\mathbf{I}) term; it uses the first-order approximation BˉΔB\bar{\mathbf{B}} \approx \Delta\mathbf{B} (a plain Euler step), keeping only Aˉ=exp(ΔA)\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}) exact. The two agree to first order in Δ\Delta, and the simpler form trains fine. We flag it in the Provenance panel and cite the file.

Two faces: a recurrence and a convolution

Here is the property that made these models trainable at all, and it holds only when Aˉ,Bˉ,C\bar{\mathbf{A}}, \bar{\mathbf{B}}, \mathbf{C} are the same at every step (the model is time-invariant). Unroll the recurrence (3) from a zero start. The output at position tt is a fixed weighted sum of the current and past inputs:

yt=CBˉxt+CAˉBˉxt1+CAˉ2Bˉxt2+y_t = \mathbf{C}\bar{\mathbf{B}}\,x_t + \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}}\,x_{t-1} + \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}}\,x_{t-2} + \cdots

Those coefficients do not depend on tt. They are one fixed sequence, a convolution kernel:

Kˉ=(CBˉ, CAˉBˉ, CAˉ2Bˉ, ),y=xKˉ\bar{\mathbf{K}} = \big(\mathbf{C}\bar{\mathbf{B}},\ \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\ \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}},\ \dots\big), \qquad y = x * \bar{\mathbf{K}}(4)

So the very same model has two faces. As a recurrence you step through it one token at a time, carrying the state, which is perfect for generating tokens one by one. As a convolution you apply that single fixed kernel Kˉ\bar{\mathbf{K}} to the whole sequence at once, which (via the FFT) is fast and parallel, perfect for training on a long sequence in one shot. Same numbers, two schedules. This duality is what let structured SSMs train at scale.

Figure 3 · recurrence and convolution
One time-invariant SSM, two computations. The recurrent view carries a state left to right, one token at a time. The convolution view slides one fixed kernel K̄ over the whole sequence at once. Both give the same output. Toggle between them.

Read the kernel and you can feel the limitation coming. Kˉ\bar{\mathbf{K}} is fixed before the model sees any data. The weight on "the token 5 steps ago" is CAˉ5Bˉ\mathbf{C}\bar{\mathbf{A}}^5\bar{\mathbf{B}} whether that token was a crucial name or a throwaway "the." The model decays the past on a schedule it set in advance, blind to content. That is the flaw.

The flaw: one filter for every token

To expose it, the authors use a deliberately mean little task: selective copying. The model sees a stream of tokens, most of them random filler, a few of them marked content, and it has to reproduce only the content tokens, in order, ignoring the filler. The catch is that the filler sits at random positions, so the model cannot just learn "copy positions 3, 7, and 11." It has to decide what to keep based on what each token is.

A time-invariant SSM cannot do this, and the reason is the fixed kernel. It applies the same decay to every token, so it either keeps too much (and the filler floods the state) or forgets too fast (and loses the content). It has no knob that looks at the token and says "this one, remember; that one, drop." The numbers are stark: a non-selective SSM gets 18.3% on selective copying. Bolt a content-blind gate onto the architecture and it climbs to 57.0%, still poor. The selective model gets 99.8%.

Figure 4 · selective copying
A stream where diamonds are content to remember and dots are filler to ignore. A time-invariant model uses one gate for every token and blurs content with filler. A selective model opens its gate on content and shuts it on filler, keeping the state clean. Toggle to compare.

This is the content-based reasoning the abstract says these models lacked. Attention has it for free: it scores each past token against the current one and weights it accordingly. A fixed convolution kernel cannot. So the question becomes sharp. How do you give a recurrence the ability to look at a token and choose, without giving up the linear-time cost that made it attractive?

Selection: let the input steer

The answer is direct. Make the SSM parameters functions of the input. Instead of one fixed Δ,B,C\Delta, \mathbf{B}, \mathbf{C} for the whole sequence, compute a fresh Δt,Bt,Ct\Delta_t, \mathbf{B}_t, \mathbf{C}_t at every position from the token there, via small linear layers:

Bt=sB(xt),Ct=sC(xt),Δt=τΔ(param+sΔ(xt))\mathbf{B}_t = s_B(x_t),\quad \mathbf{C}_t = s_C(x_t),\quad \Delta_t = \tau_\Delta\big(\text{param} + s_\Delta(x_t)\big)(5)

where sB,sCs_B, s_C are linear maps into RN\mathbb{R}^N, sΔs_\Delta is a linear map whose output is broadcast across the DD channels, and τΔ=softplus\tau_\Delta = \text{softplus} keeps the step positive. The authors call this selective model S6. (One matrix stays fixed: A\mathbf{A} is still input-independent. It does not need to be selective, because Δt\Delta_t already multiplies it inside Aˉ=exp(ΔtA)\bar{\mathbf{A}} = \exp(\Delta_t \mathbf{A}), so steering Δ\Delta already steers the effective decay. The paper finds Δ\Delta is the main source of the improvement.)

Now look back at the discretization. With Δ\Delta now a per-token quantity, Aˉt=exp(ΔtA)\bar{\mathbf{A}}_t = \exp(\Delta_t\mathbf{A}) is a per-token decay. A token the model deems important gets a large Δt\Delta_t, which (with negative A\mathbf{A}) shrinks Aˉt\bar{\mathbf{A}}_t, wipes the stale state, and writes the new input in hard. A filler token gets a small Δt\Delta_t, Aˉt1\bar{\mathbf{A}}_t \approx 1, and the state coasts past it untouched. Δ\Delta has become a learned, content-dependent forget gate.

This is not a loose analogy; the paper proves it. In the simplest case (state size N=1N=1, A=1\mathbf{A}=-1, B=1\mathbf{B}=1, with the softplus step), the selective recurrence collapses exactly to the classic gated update:

gt=σ(Linear(xt)),ht=(1gt)ht1+gtxtg_t = \sigma\big(\text{Linear}(x_t)\big), \qquad h_t = (1 - g_t)\,h_{t-1} + g_t\,x_t(6)

which is the forget gate at the heart of an LSTM or GRU. Mamba's selection is that gate, generalized to an NN-dimensional state and derived from the SSM rather than bolted on. Drag the gate below: push it toward persist and the state ignores new inputs and coasts; push it toward reset and the state snaps to each new input. Selection means the model learns this dial per token.

Figure 5 · Δ is a forget gate
mixed
The selective step Δ acts as a gate (eq 6). The amber staircase is the input; the teal curve is the state h. Toward persist, the state coasts and ignores inputs; toward reset, it snaps to each new input. The bottom bars show the gate at each step.

The cost of this is that the model is no longer time-invariant. Every position now has its own Aˉt,Bˉt,Ct\bar{\mathbf{A}}_t, \bar{\mathbf{B}}_t, \mathbf{C}_t, so there is no single kernel to slide. The convolution face is gone. We are back to a genuine recurrence, and a naive recurrence is slow to train on a GPU. If the story stopped here, selection would be a good idea you could not afford. The last piece is making it affordable.

Running it fast: the scan

The recurrence ht=Aˉtht1+Bˉtxth_t = \bar{\mathbf{A}}_t\,h_{t-1} + \bar{\mathbf{B}}_t\,x_t looks irreducibly sequential, each step waiting on the one before. But it is a chain of affine maps (scale by Aˉt\bar{\mathbf{A}}_t, add Bˉtxt\bar{\mathbf{B}}_t x_t), and composing affine maps is associative. That single fact unlocks a parallel scan: instead of stepping through LL times, combine the maps pairwise up a tree in log2L\log_2 L rounds, with every pair in a round done at once. The same prefix sums, computed in a fraction of the sequential depth.

Figure 6 · the parallel scan
The recurrence is a chain of affine maps, and composing them is associative. The sequential view steps through all L positions in order. The parallel scan combines them pairwise up a tree in log₂L rounds, with a whole round done at once. Toggle to compare.

The scan handles parallelism. The other half of the speed is about memory traffic, and it is pure systems work. The selective SSM's expanded state is a tensor of shape (B,L,D,N)(B, L, D, N), big enough that writing it to the GPU's slow main memory (HBM) and reading it back would dominate the runtime. Mamba never does. The kernel is fused: it loads the small inputs (Δ,B,C)(\Delta, \mathbf{B}, \mathbf{C}) into fast on-chip SRAM, performs the discretization and the scan there, and writes back only the final outputs, never the giant state. And for the backward pass it recomputes the intermediate states on the fly rather than storing them. The result, the authors report, is the same memory footprint as a well-optimized attention layer with FlashAttention, and a scan that runs far faster than a naive sequential loop. The three tricks together are kernel fusion, the parallel scan, and recomputation.

Here is the selective recurrence for one channel, unrolled, so the symbols become code:

# the selective SSM, one channel, unrolled (B,L,N math elided)
# inputs for this step: x_t  (a token's value on this channel)
A   = -exp(A_log)              # (N,)  per-channel, learned, stays negative
dt  = softplus(dt_bias + s_dt(x_t))   # (1,)  the step Δ, from the input
B_t = s_B(x_t)                 # (N,)  input-dependent  (was fixed in S4)
C_t = s_C(x_t)                 # (N,)  input-dependent
Abar = exp(dt * A)             # (N,)  discretize: decay this step
Bbar = dt * B_t                # (N,)  Euler form the code ships
h    = Abar * h + Bbar * x_t   # (N,)  carry the state forward
y_t  = (C_t * h).sum()         # scalar output for this channel, this token
# y_t then gets the residual skip (+ x_t * D) and the SiLU gate

The Mamba block, end to end

The selective SSM is the engine. The Mamba block is the car built around it, and the design goal was to make one simple block you can stack, with no separate attention or MLP layers. It fuses the previous SSM layer (called H3) with a gated MLP into a single repeated unit.

The block does five things, in order:

Figure 7 · the Mamba block
One repeated block, no attention. in_proj expands and splits into an SSM branch (Conv1d, SiLU, selective SSM) and a SiLU gate. The branches multiply, then out_proj returns to width D, inside a residual.

Make it concrete with shapes. Take one block at hidden width D=1024D = 1024, sequence length L=2048L = 2048, batch 1, the defaults from the released code: N=16N = 16 state slots, conv width 4, expansion E=2E = 2. Note who is per-channel and who is shared. A\mathbf{A} and Δ\Delta are per channel (there are ED=2048E\cdot D = 2048 independent SSMs, each with its own 16-slot state), while B\mathbf{B} and C\mathbf{C} are shared across channels (shape (B,L,N)(B, L, N)), broadcast over the 2048 channels. The carried state never grows with LL:

# one Mamba block, batch B=1, length L=2048, width D=1024
in_proj :  (1, 2048, 1024)  ->  (1, 2048, 4096)   # expand E=2, then split
x, z    :  each (1, 2048, 2048)                    # SSM branch / gate branch
conv1d  :  (1, 2048, 2048)  width-4 causal, then SiLU
# selective SSM parameters, produced from x by a single linear (x_proj):
dt      :  (1, 2048, 2048)        # Δ, per channel  (dt_rank=64 -> 2048)
B, C    :  each (1, 2048, 16)      # shared across the 2048 channels
A       :  (2048, 16)             # per channel, N=16 state slots
state h :  (1, 2048, 16)          # carried, never grows with L
y       :  (1, 2048, 2048)  =  y * silu(z)         # gate
out_proj:  (1, 2048, 2048)  ->  (1, 2048, 1024)    # back to D

Most of the parameters live in the projections, about 3ED2=6D23 E D^2 = 6 D^2 per block, with the SSM itself adding little. So Mamba is, in parameter terms, mostly a stack of big linear layers with a cheap, selective recurrence threaded between them. Stack the block, add an embedding and a final norm, and that is the model. No attention, no MLP blocks, no positional encodings (the recurrence is ordered by construction).

So what does it actually do

On language, Mamba is the first attention-free model to keep pace with a strong modern Transformer recipe (the LLaMA-style "Transformer++") at matched compute. The headline: Mamba-3B matches Transformers twice its size, both in pretraining loss and downstream evaluation, and beats same-size baselines like Pythia by a few points on common-sense reasoning. It does this while reading the sequence in linear time and generating with no KV cache, at 4–5× the throughput.

The synthetic tasks are the cleanest evidence that selection is doing the work. On selective copying, selection takes the model from 18.3% to 99.8%. On induction heads, the canonical in-context-recall task, Mamba reaches near-perfect accuracy and then does something the others cannot: trained on length 256, it extrapolates to sequences of a million tokens, 4000× longer than anything it saw in training. A fixed schedule cannot do that. A content-driven gate can, because the rule it learned ("remember this kind of token") does not depend on the length.

Selection generalizes past language too, which is the real argument for it as a backbone. On raw audio waveforms (the SC09 generation benchmark) a 24M-parameter Mamba reaches an FID of 0.67 against 1.42 for a comparable SaShiMi, beating much larger GAN and diffusion baselines. On DNA, where the alphabet is four letters and the useful context can be enormous, Mamba keeps improving as the sequence grows toward a million bases and matches strong baselines at 3–4× fewer parameters. The same block, the same selective scan, three modalities.

The limits are honest. A fixed-size state is a compression, so on tasks that need exact recall of an arbitrary past token, attention's full cache still has an edge, and hybrids that sprinkle a few attention layers among Mamba blocks tend to win on those. Selection also gives up the convolution view, so the speed depends on a custom CUDA kernel rather than a stock FFT, which is more to maintain. And the architecture is newer than the Transformer, with less of the tooling and folklore that makes a thing easy to train at the largest scales.

Step back and the argument is four facts long. A recurrence reads a sequence in linear time and generates with no cache. A structured SSM is such a recurrence, and when it is time-invariant it is also a convolution you can train fast. Making it time-invariant is exactly what stops it from choosing what to remember, so Mamba makes the gate depend on the token, which costs the convolution but buys content-based memory. A careful scan puts the speed back. The result is a model that reads everything cheaply and keeps only what matters, which turns out to be most of what attention was buying at quadratic cost.

Provenance Verified against primary literature
Mamba (2023)Gu & Dao: selective SSMs (S6), the Δ/B/C input-dependence, the hardware-aware scan, and the architecture.
mamba_simple.py (code)Official block: d_state=16, d_conv=4, expand=2, dt_rank=⌈d_model/16⌉, A = −exp(A_log), SiLU gate, out = y·silu(z).
selective_scan_ref (code)The reference scan uses the Euler step B̄ = Δ·B, not the paper’s zero-order-hold B̄. See the flag below.
S4 / S4D (2022)Gu et al.: the structured SSM, the convolution view, and the diagonal A initialization Mamba inherits.
HiPPO (2020)Gu et al.: the original argument that an SSM state can optimally compress a sequence’s history.
correctionThe paper writes B̄ with the full zero-order-hold formula B̄ = (ΔA)⁻¹(exp(ΔA) − I)·ΔB (eq 4). The released selective_scan uses the first-order Euler form B̄ = Δ·B (deltaB_u = einsum(delta, B, u) in selective_scan_ref). Ā = exp(ΔA) is the same in both. We teach the ZOH form and flag that the code ships the cheaper Euler approximation.

Questions you might still have

?

If the state is a fixed size, doesn’t Mamba forget the distant past?
It compresses, the way an RNN does. The bet is that selection makes the compression smart: the gate keeps the few tokens that matter and drops filler, so the fixed-size state spends its room on signal. On synthetic recall it extrapolates to a million tokens; on tasks needing exact lookup of arbitrary past tokens, attention’s full cache still has an edge.

?

Why does making B, C, Δ depend on the input kill the convolution?
A convolution works only because the kernel is the same at every position (time-invariant). Once Δ, B, C change with the token, every position has its own dynamics, so there is no single kernel to slide. You are back to a recurrence, which is why the scan matters.

?

Is the selective scan actually fast, given it cannot use the FFT?
Yes, because it never materializes the big (B, L, D, N) state in slow memory. The kernel loads inputs into SRAM, runs the scan and discretization there, writes only the outputs back, and recomputes the states in the backward pass. Same memory profile as FlashAttention, and the scan itself runs far faster than a naive sequential loop.

?

Where did the convolution and SiLU in the block come from?
They are not part of the SSM math. The short causal Conv1d mixes a few neighboring tokens before the SSM sees them, and the SiLU gate is the gated-MLP half of the block folded in. Mamba is one repeated block that fuses the old H3 SSM layer and a gated MLP, with no separate attention or MLP blocks.

Footnotes & further reading

  1. The paper: Gu, Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Carnegie Mellon & Princeton, COLM 2024). Code.
  2. The structured SSM Mamba builds on, with the convolution view and the diagonal initialization: Gu, Goel, Ré, Efficiently Modeling Long Sequences with Structured State Spaces (S4), and Gu et al., On the Parameterization and Initialization of Diagonal State Space Models (S4D).
  3. The argument that an SSM state can optimally compress history: Gu et al., HiPPO: Recurrent Memory with Optimal Polynomial Projections.
  4. The IO-aware kernel idea Mamba's scan mirrors: Dao et al., FlashAttention.
  5. The block fuses the SSM layer from Fu et al., Hungry Hungry Hippos (H3), with a gated MLP in the SwiGLU family.
  6. The reference selective scan and block defaults are in the official repo, mamba_simple.py and selective_scan_interface.py.