"Mamba vs Transformer: State Space Models Aren't Just Hype"

The Transformer has been the dominant architecture for language models since 2017. Every major model — GPT, Gemini, Claude, Llama — runs on attention. But attention has a fundamental scaling problem that everyone building large models knows about: quadratic compute with sequence length.

Mamba and state space models (SSMs) offer a different mathematical foundation. The question is whether it's actually better or just different.

## The Attention Bottleneck

Transformer attention computes pairwise relationships between every token and every other token. That's O(n²) memory and compute with sequence length n.

For short sequences (under 4K tokens), this is manageable. At 100K tokens — which modern RAG and long-context use cases require — the attention computation becomes the dominant cost. The KV cache alone for a 70B model at 128K context exceeds available GPU memory on most inference hardware.

---

## What SSMs Do Instead

State space models replace pairwise attention with a compressed state representation. Instead of attending to all previous tokens, the model maintains a fixed-size state that evolves as new tokens arrive.

> ⚡ Linear compute with sequence length. The state representation is constant size regardless of sequence length. Memory use is O(n), not O(n²).

Mamba specifically adds **selective state spaces** — the model learns which information to keep in the state and what to discard, controlled by input-dependent gating. This addresses the main weakness of earlier SSMs, which couldn't selectively ignore irrelevant context.

---

## Where Mamba Wins and Loses

**Mamba wins on:**
- Very long sequences (100K+ tokens): linear scaling vs. quadratic attention
- Inference throughput: constant memory per step, no KV cache growth
- Real-time streaming: state updates are O(1) per new token

**Transformer wins on:**
- Recall from arbitrary positions: attention can directly look back anywhere
- In-context learning: Transformers are empirically better at few-shot tasks
- Training stability: better understood optimization landscape

The evidence from Mamba-2 and hybrid architectures (attention layers + SSM layers) suggests the real answer is combination, not replacement. Models like Jamba and Zamba use attention for global recall and SSMs for efficient local processing.

---

## The Bigger Picture

SSMs aren't replacing Transformers — they're filling the gaps Transformers leave. Quadratic attention is genuinely a constraint at very long contexts, and the inference cost difference becomes decisive at production scale.

The more interesting question is whether hybrid architectures can capture the best of both: Transformer-quality recall with SSM-class inference efficiency. Early results say yes, with the right mixing ratio. That's where the architecture research is actually going.

"Mamba vs Transformer: State Space Models Aren't Just Hype"

// COMMENTS

ON THIS PAGE