Transformer Architecture

The Transformer is a deep learning architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." It replaced the recurrent networks (RNNs and LSTMs) that had dominated sequence modeling for nearly a decade, and became the foundation for virtually every major language model developed since — including BERT, GPT, T5, LLaMA, Claude, and Gemini. Understanding the Transformer is the prerequisite for understanding how modern AI works.

## The Problem It Solved

Before Transformers, sequence-to-sequence tasks (translation, summarization, question answering) were handled by recurrent neural networks. RNNs process sequences token by token — each step's output depends on the previous step's hidden state. This creates two critical problems:

1. **Sequential bottleneck**: Each token must wait for all previous tokens to be processed, preventing parallelization during training. Training on long sequences was slow.
2. **Long-range dependency failure**: Information from early tokens is compressed into a fixed-size hidden state. Over long sequences, gradients vanish and the network forgets early context. A question about the first sentence cannot be answered using information from the last sentence.

Attention mechanisms were developed to address the second problem — allowing the network to directly reference any position in the input. The 2017 paper's insight was radical: eliminate the recurrent structure entirely and use attention as the sole information-passing mechanism.

## Architecture Overview

A standard Transformer has two stacks: an **encoder** (reads the input) and a **decoder** (generates the output). Modern language models like GPT use only the decoder stack. Models like BERT use only the encoder.

```
Input Tokens
    |
Token Embeddings + Positional Encoding
    |
[Encoder Stack: N identical layers]
  - Multi-Head Self-Attention
  - Feed-Forward Network
  - Add & Norm (residual connections + layer normalization)
    |
[Decoder Stack: N identical layers]
  - Masked Multi-Head Self-Attention
  - Cross-Attention (attends to encoder output)
  - Feed-Forward Network
  - Add & Norm
    |
Linear + Softmax
    |
Output Probabilities
```

## Self-Attention: The Core Mechanism

Self-attention allows each token in a sequence to directly attend to every other token. For each token, three vectors are computed from its embedding:

- **Query (Q)**: What this token is looking for
- **Key (K)**: What this token offers to others
- **Value (V)**: The actual content this token contributes

Attention scores are computed as: **Attention(Q, K, V) = softmax(QK^T / √d_k) V**

Where d_k is the dimension of the key vectors (the √d_k scaling prevents dot products from growing too large, which would push softmax into saturation). The softmax produces a probability distribution over all positions, and the output is a weighted sum of the value vectors.

In plain terms: for each token, compute how relevant every other token is (via dot products of queries and keys), normalize with softmax, and take a weighted average of the values. Every token can directly attend to every other in one operation, with no sequential bottleneck.

## Multi-Head Attention

Rather than performing attention once, the Transformer runs attention h times in parallel with different learned weight matrices. Each "head" can attend to different aspects of the relationships: one head might track syntactic dependencies, another semantic similarity, another coreference.

The outputs of all heads are concatenated and projected back to the original dimension. This is the "multi-head" in Multi-Head Attention.

## Positional Encoding

Unlike RNNs, self-attention is permutation-invariant — it treats tokens as a set, not a sequence. To give the model information about token order, positional encodings are added to token embeddings. The original paper used fixed sinusoidal encodings; most modern models use learned positional embeddings or variants like RoPE (Rotary Position Embedding) that generalize better to sequences longer than those seen during training.

## Feed-Forward Networks and Residual Connections

After each attention layer, a position-wise feed-forward network (two linear layers with a nonlinearity) applies independent transformations to each token position. The intuition is that attention aggregates information across positions while the FFN processes each token's representation locally.

**Residual connections** (adding the input to the layer's output before normalization) address the vanishing gradient problem and allow very deep networks to train stably. **Layer normalization** stabilizes training by normalizing activations across the feature dimension.

## Scaling Properties

The Transformer's impact is amplified by a remarkable empirical finding: performance scales predictably with model size, training compute, and data — the "scaling laws" documented by Kaplan et al. (2020). Doubling parameters, compute, or data each yields consistent, predictable improvements in perplexity. This predictability enabled confident investment in much larger models.

| Model | Year | Parameters | Context Length |
|-------|------|-----------|---------------|
| Original Transformer | 2017 | ~65M | 512 tokens |
| GPT-3 | 2020 | 175B | 4,096 tokens |
| GPT-4 | 2023 | ~1T (est.) | 128K tokens |
| Gemini 1.5 Pro | 2024 | Unknown | 1M tokens |
| LLaMA 3.1 | 2024 | 405B | 128K tokens |

## Limitations and Active Research

**Quadratic complexity**: Standard self-attention computes a score between every pair of tokens, making computation and memory scale as O(n²) with sequence length. For long documents, this becomes prohibitive. Linear attention approximations (Performer, Mamba, RWKV) attempt to reduce this cost.

**Context window**: While recent models have extended to millions of tokens, effectively using very long contexts — especially retrieving specific information from earlier in the sequence — remains an open research problem.

**Hallucination**: Transformers generate text by predicting the next token from patterns in training data. They do not have a separate "knowledge" store and can generate plausible-sounding but factually wrong content with high confidence.

**Interpretability**: Despite enormous effort, it remains difficult to explain why a Transformer produces a specific output. Attention weights are often interpreted as showing "what the model attends to," but this is a known oversimplification — attention patterns do not straightforwardly correspond to which information was actually used.

The Transformer's seven-year dominance of AI research is remarkable, but architectures evolve. State-space models (Mamba), mixture-of-experts architectures, and hybrid approaches are all active areas of research exploring whether the Transformer's quadratic attention bottleneck can be resolved at scale.

Transformer architecture

// COMMENTS

ON THIS PAGE

Transformer architecture

// COMMENTS ↓ Newest First

ON THIS PAGE

// COMMENTS