Transformers — The Architecture That Changed AI (Part 1 of 3)

A deep dive into the Transformer architecture — from attention mechanisms to self-attention, multi-head attention, positional encoding, and why this single paper reshaped all of modern AI.

Banner

In June 2017, a team at Google published a paper with a deceptively simple title: "Attention Is All You Need." Eight authors, fourteen pages, and one architecture that would go on to power GPT-4, Claude, Gemini, DALL-E, Stable Diffusion, AlphaFold, and virtually every breakthrough in AI since.

The Transformer didn't just improve on existing models. It replaced the entire paradigm. Recurrent neural networks, LSTMs, sequence-to-sequence models with attention — all of them became legacy architectures almost overnight.

This is Part 1 of a 3-part series. Here we cover the Transformer itself — the core architecture, the intuition behind each component, and why it scales so remarkably well. Part 2 will cover Vision Transformers (how this architecture learned to see), and Part 3 will cover Vision-Language Models (when AI learned to see and talk).


The Problem: Why RNNs Hit a Wall

To understand why Transformers matter, you need to understand what came before.

Recurrent Neural Networks (RNNs) process sequences one token at a time, left to right. Each step takes the previous hidden state and the current input, produces a new hidden state, and passes it forward. This is elegant in theory: the hidden state is a compressed summary of everything the model has seen so far.

In practice, it has three devastating problems:

  1. The bottleneck problem. By the time an RNN reaches the 500th word in a paragraph, the information from the 1st word has been compressed through 499 sequential transformations. Important early context gets diluted or lost entirely. Imagine trying to remember the first sentence of a book after reading 500 pages, where each page partially overwrites your memory of the previous one.

  2. No parallelization. Because each step depends on the previous step's output, you cannot process tokens in parallel. Training is inherently sequential. On modern GPUs with thousands of cores designed for parallel computation, this is a catastrophic bottleneck.

  3. Vanishing and exploding gradients. During backpropagation through time, gradients must flow backwards through every sequential step. Over long sequences, they either shrink to near-zero (vanishing) or blow up to infinity (exploding), making it extremely hard to learn long-range dependencies.

LSTMs and GRUs partially addressed problem 3 by adding gating mechanisms — explicit "remember" and "forget" controls. They helped, but they didn't solve the fundamental sequential nature of the computation (problem 2) or the information bottleneck (problem 1).

The sequence-to-sequence model with attention (Bahdanau et al., 2014) made a crucial step forward. Instead of forcing the decoder to work from a single compressed context vector, it allowed the decoder to "look back" at all encoder hidden states and attend to the most relevant ones at each decoding step. This was the birth of attention as a mechanism.

But even seq2seq with attention still relied on an RNN backbone. The encoder still processed tokens sequentially. The Transformer's radical insight was: what if we throw away the recurrence entirely and use only attention?


The Core Idea: Attention as the Only Mechanism

The Transformer computes relationships between all tokens in a sequence simultaneously. Instead of passing information through a chain of hidden states, every token can directly attend to every other token in a single operation.

Think of it this way. An RNN is like a game of telephone — each person whispers the message to the next, and by the end of the line, the message is garbled. A Transformer is like a round table where everyone can hear everyone else directly. No information loss from sequential passing. No bottleneck.

This has a profound consequence: the entire sequence can be processed in parallel. During training, all tokens are known in advance, so every attention computation can happen simultaneously across the GPU. This is why Transformers train orders of magnitude faster than RNNs on the same hardware.


The Architecture: Component by Component

The original Transformer uses an encoder-decoder structure, designed for sequence-to-sequence tasks like machine translation (English to German, for example). Let's walk through each component.

Encoder-Decoder Overview

The encoder takes the input sequence (e.g., an English sentence) and produces a rich representation of it — a set of vectors that capture meaning and context. The decoder takes that representation and generates the output sequence (e.g., the German translation) one token at a time.

The encoder is a stack of 6 identical layers. Each layer has two sub-components: a multi-head self-attention mechanism and a position-wise feed-forward network. The decoder is also 6 layers, but each layer has three sub-components: masked multi-head self-attention, multi-head cross-attention (attending to the encoder output), and a feed-forward network.

Every sub-component is wrapped with a residual connection and layer normalization. We'll cover each piece.

Input Embeddings and Positional Encoding

Before anything else, input tokens are converted to dense vectors via a learned embedding table. If the model dimension is d_model = 512, each token becomes a 512-dimensional vector.

But here's a problem the RNN never had: since the Transformer processes all tokens simultaneously, it has no inherent notion of order. The sentence "the cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns without some way to encode position.

The solution is positional encoding — adding a position-dependent signal to each token embedding. The original paper uses sinusoidal functions:

PE(pos, 2i)     = sin(pos / 10000^(2i/d_model))
PE(pos, 2i + 1) = cos(pos / 10000^(2i/d_model))

Each position gets a unique pattern of sine and cosine values across the embedding dimensions. The key properties: (1) each position has a unique encoding, (2) the encoding is deterministic (no learned parameters), and (3) the model can generalize to sequence lengths longer than those seen during training because the functions are continuous.

The analogy: think of positional encoding as a unique "address" stamped onto each word. The model learns to read these addresses and factor position into its attention decisions.

Modern Transformers often use learned positional embeddings (just another embedding table indexed by position) or Rotary Position Embeddings (RoPE), which encode relative position directly into the attention computation. But the core insight remains the same: you must inject position information explicitly.

Scaled Dot-Product Attention: The Q/K/V Framework

This is the heart of the Transformer. Every attention mechanism in the architecture is built on this single operation.

For each token, we compute three vectors from its embedding:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide if you attend to me?"

These are produced by multiplying the input by three learned weight matrices: W_Q, W_K, and W_V.

The attention computation works in three steps:

Step 1: Compute compatibility scores. Multiply each query by all keys (dot product). This produces a score matrix: how relevant is each key to each query. High dot product = the query and key are aligned = this token is relevant to that token.

Step 2: Scale and normalize. Divide scores by the square root of the key dimension (sqrt(d_k)). This scaling prevents the dot products from growing too large in magnitude, which would push the softmax into regions with tiny gradients. Then apply softmax row-wise to get attention weights that sum to 1.

Step 3: Weighted sum of values. Multiply the attention weights by the value vectors. Each token's output is a weighted combination of all value vectors, with weights determined by how relevant each key was to that token's query.

In matrix form:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

The analogy: imagine you're at a library (the sequence). Your query is the question you're researching. Each book has a title (key) and content (value). You scan all the titles, figure out which books are most relevant to your question, and then read those books more carefully — weighting your reading time based on relevance.

Multi-Head Attention: Attending to Different Things Simultaneously

A single attention head learns one kind of relationship. But language has many simultaneous relationships: syntactic, semantic, coreference, positional, topical.

Multi-head attention runs multiple attention operations in parallel, each with its own learned Q/K/V projections. The original Transformer uses 8 heads with d_k = d_v = 64 each (total: 8 * 64 = 512 = d_model).

Each head can specialize. Research has shown that different heads learn to capture different linguistic phenomena:

  • One head might track subject-verb agreement
  • Another might focus on adjacent word relationships
  • Another might capture long-range coreference ("she" referring to "Dr. Smith" from three sentences ago)

The outputs of all heads are concatenated and linearly projected back to d_model:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_Qi, K * W_Ki, V * W_Vi)

This is one of the Transformer's most powerful design choices — it gets multiple "perspectives" on the same data for the cost of one full-dimensional attention computation.

Three Types of Attention in the Transformer

The encoder-decoder Transformer uses attention in three distinct ways:

  1. Encoder self-attention. Each token in the input attends to all other tokens in the input. This builds a contextual representation where each word's embedding is enriched by the full context of the sentence.

  2. Masked decoder self-attention. Each token in the output sequence attends to all previous output tokens (but not future ones). The masking prevents the model from "cheating" by looking ahead during training. Future positions are set to negative infinity before the softmax, zeroing out their attention weights.

  3. Encoder-decoder cross-attention. Each decoder token attends to all encoder outputs. The queries come from the decoder; the keys and values come from the encoder. This is how the decoder "reads" the input sentence to produce the translation.

Feed-Forward Network

After each attention sub-layer, every token passes independently through the same two-layer feed-forward network:

FFN(x) = ReLU(x * W_1 + b_1) * W_2 + b_2

The inner dimension is expanded (typically 4x, from 512 to 2048 in the original), then projected back down. This gives the model per-token nonlinear processing capacity — attention handles inter-token relationships, while the FFN handles intra-token transformation.

Recent research suggests the FFN layers serve as the model's "memory," storing factual knowledge, while the attention layers handle relational reasoning.

Residual Connections and Layer Normalization

Every sub-layer (attention and FFN) is wrapped with:

output = LayerNorm(x + Sublayer(x))

The residual connection (x + Sublayer(x)) allows gradients to flow directly through the network without degradation, enabling very deep stacks (6, 12, 24, 96+ layers). Without residuals, training deep Transformers would be nearly impossible.

Layer normalization stabilizes the hidden state magnitudes, preventing the distribution of activations from drifting as the signal passes through many layers. It normalizes across the feature dimension for each token independently.

These aren't glamorous components, but they're essential. The Transformer's depth — and therefore its capacity — depends on them.


Why This Design Scales So Well

The Transformer has a property that no previous architecture achieved to the same degree: predictable, smooth scaling.

In 2020, Kaplan et al. (OpenAI) published the scaling laws paper, showing that Transformer performance improves as a smooth power law with respect to three factors:

  1. Model size (number of parameters)
  2. Dataset size (number of training tokens)
  3. Compute budget (FLOPs spent on training)

Double the parameters, and you get a predictable improvement. Double the data, same thing. This is remarkably different from previous architectures where scaling often hit diminishing returns or instabilities.

Why do Transformers scale so well?

  • Full parallelism means more compute directly translates to faster training and larger batch sizes. No sequential bottleneck limits how much hardware you can throw at the problem.
  • Attention across all positions means the model's capacity to capture relationships grows with sequence length and model width, without architectural changes.
  • Depth composability — stacking more identical layers adds representational power smoothly, thanks to residual connections and layer normalization.
  • No information bottleneck — unlike RNNs, there's no fixed-size hidden state that all information must pass through.

The Chinchilla paper (Hoffmann et al., 2022) later refined these laws, showing that models should be trained on roughly 20 tokens per parameter for optimal compute efficiency. This led to a shift from the "bigger model" paradigm to the "more data" paradigm.


Transformer Variants: A Family of Architectures

The original Transformer is encoder-decoder. But researchers quickly discovered that using parts of the architecture for specific tasks yielded remarkable results. Three major paradigms emerged.

Encoder-Only: BERT and Its Descendants

BERT (Bidirectional Encoder Representations from Transformers, 2018) uses only the encoder stack. During pre-training, it masks random tokens in the input and trains the model to predict them — this is Masked Language Modeling (MLM). Because there's no autoregressive generation, every token can attend to every other token bidirectionally.

BERT excels at understanding tasks: classification, named entity recognition, question answering, semantic similarity. It produces rich contextual embeddings where the same word gets different representations depending on context ("bank" in "river bank" vs. "bank account").

Key descendants:

  • RoBERTa — Same architecture, better training: more data, longer training, no next-sentence prediction objective. Showed BERT was significantly undertrained.
  • ALBERT — Parameter-efficient BERT with cross-layer parameter sharing and factorized embeddings. Smaller model, competitive performance.
  • DeBERTa — Disentangles content and position into separate attention streams, then combines them late. Achieved human-level performance on the SuperGLUE benchmark.
  • XLNet — Uses a permutation-based training objective to capture bidirectional context while maintaining autoregressive formulation. Avoids the pretrain-finetune discrepancy of BERT's masking.

Decoder-Only: GPT and the Autoregressive Revolution

GPT (Generative Pre-trained Transformer, 2018) uses only the decoder stack. It's trained to predict the next token given all previous tokens — pure autoregressive language modeling. The masked self-attention ensures each position can only attend to earlier positions.

This paradigm turned out to be the one that scales the furthest. The progression:

  • GPT-1 (2018): 117M parameters. Showed that unsupervised pre-training + supervised fine-tuning works.
  • GPT-2 (2019): 1.5B parameters. Showed that scale enables zero-shot task performance. "Too dangerous to release" (it wasn't, but the PR worked).
  • GPT-3 (2020): 175B parameters. Showed that in-context learning emerges at scale — the model can perform tasks from a few examples in the prompt, no fine-tuning needed.
  • GPT-4 (2023): Architecture undisclosed, rumored mixture-of-experts. Multimodal (text + vision). State-of-the-art across dozens of benchmarks.

Other major decoder-only models:

  • PaLM (Google, 540B) — Showed breakthrough performance on reasoning tasks with chain-of-thought prompting.
  • LLaMA (Meta) — Open-weight models proving that smaller, well-trained models can match much larger ones. LLaMA 2 (7B-70B) and LLaMA 3 catalyzed the open-source AI ecosystem.
  • Mistral — Efficient open models using Grouped-Query Attention and Sliding Window Attention. Mistral 7B outperformed LLaMA 2 13B.
  • Claude (Anthropic) — Constitutional AI approach with RLHF. Strong reasoning and instruction-following with emphasis on safety and helpfulness.

Encoder-Decoder: T5 and Unified Frameworks

T5 (Text-to-Text Transfer Transformer, 2019) keeps the full encoder-decoder architecture but reframes every NLP task as a text-to-text problem. Classification? Input: "classify: this movie was great", output: "positive". Translation? Input: "translate English to German: Hello", output: "Hallo".

This unified framing is elegant — one architecture, one training procedure, one format for everything. T5 also systematically studied every architectural choice (model size, pre-training objective, dataset), making it one of the most thorough papers in the field.


Model Comparison

ModelTypeYearParametersKey Innovation
Original TransformerEncoder-Decoder201765MSelf-attention replacing recurrence entirely
BERTEncoder-only2018110M / 340MBidirectional pre-training with masked language modeling
GPT-1Decoder-only2018117MUnsupervised pre-training + fine-tuning paradigm
GPT-2Decoder-only20191.5BZero-shot task transfer via scale
T5Encoder-Decoder2019220M - 11BUnified text-to-text framing for all NLP tasks
XLNetAutoregressive2019340MPermutation-based training for bidirectional context
RoBERTaEncoder-only2019355MOptimized BERT training procedure
ALBERTEncoder-only201912M - 235MCross-layer parameter sharing, factorized embeddings
DeBERTaEncoder-only2020134M - 1.5BDisentangled attention for content and position
GPT-3Decoder-only2020175BIn-context learning, few-shot capabilities
PaLMDecoder-only2022540BPathways system, breakthrough reasoning
LLaMA 2Decoder-only20237B - 70BOpen-weight, efficient training, GQA
Mistral 7BDecoder-only20237BSliding window attention, grouped-query attention
GPT-4Decoder-only (MoE?)2023UndisclosedMultimodal, state-of-the-art reasoning
Claude 3.5Decoder-only2024UndisclosedConstitutional AI, strong reasoning + safety
LLaMA 3Decoder-only20248B - 405B15T tokens training data, extended context

Why Decoder-Only Won (For Now)

A natural question: if the original Transformer is encoder-decoder, why are the largest and most capable models decoder-only?

Several factors converged:

  1. Simplicity. One stack is easier to scale than two. Fewer architectural decisions, fewer hyperparameters, simpler training pipelines.

  2. Unification of understanding and generation. Encoder-only models are great at understanding but cannot generate. Decoder-only models can do both — they understand context through the process of predicting what comes next.

  3. Emergent capabilities. As decoder-only models scaled, unexpected abilities appeared: chain-of-thought reasoning, in-context learning, instruction following. These emergent behaviors were less pronounced in encoder-only or encoder-decoder models at similar scales.

  4. Training efficiency. Next-token prediction is a dense supervision signal — every token in the training data provides a training signal. Masked language modeling only trains on the ~15% of tokens that are masked.

That said, encoder-decoder architectures aren't dead. They excel at tasks where you have a clear input-output mapping (translation, summarization), and models like T5 and its successors remain competitive in many benchmarks.


The Lasting Impact

The Transformer didn't just change NLP. It became the universal architecture for deep learning:

  • Computer Vision: Vision Transformers (ViT) treat image patches as tokens and achieve state-of-the-art classification, detection, and segmentation.
  • Audio: Whisper uses Transformers for speech recognition. MusicLM generates music.
  • Protein Structure: AlphaFold 2 uses a modified attention mechanism to predict 3D protein structures, solving a 50-year-old biology problem.
  • Robotics: RT-2 uses Transformers to translate language instructions into robot actions.
  • Code: Codex, CodeLlama, and StarCoder generate and understand programming languages.

The architecture is so general that the main research question shifted from "what architecture should we use?" to "how much data and compute should we invest?"


What Comes Next: From Text to Vision

The Transformer started with language, but it didn't stay there. In Part 2 of this series, we'll explore Vision Transformers (ViT) — how researchers adapted the attention mechanism to work with images, why it works so well, and how it dethroned CNNs as the dominant architecture in computer vision.

From pixels to patches to attention maps — the next chapter of the Transformer story is just as transformative.

Next up: Part 2 — Vision Transformers: How Transformers Learned to See

Transformers — The Architecture That Changed AI (Part 1 of 3) | Software Engineer Blog