Vision Transformers — How Transformers Learned to See (Part 2 of 3)

Recap: The Transformer Revolution (Part 1)

In Part 1 of this series, we explored how the Transformer architecture — introduced in Google's 2017 paper "Attention Is All You Need" — upended natural language processing. The key ideas were self-attention (letting every token attend to every other token), positional encodings (injecting sequence order without recurrence), and multi-head attention (learning multiple relationship patterns in parallel). Transformers replaced RNNs and LSTMs as the backbone of language models, eventually powering GPT, BERT, and everything that followed.

But Transformers were designed for sequences of tokens — words, subwords, characters. Images are not sequences. They are 2D grids of pixels with spatial structure, local patterns, and hierarchical features. For decades, a completely different family of architectures dominated vision: convolutional neural networks.

So how did Transformers learn to see? That is the story of this post.

The CNN Era: What Worked and What Did Not

Convolutional Neural Networks (CNNs) have been the workhorse of computer vision since AlexNet won ImageNet in 2012. The architecture is elegant: small learnable filters slide across the image, detecting local patterns like edges, textures, and shapes. Stacking convolutional layers builds a hierarchy — early layers detect edges, middle layers detect parts (eyes, wheels), and deep layers detect entire objects.

CNNs come with strong inductive biases baked into their design:

Locality: Each filter looks at a small patch of the image. A 3x3 convolution only sees 9 pixels at a time.
Translation equivariance: The same filter is applied everywhere, so a cat detected in the top-left corner uses the same weights as a cat in the bottom-right.
Hierarchical feature extraction: Pooling layers progressively reduce spatial resolution, forcing the network to build abstract representations.

These biases are a gift when data is limited. They tell the network how to look at images before it sees a single training example. Models like ResNet, EfficientNet, and ConvNeXt achieved remarkable accuracy and efficiency.

But CNNs have limitations:

Limited receptive field: Even deep CNNs struggle to capture long-range dependencies. A pixel in the top-left corner has no direct connection to a pixel in the bottom-right until very late in the network. This matters for understanding scene-level context — knowing that a person is holding a surfboard requires relating distant image regions.
Fixed geometric structure: Convolutions are rigid. They process fixed-size local neighborhoods regardless of content. They cannot dynamically decide which parts of the image are most relevant to each other.
Scaling bottlenecks: While CNNs scale reasonably, the relationship between model size, data, and performance plateaus compared to what Transformers achieve in NLP.

Researchers asked: could Transformers, with their ability to let any part of the input attend to any other part, do better?

The Key Insight: Images as Sequences of Patches

The breakthrough idea behind Vision Transformers is disarmingly simple: treat an image as a sequence of patches, just like a sentence is a sequence of words.

Take a 224x224 pixel image. Divide it into a grid of non-overlapping 16x16 patches. You get 14 x 14 = 196 patches. Each patch is a small image region containing 16 x 16 x 3 = 768 pixel values (for RGB images). Flatten each patch into a vector, project it through a linear layer, and you have a sequence of 196 "tokens" — each one representing a patch of the image.

Now you can feed this sequence into a standard Transformer encoder. Self-attention lets every patch attend to every other patch, regardless of spatial distance. The patch in the top-left corner can directly interact with the patch in the bottom-right corner in a single layer. No need to stack dozens of layers to build a large receptive field.

This is the core idea of the Vision Transformer (ViT), published by Google Research in late 2020.

ViT Architecture: A Step-by-Step Walkthrough

Let us trace how ViT processes a single image from input to classification.

Step 1: Patch Embedding

The input image (e.g., 224x224x3) is divided into a grid of P x P patches (typically P=16). Each patch is flattened into a vector of length P^2 x C (where C is the number of channels), giving a 768-dimensional vector for 16x16 RGB patches. A learned linear projection maps each flattened patch to the model's hidden dimension D (e.g., 768). The result: a sequence of N patch embeddings, where N = (224/16)^2 = 196.

Image (224x224x3)
  → Split into 196 patches of 16x16x3
  → Flatten each patch to 768-dim vector
  → Linear projection to D-dim embedding
  → Sequence of 196 token embeddings

Step 2: The [CLS] Token

ViT prepends a special learnable [CLS] token to the patch sequence, borrowed directly from BERT. This token does not correspond to any image patch. Instead, it serves as an aggregation point: through self-attention across all layers, it collects information from every patch. After the final Transformer layer, the [CLS] token's representation is used for classification.

The sequence is now 197 tokens long: 1 [CLS] token + 196 patch tokens.

Step 3: Position Embeddings

Unlike convolutions, the Transformer has no built-in notion of spatial position. If you shuffle the patch order, the self-attention output is identical (it is permutation-equivariant). To encode spatial information, ViT adds learnable 1D position embeddings to each token. Position 0 is the [CLS] token, positions 1-196 correspond to patches in raster order (left-to-right, top-to-bottom).

Interestingly, the original ViT paper found that 1D positional embeddings work just as well as explicit 2D positional encodings. The model learns the 2D structure from data — nearby position embeddings end up with similar values, effectively reconstructing a 2D grid.

Step 4: Transformer Encoder

The sequence of 197 position-encoded embeddings is fed into a standard Transformer encoder — the exact same architecture from the NLP world. Each layer consists of:

Layer Normalization (applied before attention, following the Pre-Norm convention)
Multi-Head Self-Attention (MHSA): Every token attends to every other token. For 197 tokens, this means a 197x197 attention matrix per head. Each head can learn different relationships — some might focus on nearby patches, others on semantically related distant patches.
Residual Connection: Add the attention output back to the input
Layer Normalization
MLP (Feed-Forward Network): Two linear layers with GELU activation, expanding the dimension by 4x then projecting back
Residual Connection

ViT-Base uses 12 such layers, ViT-Large uses 24, and ViT-Huge uses 32.

Step 5: Classification Head

After the final encoder layer, the [CLS] token's output representation is extracted and passed through a simple MLP head (one hidden layer during pre-training, a single linear layer during fine-tuning) to produce class logits.

Input Image
  → Patch Embedding (196 patches)
  → Prepend [CLS] token (197 tokens)
  → Add Position Embeddings
  → Transformer Encoder (L layers of MHSA + MLP)
  → Extract [CLS] token output
  → MLP Classification Head
  → Class Prediction

The elegance is striking: no pooling layers, no convolutions, no hand-crafted feature extractors. Just patches, linear projections, and attention.

The Data Hunger Problem

Here is the catch. When ViT was trained on ImageNet alone (1.3 million images), it performed worse than comparable CNNs like ResNet. The Transformer's lack of inductive bias is both its strength and its weakness.

CNNs "know" to look locally and share weights spatially. ViT knows nothing — it must learn everything from data, including the fact that nearby pixels are related and that patterns can appear anywhere in the image. Learning these priors from scratch requires enormous amounts of data.

The original ViT paper showed that the picture flips dramatically with scale: when pre-trained on JFT-300M (Google's internal dataset of 300 million images), ViT-Huge outperformed every CNN on ImageNet, CIFAR-100, and other benchmarks. The takeaway was clear: Transformers for vision work, but they are data-hungry.

This raised a practical question: most researchers and companies do not have 300 million labeled images. Can Vision Transformers work without massive private datasets?

The Ecosystem: Vision Transformer Variants

The original ViT paper sparked an explosion of follow-up work addressing its limitations. Here are the most significant models and what they contribute.

DeiT — Data-Efficient Image Transformers (Facebook, 2021)

DeiT proved that ViT can be trained effectively on ImageNet alone (1.3M images) — no JFT needed. The key innovations:

Strong data augmentation (RandAugment, Mixup, CutMix, random erasing) to compensate for the lack of inductive bias
Regularization techniques (stochastic depth, repeated augmentation)
Knowledge distillation from a CNN teacher (RegNet): a special distillation token learns to mimic the CNN's predictions, effectively transferring the CNN's inductive bias to the Transformer

DeiT-Base matched ViT-Base performance while training only on ImageNet with 4 GPUs — a massive reduction in compute requirements.

Swin Transformer — Shifted Windows (Microsoft, 2021)

Swin Transformer addressed ViT's two biggest architectural issues: quadratic attention cost and single-scale representation.

Hierarchical feature maps: Like a CNN, Swin produces feature maps at multiple resolutions (1/4, 1/8, 1/16, 1/32 of input size) by merging patches between stages. This makes it a drop-in backbone replacement for CNNs in detection and segmentation frameworks like FPN and UPerNet.
Window-based attention: Instead of global self-attention over all patches (O(N^2) cost), Swin computes attention within local windows of fixed size (e.g., 7x7 patches). This reduces complexity to O(N).
Shifted windows: In alternating layers, the window partition is shifted by half the window size, allowing cross-window information flow without the cost of global attention.

Swin became the de facto backbone for dense prediction tasks (object detection, semantic segmentation, instance segmentation) and won the "best paper" at ICCV 2021.

BEiT — BERT Pre-Training for Images (Microsoft, 2021)

BEiT brought BERT-style masked pre-training to vision. During pre-training, random image patches are masked, and the model must predict the visual tokens of the masked patches (using a discrete visual codebook from a tokenizer called dVAE). This self-supervised objective lets ViT learn powerful representations without any labels.

BEiT showed that self-supervised pre-training dramatically improves ViT's performance when fine-tuned on smaller labeled datasets, partially solving the data hunger problem.

CvT and CoAtNet — Hybrid CNN + Transformer

These models combine CNN and Transformer strengths:

CvT (Convolutional Vision Transformer): Replaces the linear patch embedding with convolutional token embeddings and uses depthwise convolutions inside the attention projection. This injects locality bias into the Transformer while keeping the global attention mechanism.
CoAtNet (Google, 2021): Stacks depthwise convolution layers (for local patterns in early stages) with Transformer layers (for global attention in later stages). By systematically combining convolutions and attention, CoAtNet achieves state-of-the-art ImageNet accuracy (90.88% top-1) with strong efficiency.

The hybrid approach is pragmatic: use convolutions where locality matters most (early layers processing raw pixels) and attention where global reasoning matters (later layers composing high-level features).

DINO and DINOv2 — Self-Supervised Learning (Meta)

DINO (Self-DIstillation with NO Labels, 2021) showed that ViT trained with self-supervised distillation learns remarkably structured features. The attention maps of self-supervised ViTs spontaneously learn to segment objects — without ever seeing a segmentation label. The model learns by having a student network match the output of a momentum-updated teacher network on different augmented views of the same image.

DINOv2 (2023) scaled this approach with curated data, larger models, and improved training recipes. DINOv2 features are so general that they work as frozen feature extractors for depth estimation, segmentation, classification, and retrieval — often matching or beating supervised models without any fine-tuning.

EVA and InternImage — Pushing Scale

EVA (Exploring the Limits of Masked Visual Representation Learning, 2022): Combined masked image modeling with CLIP-style vision-language alignment at billion-parameter scale. EVA-02 demonstrated that scaling ViT with the right pre-training recipe achieves new state-of-the-art results across many benchmarks.
InternImage (2023): Took a different path — a large-scale model based on deformable convolutions (not Transformers) that matched or exceeded ViT performance, proving that the CNN vs. Transformer debate is not settled. InternImage uses dynamic sparse attention patterns through deformable convolutions, getting some of the benefits of attention without the architecture.

SAM — Segment Anything (Meta, 2023)

SAM is a foundation model for image segmentation. Its image encoder is a ViT-Huge pre-trained with MAE (Masked Autoencoder). Given an image and a prompt (point, box, or text), SAM produces high-quality segmentation masks for any object — including objects it has never seen before.

SAM demonstrated that ViT backbones, trained at scale with the right objectives, can power general-purpose visual understanding that transfers to virtually any segmentation task. SAM 2 extended this to video.

Comparison: CNN vs. ViT vs. Hybrid

Property	CNN (e.g., ResNet, ConvNeXt)	ViT (Pure Transformer)	Hybrid (e.g., CoAtNet, CvT)
Inductive bias	Strong (locality, translation equivariance)	Minimal (learns from data)	Moderate (conv early, attention late)
Data efficiency	Good with small datasets	Poor without large-scale pre-training	Good — best of both worlds
Scalability	Moderate — diminishing returns at scale	Excellent — performance scales with data and compute	Excellent
Global context	Limited — requires deep stacking	Full — every patch sees every patch	Progressive — local to global
Compute cost	Efficient (linear in image size)	Expensive (quadratic in patch count)	Moderate
Dense prediction	Natural multi-scale features	Single-scale (ViT) — needs adaptation	Multi-scale (Swin, CvT)
Transfer learning	Strong	Exceptional at scale	Strong

Comparison of Major Vision Transformer Models

Model	Year	Key Innovation	Pre-training	Best For
ViT	2020	Pure Transformer for vision	Supervised (JFT-300M)	Classification at scale
DeiT	2021	Data-efficient training + distillation	Supervised (ImageNet-1K)	Classification without massive data
Swin	2021	Hierarchical + shifted windows	Supervised (ImageNet)	Detection, segmentation
BEiT	2021	Masked image modeling	Self-supervised	Low-data fine-tuning
CoAtNet	2021	Conv + Attention hybrid staging	Supervised (JFT)	Top accuracy on ImageNet
DINO/DINOv2	2021/2023	Self-supervised distillation	Self-supervised	General-purpose features
EVA	2022	Scaled masked modeling + CLIP alignment	Self-supervised + CLIP	Vision-language tasks
InternImage	2023	Large-scale deformable convolutions	Supervised	Dense prediction at scale
SAM	2023	Promptable segmentation foundation model	MAE + SA-1B dataset	Zero-shot segmentation

When to Use CNN vs. ViT in Practice

Choosing between a CNN and a Vision Transformer is not about which is "better" — it depends on your constraints.

Use a CNN (ResNet, EfficientNet, ConvNeXt) when:

You have a small to medium dataset (under 100K images) and no pre-trained ViT is available for your domain
You need real-time inference on edge devices or mobile — CNNs are still more efficient at small model sizes
Your task is straightforward classification or detection with well-established CNN pipelines
You want a battle-tested, well-understood architecture with extensive tooling

Use a ViT (or Swin/DeiT) when:

You can leverage a strong pre-trained checkpoint (ImageNet-21K, CLIP, DINOv2, etc.)
Your task benefits from global context (scene understanding, medical imaging with long-range dependencies, satellite imagery)
You are working at scale — more data and compute reliably improve ViT performance
You need a backbone that connects to modern multimodal systems (CLIP, LLaVA, GPT-4V)

Use a hybrid (CoAtNet, CvT, or ConvNeXt + attention) when:

You want the best accuracy-efficiency tradeoff
You need multi-scale features for dense prediction (detection, segmentation) without Swin's complexity
You are building a production system where you need both the efficiency of convolutions and the expressiveness of attention

A practical note: in 2025-2026, the default starting point for most vision tasks is a pre-trained ViT or Swin backbone, fine-tuned on your data. The pre-training handles the data hunger problem. If you are training from scratch on a small dataset with no relevant pre-trained model, CNNs remain the safer choice.

The Bigger Picture

Vision Transformers did more than improve accuracy numbers. They unified the architecture across modalities. The same Transformer that processes text can now process images — and critically, the same architecture can process both at the same time.

This unification is what enables the next wave: models that see and read simultaneously. CLIP learns to align images and text in a shared embedding space. Flamingo, LLaVA, and GPT-4V combine a ViT image encoder with a language model decoder to answer questions about images, describe scenes, and reason visually.

What is Next: Part 3 — Vision Language Models

In Part 3 of this series, we will explore Vision Language Models (VLMs) — the architectures that combine Vision Transformers with Large Language Models. We will cover how models like CLIP, LLaVA, and GPT-4V bridge the gap between seeing and understanding, enabling AI systems that can describe images, answer visual questions, and reason about the visual world in natural language.

The Transformer learned to read in 2017. It learned to see in 2020. Now it is learning to do both at once.

Stay tuned.