Vision Language Models — When AI Learns to See and Talk (Part 3 of 3)

The final piece: combining vision and language into unified models. From CLIP to GPT-4V, LLaVA, and Gemini — how VLMs understand images and text together, and why this changes everything.

Banner

This is Part 3 of a 3-part series on the transformer revolution in vision and language:

In Part 1, we covered how the transformer architecture replaced RNNs and CNNs as the backbone of modern AI. In Part 2, we saw how Vision Transformers (ViTs) brought that same architecture to image understanding — splitting images into patches and treating them like tokens.

Now comes the question that drives this entire field forward: what happens when you combine both?


Why Combine Vision and Language?

Humans don't process the world in isolated channels. When you look at a photo of a dog catching a frisbee in a park, you don't separately "see" the image and then "think" in language. Your understanding is multimodal from the start — you perceive the scene, recognize objects, understand spatial relationships, and can describe it all in natural language without effort.

Traditional AI couldn't do this. Computer vision models could classify images or detect objects, but they couldn't explain what they saw. Language models could write eloquently, but they were blind. These were separate systems with separate training pipelines, separate datasets, and no shared understanding.

Vision Language Models (VLMs) change this. They bridge the gap between pixels and words, creating systems that can look at an image and answer questions about it, generate descriptions, follow visual instructions, or reason about what they see.

The applications are enormous: a doctor uploads a medical scan and asks "What do you see?"; a warehouse robot reads labels and navigates shelves; a visually impaired user points their phone at a restaurant menu and gets it read aloud. All of these require a model that sees and speaks.


The Evolution: From Captioning to True Multimodal Understanding

The journey from "AI that describes pictures" to "AI that understands images and reasons about them" happened in distinct phases.

Phase 1: Image Captioning (2015-2019)

Early systems used a CNN encoder (like ResNet) to extract image features, then fed those features into an RNN or LSTM decoder to generate a caption. The architecture was straightforward: image -> CNN -> feature vector -> RNN -> "A dog catches a frisbee". These systems worked but were brittle — they could generate grammatically correct captions, but didn't truly understand the scene. Ask a follow-up question and they'd fall apart.

Phase 2: Contrastive Pre-training (2021)

CLIP changed the game by learning to align images and text in a shared embedding space, without generating anything. This allowed zero-shot classification, image search, and open-vocabulary recognition. It was the first time a single model could handle visual concepts it had never been explicitly trained on.

Phase 3: Generative Multimodal Models (2022-2023)

Models like Flamingo, BLIP-2, and LLaVA took things further — they could not just align images and text, but generate free-form text responses about images. You could have a conversation about a photo.

Phase 4: Natively Multimodal Systems (2023-present)

GPT-4V, Gemini, and Claude represent the current frontier: models trained from the ground up to handle text, images, video, and audio as first-class inputs. These aren't vision modules bolted onto a language model — they are unified systems.


Four Architectural Approaches to VLMs

Not all VLMs are built the same way. There are four fundamental design patterns, each with distinct trade-offs.

1. Contrastive Learning (The CLIP Approach)

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) uses a dual-encoder architecture. One encoder processes images (a ViT or ResNet), and a separate encoder processes text (a transformer). Both encoders map their inputs into the same embedding space.

During training, CLIP sees 400 million image-text pairs scraped from the internet. For each batch, it computes the cosine similarity between every image embedding and every text embedding. The training objective is simple: maximize the similarity between matching image-text pairs and minimize it for non-matching ones. This is contrastive learning — the model learns by contrasting positives against negatives.

Image Encoder (ViT)  ──→  Image Embedding  ──┐
                                               ├──→  Cosine Similarity
Text Encoder (Transformer) → Text Embedding ──┘

Why was CLIP revolutionary? Three reasons:

  • Zero-shot transfer. To classify an image, you don't fine-tune. You just compute similarity between the image embedding and text embeddings like "a photo of a dog" or "a photo of a cat". The highest similarity wins. This means CLIP can classify images into any categories you define at inference time — no retraining needed.
  • Open vocabulary. Traditional classifiers are limited to their fixed label set. CLIP understands free-form language, so you can classify images using any text description you can think of.
  • Web-scale training. By using image-text pairs from the internet instead of hand-labeled datasets, CLIP trained on far more diverse data than any supervised model.

The limitation: CLIP is an alignment model, not a generative model. It can tell you which text best matches an image, but it can't generate a detailed description or answer complex questions.

2. Cross-Attention Fusion (The Flamingo Approach)

Flamingo (DeepMind, 2022) takes a different strategy. Instead of aligning two separate encoders, it injects visual information directly into a frozen large language model using cross-attention layers.

The architecture works like this: a vision encoder (a frozen NFNet or ViT) extracts visual features from the input image. These features are then compressed by a Perceiver Resampler — a small transformer module that reduces the variable number of visual tokens into a fixed set (typically 64). These compressed visual tokens are fed into newly added cross-attention layers that are interleaved between the existing self-attention layers of the frozen LLM.

Image ──→ Vision Encoder ──→ Perceiver Resampler ──→ Visual Tokens
                                                         │
Text ──→ [Frozen LLM with interleaved cross-attention] ←─┘
                         │
                    Generated Text

The key insight: the LLM itself stays frozen. Only the Perceiver Resampler and the cross-attention layers are trained. This preserves the language model's capabilities while teaching it to attend to visual information.

Flamingo excelled at few-shot learning. You could show it a few image-text examples as context, and it would generalize to new tasks — much like how GPT-3 demonstrated few-shot language capabilities.

3. Visual Tokens into LLM (The LLaVA Approach)

LLaVA (Large Language and Vision Assistant, 2023) takes the simplest possible approach: use a linear projection (or MLP) to map visual features into the token embedding space of an LLM, then just prepend them to the text tokens.

The architecture is refreshingly minimal:

  1. Pass the image through a pre-trained CLIP ViT to get visual features.
  2. Project those features through a trained MLP to match the LLM's embedding dimension.
  3. Concatenate the projected visual tokens with the text tokens.
  4. Feed everything into the LLM as a single sequence.
Image ──→ CLIP ViT ──→ MLP Projection ──→ Visual Tokens
                                              │
                              [v1, v2, ..., vN, text tokens] ──→ LLM ──→ Response

From the LLM's perspective, visual tokens look just like any other tokens. The model processes them using its standard self-attention mechanism. No architectural changes to the LLM are needed.

LLaVA's training happens in two stages:

  1. Pre-training the projection. The vision encoder and LLM are frozen; only the MLP is trained on image-caption pairs to align the visual feature space with the language embedding space.
  2. Visual instruction tuning. The MLP and LLM are fine-tuned together on instruction-following data — conversations about images, visual question answering, and complex reasoning tasks.

LLaVA proved that you don't need exotic architectural innovations. A well-trained projection layer and good instruction-tuning data can produce remarkably capable multimodal models. Its open-source nature made it enormously influential — LLaVA-NeXT, LLaVA-OneVision, and many derivative models followed.

4. Natively Multimodal (The Gemini Approach)

Gemini (Google, 2023) takes the most ambitious approach: train a single transformer from scratch on interleaved text, images, audio, and video. There is no separate vision encoder bolted on — the model natively processes all modalities through a unified architecture.

Images are tokenized using SentencePiece-style visual tokenization or learned patch embeddings, and these visual tokens are interleaved with text tokens during both training and inference. The model processes everything through the same transformer layers.

[text tokens, image tokens, text tokens, audio tokens, ...] ──→ Unified Transformer ──→ Output

The advantage is deep fusion: visual and textual understanding develop together during training, rather than being stitched together after the fact. The model can reason across modalities in a way that adapter-based approaches struggle with.

The disadvantage is cost. Training a natively multimodal model from scratch requires enormous compute budgets and carefully curated multimodal training data. This is why only a handful of labs (Google, OpenAI, Anthropic) have built models in this category.


Major VLM Models: A Landscape Overview

The VLM space has exploded. Here's a tour of the most important models and what makes each one notable.

CLIP (OpenAI, 2021)

The model that started the modern VLM era. Trained on 400M image-text pairs using contrastive learning. CLIP is not generative — it aligns images and text in a shared space — but it became the backbone vision encoder for dozens of subsequent models (LLaVA, BLIP-2, and more all use CLIP's ViT).

BLIP and BLIP-2 (Salesforce, 2022-2023)

BLIP introduced "bootstrapping" — using a model to generate and then filter its own training captions, creating higher-quality data. BLIP-2 took this further with the Q-Former (Querying Transformer), a lightweight module that bridges a frozen image encoder and a frozen LLM. The Q-Former uses a set of learnable query tokens that interact with visual features through cross-attention, then pass the result to the LLM. This made it possible to combine powerful pre-trained components with minimal training.

Flamingo (DeepMind, 2022)

The few-shot champion. By interleaving cross-attention layers into a frozen Chinchilla LLM, Flamingo showed that you could give a language model vision capabilities without retraining it. Its few-shot performance on visual QA benchmarks was remarkable — you could show it 4-8 example image-text pairs and it would generalize effectively.

LLaVA / LLaVA-NeXT (University of Wisconsin, 2023-2024)

The open-source workhorse. LLaVA proved that a simple projection MLP between a CLIP ViT and a Vicuna/LLaMA LLM, combined with high-quality visual instruction-tuning data, could match or exceed far more complex architectures. LLaVA-NeXT improved resolution handling with dynamic image partitioning — splitting high-resolution images into tiles, encoding each tile, and concatenating the features.

GPT-4V / GPT-4o (OpenAI, 2023-2024)

GPT-4V brought multimodal capabilities to the most capable commercial LLM. Architectural details are not published, but it handles complex visual reasoning, OCR, chart understanding, and multi-image comparisons. GPT-4o ("omni") extended this to audio and real-time interaction, processing text, images, and audio natively rather than through separate pipelines.

Gemini (Google, 2023-2024)

Google's natively multimodal family. Available in Ultra, Pro, and Nano sizes. Gemini processes text, images, audio, and video through a unified transformer trained from scratch on multimodal data. Gemini 1.5 Pro introduced a 1M+ token context window, enabling processing of hour-long videos or hundreds of pages of documents alongside text queries.

Claude (Anthropic, 2024-present)

Anthropic's Claude models support image understanding with strong performance on document analysis, chart reading, and visual reasoning. Like GPT-4V and Gemini, the exact architecture is proprietary, but Claude demonstrates particularly strong performance on tasks requiring careful analysis and reduced hallucination.

PaliGemma (Google, 2024)

An open-weight, lightweight VLM combining a SigLIP vision encoder with a Gemma language model. Designed for fine-tuning on specific tasks — OCR, visual QA, object detection, image segmentation. At 3B parameters, PaliGemma showed that you don't need massive models for practical VLM applications.

Qwen-VL (Alibaba, 2023-2024)

Alibaba's open-source multimodal model. Supports image, video, and text inputs. Qwen2-VL introduced Naive Dynamic Resolution — handling images at their native resolution by dynamically adjusting the number of visual tokens rather than resizing all images to a fixed size. This was a meaningful advance for tasks requiring fine-grained detail.

InternVL (Shanghai AI Lab, 2023-2024)

A strong open-source contender, combining an InternViT vision encoder with an InternLM language model. InternVL 2.0 scaled to 108B parameters and achieved competitive performance with commercial models on benchmarks. Notable for its progressive training strategy — scaling both the vision encoder and LLM together.


VLM Comparison

ModelArchitecture TypeOpen/ClosedKey StrengthParameters
CLIPDual encoder (contrastive)OpenZero-shot classification, backbone encoder~400M
BLIP-2Q-Former bridgeOpenEfficient frozen model connection~3-12B
FlamingoCross-attention fusionClosedFew-shot multimodal learning80B
LLaVA / LLaVA-NeXTProjection into LLMOpenSimple, effective, easy to reproduce7-34B
GPT-4V / GPT-4oNatively multimodalClosedStrongest general reasoningUndisclosed
GeminiNatively multimodalClosed (API)Long context, video understandingUndisclosed
ClaudeNatively multimodalClosed (API)Document analysis, reduced hallucinationUndisclosed
PaliGemmaSigLIP + Gemma projectionOpenLightweight, fine-tunable3B
Qwen2-VLDynamic resolution + LLMOpenNative resolution, multilingual2-72B
InternVL 2.0Progressive scaling ViT + LLMOpenCompetitive with closed models1-108B

Real-World Applications

VLMs have moved well beyond research benchmarks into production systems.

Visual Question Answering and Conversational AI

The most visible application: upload an image and ask questions about it. This powers customer support (photograph a broken product and describe the issue), education (point at a math problem and get step-by-step solutions), and accessibility (describe scenes for visually impaired users).

Document Understanding and OCR

VLMs excel at understanding the structure of documents, not just the text. They can read invoices, parse tables, understand forms, and extract information from complex layouts that traditional OCR systems struggle with. Financial services, legal, and healthcare all benefit here.

Autonomous Driving and Robotics

Self-driving systems need to understand scenes in context: "Is that person about to cross the street?" requires combining visual perception with semantic reasoning. VLMs can provide this contextual understanding as part of the driving stack. In robotics, VLMs enable robots to follow natural language instructions in the physical world — "pick up the red cup next to the keyboard."

Medical Imaging

Radiologists can use VLMs as a second-opinion tool — upload a chest X-ray and ask about potential findings. Models like Med-PaLM M (Google) were specifically fine-tuned for medical multimodal tasks. The combination of visual understanding and natural language output makes findings more accessible to non-specialists.

Creative and Design Work

VLMs can critique designs ("What's wrong with this UI layout?"), provide alt-text for images at scale, analyze competitors' visual branding, and help with content moderation by understanding both images and their context.


Challenges and Open Problems

Despite rapid progress, VLMs still have significant limitations.

Hallucination

This is the biggest problem. VLMs confidently describe objects that don't exist in the image, misread text, or invent details. A model might claim there are three people in an image that shows two, or describe a red car as blue. The language model's tendency to generate plausible-sounding text sometimes overrides what it actually "sees." Reducing multimodal hallucination is an active research area, with approaches like RLHF on visual tasks and grounding mechanisms showing promise.

Spatial Reasoning

VLMs struggle with precise spatial relationships. "Is the cup to the left or right of the book?" or "How many windows are on the second floor?" often produce wrong answers. The image-to-tokens pipeline loses fine-grained spatial information, and current training data doesn't emphasize spatial reasoning enough.

Fine-Grained Understanding

Counting objects accurately, reading small text in images, distinguishing between visually similar items (different bird species, similar product models) — these remain difficult. Higher image resolutions help, but processing 4K images means thousands of visual tokens, which strains context windows and compute budgets.

Temporal Reasoning in Video

While models like Gemini can process video, true temporal reasoning ("What happened right before the person fell?") remains limited. Most video VLMs sample frames rather than processing continuous video, losing temporal dynamics.

Safety and Bias

VLMs inherit biases from both their visual and textual training data. They may generate stereotypical descriptions, fail to recognize people from underrepresented groups, or be manipulated through adversarial images. Multimodal safety is harder than text-only safety because the attack surface is larger — adversarial patterns can be embedded in images in ways that are invisible to humans but manipulate model outputs.


Where Multimodal AI Is Headed

The trajectory is clear: modality barriers are dissolving.

Unified models are winning. The trend is away from adapter-based approaches (bolt a vision encoder onto an LLM) and toward natively multimodal training. Future models will likely process text, images, video, audio, 3D data, and sensor readings through a single architecture, trained together from the start.

Reasoning over visual information is improving fast. Chain-of-thought prompting is being extended to visual reasoning — models that can "think step by step" about what they see, breaking complex visual scenes into sequential reasoning steps. This addresses spatial reasoning and counting weaknesses.

Smaller, specialized models are becoming practical. Not every application needs GPT-4V. Models like PaliGemma and Qwen2-VL show that focused, open-weight models in the 2-7B parameter range can handle specific visual tasks effectively. Expect more task-specific VLMs that can run on edge devices.

The agentic future is multimodal. AI agents that can browse the web, interact with GUIs, and operate in physical environments need vision-language understanding as a core capability. VLMs are the perceptual backbone of autonomous AI systems — from computer-use agents that navigate screens to robots that manipulate objects based on verbal instructions.

Real-time multimodal interaction is arriving. GPT-4o's ability to process audio, video, and text simultaneously in real-time conversation points toward a future where AI assistants see, hear, and respond as naturally as a human conversation partner.


Conclusion

The three-part journey from transformers to VLMs tells a coherent story about how a single architectural idea — self-attention over sequences — scaled from text to images to true multimodal understanding.

Transformers gave us attention. Vision Transformers showed that images are just sequences of patches. And Vision Language Models proved that you can unify perception and language in a single model.

We are still early. Current VLMs hallucinate, struggle with spatial reasoning, and require enormous compute. But the rate of improvement is remarkable — the gap between the best VLMs in early 2024 and late 2025 is larger than the gap between 2018 and 2023. The models are getting smaller, faster, more capable, and more accessible.

The endgame isn't just a model that can look at a picture and answer a question. It's an AI that perceives the world as richly as humans do — that can watch a video, read a document, listen to a conversation, and reason across all of it simultaneously. VLMs are the foundation that makes this possible.

Vision Language Models — When AI Learns to See and Talk (Part 3 of 3) | Software Engineer Blog