LLM Model Types — Reasoning, Thinking, and Beyond

Not all language models are built the same. In 2024-2025, the landscape split into distinct categories — each designed to solve different problems in different ways. If you've been confused by terms like "reasoning model," "thinking tokens," or "mixture of experts," this post is for you.

We'll walk through every major model type, what makes it different under the hood, and when you should actually use it.

The Evolution: From Base Models to Reasoning

The journey of LLMs follows a clear progression in how they think:

Base models — raw next-token predictors
Instruction-tuned models — aligned to follow instructions
Reasoning models — trained to think step-by-step before answering
Specialized & capability models — multimodal, small/distilled, and coding-focused

Each stage builds on the previous one. Alongside this evolution, there are also architectural choices (like Mixture of Experts) that can apply to any of the above categories. Let's break them all down.

Base Models

A base model (sometimes called a "foundation model") is trained purely on next-token prediction. You feed it internet-scale text — books, code, Wikipedia, forums — and it learns to predict what comes next.

Examples: GPT-4 base, Llama 3.1 base, Qwen 2.5 base

Characteristics:

Incredible knowledge breadth, but unpredictable behavior
May continue your prompt as if it's a document, not answer your question
No concept of "helpfulness" — it just completes text
Used as the foundation for everything else

You rarely interact with base models directly. They're the raw material that gets refined through fine-tuning.

Instruction-Tuned Models (Chat Models)

Take a base model, fine-tune it on thousands of (instruction, response) pairs, then apply RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization) to align it with human preferences.

Examples: GPT-4o, Claude Sonnet/Opus, Gemini 2.5 Flash, Llama 3.1 Instruct, Qwen 2.5 Instruct

What changes:

The model learns to follow instructions rather than just complete text
It becomes "helpful, harmless, and honest" (the alignment goal)
It understands the chat format: system prompts, user messages, assistant responses
It can refuse harmful requests

When to use: This is your default choice for 90% of tasks — writing, summarizing, coding, Q&A, analysis. When people say "ChatGPT" or "Claude," they mean instruction-tuned models.

Reasoning Models — The Big Shift

In September 2024, OpenAI released o1, and the game changed. Instead of answering immediately, reasoning models think before they respond. They generate an internal chain of thought — sometimes thousands of tokens of deliberation — before producing the final answer.

How Reasoning Models Work

A standard chat model:

Input --> Generate answer immediately

A reasoning model:

Input --> Think (hidden reasoning tokens) --> Generate answer

The "thinking" step is where the model:

Breaks the problem into sub-problems
Considers multiple approaches
Checks its own work
Backtracks when it hits dead ends
Verifies the final answer

This is trained through reinforcement learning on reasoning tasks. The model learns that spending more compute on thinking leads to better answers — a concept called test-time compute scaling.

The Key Players

Model	Provider	Parameters	AIME 2024	GPQA Diamond	Cost (per 1M tokens)
o3	OpenAI	Proprietary	96.7%	87.7%	$10-40
o1	OpenAI	Proprietary	92.3%	85.2%	$15-60
DeepSeek R1	DeepSeek	671B (MoE)	87.5%	High	$0.55-2.19
QwQ-32B	Alibaba	32B	79.5%	Good	Free (local)
Gemini 2.5 Pro	Google	Proprietary	High	High	$1.25-10

The cost story is dramatic: DeepSeek R1 delivers competitive reasoning at 20-50x lower cost than OpenAI o1. QwQ-32B achieves o1-level performance on a model you can run locally.

Thinking Tokens — What Are They?

When a reasoning model "thinks," it generates special tokens that represent its internal deliberation. Different providers handle these differently:

OpenAI (o1/o3): Thinking tokens are hidden. You pay for them but never see them. A 200-token response might have 3,000 thinking tokens underneath.
DeepSeek R1: Thinking is visible inside special think tags. You can see exactly how the model reasons.
Claude: Shows a summary of its thinking process.
Gemini 2.5 Pro: Thinking mode can be toggled on/off.

Cost implication: Thinking tokens often outnumber output tokens by 5-20x on complex problems. A math problem that produces a 100-word answer might require 2,000 words of internal reasoning.

When to Use Reasoning Models

Reasoning models shine on tasks that require multi-step logic:

Mathematical proofs and competition problems
Complex coding challenges (algorithms, system design)
Scientific reasoning and analysis
Legal or compliance analysis with many interacting rules

They're overkill for:

Simple Q&A, creative writing, summarization, translation
Quick code generation or editing

Rule of thumb: If a smart human would need to sit down and think carefully, use a reasoning model. If they'd answer instantly, use a standard chat model.

Chain of Thought vs. Thinking Tokens

These terms get confused constantly. Here's the distinction:

Chain of Thought (CoT)

A prompting technique where you ask any model to "think step by step." The reasoning is visible in the output. You can use CoT with any instruction-tuned model — GPT-4o, Claude, Gemini Flash. It works because the model generates intermediate reasoning steps that help it arrive at the correct answer.

CoT is free (no extra cost), works with any model, and the reasoning is fully visible. But the quality depends on the model — a small model doing CoT won't match a large reasoning model.

Thinking Tokens (Internal Reasoning)

A training technique where the model is specifically trained to reason internally before responding. The reasoning happens in a separate "thinking" phase, often using specialized tokens.

Key differences from CoT:

The model was trained to reason this way (not just prompted)
Reasoning quality is much higher — the model learned what good reasoning looks like
Costs more because you're paying for thinking tokens
Can be hidden from the user (OpenAI) or visible (DeepSeek)

The bottom line: CoT is a prompting hack that helps any model. Thinking tokens are a fundamental architectural feature that produces dramatically better reasoning on hard problems.

Mixture of Experts (MoE) — Scaling Without the Cost

MoE is not a model type per se — it's an architecture pattern that can be applied to any of the above categories. But it's so important to the current landscape that it deserves its own section.

The Problem MoE Solves

A traditional "dense" model activates every parameter for every token. GPT-4 (rumored ~1.8T parameters) would need to run all 1.8T parameters for every single token if it were dense. That's incredibly expensive.

How MoE Works

Instead of one massive network, MoE uses many smaller expert networks plus a router that decides which experts handle each token:

Token --> Router --> Expert 3 + Expert 7 (out of 64 experts) --> Output

Only 2-8 experts activate per token (out of 64+), so you get the knowledge capacity of a huge model with the compute cost of a small one.

MoE in Practice

Model	Total Params	Active Params	Ratio
Mixtral 8x7B	47B	13B	28%
DeepSeek R1	671B	37B	5.5%
Qwen 3 30B (MoE)	30B	3B	10%
GPT-4 (rumored)	~1.8T	~280B	~15%

DeepSeek R1 is a perfect example: it has 671B total parameters (massive knowledge base) but only activates 37B per token (manageable compute). This is how it achieves reasoning performance close to o1 at a fraction of the cost.

Why MoE Matters for You

If you're using models via API (like Gemini 2.5 Flash Lite via OpenRouter), MoE is why some models are both cheap and good. The provider runs a huge MoE model, but each request only uses a fraction of the compute. You get big-model quality at small-model prices.

Hybrid and Emerging Categories

Multimodal Models

Models that handle text, images, audio, and video. Not a separate "type" — more of a capability layer added on top of any model category.

Examples: GPT-4o (text + image + audio), Gemini 2.5 (text + image + video + audio), Claude (text + image)

Distilled Models

Take a large powerful model and train a smaller model to mimic its behavior. The small model learns the "knowledge" of the large model without needing all its parameters.

Examples: DeepSeek R1 distilled into 7B, 14B, and 32B variants. Phi-4-mini (3.8B) distilled from larger Microsoft models.

This is particularly relevant for local deployment — a distilled 7B model can capture much of what a 70B model knows.

Coding-Specialized Models

Models fine-tuned specifically for code generation, debugging, and software engineering tasks.

Examples: Codestral, DeepSeek Coder, Qwen Coder, StarCoder

These outperform general-purpose models of the same size on coding tasks because their training data and fine-tuning is code-focused.

The Decision Matrix

Here's how to pick the right model type for your task:

Task	Best Model Type	Example
General chat, Q&A	Instruction-tuned	GPT-4o, Claude Sonnet, Gemini Flash
Hard math/logic	Reasoning	o3, DeepSeek R1, QwQ-32B
Code generation	Coding-specialized	DeepSeek Coder, Qwen Coder
High volume, low cost	MoE via API	Gemini 2.5 Flash Lite, DeepSeek V3
Offline/private	Small + quantized	Phi-4-mini, Llama 3.2 3B
Image understanding	Multimodal	GPT-4o, Gemini 2.5, Claude

The Cost Landscape

One of the most important practical considerations is cost. Here's the current landscape for API models:

Tier	Cost (per 1M output tokens)	Examples
Budget	$0.30-1	Gemini 2.5 Flash Lite, DeepSeek V3
Mid-range	$1-10	Gemini 2.5 Flash, Claude Sonnet, GPT-4o
Premium	$10-30	Claude Opus, Gemini 2.5 Pro
Reasoning	$15-60	o1, o3
Free (local)	$0 + electricity	Qwen, Llama, Phi, Gemma

The gap between budget API models and premium ones has narrowed dramatically. Gemini 2.5 Flash Lite at $0.30/1M tokens delivers 75% of the quality of models costing 50x more. For many production use cases, that's more than enough.

Conclusion

The model landscape is no longer "bigger is better." It's a spectrum of trade-offs:

Need raw intelligence? Reasoning models (o3, R1)
Need speed and cost efficiency? MoE models via API (Gemini Flash Lite, DeepSeek)
Need privacy? Small local models with quantization
Need reliability for production? Instruction-tuned models from major providers

The most important skill isn't picking the "best" model — it's picking the right model for each task. A $0.30/1M token model handling your summarization while a reasoning model handles your complex logic is both cheaper and better than using one expensive model for everything.

In Part 2 of this series, we'll get practical: running small models locally on your own GPU, what quantization actually does, and whether a 4GB GPU can replace your API subscription.