LLM Model Types — Reasoning, Thinking, and Beyond
A practical guide to understanding the different types of language models: base models, instruction-tuned, reasoning models, thinking models, and MoE architectures — what they are, how they differ, and when to use each.
n
Not all language models are built the same. In 2024-2025, the landscape split into distinct categories — each designed to solve different problems in different ways. If you've been confused by terms like "reasoning model," "thinking tokens," or "mixture of experts," this post is for you.
We'll walk through every major model type, what makes it different under the hood, and when you should actually use it.
The Evolution: From Base Models to Reasoning
The journey of LLMs follows a clear progression in how they think:
- Base models — raw next-token predictors
- Instruction-tuned models — aligned to follow instructions
- Reasoning models — trained to think step-by-step before answering
- Specialized & capability models — multimodal, small/distilled, and coding-focused
Each stage builds on the previous one. Alongside this evolution, there are also architectural choices (like Mixture of Experts) that can apply to any of the above categories. Let's break them all down.
Base Models
A base model (sometimes called a "foundation model") is trained purely on next-token prediction. You feed it internet-scale text — books, code, Wikipedia, forums — and it learns to predict what comes next.
Examples: GPT-4 base, Llama 3.1 base, Qwen 2.5 base
Characteristics:
- Incredible knowledge breadth, but unpredictable behavior
- May continue your prompt as if it's a document, not answer your question
- No concept of "helpfulness" — it just completes text
- Used as the foundation for everything else
You rarely interact with base models directly. They're the raw material that gets refined through fine-tuning.
Instruction-Tuned Models (Chat Models)
Take a base model, fine-tune it on thousands of (instruction, response) pairs, then apply RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization) to align it with human preferences.
Examples: GPT-4o, Claude Sonnet/Opus, Gemini 2.5 Flash, Llama 3.1 Instruct, Qwen 2.5 Instruct
What changes:
- The model learns to follow instructions rather than just complete text
- It becomes "helpful, harmless, and honest" (the alignment goal)
- It understands the chat format: system prompts, user messages, assistant responses
- It can refuse harmful requests
When to use: This is your default choice for 90% of tasks — writing, summarizing, coding, Q&A, analysis. When people say "ChatGPT" or "Claude," they mean instruction-tuned models.
Reasoning Models — The Big Shift
In September 2024, OpenAI released o1, and the game changed. Instead of answering immediately, reasoning models think before they respond. They generate an internal chain of thought — sometimes thousands of tokens of deliberation — before producing the final answer.
How Reasoning Models Work
A standard chat model:
Input --> Generate answer immediately
A reasoning model:
Input --> Think (hidden reasoning tokens) --> Generate answer
The "thinking" step is where the model:
- Breaks the problem into sub-problems
- Considers multiple approaches
- Checks its own work
- Backtracks when it hits dead ends
- Verifies the final answer
This is trained through reinforcement learning on reasoning tasks. The model learns that spending more compute on thinking leads to better answers — a concept called test-time compute scaling.
The Key Players
| Model | Provider | Parameters | AIME 2024 | GPQA Diamond | Cost (per 1M tokens) |
|---|---|---|---|---|---|
| o3 | OpenAI | Proprietary | 96.7% | 87.7% | $10-40 |
| o1 | OpenAI | Proprietary | 92.3% | 85.2% | $15-60 |
| DeepSeek R1 | DeepSeek | 671B (MoE) | 87.5% | High | $0.55-2.19 |
| QwQ-32B | Alibaba | 32B | 79.5% | Good | Free (local) |
| Gemini 2.5 Pro | Proprietary | High | High | $1.25-10 |
The cost story is dramatic: DeepSeek R1 delivers competitive reasoning at 20-50x lower cost than OpenAI o1. QwQ-32B achieves o1-level performance on a model you can run locally.
Thinking Tokens — What Are They?
When a reasoning model "thinks," it generates special tokens that represent its internal deliberation. Different providers handle these differently:
- OpenAI (o1/o3): Thinking tokens are hidden. You pay for them but never see them. A 200-token response might have 3,000 thinking tokens underneath.
- DeepSeek R1: Thinking is visible inside special think tags. You can see exactly how the model reasons.
- Claude: Shows a summary of its thinking process.
- Gemini 2.5 Pro: Thinking mode can be toggled on/off.
Cost implication: Thinking tokens often outnumber output tokens by 5-20x on complex problems. A math problem that produces a 100-word answer might require 2,000 words of internal reasoning.
When to Use Reasoning Models
Reasoning models shine on tasks that require multi-step logic:
- Mathematical proofs and competition problems
- Complex coding challenges (algorithms, system design)
- Scientific reasoning and analysis
- Legal or compliance analysis with many interacting rules
They're overkill for:
- Simple Q&A, creative writing, summarization, translation
- Quick code generation or editing
Rule of thumb: If a smart human would need to sit down and think carefully, use a reasoning model. If they'd answer instantly, use a standard chat model.
Chain of Thought vs. Thinking Tokens
These terms get confused constantly. Here's the distinction:
Chain of Thought (CoT)
A prompting technique where you ask any model to "think step by step." The reasoning is visible in the output. You can use CoT with any instruction-tuned model — GPT-4o, Claude, Gemini Flash. It works because the model generates intermediate reasoning steps that help it arrive at the correct answer.
CoT is free (no extra cost), works with any model, and the reasoning is fully visible. But the quality depends on the model — a small model doing CoT won't match a large reasoning model.
Thinking Tokens (Internal Reasoning)
A training technique where the model is specifically trained to reason internally before responding. The reasoning happens in a separate "thinking" phase, often using specialized tokens.
Key differences from CoT:
- The model was trained to reason this way (not just prompted)
- Reasoning quality is much higher — the model learned what good reasoning looks like
- Costs more because you're paying for thinking tokens
- Can be hidden from the user (OpenAI) or visible (DeepSeek)
The bottom line: CoT is a prompting hack that helps any model. Thinking tokens are a fundamental architectural feature that produces dramatically better reasoning on hard problems.
Mixture of Experts (MoE) — Scaling Without the Cost
MoE is not a model type per se — it's an architecture pattern that can be applied to any of the above categories. But it's so important to the current landscape that it deserves its own section.
The Problem MoE Solves
A traditional "dense" model activates every parameter for every token. GPT-4 (rumored ~1.8T parameters) would need to run all 1.8T parameters for every single token if it were dense. That's incredibly expensive.
How MoE Works
Instead of one massive network, MoE uses many smaller expert networks plus a router that decides which experts handle each token:
Token --> Router --> Expert 3 + Expert 7 (out of 64 experts) --> Output
Only 2-8 experts activate per token (out of 64+), so you get the knowledge capacity of a huge model with the compute cost of a small one.
MoE in Practice
| Model | Total Params | Active Params | Ratio |
|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | 28% |
| DeepSeek R1 | 671B | 37B | 5.5% |
| Qwen 3 30B (MoE) | 30B | 3B | 10% |
| GPT-4 (rumored) | ~1.8T | ~280B | ~15% |
DeepSeek R1 is a perfect example: it has 671B total parameters (massive knowledge base) but only activates 37B per token (manageable compute). This is how it achieves reasoning performance close to o1 at a fraction of the cost.
Why MoE Matters for You
If you're using models via API (like Gemini 2.5 Flash Lite via OpenRouter), MoE is why some models are both cheap and good. The provider runs a huge MoE model, but each request only uses a fraction of the compute. You get big-model quality at small-model prices.
Hybrid and Emerging Categories
Multimodal Models
Models that handle text, images, audio, and video. Not a separate "type" — more of a capability layer added on top of any model category.
Examples: GPT-4o (text + image + audio), Gemini 2.5 (text + image + video + audio), Claude (text + image)
Distilled Models
Take a large powerful model and train a smaller model to mimic its behavior. The small model learns the "knowledge" of the large model without needing all its parameters.
Examples: DeepSeek R1 distilled into 7B, 14B, and 32B variants. Phi-4-mini (3.8B) distilled from larger Microsoft models.
This is particularly relevant for local deployment — a distilled 7B model can capture much of what a 70B model knows.
Coding-Specialized Models
Models fine-tuned specifically for code generation, debugging, and software engineering tasks.
Examples: Codestral, DeepSeek Coder, Qwen Coder, StarCoder
These outperform general-purpose models of the same size on coding tasks because their training data and fine-tuning is code-focused.
The Decision Matrix
Here's how to pick the right model type for your task:
| Task | Best Model Type | Example |
|---|---|---|
| General chat, Q&A | Instruction-tuned | GPT-4o, Claude Sonnet, Gemini Flash |
| Hard math/logic | Reasoning | o3, DeepSeek R1, QwQ-32B |
| Code generation | Coding-specialized | DeepSeek Coder, Qwen Coder |
| High volume, low cost | MoE via API | Gemini 2.5 Flash Lite, DeepSeek V3 |
| Offline/private | Small + quantized | Phi-4-mini, Llama 3.2 3B |
| Image understanding | Multimodal | GPT-4o, Gemini 2.5, Claude |
The Cost Landscape
One of the most important practical considerations is cost. Here's the current landscape for API models:
| Tier | Cost (per 1M output tokens) | Examples |
|---|---|---|
| Budget | $0.30-1 | Gemini 2.5 Flash Lite, DeepSeek V3 |
| Mid-range | $1-10 | Gemini 2.5 Flash, Claude Sonnet, GPT-4o |
| Premium | $10-30 | Claude Opus, Gemini 2.5 Pro |
| Reasoning | $15-60 | o1, o3 |
| Free (local) | $0 + electricity | Qwen, Llama, Phi, Gemma |
The gap between budget API models and premium ones has narrowed dramatically. Gemini 2.5 Flash Lite at $0.30/1M tokens delivers 75% of the quality of models costing 50x more. For many production use cases, that's more than enough.
Conclusion
The model landscape is no longer "bigger is better." It's a spectrum of trade-offs:
- Need raw intelligence? Reasoning models (o3, R1)
- Need speed and cost efficiency? MoE models via API (Gemini Flash Lite, DeepSeek)
- Need privacy? Small local models with quantization
- Need reliability for production? Instruction-tuned models from major providers
The most important skill isn't picking the "best" model — it's picking the right model for each task. A $0.30/1M token model handling your summarization while a reasoning model handles your complex logic is both cheaper and better than using one expensive model for everything.
In Part 2 of this series, we'll get practical: running small models locally on your own GPU, what quantization actually does, and whether a 4GB GPU can replace your API subscription.