Running LLMs Locally — Small Models, Quantization, and Your 4GB GPU

A hands-on guide to running language models on consumer hardware. What fits on a 4GB GPU, what quantization actually does, how llama.cpp and Ollama work, and whether local models can replace your API subscription.

Banner

I use Gemini 2.5 Flash Lite via OpenRouter as my daily driver LLM. It's fast, cheap, and handles most tasks well. But I started wondering: could I run something locally on my 4GB GPU instead? What would I gain, what would I lose, and what does "quantization" even mean?

This post is the practical answer. No theory fluff — just what works, what fits, and real performance numbers.


The Question: Can a 4GB GPU Run a Useful LLM?

Yes. But with trade-offs. Here's the short answer:

ModelParametersVRAM (Q4)QualitySpeed
Qwen 2.5 3B Instruct3B~2.2GBGood40-60 tok/s
Llama 3.2 3B Instruct3B~2.0GBGood40-60 tok/s
Phi-4-mini3.8B~2.8GBVery good35-50 tok/s
Gemma 3 4B4B~3.0GBGood30-45 tok/s

All of these fit comfortably on a 4GB GPU when quantized to Q4. They won't match GPT-4o or Gemini 2.5 Pro, but for many tasks — code completion, summarization, Q&A, drafting — they're surprisingly capable.


What Is Quantization?

This is the key concept that makes local LLMs possible on consumer hardware.

The Problem

A 3B parameter model stored in standard 16-bit floating point (FP16) takes up:

3 billion parameters x 2 bytes each = 6 GB

That won't fit on a 4GB GPU. And a 7B model would need 14GB. Game over for consumer hardware.

The Solution

Quantization reduces the precision of each parameter from 16 bits to 8, 4, or even 2 bits. Instead of storing each weight as a precise decimal number, you round it to fewer possible values.

Think of it like image compression. A RAW photo is 50MB. A JPEG is 5MB. You lose some subtle detail, but 95% of the visual information is preserved. Quantization does the same thing to model weights.

FP16 (16-bit):  3B params x 2 bytes = 6.0 GB
Q8   (8-bit):   3B params x 1 byte  = 3.0 GB  (50% smaller)
Q4   (4-bit):   3B params x 0.5 byte = 1.5 GB  (75% smaller)

How Much Quality Do You Lose?

This is the critical question. The answer: surprisingly little.

QuantizationSize ReductionQuality RetentionBest For
Q8 (8-bit)50%~99%When you have the VRAM
Q4_K_M75%~92-95%Sweet spot for most use cases
Q4_K_S75%~90%When VRAM is very tight
Q287%~80%Experimental, noticeable degradation

Q4_K_M is the industry standard recommendation. The "K" means it uses mixed precision — important layers (attention, output) keep higher precision while less critical layers get compressed more aggressively. The "M" stands for medium (between S=small and L=large quality tiers).

Recent research shows that task-specific fine-tuned models retain more quality after quantization than general-purpose models. Their weight distributions are narrower, making them more resistant to rounding errors.

Quantization Formats

There are three main formats you'll encounter:

GGUF — The universal format. Works on CPU, GPU, Apple Silicon, ARM. Created by the llama.cpp project. This is what you want for local inference on consumer hardware.

GPTQ — Optimized for NVIDIA GPUs. About 20% faster than GGUF on NVIDIA hardware, but doesn't work on CPU. Good if you have an NVIDIA card and want maximum speed.

AWQ — Newest format, highest quality retention (95% at Q4). Preserves the most important weights ("activation-aware"). Slightly less compatible than GGUF.

For a 4GB GPU: Use GGUF Q4_K_M. It works everywhere, has excellent quality, and the tooling is mature.


What Is llama.cpp?

llama.cpp is a C++ inference engine that runs quantized models efficiently on consumer hardware. It's the backbone of the local LLM ecosystem.

What It Does

  • Loads GGUF quantized models
  • Runs inference on CPU, GPU, or hybrid (some layers on GPU, rest on CPU)
  • Optimized for each platform: ARM NEON (Raspberry Pi, phones), AVX2 (Intel/AMD), CUDA (NVIDIA), Metal (Apple)
  • Provides a simple CLI and HTTP server

Why It Matters

Before llama.cpp, running a local LLM required Python, PyTorch, CUDA toolkit, and significant RAM overhead. llama.cpp is a single compiled binary with no dependencies. It's fast, lean, and runs anywhere.

# Example: run a model with llama.cpp
./llama-cli -m qwen2.5-3b-instruct-q4_k_m.gguf \
  -p "Explain Docker in 3 sentences" \
  -n 256 \
  --gpu-layers 99

The --gpu-layers 99 flag tells it to offload as many layers as possible to your GPU. On a 4GB card with a 3B Q4 model, all layers fit on the GPU.


Ollama — The Easy Way

If llama.cpp is the engine, Ollama is the car. It wraps llama.cpp in a user-friendly tool with one-command model downloads, an API server, and simple management.

Getting Started

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model (one command)
ollama run qwen2.5:3b

# Or run Phi-4-mini
ollama run phi4-mini

# Or Llama 3.2 3B
ollama run llama3.2:3b

That's it. Ollama downloads the quantized model, detects your GPU, and starts an interactive chat. No Python, no CUDA setup, no config files.

Ollama as an API Server

Ollama also runs as a REST API — perfect for integrating with your own applications:

# Start the server (runs in background by default after install)
ollama serve

# Query it like OpenAI's API
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:3b",
  "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists"}]
}'

The API is compatible with many OpenAI client libraries, so you can often swap base_url from OpenAI/OpenRouter to http://localhost:11434 and it just works.

Ollama vs llama.cpp: Which One?

Ollamallama.cpp
Setup1 minute10+ minutes (compile from source)
Model managementAutomatic downloadsManual GGUF file management
Performance10-30% slower (overhead)Maximum speed (bare metal)
API serverBuilt-inSeparate llama-server binary
CustomizationLimitedFull control over every parameter
Use caseDevelopment, demos, daily useEmbedded, production, edge devices

Recommendation: Start with Ollama. Move to raw llama.cpp only if you need every last token/second or are deploying to constrained hardware.


vLLM — When You Need to Serve Multiple Users

If you're building a product that serves multiple concurrent users, neither Ollama nor llama.cpp is the right choice. You need vLLM.

vLLM uses PagedAttention — a memory management technique that reduces KV cache waste from 60-80% down to 19-27%. In practice, this means:

  • 35x higher request throughput than llama.cpp under concurrent load
  • Stable latency even with dozens of simultaneous users
  • Designed for production APIs, not desktop use

vLLM is overkill for personal use, but essential if you're hosting a model for your app's backend.


The Models: What to Run on 4GB

Let's get specific about each option.

Qwen 2.5 3B Instruct

The well-rounded choice. Trained on 18 trillion tokens across 29 languages. Strong at coding, math, and instruction following. Alibaba's Qwen series has been consistently improving and the 3B variant punches well above its weight.

ollama run qwen2.5:3b
  • VRAM: ~2.2GB (Q4_K_M)
  • Context: 32K tokens (extendable to 128K with RoPE scaling)
  • Strengths: Multilingual, good at structured output (JSON), solid coding
  • Weakness: Can be verbose, sometimes over-explains

Phi-4-mini (3.8B)

The overachiever. Microsoft's Phi series focuses on data quality over data quantity. Phi-4-mini beats GPT-4o on several math benchmarks despite being 3.8B parameters. If your use case involves reasoning or math, this is your local model.

ollama run phi4-mini
  • VRAM: ~2.8GB (Q4_K_M)
  • Context: 128K tokens
  • Strengths: Math, reasoning, code — best benchmarks in the 3-4B range
  • Weakness: Weaker on creative writing, less multilingual than Qwen

Llama 3.2 3B Instruct

The community standard. Meta's Llama models have the largest ecosystem of fine-tunes, tools, and community support. The 3B variant is the most downloaded small model.

ollama run llama3.2:3b
  • VRAM: ~2.0GB (Q4_K_M)
  • Context: 128K tokens
  • Strengths: Best community support, tons of fine-tuned variants, smallest VRAM footprint
  • Weakness: Slightly behind Qwen/Phi on benchmarks

Gemma 3 4B

Google's efficient entry. Good all-around performance, but at 4B parameters it's the tightest fit on 4GB VRAM.

ollama run gemma3:4b
  • VRAM: ~3.0GB (Q4_K_M)
  • Context: 128K tokens
  • Strengths: Well-balanced, good instruction following
  • Weakness: Leaves little VRAM headroom on 4GB

Local Model vs. Gemini 2.5 Flash Lite via API

This is the practical comparison that matters. I use Gemini 2.5 Flash Lite through OpenRouter — how does a local 3B model compare?

MetricGemini 2.5 Flash Lite (API)Qwen 2.5 3B (Local, Q4)
Quality~75% of Gemini Pro~50-60% of Gemini Pro
Latency (first token)200-400ms (network)50-100ms (instant)
Speed100+ tok/s40-60 tok/s (GPU)
Cost~$0.30/1M tokensFree (electricity only)
Context window1M tokens32-128K tokens
PrivacyData sent to Google100% local
AvailabilityRequires internetAlways available
MultilingualExcellentGood (Qwen) to limited (Phi)
Complex reasoningGoodLimited

When Local Wins

  • Privacy-sensitive tasks: Medical notes, legal documents, personal data
  • Offline environments: No internet, air-gapped systems, travel
  • High-volume simple tasks: Thousands of small completions (no API cost)
  • Latency-critical: No network round-trip, instant first token
  • Learning and experimenting: No cost to experiment

When API Wins

  • Quality matters: Anything requiring nuanced understanding
  • Long context: Processing large documents (1M tokens vs 32K)
  • Complex coding: Multi-file refactors, architecture decisions
  • Reasoning tasks: Math, logic, analysis
  • Production reliability: SLAs, uptime guarantees

The Hybrid Approach (My Recommendation)

Don't choose one — use both:

  1. Gemini 2.5 Flash Lite via OpenRouter for your primary work (quality + cost-effective)
  2. Local Qwen 2.5 3B for offline use, privacy-sensitive tasks, and experimentation
  3. Reasoning model (DeepSeek R1 via API) when you need heavy thinking

This way you get the best quality where it matters, zero cost for experimenting, and complete privacy when you need it.


Practical Setup: From Zero to Running

Here's the complete setup for your 4GB GPU:

Step 1: Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU detection
ollama --version
nvidia-smi  # Should show your GPU

Step 2: Pull Your First Model

# Start with Qwen 2.5 3B (best balance for 4GB)
ollama pull qwen2.5:3b

# Check model size
ollama list

Step 3: Test It

# Interactive chat
ollama run qwen2.5:3b

# Or one-shot query
echo "Explain what a Docker volume is in 2 sentences" | ollama run qwen2.5:3b

Step 4: Use It From Your Code

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "qwen2.5:3b",
    "messages": [{"role": "user", "content": "Write a FastAPI health check endpoint"}],
    "stream": False
})
print(response.json()["message"]["content"])

Step 5: Try Different Models

# Pull alternatives and compare
ollama pull phi4-mini
ollama pull llama3.2:3b

# Quick benchmark: same prompt, different models
time echo "Write a Python binary search function" | ollama run qwen2.5:3b
time echo "Write a Python binary search function" | ollama run phi4-mini
time echo "Write a Python binary search function" | ollama run llama3.2:3b

Performance Tips for 4GB GPUs

Maximize GPU Offloading

Ollama does this automatically, but if using llama.cpp directly:

# Offload all layers to GPU (if they fit)
./llama-cli -m model.gguf --gpu-layers 99 -ngl 99

Monitor VRAM Usage

# Watch GPU memory in real-time
watch -n 1 nvidia-smi

If you see VRAM at 95%+, the model barely fits. Consider a smaller quantization (Q4_K_S instead of Q4_K_M) or a smaller model.

CPU Fallback Is Fine

If a model doesn't fully fit in VRAM, llama.cpp splits it: some layers on GPU, the rest on CPU. You lose speed but it works. A 7B Q4 model can run on a 4GB GPU + 8GB RAM at ~15-20 tok/s instead of ~40 tok/s.

Memory Bandwidth Matters More Than Compute

Here's a counterintuitive fact: for LLM inference, memory bandwidth matters more than raw GPU compute power. An RTX 3090 (936 GB/s bandwidth) often outperforms an RTX 4080 (717 GB/s) for LLM inference, despite the 4080 being a newer card. If you're shopping for a used GPU specifically for local LLMs, prioritize bandwidth.


Beyond 4GB: What Opens Up With More VRAM

If you upgrade in the future:

VRAMWhat Fits (Q4)Notable Models
4GBUp to 4BQwen 3B, Phi-4-mini, Llama 3.2 3B
8GBUp to 8BLlama 3.1 8B, Qwen 2.5 7B, Gemma 9B
12GBUp to 14BQwen 2.5 14B, Phi-3 Medium 14B
16GBUp to 32BQwQ-32B (reasoning!), Qwen 2.5 32B
24GBUp to 70BLlama 3.1 70B, Qwen 2.5 72B

The sweet spot in the current market is a used RTX 3090 (24GB, ~$800). It runs QwQ-32B — an open-source reasoning model that rivals OpenAI o1 — entirely locally.


Can Open-Source Models Actually Replace Gemini 2.5 Flash?

This is the question that matters most if you're considering a hardware upgrade. I use Gemini 2.5 Flash Lite via OpenRouter as my daily driver, and Gemini 2.5 Flash for harder tasks. Can any open-source model match them locally?

The Benchmarks: Gemini Flash vs Open-Source

BenchmarkGemini 2.5 FlashGemini 2.5 Flash LiteQwen3 235B-A22BQwQ-32BQwen 2.5 72B
MMLU (knowledge)88.4%~75%~85%~82%~83%
AIME 2024 (math)88.0%--85.7%~75%~65%
AIME 2025 (math)72.0%--81.5%----
GPQA Diamond (science)82.8%--81.1%--~68%
LiveCodeBench v5 (coding)63.9%--70.7%--~50%
SWE-Bench (real coding)60.4%--------

vs. Gemini 2.5 Flash Lite (Your Daily Driver)

Flash Lite delivers roughly 75% of Flash quality at 30% of the cost. The open-source models that match or exceed Flash Lite:

  • Qwen 2.5 72B — Matches Flash Lite on general tasks. Needs 2x RTX 4090 (48GB) or 1x RTX 4090 + 64GB RAM with offloading.
  • QwQ-32B — Beats Flash Lite on reasoning/math tasks. Fits on a single RTX 4090 (24GB).
  • Llama 3.1 70B — Similar to Qwen 72B. Same hardware requirements.

For a 4GB GPU, no open-source model comes close to Flash Lite. The 3B models are roughly 40-50% of Flash Lite quality. The gap is real.

vs. Gemini 2.5 Flash (The Full Model)

Only two open-source models genuinely compete with Flash:

1. Qwen3 235B-A22B (MoE) — The closest all-rounder.

  • 235B total parameters, but only 22B active per token (MoE architecture)
  • Matches Flash on science (GPQA 81.1 vs 82.8)
  • Beats Flash on math (AIME 2025: 81.5 vs 72.0) and coding (LiveCodeBench: 70.7 vs 63.9)
  • Q4 quantized needs ~110-140GB total memory
  • Practical setup: 1x RTX 4090 (24GB) + 128GB DDR5 system RAM using CPU/GPU hybrid offload
  • Speed: ~3-8 tok/s (usable for complex tasks where you'd wait anyway)

2. DeepSeek V3.2 (thinking mode) — Crushes Flash on coding and math.

  • LiveCodeBench: 83.3% vs Flash's 63.9%
  • AIME 2025: 93.1% vs Flash's 72.0%
  • But at 671B parameters, it needs ~245GB+ even quantized — essentially impossible on consumer hardware

The Hardware Investment Math

SetupCostBest ModelFlash Equivalent?
Current 4GB GPU$0Qwen 2.5 3B~40% of Flash Lite
RTX 4090 (24GB)~$2,000QwQ-32B~70% of Flash, strong reasoning
RTX 4090 + 128GB RAM~$3,000Qwen3 235B-A22B~90-100% of Flash
2x RTX 4090~$4,500Qwen 2.5 72B (full VRAM)~85% of Flash

The Economic Reality

Let's do the math. At Gemini 2.5 Flash Lite pricing (~$0.30/1M output tokens):

  • $3,000 hardware / $0.30 per 1M tokens = 10 billion tokens before break-even
  • That's roughly 15 million full responses — you'd need decades of heavy use

At full Flash pricing (~$1-2/1M tokens):

  • Break-even drops to 1.5-3 billion tokens — still years of use for a single developer

Local makes economic sense only if:

  • You process massive volumes (batch jobs, RAG pipelines, thousands of queries/day)
  • Privacy/compliance requires keeping data off third-party servers
  • You need zero-latency, zero-downtime, offline-capable inference
  • You want to fine-tune models on your own data

For most individual developers, the API remains cheaper. But if you need privacy or offline access, even a modest upgrade opens real possibilities.


Conclusion

Running LLMs locally on a 4GB GPU is absolutely viable in 2026. Quantization (specifically Q4_K_M GGUF) shrinks models by 75% with minimal quality loss. Tools like Ollama make setup trivial — one command to install, one command to run.

For a 4GB GPU, Qwen 2.5 3B or Phi-4-mini are your best options — good for code completion, Q&A, summarization, and drafting. They won't replace Gemini 2.5 Flash Lite, but they're perfect for offline and privacy-sensitive work.

If you're willing to invest in an RTX 4090 + 128GB RAM (~$3,000), you can run Qwen3 235B-A22B — an open-source model that genuinely matches Gemini 2.5 Flash on most benchmarks. That's the current frontier of what's possible on consumer hardware.

The practical setup for most developers:

  • Daily driver: Gemini 2.5 Flash Lite via OpenRouter (quality + cost)
  • Local backup: Qwen 2.5 3B via Ollama (privacy + offline + free)
  • Heavy lifting: DeepSeek R1 via API when you need serious reasoning
  • If you upgrade: QwQ-32B on an RTX 4090 for local reasoning that rivals o1

The API vs. local debate isn't either/or — it's about using the right tool for the right job. A 4GB GPU and 10 minutes of Ollama setup gives you a capable, private, always-available AI assistant. And if you catch the local LLM bug, there's a clear upgrade path all the way to Flash-level performance on your own machine.

Running LLMs Locally — Small Models, Quantization, and Your 4GB GPU | Software Engineer Blog