PostGIS vs DuckDB — Choosing the Right Tool for Spatial Data
A practical guide to OLTP vs OLAP for geospatial workloads. When to use PostGIS for serving and transactions, when DuckDB wins for analytics and batch processing, and how to combine both in a modern spatial data stack.
You have 50 million spatial points. You need to store them, query them, and compute statistics across them.
Which tool do you reach for?
If your answer is "PostgreSQL with PostGIS" — you are not wrong. But you might be using a screwdriver to hammer a nail.
This post breaks down the two fundamental modes of spatial data processing — transactional (PostGIS) and analytical (DuckDB) — when each one shines, where each one breaks down, and how to combine them in a modern geospatial stack.
Table of Contents
- OLTP vs OLAP — The Fundamental Split
- What is PostGIS?
- PostGIS in Action
- Where PostGIS Struggles
- What is DuckDB?
- DuckDB Spatial in Action
- Where DuckDB Struggles
- Side-by-Side Comparison
- When to Use PostGIS
- When to Use DuckDB
- Combining Both: A Modern Spatial Stack
- Other OLAP Options
- Practical Tips
- Summary
OLTP vs OLAP — The Fundamental Split
Before we compare tools, you need to understand the fundamental architectural split in databases. Every database optimizes for one of two workloads:
OLTP (Online Transaction Processing) — optimized for many small, fast read/write operations. A user signs up, an API returns one record, a sensor writes a new reading. PostgreSQL, MySQL, and SQLite are OLTP databases.
OLAP (Online Analytical Processing) — optimized for scanning and aggregating massive amounts of data. "What was the average elevation across 10 million LiDAR points?" or "Show me total road assets per canton." ClickHouse, BigQuery, Snowflake, and DuckDB are OLAP engines.
The key technical difference is storage layout:
Row-based storage (OLTP):
Row 1: [id=1, x=7.45, y=47.21, z=432.1, class=2, intensity=128]
Row 2: [id=2, x=7.45, y=47.21, z=433.5, class=6, intensity=95]
Row 3: [id=3, x=7.46, y=47.22, z=431.8, class=2, intensity=142]
Every field in a row is stored together on disk. Great for fetching a complete row by ID. Terrible for scanning one column across millions of rows — the engine has to read every field just to get to class.
Column-based storage (OLAP):
id: [1, 2, 3, ...]
x: [7.45, 7.45, 7.46, ...]
y: [47.21, 47.21, 47.22, ...]
z: [432.1, 433.5, 431.8, ...]
class: [2, 6, 2, ...]
intensity: [128, 95, 142, ...]
Each column is stored as a contiguous block. To compute AVG(z) WHERE class = 2, the engine only reads two columns (z and class), skipping everything else. On 50 million points, this is the difference between seconds and minutes.
This is not about one being "better." It is about choosing based on your access pattern.
What is PostGIS?
PostGIS is the spatial extension for PostgreSQL. It adds geometry and geography types, spatial indexing (GIST), and hundreds of spatial functions to the world's most popular open-source relational database.
PostGIS turns PostgreSQL into a full-featured spatial OLTP database: it can store points, lines, polygons, rasters, and 3D geometries with transactional guarantees (ACID), concurrent access, and standard SQL.
PostGIS in Action
Store and query spatial data with standard SQL:
-- Create a table for survey points
CREATE TABLE survey_points (
id SERIAL PRIMARY KEY,
geom GEOMETRY(PointZ, 2056), -- Swiss LV95 coordinate system
classification INTEGER,
intensity FLOAT,
captured_at TIMESTAMP DEFAULT NOW()
);
-- Create a spatial index
CREATE INDEX idx_survey_geom ON survey_points USING GIST(geom);
-- Insert a point
INSERT INTO survey_points (geom, classification, intensity)
VALUES (ST_SetSRID(ST_MakePoint(2600000, 1200000, 450.5), 2056), 2, 128.0);
-- Find all ground points within 100m of a location
SELECT id, ST_AsText(geom), intensity
FROM survey_points
WHERE classification = 2
AND ST_DWithin(geom, ST_SetSRID(ST_MakePoint(2600100, 1200100, 0), 2056), 100);
-- Spatial join: find which parcel each point belongs to
SELECT p.id, p.classification, parcels.owner
FROM survey_points p
JOIN parcels ON ST_Contains(parcels.geom, p.geom);
PostGIS is excellent here. The spatial index makes point-in-polygon queries fast even with millions of rows. ACID transactions mean two users can write simultaneously without data corruption. Your web API can serve single-record lookups in milliseconds.
Where PostGIS Struggles
Now try this:
-- Compute average elevation per classification across 50M points
SELECT classification, AVG(ST_Z(geom)), COUNT(*)
FROM survey_points
GROUP BY classification;
On 50 million rows, this query will take minutes. PostgreSQL has to read every row from disk (row-based storage), extract the Z coordinate from each geometry, and aggregate. The spatial index does not help — there is no spatial filter to narrow the scan.
Or this:
-- Compute a 1m grid density map across the entire dataset
SELECT
FLOOR(ST_X(geom)) AS grid_x,
FLOOR(ST_Y(geom)) AS grid_y,
COUNT(*) AS point_count,
AVG(ST_Z(geom)) AS avg_elevation
FROM survey_points
GROUP BY grid_x, grid_y;
This is a full table scan with aggregation — exactly the workload PostgreSQL was not designed for. It will work, but it will be slow and use significant memory.
PostGIS is not the right tool when you need to:
- Scan and aggregate millions of rows
- Run exploratory analytics ("show me distribution of X across all data")
- Process batch ETL transformations
- Compute statistics across entire datasets
For these workloads, you need an OLAP engine.
What is DuckDB?
DuckDB is an in-process, columnar analytical database. Think of it as "SQLite for analytics" — it runs inside your application with no server, no configuration, and no external dependencies.
But do not let the simplicity fool you. DuckDB can process hundreds of millions of rows on a single machine, reads Parquet and CSV files directly (no import step), and has a growing spatial extension.
DuckDB Spatial in Action
-- Install and load the spatial extension
INSTALL spatial;
LOAD spatial;
-- Read a 50M-row Parquet file and analyze it directly
-- No import, no schema creation, no waiting
SELECT
classification,
COUNT(*) AS point_count,
AVG(z) AS avg_elevation,
MIN(z) AS min_z,
MAX(z) AS max_z
FROM read_parquet('survey_points_50M.parquet')
GROUP BY classification
ORDER BY point_count DESC;
On a standard laptop, this finishes in seconds. Not minutes. Seconds. The columnar storage means DuckDB only reads the classification and z columns — it never touches x, y, intensity, or any other field.
Spatial queries work too:
-- Compute point density per 100m grid cell
SELECT
FLOOR(x / 100) * 100 AS grid_x,
FLOOR(y / 100) * 100 AS grid_y,
COUNT(*) AS density,
AVG(z) AS avg_z
FROM read_parquet('survey_points_50M.parquet')
GROUP BY grid_x, grid_y
HAVING density > 100;
-- Spatial filtering with geometry
SELECT COUNT(*)
FROM read_parquet('survey_points_50M.parquet')
WHERE ST_Within(
ST_Point(x, y),
ST_GeomFromText('POLYGON((7.4 47.2, 7.5 47.2, 7.5 47.3, 7.4 47.3, 7.4 47.2))')
);
DuckDB also reads directly from remote files:
-- Query a Parquet file on S3 without downloading it
SELECT classification, COUNT(*)
FROM read_parquet('s3://my-bucket/lidar/zurich_2024.parquet')
GROUP BY classification;
Python Integration
DuckDB shines in Python workflows — it integrates natively with Pandas, Polars, and Arrow:
import duckdb
# Query a local Parquet file
result = duckdb.sql("""
SELECT
classification,
COUNT(*) as count,
AVG(z) as mean_z,
STDDEV(z) as std_z
FROM read_parquet('pointcloud.parquet')
GROUP BY classification
""").df()
print(result)
# classification count mean_z std_z
# 2 (Ground) 8234521 432.1 2.3
# 6 (Building) 3122843 445.7 5.1
# 3 (Vegetation) 5891204 438.9 8.7
You can also query Pandas DataFrames directly — no import needed:
import pandas as pd
import duckdb
# Load data from any source into a DataFrame
df = pd.read_csv('sensors.csv')
# Query it with SQL instantly
result = duckdb.sql("""
SELECT sensor_id, AVG(temperature), MAX(humidity)
FROM df
WHERE timestamp > '2026-01-01'
GROUP BY sensor_id
""").df()
Where DuckDB Struggles
DuckDB is not a server. It has no:
- Concurrent write access — one writer at a time
- User authentication — no roles, no permissions
- Network protocol — it runs in-process, not as a service
- Transactional guarantees for multi-user apps — no row-level locking
You cannot put DuckDB behind a web API serving 100 concurrent users. You cannot have two processes writing to the same DuckDB file. It is an analytical tool, not an application database.
Side-by-Side Comparison
| PostGIS | DuckDB | |
|---|---|---|
| Type | OLTP database | OLAP engine |
| Storage | Row-based | Column-based |
| Server | Yes (PostgreSQL daemon) | No (in-process, like SQLite) |
| Concurrency | Hundreds of users | Single process |
| Best at | CRUD, spatial joins, API serving | Aggregations, batch analytics, ETL |
| 50M row aggregation | Minutes | Seconds |
| Single row lookup by ID | Microseconds | Overkill |
| Updates | Yes (row-level, concurrent) | Append-mostly |
| Spatial indexing | GIST, SP-GIST (mature) | R-tree (basic, growing) |
| File format support | Import required | Reads Parquet, CSV, JSON directly |
| Use behind a web API | Yes | No |
| Setup complexity | Server installation, config | pip install duckdb |
| Geospatial maturity | 20+ years, 300+ functions | Young, but growing fast |
| Cost | Server + RAM + storage | Free, runs on your laptop |
When to Use PostGIS
Use PostGIS when your workload is transactional and interactive:
- Your web API needs to serve spatial queries to users (
GET /api/points?bbox=...) - Multiple users read and write concurrently
- You need spatial joins with other relational data (parcels, users, projects)
- You need ACID guarantees (financial data, official records)
- Your queries filter by spatial region and return small result sets
- You are managing metadata: project ownership, sharing, permissions
Example: A surveying platform where clients upload datasets, share them with team members, and query specific regions.
-- This is a PostGIS workload
SELECT p.id, p.name, ST_AsGeoJSON(p.bbox)
FROM projects p
JOIN project_members pm ON p.id = pm.project_id
WHERE pm.user_id = 42
AND ST_Intersects(p.bbox, ST_MakeEnvelope(7.4, 47.2, 7.6, 47.4, 4326));
Another example: An IoT platform receiving GPS coordinates from 10,000 fleet vehicles and serving their last-known positions to a dashboard.
-- Insert new position (happens thousands of times per second)
INSERT INTO vehicle_positions (vehicle_id, geom, speed, timestamp)
VALUES (742, ST_SetSRID(ST_MakePoint(8.54, 47.37), 4326), 62.5, NOW());
-- Query last position for a single vehicle (API endpoint)
SELECT vehicle_id, ST_AsGeoJSON(geom), speed, timestamp
FROM vehicle_positions
WHERE vehicle_id = 742
ORDER BY timestamp DESC
LIMIT 1;
PostGIS handles this effortlessly — high write throughput, fast single-row reads, concurrent access.
When to Use DuckDB
Use DuckDB when your workload is analytical and batch-oriented:
- You need to compute statistics across millions of rows
- You are building ETL pipelines (raw data → cleaned → transformed)
- You want to explore data interactively ("how many points per class?")
- You are processing files that do not live in a database
- You need to join multiple large files without importing them first
Example: A classification pipeline that processes raw LiDAR scans, computes quality metrics, and exports results.
import duckdb
# Analyze classification results across all scans in a directory
stats = duckdb.sql("""
SELECT
filename,
classification,
COUNT(*) as points,
AVG(z) as mean_z,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY z) as p95_z
FROM read_parquet('output/classified/*.parquet', filename=true)
GROUP BY filename, classification
ORDER BY filename, classification
""").df()
# Flag scans with suspicious class distributions
anomalies = stats[stats['classification'] == 2].query('points < 1000')
print(f"Warning: {len(anomalies)} scans with very few ground points")
Another example: An analytics dashboard for a geospatial platform — "Show me asset density per region for the last 12 months."
-- DuckDB reads directly from Parquet on S3
SELECT
region,
DATE_TRUNC('month', captured_at) AS month,
COUNT(*) AS asset_count,
COUNT(DISTINCT asset_type) AS unique_types
FROM read_parquet('s3://data-lake/assets/2025/*.parquet')
GROUP BY region, month
ORDER BY region, month;
This query scans hundreds of millions of rows across dozens of Parquet files and returns results in seconds — no data import, no schema setup.
Combining Both: A Modern Spatial Stack
The real power comes from using both together. PostGIS handles the application layer, DuckDB handles the analytics layer.
┌──────────────────────────────────────────────────┐
│ Your Application │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PostGIS │ │ DuckDB │ │
│ │ (OLTP) │ │ (OLAP) │ │
│ │ │ │ │ │
│ │ • User accounts │ │ • Batch stats │ │
│ │ • Project CRUD │ │ • ETL pipelines │ │
│ │ • API queries │ │ • Data quality │ │
│ │ • Spatial search │ │ • Reporting │ │
│ │ • Permissions │ │ • Exploration │ │
│ │ • Real-time data │ │ • Aggregations │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │ │
│ Serves the API Feeds dashboards │
│ Handles writes Reads from Parquet/S3 │
└──────────────────────────────────────────────────┘
Concrete example: A geospatial data platform.
PostGIS manages users, projects, and dataset metadata:
-- PostGIS: application data
CREATE TABLE datasets (
id UUID PRIMARY KEY,
name TEXT,
owner_id UUID REFERENCES users(id),
storage_path TEXT, -- path to Parquet files
bbox GEOMETRY(Polygon, 4326),
point_count BIGINT,
created_at TIMESTAMP
);
-- API query: "Show me all datasets near Zurich"
SELECT name, point_count, storage_path
FROM datasets
WHERE ST_DWithin(bbox, ST_SetSRID(ST_MakePoint(8.54, 47.37), 4326), 0.1)
AND owner_id = current_user_id();
DuckDB handles the heavy analytical lifting:
import duckdb
# DuckDB: analytics on the actual data
# User clicks "Show Statistics" on a dataset
def get_dataset_stats(storage_path: str) -> dict:
stats = duckdb.sql(f"""
SELECT
classification,
COUNT(*) as points,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as pct,
ROUND(AVG(z), 2) as avg_z,
ROUND(STDDEV(z), 2) as std_z
FROM read_parquet('{storage_path}/*.parquet')
GROUP BY classification
ORDER BY points DESC
""").df()
return stats.to_dict(orient='records')
Each tool does what it does best. PostGIS serves the API and manages state. DuckDB crunches the numbers.
Other OLAP Options
DuckDB is not the only analytical engine with spatial capabilities. Depending on your scale and infrastructure, you might also consider:
ClickHouse
A server-based columnar database designed for real-time analytics at massive scale. If you have billions of rows and need sub-second queries with concurrent users, ClickHouse is worth evaluating. It has basic geo functions but is not as spatially mature as DuckDB's extension.
Best for: Real-time dashboards over billions of events, multi-user analytical workloads.
Google BigQuery / Snowflake / Databricks
Cloud-managed OLAP services with built-in spatial functions. If your data already lives in the cloud and you want zero infrastructure management, these are strong options. The trade-off is cost and vendor lock-in.
Best for: Teams already on cloud infrastructure who want managed analytics.
Apache Sedona (GeoSpark)
A distributed spatial engine on top of Apache Spark. If you need to process terabytes of spatial data across a cluster, Sedona handles the distribution. Heavy to set up, but scales horizontally.
Best for: Truly massive datasets (TB+) that cannot fit on a single machine.
When to Stick with DuckDB
For most teams — especially small companies, consultants, and single-engineer projects — DuckDB is the right starting point. Zero infrastructure, runs on a laptop, handles hundreds of millions of rows. You do not need ClickHouse until you have billions of rows and concurrent analytical users. You do not need Spark until your data exceeds a single machine's RAM.
Practical Tips
Start Simple
If you only have PostGIS and your dataset fits in memory, do not add DuckDB just because this article exists. PostGIS can handle analytical queries on datasets up to a few million rows without problems. Add complexity only when you hit actual performance walls.
Store Raw Data as Parquet
If you are collecting spatial data at scale, store the raw data as Parquet files (not in PostgreSQL). This gives you:
- Cheap storage (S3 or local disk)
- Direct queryability with DuckDB (no import)
- Easy backup and versioning
- Freedom to switch analytical engines later
PostGIS for Metadata, DuckDB for Data
A good rule of thumb: if the table has fewer than 1 million rows and serves an API, put it in PostGIS. If the table has tens of millions of rows and serves dashboards or reports, keep it as Parquet and query with DuckDB.
Use DuckDB for One-Off Analysis
DuckDB is perfect for questions you ask once: "How many ground points do we have across all 2024 datasets?" You do not need to set up a database for this — just point DuckDB at your Parquet files and query.
# One-liner from the command line
duckdb -c "SELECT classification, COUNT(*) FROM read_parquet('*.parquet') GROUP BY 1"
Benchmark Before Deciding
If you are unsure whether PostGIS is fast enough, benchmark it. Run your actual query on your actual data. If PostGIS returns results in under a second, there is no reason to add DuckDB. If it takes minutes, now you have a reason.
Summary
There is no single database that handles all spatial workloads well. The choice depends on your access pattern:
| Access Pattern | Tool | Why |
|---|---|---|
| Serve an API, manage users, spatial lookups | PostGIS | ACID, concurrent access, spatial indexing, 20 years of maturity |
| Scan millions of rows, compute statistics | DuckDB | Columnar storage, in-process, reads files directly, zero setup |
| Both: app + analytics | PostGIS + DuckDB | Each layer does what it does best |
The key insight is simple: row-based databases are fast at finding one row. Column-based engines are fast at scanning all rows. Once you internalize this, the choice for any given workload becomes obvious.
PostGIS is not slow. DuckDB is not better. They solve different problems. The mature spatial architect picks the right tool for each layer of the stack — and lets each one do what it was built for.