PostGIS vs DuckDB — Choosing the Right Tool for Spatial Data

You have 50 million spatial points. You need to store them, query them, and compute statistics across them.

Which tool do you reach for?

If your answer is "PostgreSQL with PostGIS" — you are not wrong. But you might be using a screwdriver to hammer a nail.

This post breaks down the two fundamental modes of spatial data processing — transactional (PostGIS) and analytical (DuckDB) — when each one shines, where each one breaks down, and how to combine them in a modern geospatial stack.

OLTP vs OLAP — The Fundamental Split
What is PostGIS?
PostGIS in Action
Where PostGIS Struggles
What is DuckDB?
DuckDB Spatial in Action
Where DuckDB Struggles
Side-by-Side Comparison
When to Use PostGIS
When to Use DuckDB
Combining Both: A Modern Spatial Stack
Other OLAP Options
Practical Tips
Summary

OLTP vs OLAP — The Fundamental Split

Before we compare tools, you need to understand the fundamental architectural split in databases. Every database optimizes for one of two workloads:

OLTP (Online Transaction Processing) — optimized for many small, fast read/write operations. A user signs up, an API returns one record, a sensor writes a new reading. PostgreSQL, MySQL, and SQLite are OLTP databases.

OLAP (Online Analytical Processing) — optimized for scanning and aggregating massive amounts of data. "What was the average elevation across 10 million LiDAR points?" or "Show me total road assets per canton." ClickHouse, BigQuery, Snowflake, and DuckDB are OLAP engines.

The key technical difference is storage layout:

Row-based storage (OLTP):

Row 1: [id=1, x=7.45, y=47.21, z=432.1, class=2, intensity=128]
Row 2: [id=2, x=7.45, y=47.21, z=433.5, class=6, intensity=95]
Row 3: [id=3, x=7.46, y=47.22, z=431.8, class=2, intensity=142]

Every field in a row is stored together on disk. Great for fetching a complete row by ID. Terrible for scanning one column across millions of rows — the engine has to read every field just to get to class.

Column-based storage (OLAP):

id:        [1, 2, 3, ...]
x:         [7.45, 7.45, 7.46, ...]
y:         [47.21, 47.21, 47.22, ...]
z:         [432.1, 433.5, 431.8, ...]
class:     [2, 6, 2, ...]
intensity: [128, 95, 142, ...]

Each column is stored as a contiguous block. To compute AVG(z) WHERE class = 2, the engine only reads two columns (z and class), skipping everything else. On 50 million points, this is the difference between seconds and minutes.

This is not about one being "better." It is about choosing based on your access pattern.

What is PostGIS?

PostGIS is the spatial extension for PostgreSQL. It adds geometry and geography types, spatial indexing (GIST), and hundreds of spatial functions to the world's most popular open-source relational database.

PostGIS turns PostgreSQL into a full-featured spatial OLTP database: it can store points, lines, polygons, rasters, and 3D geometries with transactional guarantees (ACID), concurrent access, and standard SQL.

PostGIS in Action

Store and query spatial data with standard SQL:

-- Create a table for survey points
CREATE TABLE survey_points (
    id SERIAL PRIMARY KEY,
    geom GEOMETRY(PointZ, 2056),  -- Swiss LV95 coordinate system
    classification INTEGER,
    intensity FLOAT,
    captured_at TIMESTAMP DEFAULT NOW()
);

-- Create a spatial index
CREATE INDEX idx_survey_geom ON survey_points USING GIST(geom);

-- Insert a point
INSERT INTO survey_points (geom, classification, intensity)
VALUES (ST_SetSRID(ST_MakePoint(2600000, 1200000, 450.5), 2056), 2, 128.0);

-- Find all ground points within 100m of a location
SELECT id, ST_AsText(geom), intensity
FROM survey_points
WHERE classification = 2
  AND ST_DWithin(geom, ST_SetSRID(ST_MakePoint(2600100, 1200100, 0), 2056), 100);

-- Spatial join: find which parcel each point belongs to
SELECT p.id, p.classification, parcels.owner
FROM survey_points p
JOIN parcels ON ST_Contains(parcels.geom, p.geom);

PostGIS is excellent here. The spatial index makes point-in-polygon queries fast even with millions of rows. ACID transactions mean two users can write simultaneously without data corruption. Your web API can serve single-record lookups in milliseconds.

Where PostGIS Struggles

Now try this:

-- Compute average elevation per classification across 50M points
SELECT classification, AVG(ST_Z(geom)), COUNT(*)
FROM survey_points
GROUP BY classification;

On 50 million rows, this query will take minutes. PostgreSQL has to read every row from disk (row-based storage), extract the Z coordinate from each geometry, and aggregate. The spatial index does not help — there is no spatial filter to narrow the scan.

Or this:

-- Compute a 1m grid density map across the entire dataset
SELECT
    FLOOR(ST_X(geom)) AS grid_x,
    FLOOR(ST_Y(geom)) AS grid_y,
    COUNT(*) AS point_count,
    AVG(ST_Z(geom)) AS avg_elevation
FROM survey_points
GROUP BY grid_x, grid_y;

This is a full table scan with aggregation — exactly the workload PostgreSQL was not designed for. It will work, but it will be slow and use significant memory.

PostGIS is not the right tool when you need to:

Scan and aggregate millions of rows
Run exploratory analytics ("show me distribution of X across all data")
Process batch ETL transformations
Compute statistics across entire datasets

For these workloads, you need an OLAP engine.

What is DuckDB?

DuckDB is an in-process, columnar analytical database. Think of it as "SQLite for analytics" — it runs inside your application with no server, no configuration, and no external dependencies.

But do not let the simplicity fool you. DuckDB can process hundreds of millions of rows on a single machine, reads Parquet and CSV files directly (no import step), and has a growing spatial extension.

DuckDB Spatial in Action

-- Install and load the spatial extension
INSTALL spatial;
LOAD spatial;

-- Read a 50M-row Parquet file and analyze it directly
-- No import, no schema creation, no waiting
SELECT
    classification,
    COUNT(*) AS point_count,
    AVG(z) AS avg_elevation,
    MIN(z) AS min_z,
    MAX(z) AS max_z
FROM read_parquet('survey_points_50M.parquet')
GROUP BY classification
ORDER BY point_count DESC;

On a standard laptop, this finishes in seconds. Not minutes. Seconds. The columnar storage means DuckDB only reads the classification and z columns — it never touches x, y, intensity, or any other field.

Spatial queries work too:

-- Compute point density per 100m grid cell
SELECT
    FLOOR(x / 100) * 100 AS grid_x,
    FLOOR(y / 100) * 100 AS grid_y,
    COUNT(*) AS density,
    AVG(z) AS avg_z
FROM read_parquet('survey_points_50M.parquet')
GROUP BY grid_x, grid_y
HAVING density > 100;

-- Spatial filtering with geometry
SELECT COUNT(*)
FROM read_parquet('survey_points_50M.parquet')
WHERE ST_Within(
    ST_Point(x, y),
    ST_GeomFromText('POLYGON((7.4 47.2, 7.5 47.2, 7.5 47.3, 7.4 47.3, 7.4 47.2))')
);

DuckDB also reads directly from remote files:

-- Query a Parquet file on S3 without downloading it
SELECT classification, COUNT(*)
FROM read_parquet('s3://my-bucket/lidar/zurich_2024.parquet')
GROUP BY classification;

Python Integration

DuckDB shines in Python workflows — it integrates natively with Pandas, Polars, and Arrow:

import duckdb

# Query a local Parquet file
result = duckdb.sql("""
    SELECT
        classification,
        COUNT(*) as count,
        AVG(z) as mean_z,
        STDDEV(z) as std_z
    FROM read_parquet('pointcloud.parquet')
    GROUP BY classification
""").df()

print(result)
# classification  count     mean_z    std_z
# 2 (Ground)      8234521   432.1     2.3
# 6 (Building)    3122843   445.7     5.1
# 3 (Vegetation)  5891204   438.9     8.7

You can also query Pandas DataFrames directly — no import needed:

import pandas as pd
import duckdb

# Load data from any source into a DataFrame
df = pd.read_csv('sensors.csv')

# Query it with SQL instantly
result = duckdb.sql("""
    SELECT sensor_id, AVG(temperature), MAX(humidity)
    FROM df
    WHERE timestamp > '2026-01-01'
    GROUP BY sensor_id
""").df()

Where DuckDB Struggles

DuckDB is not a server. It has no:

Concurrent write access — one writer at a time
User authentication — no roles, no permissions
Network protocol — it runs in-process, not as a service
Transactional guarantees for multi-user apps — no row-level locking

You cannot put DuckDB behind a web API serving 100 concurrent users. You cannot have two processes writing to the same DuckDB file. It is an analytical tool, not an application database.

Side-by-Side Comparison

	PostGIS	DuckDB
Type	OLTP database	OLAP engine
Storage	Row-based	Column-based
Server	Yes (PostgreSQL daemon)	No (in-process, like SQLite)
Concurrency	Hundreds of users	Single process
Best at	CRUD, spatial joins, API serving	Aggregations, batch analytics, ETL
50M row aggregation	Minutes	Seconds
Single row lookup by ID	Microseconds	Overkill
Updates	Yes (row-level, concurrent)	Append-mostly
Spatial indexing	GIST, SP-GIST (mature)	R-tree (basic, growing)
File format support	Import required	Reads Parquet, CSV, JSON directly
Use behind a web API	Yes	No
Setup complexity	Server installation, config	`pip install duckdb`
Geospatial maturity	20+ years, 300+ functions	Young, but growing fast
Cost	Server + RAM + storage	Free, runs on your laptop

When to Use PostGIS

Use PostGIS when your workload is transactional and interactive:

Your web API needs to serve spatial queries to users (GET /api/points?bbox=...)
Multiple users read and write concurrently
You need spatial joins with other relational data (parcels, users, projects)
You need ACID guarantees (financial data, official records)
Your queries filter by spatial region and return small result sets
You are managing metadata: project ownership, sharing, permissions

Example: A surveying platform where clients upload datasets, share them with team members, and query specific regions.

-- This is a PostGIS workload
SELECT p.id, p.name, ST_AsGeoJSON(p.bbox)
FROM projects p
JOIN project_members pm ON p.id = pm.project_id
WHERE pm.user_id = 42
  AND ST_Intersects(p.bbox, ST_MakeEnvelope(7.4, 47.2, 7.6, 47.4, 4326));

Another example: An IoT platform receiving GPS coordinates from 10,000 fleet vehicles and serving their last-known positions to a dashboard.

-- Insert new position (happens thousands of times per second)
INSERT INTO vehicle_positions (vehicle_id, geom, speed, timestamp)
VALUES (742, ST_SetSRID(ST_MakePoint(8.54, 47.37), 4326), 62.5, NOW());

-- Query last position for a single vehicle (API endpoint)
SELECT vehicle_id, ST_AsGeoJSON(geom), speed, timestamp
FROM vehicle_positions
WHERE vehicle_id = 742
ORDER BY timestamp DESC
LIMIT 1;

PostGIS handles this effortlessly — high write throughput, fast single-row reads, concurrent access.

When to Use DuckDB

Use DuckDB when your workload is analytical and batch-oriented:

You need to compute statistics across millions of rows
You are building ETL pipelines (raw data → cleaned → transformed)
You want to explore data interactively ("how many points per class?")
You are processing files that do not live in a database
You need to join multiple large files without importing them first

Example: A classification pipeline that processes raw LiDAR scans, computes quality metrics, and exports results.

import duckdb

# Analyze classification results across all scans in a directory
stats = duckdb.sql("""
    SELECT
        filename,
        classification,
        COUNT(*) as points,
        AVG(z) as mean_z,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY z) as p95_z
    FROM read_parquet('output/classified/*.parquet', filename=true)
    GROUP BY filename, classification
    ORDER BY filename, classification
""").df()

# Flag scans with suspicious class distributions
anomalies = stats[stats['classification'] == 2].query('points < 1000')
print(f"Warning: {len(anomalies)} scans with very few ground points")

Another example: An analytics dashboard for a geospatial platform — "Show me asset density per region for the last 12 months."

-- DuckDB reads directly from Parquet on S3
SELECT
    region,
    DATE_TRUNC('month', captured_at) AS month,
    COUNT(*) AS asset_count,
    COUNT(DISTINCT asset_type) AS unique_types
FROM read_parquet('s3://data-lake/assets/2025/*.parquet')
GROUP BY region, month
ORDER BY region, month;

This query scans hundreds of millions of rows across dozens of Parquet files and returns results in seconds — no data import, no schema setup.

Combining Both: A Modern Spatial Stack

The real power comes from using both together. PostGIS handles the application layer, DuckDB handles the analytics layer.

┌──────────────────────────────────────────────────┐
│              Your Application                     │
│                                                   │
│   ┌─────────────────┐    ┌─────────────────┐     │
│   │   PostGIS        │    │   DuckDB         │     │
│   │   (OLTP)         │    │   (OLAP)         │     │
│   │                  │    │                  │     │
│   │ • User accounts  │    │ • Batch stats    │     │
│   │ • Project CRUD   │    │ • ETL pipelines  │     │
│   │ • API queries    │    │ • Data quality   │     │
│   │ • Spatial search │    │ • Reporting      │     │
│   │ • Permissions    │    │ • Exploration    │     │
│   │ • Real-time data │    │ • Aggregations   │     │
│   └─────────────────┘    └─────────────────┘     │
│          │                       │                │
│    Serves the API          Feeds dashboards       │
│    Handles writes          Reads from Parquet/S3  │
└──────────────────────────────────────────────────┘

Concrete example: A geospatial data platform.

PostGIS manages users, projects, and dataset metadata:

-- PostGIS: application data
CREATE TABLE datasets (
    id UUID PRIMARY KEY,
    name TEXT,
    owner_id UUID REFERENCES users(id),
    storage_path TEXT,      -- path to Parquet files
    bbox GEOMETRY(Polygon, 4326),
    point_count BIGINT,
    created_at TIMESTAMP
);

-- API query: "Show me all datasets near Zurich"
SELECT name, point_count, storage_path
FROM datasets
WHERE ST_DWithin(bbox, ST_SetSRID(ST_MakePoint(8.54, 47.37), 4326), 0.1)
  AND owner_id = current_user_id();

DuckDB handles the heavy analytical lifting:

import duckdb

# DuckDB: analytics on the actual data
# User clicks "Show Statistics" on a dataset
def get_dataset_stats(storage_path: str) -> dict:
    stats = duckdb.sql(f"""
        SELECT
            classification,
            COUNT(*) as points,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as pct,
            ROUND(AVG(z), 2) as avg_z,
            ROUND(STDDEV(z), 2) as std_z
        FROM read_parquet('{storage_path}/*.parquet')
        GROUP BY classification
        ORDER BY points DESC
    """).df()
    return stats.to_dict(orient='records')

Each tool does what it does best. PostGIS serves the API and manages state. DuckDB crunches the numbers.

Other OLAP Options

DuckDB is not the only analytical engine with spatial capabilities. Depending on your scale and infrastructure, you might also consider:

ClickHouse

A server-based columnar database designed for real-time analytics at massive scale. If you have billions of rows and need sub-second queries with concurrent users, ClickHouse is worth evaluating. It has basic geo functions but is not as spatially mature as DuckDB's extension.

Best for: Real-time dashboards over billions of events, multi-user analytical workloads.

Google BigQuery / Snowflake / Databricks

Cloud-managed OLAP services with built-in spatial functions. If your data already lives in the cloud and you want zero infrastructure management, these are strong options. The trade-off is cost and vendor lock-in.

Best for: Teams already on cloud infrastructure who want managed analytics.

Apache Sedona (GeoSpark)

A distributed spatial engine on top of Apache Spark. If you need to process terabytes of spatial data across a cluster, Sedona handles the distribution. Heavy to set up, but scales horizontally.

Best for: Truly massive datasets (TB+) that cannot fit on a single machine.

When to Stick with DuckDB

For most teams — especially small companies, consultants, and single-engineer projects — DuckDB is the right starting point. Zero infrastructure, runs on a laptop, handles hundreds of millions of rows. You do not need ClickHouse until you have billions of rows and concurrent analytical users. You do not need Spark until your data exceeds a single machine's RAM.

Practical Tips

Start Simple

If you only have PostGIS and your dataset fits in memory, do not add DuckDB just because this article exists. PostGIS can handle analytical queries on datasets up to a few million rows without problems. Add complexity only when you hit actual performance walls.

Store Raw Data as Parquet

If you are collecting spatial data at scale, store the raw data as Parquet files (not in PostgreSQL). This gives you:

Cheap storage (S3 or local disk)
Direct queryability with DuckDB (no import)
Easy backup and versioning
Freedom to switch analytical engines later

PostGIS for Metadata, DuckDB for Data

A good rule of thumb: if the table has fewer than 1 million rows and serves an API, put it in PostGIS. If the table has tens of millions of rows and serves dashboards or reports, keep it as Parquet and query with DuckDB.

Use DuckDB for One-Off Analysis

DuckDB is perfect for questions you ask once: "How many ground points do we have across all 2024 datasets?" You do not need to set up a database for this — just point DuckDB at your Parquet files and query.

# One-liner from the command line
duckdb -c "SELECT classification, COUNT(*) FROM read_parquet('*.parquet') GROUP BY 1"

Benchmark Before Deciding

If you are unsure whether PostGIS is fast enough, benchmark it. Run your actual query on your actual data. If PostGIS returns results in under a second, there is no reason to add DuckDB. If it takes minutes, now you have a reason.

Summary

There is no single database that handles all spatial workloads well. The choice depends on your access pattern:

Access Pattern	Tool	Why
Serve an API, manage users, spatial lookups	PostGIS	ACID, concurrent access, spatial indexing, 20 years of maturity
Scan millions of rows, compute statistics	DuckDB	Columnar storage, in-process, reads files directly, zero setup
Both: app + analytics	PostGIS + DuckDB	Each layer does what it does best

The key insight is simple: row-based databases are fast at finding one row. Column-based engines are fast at scanning all rows. Once you internalize this, the choice for any given workload becomes obvious.

PostGIS is not slow. DuckDB is not better. They solve different problems. The mature spatial architect picks the right tool for each layer of the stack — and lets each one do what it was built for.

Table of Contents