Paul Ramsey | CrunchyData Blog

PostGIS Performance: Simplification

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Tue, 09 Dec 2025 08:00:00 EST

There’s nothing simple about simplification! It is very common to want to slim down the size of geometries, and there are lots of different approaches to the problem.

We will explore different methods starting with ST_Letters for this rendering of the letter “a”.

SELECT ST_Letters('a');

This is a good starting point, but to show the different effects of different algorithms on things like redundant linear points, we need a shape with more vertices along the straights, and fewer along the curves.

SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1);

Here we add in vertices every one meter with ST_Segmentize and ST_RemoveRepeatedPoints to thin out the points along the curves. Already we are simplifying!

Lets apply the same “remove repeated” algorithm, with a 10 meter tolerance.

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_RemoveRepeatedPoints(a, 10) FROM a;

We do have a lot fewer points, and the constant angle curves are well preserved, but some straight lines are no longer legible as such, and there are redundant vertices in the vertical straight lines.

The ST_Simplify function applies the Douglas-Peuker line simplification algorithm to the rings of the polygon. Because it is a line simplifier it does a cruder job preserving some aspects of the polygon area like squareness of the top ligature.

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_Simplify(a, 1) FROM a;

The ST_SimplifyVW function applies the Visvalingam–Whyatt algorithm to the rings of the polygon. Visvalingam–Whyatt is better for preserving the shapes of polygons than Douglas-Peuker, but the differences are subtle.

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_SimplifyVW(a, 5) FROM a;

Coercing a shape onto a fixed precision grid is another form of simplification, sometimes used to force the edges of adjacent objects to line up exactly. The original such function, ST_SnapToGrid, does exactly what it says on the name. Every vertex is rounded to a fixed grid point.

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_SnapToGrid(a, 5) FROM a;

However, as you can see at the top left, the grid snapper frequently generates invalidity in polygons, such as the self-intersecting ring in this example.

A more modern alternative is precision reduction.

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_ReducePrecision(a, 5) FROM a;

The ST_ReducePrecision function not only snaps geometries to a fixed precision grid, it also ensures that outputs are always valid.

Because grid snapping tends to introduce a lot of vertices along straight edges, combining it with a line simplifier makes a lot of sense.

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_Simplify(ST_ReducePrecision(a, 5),1) FROM a;

Simplifying single geometries is all well and good, but what about simplifying groups of geometries? Specifically ones that share boundaries?

Fortunately, since PostGIS 3.6 there is now a complete set of functions for that problem.

Starting with a pair of polygons with a non-matched shared boundary.

Non-clean boundaries can be cleaned up with the ST_CoverageClean function.

SELECT ST_CoverageClean OVER() AS geom FROM polys;

And once the coverage is clean, the shapes including their shared borders can be simplified with ST_CoverageSimplify.

WITH clean AS (
  SELECT ST_CoverageClean OVER() AS geom FROM polys
)
SELECT ST_CoverageSimplify(geom, 10) OVER() FROM clean

PostGIS Performance: Data Sampling

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Fri, 21 Nov 2025 08:00:00 EST

One of the temptations database users face, when presented with a huge table of interesting data, is to run queries that interrogate every record. Got a billion measurements? What’s the average of that?!

One way to find out is to just calculate the average.

SELECT avg(value) FROM mytable;

For a billion records, that could take a while!

Fortunately, the “Law of Large Numbers” is here to bail us out, stating that the average of a sample approaches the average of the population, as the sample size grows. And amazingly, the sample does not even have to be particularly large to be quite close.

Here’s a table of 10M values, randomly generated from a normal distribution. We know the average is zero. What will a sample of 10K values tell us it is?

CREATE TABLE normal AS
  SELECT random_normal(0,1) AS values
    FROM generate_series(1,10000000);

We can take a sample using a sort, or using the random() function, but both of those techniques first scan the whole table, which is exactly what we want to avoid.

Instead, we can use the PostgreSQL TABLESAMPLE feature, to get a quick sample of the pages in the table and an estimate of the average.

SELECT avg(values)
  FROM normal TABLESAMPLE SYSTEM (1);

I get an answer – 0.0031, very close to the population average – and it takes just 43 milliseconds.

Can this work with spatial? For the right data, it can. Imagine you had a table that had one point in it for every person in Canada (36 million of them) and you wanted to find out how many people lived in Toronto (or this red circle around Toronto).

SELECT count(*)
  FROM census_people
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The answer is 5,010,266, and it takes 7.2 seconds to return. What if we take a 10% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (10)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 10%, and the answer comes back as 508,292 (near one tenth of our actual measurement) in 2.2 seconds. What about a 1% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (1)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 1%, and the answer comes back as 50,379 (near one hundredth of our actual measurement) in 0.2 seconds. Still a good estimate!

Is this black magic? No, the TABLESAMPLE SYSTEM mode gets its speed by reading pages randomly. In our last example, it randomly chose 1% of the pages. Here’s what that looks like in Toronto.

See in particular how blotchy the data are in the suburban areas outside the circle. The data in the table are not randomly distributed to the pages, they came from the census data in order, and ended up loaded into the database in order. So for any given database page, the actual rows in the page will tend to be near to one another.

This works for this example because the amount of data is high, and the area we are summarizing is a large proportion of the total data – a seventh of the Canadian population lives in that circle.

If we were summarizing a smaller area, the results would not have been so good.

The TABLESAMPLE SYSTEM is a powerful tool, but you have to be sure that any given page has a random selection of the data you are sampling for. Our random normal example worked perfectly, because the data were perfectly random. A sample of time series data would not work well for sample time windows (the data were probably stored in order of arrival) but might work for sampling some other value.

PostGIS Performance: Intersection Predicates and Overlays

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Fri, 14 Nov 2025 08:00:00 EST

In this series, we talk about the many different ways you can speed up PostGIS. A common geospatial operation is to clip out a collection of smaller shapes that are contained within a larger shape. Today let's review the most efficient ways to query for things inside something else.

Frequently the smaller shapes are clipped where they cross the boundary, using the ST_Intersection function.

The naive SQL is a simple spatial join on ST_Intersects.

SELECT ST_Intersection(polygon.geom, p.geom) AS geom
  FROM parcels p
  JOIN polygon
    ON ST_Intersects(polygon.geom, p.geom);

When run on the small test area shown in the pictures, the query takes about 14ms. That’s fast, but the problem is small, and larger operations will be slower.

There is a simple way to speed up the query that takes advantage of the fact that boolean spatial predicates are faster than spatial overlay operations.

What?

“Boolean spatial predicates” are functions like ST_Intersects and ST_Contains. They take in two geometries and return “true” or “false” for whether the geometries pass the named test.
“Spatial overlay operations” are functions like ST_Intersection or ST_Difference that take in two geometries, and generate a new geometry based on the named rule.

Predicates are faster because their tests often allow for logical short circuits (once you find any two edges that intersect, you know the geometries intersect) and because they can make use of the prepared geometry optimizations to cache and index edges between function calls.

The speed-up for spatial overlay simply observes that, for most overlays there is a large set of features that can be added to the result set unchanged – the features that are fully contained in the clipping shape. We can identify them using ST_Contains.

Similarly, there is a smaller set of features that cross the border, and thus do need to be clipped. These are features that ST_Intersects but are not ST_Contains.

The higher performance function uses the faster predicates to filter the smaller shapes into two streams, one for intersection, and one for unchanged inclusion.

SELECT
  CASE
    WHEN ST_Contains(polygon.geom, p.geom) THEN p.geom
    ELSE ST_Intersection(polygon.geom, p.geom)
    END AS geom
  FROM parcels p
  JOIN polygon
    ON ST_Intersects(polygon.geom, p.geom);

Two predicates are used here, the ST_Intersects in the join clause ensures that only parcels that might participate in the overlay are fed into the CASE statement, where the ST_Contains predicate no-ops the parcels that do not cross the boundary.

When run against our tiny example, the query executes in just 9ms. Amazing that the difference is large enough to measure on such a small example.

Using `CASE` statement to combine predicates and overlays

The core idea here is to recognize that boolean spatial predicates like ST_Contains and ST_Intersects are computationally much faster than spatial overlay operations like ST_Intersection. The standard, but slow, approach clips all intersecting features. The optimized method uses a CASE statement and ST_Contains check to create a shortcut: if a smaller geometry is entirely contained within the larger clipping polygon, we return the geometry unchanged (a quick no-op) and completely bypass the slower ST_Intersection calculation.

You can apply this optimization pattern to any PostGIS work involving clipping, spatial joins, or overlays where you suspect a significant number of features might be fully contained within a boundary. By filtering and partitioning your geometries into "fully contained" (fast path) and "crossing the border" (slow path) streams, you ensure the expensive overlay operations are only executed when they are strictly necessary to clip the edges.

Need more PostGIS?
Join us this year on November 20 for PostGIS Day 2025, a free, virtual, community event about open source geospatial!

PostGIS Performance: Improve Bounding Boxes with Decompose and Subdivide

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Thu, 06 Nov 2025 08:00:00 EST

In the third installment of the PostGIS Performance series, I wanted to talk about performance around bounding boxes.

Geometry data is different from most column types you find in a relational database. The objects in a geometry column can be wildly different in the amount of the data domain they cover, and the amount of physical size they take up on disk.

The data in the “admin0” Natural Earth data range from the 1.2 hectare Vatican City, to the 1.6 billion hectare Russia, and from the 4 point polygon defining Serranilla Bank to the 68 thousand points of polygons defining Canada.

SELECT ST_NPoints(geom) AS npoints, name
FROM admin0
ORDER BY 1 DESC LIMIT 5;

SELECT ST_Area(geom::geography) AS area, name
FROM admin0
ORDER BY 1 DESC LIMIT 5;

As you can imagine, polygons this different will have different performance characteristics:

Physically large objects will take longer to work with. To pull off the disk, to scan, to calculate with.
Geographically large objects will cover more other objects, and reduce the effectiveness of your indexes.

Your spatial indexes are “r-tree” indexes, where each object is represented by a bounding box.

The bounding boxes can overlap, and it is possible for some boxes to cover a lot of the dataset.

For example, here is the bounding box of France.

What?! How is that France? Well, France is more than just the European parts, it also includes the island of Reunion, in the southern Indian Ocean, and the island of Guadaloupe, in the Caribbean. Taken together they result in this very large bounding box.

Such a large box makes a poor addition to the spatial index of all the objects in “admin0”. I could be searching in with a query key in the middle of the Atlantic, and the index would still be telling me “maybe it is in France?”.

For this testing, I have made a synthetic dataset of one million random points covering the whole world.

CREATE TABLE random_normal AS
  SELECT id,
    ST_Point(
      random_normal(0, 180),
      random_normal(0, 80),
      4326) AS geom
  FROM generate_series(0, 1000000) AS id;


CREATE INDEX random_normal_geom_x ON random_normal USING GIST (geom);

The un-altered bounds of “admin0”, the bounds that will be used to run the spatial join, look like this. Lots of overlap, lots of places where they bounds cover areas the polygons do not.

The baseline time to do a spatial join using the un-altered “admin0” data is 9 seconds.

SELECT Count(*), admin0.name
  FROM admin0 JOIN random_normal
    ON ST_Intersects(random_normal.geom, admin0.geom)
  GROUP BY admin0.name;

What if, instead of joining against the raw “admin0” – which includes weird cases like France and a Canada with hundreds of islands – we first decompose every object into the singular polygons that make it up, using ST_Dump.

The decomposed objects cover far less ocean, and much more accurately represent the polygons they are proxying for. And the time – including the cost of decomposing the objects – to do a full join on the 1M points falls to 3.8 seconds.

WITH polys AS  (
  SELECT (ST_Dump(geom)).geom AS geom, name
  FROM admin0
)
SELECT Count(*), polys.name
FROM polys JOIN random_normal
ON ST_Intersects(random_normal.geom, polys.geom)
GROUP BY polys.name;

There is still a lot of ocean being queried here, and also some of the polygons are not just very spatially large, but include a lot of vertices. What if we make the polygons smaller yet by chopping them up ST_Subdivide?

These bounds are almost perfect, they cover very little of the ocean, and they also have reduced the maximum memory size of any polygon to no more than 256 vertices. And the performance, even including the very expensive subdivision step, gets faster yet.

WITH polys AS (
  SELECT ST_Subdivide(geom,128) AS geom, name FROM admin0
)
SELECT Count(*), polys.name
FROM polys JOIN random_normal
ON ST_Intersects(random_normal.geom, polys.geom)
GROUP BY polys.name;

The final query takes just 1.8 seconds, twice as fast as the simple boxes, and 4 times faster than a naive spatial join. For smaller collections of points, the naive approach can work as fast as the subdivision, but for this 1M point test set the overhead of doing the subdivision is still far less than the gains from using the more effective bounds.

Investing computation into creating better, smaller, and simpler geometries pays off significantly for large datasets by making the spatial index much more effective.

Need more PostGIS?
Join us this year on November 20 for PostGIS Day 2025, a free, virtual, community event about open source geospatial!

PostGIS Performance: pg_stat_statements and Postgres tuning

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Mon, 20 Oct 2025 09:00:00 EDT

In this series, we talk about the many different ways you can speed up PostGIS. Today let’s talk about looking across the queries with pg_stat_statements and some basic tuning.

Showing Postgres query times with pg_stat_statements

A reasonable question to ask, if you are managing a system with variable performance is: “what queries on my system are running slowly?”

Fortunately, PostgreSQL includes an extension called “pg_stat_statements” that tracks query performance over time and maintains a list of high cost queries.

CREATE EXTENSION pg_stat_statements;

Now you will have to leave your database running for a while, so the extension can gather up data about the kind of queries that are run on your database.

Once it has been running for a while, you have a whole table – pg_stat_statements – that collects your query statistics. You can query it directly with SELECT * or you can write individual queries to find the slowest queries, the longest running ones, and so on.

Here is an example of the longest running 10 queries ranked by duration.

SELECT
  total_exec_time,
  mean_exec_time,
  calls,
  rows,
  query
FROM pg_stat_statements
WHERE calls > 0
ORDER BY mean_exec_time DESC
LIMIT 10;

While “pg_stat_statements” is good at finding individual queries to tune, and the most frequent cause of slow queries is just inefficient SQL or a need for indexing - see the first post in the series.

Occasionally performance issues do crop up at the system level. The most frequent culprit is memory pressure. PostgreSQL ships with conservative default settings for memory usage, and some workloads benefit from more memory.

Shared buffers

A database server looks like an infinite, accessible, reliable bucket of data. In fact, the server orchestrates data between the disk – which is permanent and slow – and the random access memory – which is volatile and fast – in order to provide the illusion of such a system.

When the balance between slow storage and fast memory is out of whack, system performance falls. When attempts to read data are not present in the fast memory (a “cache hit”), they continue on to the slow disk (a “cache miss”).

You can check the balance of your system by looking at the “cache hit ratio”.

SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) +  sum(heap_blks_read)) as ratio
FROM
  pg_statio_user_tables;

A result in the 99% is a good sign. Below 90% means that your database could be memory constrained, so increasing the “shared_buffers” parameter may help. As a general rule, “shared buffers” should be about 25% of physical RAM.

Working memory

Working memory is controlled by the “work_mem” parameter, and it controls how much memory is available for in-memory sorting, index building, and other short term processes. Unlike the “shared buffers”, which are permanent and fully allocated on startup, the “working memory” is allocated on an as-needed basis.

However, the working memory limit is applied for each database connection, so it is possible for the total working memory to radically exceed the “work_mem” value. If 1000 connections each allocate 100MB, your server will probably run out of memory.

You can speed up known memory-hungry processes, like building spatial indexes, by temporarily increasing the working memory available to your particular connection, then reduce it when the process is complete.

SET work_mem = '2GB';
CREATE INDEX roads_geom_x ON roads USING GIST (geom);
SET work_mem = '100MB';

The same principle holds for maintenance tasks, like the “VACUUM” command. You can speed up the maintenance of a large table by increasing the “maintenance_work_mem” temporarily.

SET maintenance_work_mem = '2GB';
VACUUM roads;
SET maintenance_work_mem = '128MB';

Parallelism

It is common for modern database servers to have multiple CPU cores available, but your PostgreSQL configuration may not be tuned to use them all. Postgres does have parallel query support. PostgreSQL is conservative about making use of multiple cores, because executing and coordinating multi-process queries has overheads, but in general large aggregations or scans can frequently make effective use of two to four cores at once.

Check what limits are set on your database.

SHOW max_worker_processes;

SHOW max_parallel_workers;

Setting the maximums to the number of cores on your server is good practice. In particular, don’t be afraid to reduce the number of workers if you have fewer cores – there is no benefit to be had in workers contending for cores.

Tuning Postgres basics

To wrap up:

Check the slowest queries with pg_stat_statements.
Use EXPLAIN and Indexing to experiment with improvements
Check inefficient memory by looking at:
- shared buffers
- working memory (work_mem)
- parallelism

After you do some tuning, don’t forget to reset pg_stat_statements and check again to see if/how things have improved!

Need more PostGIS?
Join us this year on November 20 for PostGIS Day 2025, a free, virtual, community event about open source geospatial!

Paul Ramsey | CrunchyData Blog

PostGIS Performance: Simplification

PostGIS Performance: Data Sampling

PostGIS Performance: Intersection Predicates and Overlays

Using CASE statement to combine predicates and overlays

PostGIS Performance: Improve Bounding Boxes with Decompose and Subdivide

PostGIS Performance: pg_stat_statements and Postgres tuning

Showing Postgres query times with pg_stat_statements

Shared buffers

Working memory

Parallelism

Tuning Postgres basics

Using `CASE` statement to combine predicates and overlays