Marco Slot | CrunchyData Blog

pg_incremental: Incremental Data Processing in Postgres

Marco.Slot@crunchydata.com (Marco Slot) — Tue, 17 Dec 2024 08:30:00 EST

Today I’m excited to introduce pg_incremental, a new open source PostgreSQL extension for automated, incremental, reliable batch processing. This extension helps you create processing pipelines for append-only streams of data, such as IoT / time series / event data workloads.

Notable pg_incremental use cases include:

Creation and incremental maintenance of rollups, aggregations, and interval aggregations
Incremental data transformations
Periodic imports or export of new data using standard SQL

After you set up a pg_incremental pipeline, it runs forever until you tell Postgres to stop. There’s a lot you can do with pg_incremental and we have a lot of thoughts on why we think it’s valuable. To help you navigate some of if you want to jump directly to one of the examples that you feel is relevant to you:

Why incremental processing?

My team has been working on handling data-intensive workloads in PostgreSQL for many years. The most data-intensive workloads are usually the ones with a machine-generated stream of event data, and we often find that the best solution for handling those workloads in PostgreSQL involves incremental data processing.

For example, a common pattern in PostgreSQL is to periodically pre-aggregate incoming event data into a summary table. In that model, writes (esp. batch loads) are fast because they do not trigger any immediate processing. The incremental aggregation is fast because it only processes new rows, and queries from dashboards are fast because they hit an indexed summary table. I originally developed pg_cron for this purpose, but creating an end-to-end pipeline still required a lot of bookkeeping and careful concurrency considerations.

There are some existing solutions to this problem, such as incremental materialized views and logical decoding-based approaches, but the implementations are complex and come with many limitations. Moreover, there are other incremental processing scenarios such as collecting data from multiple sources, or periodic import/export. I also still hear from people about an old blog post I wrote on incremental data processing in PostgreSQL so I know this topic remains unsolved for many Postgres users.

I felt it was time for a new incremental processing tool. One that isn't particularly magical - but is simple, versatile and gets the job done. That tool is pg_incremental.

Practical data pipelines using parameterized SQL

The basic idea behind pg_incremental is simple: You define a pipeline using a SQL command that is executed with parameters ($1, $2) that specify a range of values to be processed.

When you first define the pipeline, it executes the command with a range that covers all existing data, and also sets up a background job using pg_cron to periodically execute the command for new ranges. Every execution of the pipeline is transactional, such that each value is processed successfully exactly once. The dimension used to identify new data can be a sequence, time, or list of files.

Let’s think about a sample data aggregation pipeline:

You have an indexed raw data table of events
You have a summary table called view_counts that summarizes data from your daily events table
pg_incremental is used for incrementally upserting existing and new event data into view_counts

Sample code:

/* define the raw data and summary table */
create table events (event_id bigserial, event_time timestamptz, user_id bigint, response_time double precision);
create table view_counts (day timestamptz, user_id bigint, count bigint, primary key (day, user_id));

/* enable fast range scans on the sequence column */
create index on events using brin (event_id);

/* for demo: generate some random data */
insert into events (event_time, user_id, response_time)
select now(), random() * 100, random() from generate_series(1,1000000) s;

/* define a sequence pipeline that periodically upserts view counts */
select incremental.create_sequence_pipeline('view-count-pipeline', 'events',
  $$
    insert into view_counts
    select date_trunc('day', event_time), user_id, count(*)
    from events where event_id between $1 and $2
    group by 1, 2
    on conflict (day, user_id) do update set count = view_counts.count + EXCLUDED.count;
  $$
);

/* get the most active users of today */
select user_id, sum(count) from view_counts where day = now()::date group by 1 order by 2 desc limit 3;
┌─────────┬───────┐
│ user_id │  sum  │
├─────────┼───────┤
│      32 │ 20486 │
│      77 │ 20404 │
│      75 │ 20378 │
└─────────┴───────┘

A “sequence pipeline” takes advantage of sequence values in PostgreSQL being monotonically increasing with every insert. It is not normally safe to just start processing a range of sequence values, because there might be ongoing transactions that are about to insert lower sequence values. However, pg_incremental waits for those transactions to complete before processing a range, which guarantees that the range is safe.

Not every table has a sequence, and sometimes the source is not a table at all. Therefore, pg_incremental has 3 types of pipelines:

Sequence pipelines can process ranges of new sequence values in small batches with upserts.
Time intervals pipelines can process data that falls within a time interval after the time interval has passed.
File list pipelines (in preview) can process new files that appear in a directory.

Let's look at some more examples:

Example 1: Unpacking raw JSON data with a sequence pipeline

PostgreSQL has great JSON support, but I often run into scenarios where you need to unpack raw JSON data into the columns of a table to simplify querying or add indexes and constraints.

Below is an example of using a pg_incremental sequence pipeline to transform raw JSON. We create a table with a sequence and a JSONB column to load raw files directly using COPY. We then set up a pipeline that extracts relevant values from the new JSON objects, and inserts them into an events table with columns.

/* create a table with a single JSONB column and a sequence to track new objects */
create table events_json (id bigint generated by default as identity, payload jsonb);
create index on events_json using brin (id);

/* load some data from a local newline-delimited JSON file */
\copy events_json (payload) from '2024-12-15-00.json' with (format 'csv', quote e'\x01', delimiter e'\x02', escape e'\x01')

/* periodically unpack the new JSON objects into the events table */
select incremental.create_sequence_pipeline('unpack-json-pipeline', 'events_json',
  $$
    insert into events (event_id, event_time, user_id, response_time)
    select
      nextval('events_event_id_seq'),
      (payload->>'created_at')::timestamptz,
      (payload->'actor'->>'id')::bigint,
      (payload->>'response_time')::double precision
    from events_json
    where id between $1 and $2
  $$
);

After setting up the pipeline, future data loads into events_json will automatically be transformed and added to the events table.

Example 2: Complex aggregations with a time interval pipeline

A time interval pipeline runs after an interval has passed when all the data in the interval is available. Compared to sequence pipelines, time interval pipelines are more suitable for aggregations that cannot be merged such as exact distinct counts.

Below is an example of using a pg_incremental time interval pipeline to aggregate the number of unique users in an hour into a user_counts table. The $1 and $2 parameters will be set to the start and end (exclusive) of a range of time intervals.

/* create a table for number of active users per hour */
create table user_counts (hour timestamptz, user_count bigint, primary key (hour));

/* enable fast range scans on the event_time column */
create index on events using brin (event_time);

/* aggregates a range of 1 hour intervals after an hour has passed */
select incremental.create_time_interval_pipeline('distinct-user-count', '1 hour',
  $$
    insert into view_counts
    select date_trunc('hour', event_time), count(distinct user_id)
    from events where event_time >= $1 and event_time < $2
    group by 1
  $$
);

/* get number of active users per hour */
select hour, user_count from user_counts order by 1;

A downside of time interval pipelines is that they do not process data with older timestamps if the corresponding interval has already been processed. By default, a time interval pipeline waits for 1 minute after the interval. You can configure a higher min_delay and can also specify a source_table_name to wait for writers to finish.

Example 3: Periodic export to Parquet in S3 with a time interval pipeline

A common requirement with event data is to export into a remote storage system like S3, for instance using the pg_parquet extension.

Below is an example of using a pg_incremental time interval pipeline to export the data in the events table to one Parquet file per day starting at Jan 1st 2024, and automatically after a day has passed.

/* define a function that wraps a COPY TO command to export data */
create or replace function export_events(start_time timestamptz, end_time timestamptz)
returns void language plpgsql as $function$ begin

  /* select all rows in a time range and export them to a Parquet file */
  execute format(
    'copy (select * from events where event_time >= %L and event_time < %L) to %L',
    start_time, end_time, format('s3://mybucket/events/%s.parquet', start_time::date)
  );

end; $function$;

/* export data as 1 file per day, starting at Jan 1st */
select incremental.create_time_interval_pipeline(
  'export-events',
  '1 day',
  'select export_events($1, $2)',

  source_table_name := 'events', /* wait for writes on events to finish */
  batched := false,              /* separate execution for each day     */
  start_time := '2024-01-01'     /* export all days from Jan 1st now    */
);

In this case, I disabled “batching” of time intervals, such that time intervals are processed one at a time, starting from Jan 1st 2024. I also specified a source_table_name, which means the execution waits for any ongoing writes. If the event_time is generated via now(), this helps ensure we do not skip any rows.

Example 4: Import new files with a file list pipeline

One of the things that triggered me to write pg_incremental was that I found myself writing a script to incrementally process new files in S3 for a Crunchy Data Warehouse use case, and I realized that processing new files in a directory had a lot in common with the other incremental processing scenarios, except we find new data by listing files.

Below is an example of using a pg_incremental file list pipeline to import all files that match a wildcard and automatically load new files as they appear (in Crunchy Data Warehouse). The $1 parameter will be set to the path of a file that has not been processed yet, as returned by the underlying list function.

/* define function that wraps a COPY FROM command to import data */
create or replace function import_events(path text)
returns void language plpgsql as $function$ begin

  /* load a file into the events table */
  execute format('copy events from %L', path);

end; $function$;

/* load all the files under a prefix, and automatically load new files, one at a time */
select incremental.create_file_list_pipeline(
    'import-events',
    's3://mybucket/events/*.csv',
    'select import_events($1)'
);

The list function is configurable via the list_function argument. For instance, you could wrap around the pg_ls_dir() function to load files on the server, or use a function that returns a synthetic range to load public (not listable) data.

The API of the file list pipeline might still undergo small changes, hence it’s in preview.

Monitoring pg_incremental

You can see all your pipelines in the incremental.pipelines table and monitor the progress of your pipelines via the tables that pg_incremental uses to do its own bookkeeping, which contain the last processed value:

select * from incremental.sequence_pipelines ;
┌─────────────────────┬────────────────────────────┬────────────────────────────────┐
│    pipeline_name    │       sequence_name        │ last_processed_sequence_number │
├─────────────────────┼────────────────────────────┼────────────────────────────────┤
│ view-count-pipeline │ public.events_event_id_seq │                        3000000 │
└─────────────────────┴────────────────────────────┴────────────────────────────────┘

select * from incremental.time_interval_pipelines;
┌───────────────┬───────────────┬─────────┬───────────┬────────────────────────┐
│ pipeline_name │ time_interval │ batched │ min_delay │  last_processed_time   │
├───────────────┼───────────────┼─────────┼───────────┼────────────────────────┤
│ export-events │ 1 day         │ f       │ 00:00:30  │ 2024-12-17 00:00:00+01 │
└───────────────┴───────────────┴─────────┴───────────┴────────────────────────┘

In addition, you can view the result of the underlying pg_cron jobs via the regular pg_cron tables.

select jobname, start_time, status, return_message
from cron.job_run_details join cron.job using (jobid)
where jobname like 'pipeline:event-import%' order by 1 desc limit 3;
┌───────────────────────┬───────────────────────────────┬───────────┬────────────────┐
│        jobname        │          start_time           │  status   │ return_message │
├───────────────────────┼───────────────────────────────┼───────────┼────────────────┤
│ pipeline:event-import │ 2024-12-17 13:27:00.090057+01 │ succeeded │ CALL           │
│ pipeline:event-import │ 2024-12-17 13:26:00.055813+01 │ succeeded │ CALL           │
│ pipeline:event-import │ 2024-12-17 13:25:00.086688+01 │ succeeded │ CALL           │
└───────────────────────┴───────────────────────────────┴───────────┴────────────────┘

Note that the jobs run more frequently than the pipeline command is executed. The command is skipped if there is no new work to do.

Get started with incremental processing in PostgreSQL

Crunchy Data is proud to release pg_incremental under the PostgreSQL license. We believe it is a foundational building block for building IoT applications on PostgreSQL that should be available to everyone, similar to pg_cron, pg_parquet, and pg_partman.

You can find code and documentation on the pg_incremental GitHub repo, and let us know if you have any feedback (always appreciate a star!).

Starting today, pg_incremental is also available on Crunchy Bridge and Crunchy Data Warehouse.

Postgres Powered by DuckDB: The Modern Data Stack in a Box

Marco.Slot@crunchydata.com (Marco Slot) — Fri, 16 Aug 2024 12:00:00 EDT

Looking for Postgres with the power of DuckDB? Crunchy Data Warehouse is the latest Postgres-native tool with full Iceberg support and DuckDB integration.

Postgres for analytics has always been a huge question mark. By using PostgreSQL's extension APIs, integrating DuckDB as a query engine for state-of-the-art analytics performance without forking either project could Postgres be the analytics database too?

Bringing an analytical query engine into a transactional database system raises many interesting possibilities and questions. In this blog post I want to reflect on what makes these workloads and system architectures so different and what bringing them together means.

OLAP & OLTP: Never the twain shall meet

Database systems have always been divided into two worlds: Transactional and Analytical (traditionally referred to as OLTP) and OLAP (Online Transactional/Analytical Processing).

Both types of data stores use very similar concepts. The relational data model and SQL dominate. Writes, schema management, transactions and indexes use similar syntax and semantics. Many tools can interact with both types. Why then are they separate systems?

The answer has multiple facets. At a high level, OLTP involves doing a very large number of small queries and OLAP involves doing a small number of very large queries. They represent two extremes of the database workload spectrum. While many workloads fall somewhere in between, they can often be optimized or split until they can reasonably be handled by conventional OLTP or OLAP systems.

For many applications, the database system does the bulk of the critical computational work. Deep optimizations are essential for a database system to be useful and appealing, but optimization inherently comes with complexity. Consider that relational database systems have a vast amount of functionality and need to cater to a wide range of workloads. Building a versatile yet well-optimized database system can take a very long time.

Optimization is most effective when specializing for the characteristics of specific workloads, which practically always comes with the trade-off of being less optimized for other workloads. As it turns out, doing many small things or a few large things, when optimized to the extreme across a wide range of system functions over a long period, results in fundamentally different system architectures. The interface may be similar, but everything from the way queries and transactions are processed down to the storage architecture is going to be vastly different.

Let's have a look at what the main challenges are for each type of system.

Challenges of transactional systems

The biggest challenge in transactional systems is the efficient handling of a high rate of small update/delete transactions, in a way that guarantees ACID properties, while also handling concurrent read queries.

Storage in transactional systems is organized around small in-memory buffers that can be modified in less than a microsecond and written to disk in under a millisecond. Rows are packed together in the buffers, with tree data structures across the buffers (indexes) used to efficiently find the rows.

An insert/update/delete command involves modifying several buffers to write the new rows and add index entries, or a larger number when modifying multiple rows at once. The changes to the buffers are also written to a write ahead log (WAL) to ensure that they can be recovered in case of a crash.

In the cloud, the buffers and WAL can be stored in elastic block storage or other network-attached storage systems that replicate small disk blocks with minimal latency. Only the WAL is synchronously flushed to disk when committing a transaction. Recently modified buffers may only exist in the memory of the database server. The disk can have many stale and sometimes even truncated buffers. It can only be correctly interpreted by the server to which it is attached or through a crash recovery process that restores changes from the WAL.

Many write operations and read queries will be running concurrently. Database systems typically try to prevent queries from seeing ongoing changes by versioning the rows during a write, such that concurrent reads can skip row versions that were added after the query started ("snapshot isolation"). This comes with additional challenges. For instance, consider that strange anomalies would occur if concurrent updates would operate on the same snapshot without considering each other’s effects. PostgreSQL resolves this through a combination of row-level locks, a chain of forward pointers from past to current row versions, and using the latest row version in the update regardless of the snapshot.

To build a high performance transactional system it is essential to minimize the overhead from synchronization across all these operations, as well as storage access, query processing, transaction management, I/O, and concurrency control.

Challenges of analytical systems

The biggest challenge in analytical systems is to process a very large number of records in as little time as possible within the context of a single query.

Analytical queries compute statistics and trends across historical data. The number of records involved in a single query can easily be in the billions, which means it's a billion times higher than the number of records involved in a typical transactional query (1-100). If your database system needed 1 second per record, a single query might not finish in your lifetime. Hence, spending as little time as possible per record matters above all else.

One of the techniques analytical systems use to minimize per-record overhead is columnar storage, which dissects records into fixed-sized vectors of values from the same column. Analytical queries often only use a subset of the columns while involving most of the records, and columnar storage enables skipping unused columns during reads. The vectors can also be compressed effectively because columns often contain many similar values.

Database systems that use columnar storage can be architected for vectorized execution, which means they process a vector at a time rather than a record at a time. For instance, a filter might be evaluated on a vector from column A, which produces a list of indices to be retrieved from a vector from column B. Vectorized execution minimizes the overhead of switching between different expression states, and can take advantage of modern CPU instructions that process multiple values at once (SIMD).

Analytical database systems also optimize for parallel execution within a single query. Data needs to be passed around efficiently between different parts of a parallel execution pipeline.

Managing the flows of data in a parallel, vectorized executor with minimal data copying, while also using data structures that maximize performance is the most complex part of building an analytical database system. The query planner and executor are very different than in transactional systems, with higher processing time per query, but much lower processing time per row when a query spans many rows.

In modern analytical database systems, storage is organized around files in distributed storage systems like Amazon S3, because they can scale the amount of storage and retrieval bandwidth, and can be accessed directly by various applications. The latency of such systems is relatively high, but acceptable for analytical queries which typically range from hundreds of milliseconds to minutes.

The overhead of synchronization between components and concurrent queries is also less of an issue in analytical systems than in transactional systems because they are negligible compared to the time required to execute a single query.

Can you build a unified database architecture?

It is technically possible to build a database engine that can simultaneously handle a high rate of low latency transactions and perform fast analytical queries on a single copy of the data, with ACID properties. Such systems are often referred to as HTAP (Hybrid Transactional/Analytical Processing). However, hybrid systems generally underperform against dedicated OLTP or OLAP systems or make other invasive trade-offs such as limited functionality, or lower durability.

It is hard to precisely identify the workloads that benefit substantially from a hybrid approach. Moreover, the incremental cost of keeping a second copy of transactional data in a format that is optimized for analytics and compression in object storage is relatively small. Hence, replicating transactional data into an analytical system with some lag has become the dominant way of doing analytics on transactional data.

The OLTP vs. OLAP disparity is likely to stay, though it is not without downsides. The capabilities, tools, ecosystem, and practices differ significantly between different parts of the data stack, which becomes a source of complexity and high maintenance cost. In addition, expensive, brittle data movement tools are needed to get data from transactional to analytical stores.

OLTP and OLAP workloads are often managed by different teams, so some differences in tooling and practices are to be expected. Still, organizations spend a huge amount of time and money on moving data and integrating different systems. Moreover, an application that is purely transactional or purely analytical is not likely to remain so for long.

Consider that analytics teams often create dashboards and popular dashboards end up having various types of materialized views, typically kept in transactional database systems. Data management also involves a lot of metadata and bookkeeping, and those are best done in a transactional way to avoid duplicate or missing data.

Application teams use transactional database systems, but often want to add value by providing their customers with insights. However, they do not want the complexity of funneling the data through several (company-wide) analytics systems.

OLTP and OLAP are fundamentally different workloads that benefit from fundamentally different techniques and storage solutions, but they don't necessarily benefit from using wholly different database systems. The fact that most database systems are focused on only one type of workload is because it is extremely hard for a database builder to be simultaneously successful in two different worlds.

The solution to this conundrum, we believe, lies in extensibility.

Database extensibility: Bringing disparate systems together

From its inception, PostgreSQL has been designed to be extensible. It supports many forms of extensibility, with extensions able to control the behavior of query processing and data storage at many different levels. There is a flourishing ecosystem of extensions.

DuckDB is an embedded OLAP database, which is taking the analytics landscape by storm. DuckDB takes inspiration from SQLite for its deployment model, and PostgreSQL for its functionality and extensibility.

With both of these systems being extensible, and having a similar interface, they can be integrated in interesting ways. In Crunchy Bridge for Analytics, we introduced the notion of analytics tables that are backed by files in S3 and integrated DuckDB as a query engine for queries that involve analytics tables. PostgreSQL gives sufficient flexibility to use DuckDB for the full query or specific parts of the query. We used DuckDB extensions to incrementally fill any gaps that DuckDB might have compared to PostgreSQL, to ensure a wide range of PostgreSQL queries can take advantage of parallel, vectorized execution on columnar Parquet files within DuckDB.

Our goal in Crunchy Bridge for Analytics is not necessarily to enable HTAP tables. It is meant "for analytics". We do believe it is very useful to have your analytics database use the same system as your transactional databases, and that it is also very useful to be able to handle arbitrary transactional workloads in your analytics database, including handling of metadata, fast insert queues, partitioned time series tables, materialized views, etc. We also think it is useful to be able to easily move data between transactional and analytical tables without needing external tools, or do fast analytics without needing to switch to a different set of tools, data types, syntax, and ecosystem.

Effectively, extensibility can reduce the traditional OLTP-OLAP gap to a difference between tables, rather than a difference between database systems. Multiple query engines that make very different trade-offs and use different storage layers can be blended together into a single environment.

The power single machine database systems

One of the most surprising and disruptive aspects of DuckDB is that it is taking fast analytics away from the complex world of distributed systems back into the much simpler single-machine realm. While transactional database systems like PostgreSQL can be distributed, the vast majority of database systems run on a single machine. Hence, we now have two state-of-the-art, open source OLAP & OLTP systems which we can reasonably run together on the same machine.

There is still a difference between running a single machine briefly to handle a large analytics query (common for OLAP) vs. running it all the time to handle a steady rate of transactions (common for OLTP). However, we found that using a persistent machine for analytics on modern hardware has one tremendous benefit: long-lived cached in memory and on large NVMe drives.

Retrieving a file over S3 over a single connection is generally limited to 50-80MB/sec. Multiple concurrent connections can help, but in most scenarios the aggregate throughput on a single machine only goes a few times higher. Conversely, locally-attached NVMe drives easily reach 2-3GB/sec of read throughput and are usually big enough to hold a large part of the data. Keeping our data cached lets us take advantage of DuckDB’s processing power at a limited, predictable cost.

Even with a local cache, you do want your machine to have a high bandwidth connection to S3 for querying files that are not in cache or for loading into cache within a reasonable amount of time. The best way to achieve that is to run the machine on EC2 in the same AWS region as your S3 buckets. A major benefit DuckDB-in-PostgreSQL has over plain DuckDB in that regard is that it has a well-defined network protocol and an huge ecosystem of tools that support it. Moreover, PostgreSQL can be managed for you in EC2 by Crunchy Bridge.

Of course, when we talk about single machine systems, you can still have as many of those as you need to handle various applications and workflows with high concurrency. The big advantage is that you avoid a lot of the cost and complexity of gluing together operationally complex data processing systems, and instead you have a set of versatile units that are managed for you.

DuckDB + PostgreSQL = The Everything Database?

So, with Crunchy Bridge for Analytics you can get the stellar analytics performance of DuckDB along with all the familiar transactional capabilities and versatility of PostgreSQL in one box, which is managed for you by the team at Crunchy Data.

You can do fast ad-hoc queries on Parquet in S3 or create materialized views, you can schedule your ETL processes via pg_cron, you can track data operations via transactions, you can efficiently import and export Parquet/CSV/JSON, you can use any PostgreSQL-compatible tool (incl. most BI tools), and you can use all the PostgreSQL extensions and managed database features offered by Crunchy Bridge. Finally, you get a predictable price with great price-performance thanks to long-lived caches.

It might just be time for a new data stack.

An Overview of Distributed PostgreSQL Architectures

Marco.Slot@crunchydata.com (Marco Slot) — Mon, 08 Jan 2024 08:00:00 EST

I've always found distributed systems to be the most fascinating branch of computer science. I think the reason is that distributed systems are subject to the rules of the physical world just like we are. Things are never perfect, you cannot get everything you want, you’re always limited by physics, and often by economics, or by who you can communicate with. Many problems in distributed systems simply do not have a clean solution, instead there are different trade-offs you can make.

While at Citus Data, Microsoft, and now Crunchy Data, the focus of my work has been on distributed PostgreSQL architectures. At the last PGConf.EU in December, I gave a talk titled “PostgreSQL Distributed: Architectures & Best Practices” where I went over various kinds of distributed PostgreSQL architectures that I’ve encountered over the years.

Many distributed database discussions focus on algorithms for distributed query planning, transactions, etc. These are very interesting topics, but the truth is that only a small part of my time as a distributed database engineer goes into algorithms, and an excessive amount of time goes into making very careful trade-offs at every level (and of course, failure handling, testing, fixing bugs). Similarly, what many users notice within the first few minutes of using a distributed database is how unexpectedly slow they can be, because you quickly start hitting performance trade-offs.

There are many types of distributed PostgreSQL architectures, and they each make a different set of trade-offs. Let’s go over some of these architectures.

Single machine PostgreSQL

To set the stage for discussing distributed PostgreSQL architectures, we first need to understand a bit about the simplest possible architecture: running PostgreSQL on a single machine, or "node".

PostgreSQL on a single machine can be incredibly fast. There’s virtually no network latency on the database layer and you can even co-locate your application server. Millions of IOPS are available depending on the machine configuration. Disk latency is measured in microseconds. In general, running PostgreSQL on a single machine is a performant and cost-efficient choice.

So why doesn’t everyone just use a single machine?

Many companies do. However, PostgreSQL on a single machine comes with operational hazards. If the machine fails, there’s inevitably some kind of downtime. If the disk fails, you’re likely facing some data loss. An overloaded system can be difficult to scale. And you’re limited to the storage size of a disk, which when full will cease to process and store data. That very low latency and efficiency clearly comes at a price.

Distributed PostgreSQL architectures are ultimately trying to address the operational hazards of a single machine in different ways. In doing so, they do lose some of its efficiency, and especially the low latency.

Goals of a Distributed Database Architecture

The goal of a distributed database architecture is to try to meet the availability, durability, performance, regulatory, and scale requirements of large organizations, subject the physics. The ultimate goal is to do so with the same rich functionality and precise transactional semantics as a single node RDBMS.

There are several mechanisms that distributed database systems employ to achieve this, namely:

Replication - Place copies of data on different machines
Distribution - Place partitions of data on different machines
Decentralization - Place different DBMS activities on different machines

In practice, each of these mechanisms inherently comes with concessions in terms of performance, transactional semantics, functionality, and/or operational complexity.

To get a nice thing, you’ll have to give up a nice thing, but there are many different combinations of what you can get and what you need to give up.

The importance of latency in OLTP systems

Of course, distributed systems have already taken over the world, and most of the time we don’t really need to worry a lot about trade-offs when using them. Why would distributed database systems be any different?

The difference lies in a combination of storing the authoritative state for the application, the rich functionality that an RDBMS like PostgreSQL offers, and the relatively high impact of latency on client-perceived performance in OLTP systems.

PostgreSQL, like most other RDBMSs, uses a synchronous, interactive protocol where transactions are performed step-by-step. The client waits for the database to answer before sending the next command, and the next command might depend on the answer to the previous.

Any network latency between client and database server will already be a noticeable factor in the overall duration of a transaction. When PostgreSQL itself is a distributed system that makes internal network round trips (e.g. while waiting for WAL commit), the duration can get many times higher.

Why is it bad for transactions to take longer? Surely humans won’t notice if they need to wait 10-20ms? Well, if transactions take on average 20ms, then a single (interactive) session can only do 50 transactions per second. You then need a lot of concurrent sessions to actually achieve high throughput.

Having many sessions is not always practical from the application point-of-view, and each session uses significant resources like memory on the database server. Most PostgreSQL set ups limit the maximum number of sessions in the hundreds or low thousands, which puts a hard limit on achievable transaction throughput when network latency is involved. In addition, any operation that is holding locks while waiting for network round trips is also going to affect the achievable concurrency.

While in theory, latency does not have to affect performance so much, in practice it almost always does. The CIDR ‘23 paper “Is Scalable OLTP in the Cloud a solved problem?” gives a nice discussion of the issue of latency in section 2.5.

PostgreSQL Distributed Architectures

PostgreSQL can be distributed at many different layers that hook into different parts of its own architecture and make different trade-offs. In the following sections, we will discuss these well-known architectures:

Network-attached block storage (e.g. EBS)
Read replicas
DBMS-optimized cloud storage (e.g. Aurora)
Active-active (e.g. BDR)
Transparent Sharding (e.g. Citus)
Distributed key-value stores with SQL (e.g. Yugabyte)

We will describe the pros and cons of each architecture, relative to running PostgreSQL on a single machine.

Note that many of these architectures are orthogonal. For instance, you could have a sharded system with read replicas using network-attached storage, or an active-active system that uses DBMS-optimized cloud storage.

Network-attached block storage

Network-attached block storage is a common technique in cloud-based architectures where the database files are stored on a different device. The database server typically runs in a virtual machine in a Hypervisor, which exposes a block device to the VM. Any reads and writes to the block device will result in network calls to a block storage API. The block storage service internally replicates the writes to 2-3 storage nodes.

Practically all managed PostgreSQL services use network-attached block devices because the benefits are critical to most organizations. The internal replication results in high durability and also allows the block storage service to remain available when a storage node fails. The data is stored separately from the database server, which means the database server can easily be respawned on a different machine in case of failure, or when scaling up/down. Finally, the disk itself is easily resizable and supports snapshots for fast backups and creating replicas.

Getting so many nice things does come at a significant performance cost. Where modern Nvme drives generally achieve over >1M IOPS and disk latency in the tens of microseconds, network-attached storage is often below 10K IOPS and >1ms disk latency, especially for writes. That is a ~2 order of magnitude difference.

Pros:

Higher durability (replication)
Higher uptime (replace VM, reattach)
Fast backups and replica creation (snapshots)
Disk is resizable

Cons:

Higher disk latency (~20μs -> ~1000μs)
Lower IOPS (~1M -> ~10k IOPS)
Crash recovery on restart takes time
Cost can be high

💡 Guideline: the durability and availability benefits of network-attached storage usually outweigh the performance downsides, but it’s worth keeping in mind that PostgreSQL can be much faster.

Read replicas

PostgreSQL has built-in support for physical replication to read-only replicas. The most common way of using a replica is to set it up as a hot standby that takes over when the primary fails in a high availability set up. There are many blogs, books, and talks describing the trade-offs of high availability set ups, so in this post I will focus on other architectures.

Another common use for read replicas is to help you scale read throughput when reads are CPU or I/O bottlenecked by load balancing queries across replicas, which achieves linear scalability of reads and also offloads the primary, which speeds up writes!

A challenge with read replicas is that there is no prescribed way of using them. You have to decide on the topology and how you query them, and in doing so you will be making distributed systems trade-offs yourself.

The primary usually does not wait for replication when committing a write, which means read replicas are always slightly behind. That can become an issue when your application does a read that, from the user’s perspective, depends on a write that happened earlier. For example, a user clicks “Add to cart”, which adds the item to the shopping cart and immediately sends the user to the shopping cart page. If reading the shopping cart contents happens on the read replica, the shopping cart might then appear empty. Hence, you need to be very careful about which reads use a read replica.

Even if reads do not directly depend on a preceding write, at least from the client perspective, there may still be strange time travel anomalies. When load balancing between different nodes, clients might repeatedly get connected to different replica and see a different state of the database. As distributed systems engineers, we say that there is no “monotonic read consistency”.

Another issue with read replicas is that, when queries are load balanced randomly, they will each have similar cache contents. While that is great when there are certain extremely hot queries, it becomes painful when the frequently read data (working set) no longer fits in memory and each read replica will be performing a lot of redundant I/O. In contrast, a sharded architecture would divide the data over the memory and avoid I/O.

Read replicas are a powerful tool for scaling reads, but you should consider whether your workload is really appropriate for it.

Pros:

Read throughput scales linearly
Low latency stale reads if read replica is closer than primary
Lower load on primary

Cons:

Eventual read-your-writes consistency
No monotonic read consistency
Poor cache usage

💡 Guideline: Consider using read replicas when you need >100k reads/sec or observe a CPU bottleneck due to reads, best avoided for dependent transactions and large working sets.

DBMS-optimized cloud storage

There are a number of cloud services now like Aurora and AlloyDB that provide a network-attached storage layer that is optimized specifically for a DBMS.

In particular, a DBMS normally performs every write in two different ways: Immediately to the write-ahead log (WAL), and in the background to a data page (or several pages, when indexes are involved). Normally, PostgreSQL performs both of these writes, but in the DBMS-optimized storage architecture the background pages writes are performed by the storage layer instead, based on the incoming WAL. This reduces the amount of write I/O on the primary node.

The WAL is typically replicated directly from the primary node to several availability zones to parallelize the network round trips, which increases I/O again. Always writing to multiple availability zones also increases the write latency, which can result in lower per-session performance. In addition, read latency can be higher because the storage layer does not always materialize pages in memory. Architecturally, PostgreSQL is also not optimized for these storage characteristics.

While the theory behind DBMS-optimized storage is sound. In practice, the performance benefits are often not very pronounced (and can be negative), and the cost can be much higher than regular network-attached block storage. It does offer a greater degree of flexibility to the cloud service provider, for instance in terms of attach/detach times, because storage is controlled in the data plane rather than the hypervisor.

Pros:

Potential performance benefits by avoiding page writes from primary
Replicas can reuse storage, incl. hot standby
Can do faster reattach, branching than network-attached storage

Cons:

Write latency is high by default
High cost / pricing
PostgreSQL is not designed for it, not OSS

💡 Guideline: Can be beneficial for complex workloads, but important to measure whether price-performance under load is actually better than using a bigger machine.

Active-active

In the active-active architecture any node can locally accept writes without coordination with other nodes. It is typically used with replicas in multiple sites, each of which will then see low read and write latency, and can survive failure of other sites. These benefits are phenomenal, but of course come with a significant downside.

First, you have the typical eventual consistency downsides of read replicas. However, the main challenge with an active-active setup is that update conflicts are not resolved upfront. Normally, if two concurrent transactions try to update the same row in PostgreSQL, the first one will take a “row-level lock”. In case of active-active, both updates might be accepted concurrently.

For instance, when you perform two simultaneous updates of a counter on different nodes, the nodes might both see 4 as the current value and set the new value to 5. When replication happens, they’ll happily agree that the new value is 5 even though there were two increment operations.

Active-active systems do not have a linear history, even at the row level, which makes them very hard to program against. However, if you are very prepared to live with that, the benefits could be attractive especially for very high availability.

Pros:

Very high read and write availability
Low read and write latency
Read throughput scales linearly

Cons:

Eventual read-your-writes consistency
No monotonic read consistency
No linear history (updates might conflict after commit)

💡 General guideline: Consider only for very simple workloads (e.g. queues) and only if you really need the benefits.

Transparent sharding

Transparent sharding systems like Citus distribute tables by a shard key and/or replicate tables across multiple primary nodes. Each node shows the distributed tables as if they were regular PostgreSQL tables and queries & transactions are transparently routed or parallelized across nodes.

Data is stored in shards, which are regular PostgreSQL tables, which can take advantage of indexes, constraints, etc. In addition, the shards can be co-located by the shard key (in “shard groups”), such that joins and foreign keys that include the shard key can be performed locally.

The advantage of distributing the data this way is that you can take advantage of the memory, IO bandwidth, storage, and CPU of all the nodes in an efficient manner. You could even ensure that your data or at least your working set always fits in memory by scaling out.

Scaling out transactional workloads is most effective when queries have a filter on the shard key, such that they can be routed to a single shard group (e.g. single tenant in a multi-tenant app). That way, there is only a marginal amount of overhead compared to running a query on a single server, but you have a lot more capacity. Another effective way of scaling out is when you have compute-heavy analytical queries that can be parallelized across the shards (e.g. time series / IoT).

However, there is also higher latency, which reduces the per-session throughput compared to a single machine. And, if you have a simple lookup that does not have a shard key filter, you will still experience all the overhead of parallelizing the query across nodes. Finally, there may be restrictions in terms of data model (e.g. unique and foreign constraints must include shard key), SQL (non-co-located correlated subqueries), and transactional guarantees (snapshot isolation only at shard level).

Using a sharded system often means that you will need to adjust your application to deal with higher latency and a more rigid data model. For instance, if you are building a multi-tenant application you will need to add tenant ID columns to all your tables to use as a shard key, and if you are currently loading data using INSERT statements then you might want to switch to COPY to avoid waiting for every row.

If you are willing to adjust your application, sharding can be one of the most powerful tools in your arsenal for dealing with data-intensive applications.

Pros:

Scale throughput for reads & writes (CPU & IOPS)
Scale memory for large working sets
Parallelize analytical queries, batch operations

Cons:

High read and write latency
Data model decisions have high impact on performance
Snapshot isolation concessions

💡 General guideline: Use for multi-tenant apps, otherwise use for large working set (>100GB) or compute heavy queries.

Distributed key-value storage with SQL

About a decade ago, Google Spanner introduced the notion of a distributed key-value store that supports transactions across nodes (key ranges) with snapshot isolation in a scalable manner by using globally synchronized clocks. Subsequent evolutions of Spanner then added a SQL layer on top, and ultimately even a PostgreSQL interface. Open source alternatives like CockroachDB and Yugabyte followed a similar approach without the requirement of synchronized clocks, at the cost of significantly higher latency.

These systems have built on top of existing key-value storage techniques for availability and scalability, such as shard-level replication and failover using Paxos or Raft. Tables are then stored in the key-value store, with the key being a combination of the table ID and the primary key. The SQL engine is adjusted accordingly, distributing queries where possible.

In my view, the relational data model (or, your typical PostgreSQL app) is not well-served by using a distributed key-value store underneath. Related tables and indexes are not necessarily stored together, meaning typical operations such as joins and evaluating foreign keys or even simple index lookups might incur an excessive number of internal network hops. The relatively strong transactional guarantees that involve additional locks and coordination can also become a drag on performance.

In comparison to PostgreSQL or Citus, performance and efficiency are often disappointing. However, these systems offer much richer (PostgreSQL-like) functionality than existing key-value stores, and better scalability than consensus stores like etcd, so they can be a great alternative for those.

Pros:

Good read and write availability (shard-level failover)
Single table, single key operations scale well
No additional data modeling steps or snapshot isolation concessions

Cons:

Many internal operations incur high latency
No local joins in current implementations
Not actually PostgreSQL, and less mature and optimized

💡 General guideline: Just use PostgreSQL 😉 For simple applications, the availability and scalability benefits can be useful.

Conclusion

PostgreSQL can be distributed at different layers. Each architecture can introduce severe trade-offs. Almost nothing comes for free.

When deciding on the database architecture, keep asking yourself:

What do I really want?
Which architecture achieves that?
What are the downsides?
What can my application tolerate? (can I change my application?)

Even with state-of-the-art tools, deploying a distributed database system is never a solved problem, and perhaps never will be. You will need to spend some time understanding the trade-offs. I hope this blog post will help.

If you’re still feeling a bit lost, our PostgreSQL experts at Crunchy Data will be happy to help you pick the right architecture for your application.