David Christensen | CrunchyData Blog

Is Postgres Read Heavy or Write Heavy? (And Why You Should Care)

David.Christensen@crunchydata.com (David Christensen) — Fri, 17 Oct 2025 08:00:00 EDT

When someone asks about Postgres tuning, I always say “it depends”. What “it” is can vary widely but one major factor is the read and write traffic of a Postgres database. Today let’s dig into knowing if your Postgres database is read heavy or write heavy.

Of course write heavy or read heavy can largely be inferred from your business logic. Social media app - read heavy. IoT logger - write heavy. But …. Many of us have mixed use applications. Knowing your write and read load can help you make other decisions about tuning and architecture priorities with your Postgres fleet.

Understanding whether a Postgres database is read-heavy or write-heavy is paramount for effective database administration and performance tuning. For example, a read-heavy database might benefit more from extensive indexing, query caching, and read replicas, while a write-heavy database might require optimizations like faster storage, efficient WAL (Write-Ahead Log) management, table design considerations (such as fill factor and autovacuum tuning) and careful consideration of transaction isolation levels.

By reviewing a detailed read/write estimation, you can gain valuable insights into the underlying workload characteristics, enabling informed decisions for optimizing resource allocation and improving overall database performance.

Read and writes are not really equal

The challenge here in looking at Postgres like this is that reads and writes are not really equal.

Postgres reads data in whole 8kb units, called blocks on disk or pages once they’re part of the shared memory. The cost of reading is much lower than writing. Since the most frequently used data generally resides in the shared buffers or the OS cache, many queries never need additional physical IO and can return results just from memory.
Postgres writes by comparison are a little more complicated. When changing an individual tuple, Postgres needs to write data to WAL defining what happens. If this is the first write after a checkpoint, this could include a copy of the full data page. This also can involve writing additional data for any index changes, toast table changes, or toast table indexes. This is the direct write cost of a single database change, which is done before the commit is accepted. There is also the IO cost for writing out all dirty page buffers, but this is generally done in the background by the background writer. In addition to these write IO costs, the data pages need to be in memory in order to make changes, so every write operation also has potential read overhead as well.

That being said - I’ve worked on a query using internal table statistics that loosely estimates read load and write load.

Query Postgres for read and write traffic

This query leverages Postgres’ internal metadata to provide an estimation of the number of disk pages (or blocks) that have been directly affected by changes to a given number of tuples (rows). This estimation is crucial for understanding the read/write profile of a database, which in turn can inform optimization strategies (see below).

The query's logic is broken down into several Common Table Expressions (CTEs) to enhance readability and modularity:

ratio_target CTE:

This initial CTE is designed to establish a predefined threshold. It allows the user to specify a target ratio of read pages per write page. This ratio serves as the primary criteria for classifying a database or table as either read-heavy or write-heavy.

I’ve set the ratio in the query to 5 reads : 1 write, which means that roughly 20% of the database activity would be writes in this case. This is a bit of a fudge factor number and the exact definition of what makes up a write-heavy database may differ. If you set to 100, it would consider 100 reads to be equivalent to 1 write, or 1%; this is to allow you to tweak the definitions here for the classifications.

By defining this threshold explicitly, the query provides a flexible mechanism for evaluating different performance characteristics based on specific application requirements. For instance, a higher ratio_target might indicate a preference for read-intensive operations, while a lower one might suggest a workload dominated by writes.

table_list CTE

This CTE is responsible for the core calculations necessary to determine the read and write page counts. It performs the following key functions:

Total read pages:

It calculates the total number of pages that are typically read for the tables under consideration. This metric is fundamental to assessing the read demand placed on the database.

Estimated changed pages for writes:

To estimate the number of pages affected by write operations, the table_list CTE utilizes the existing relpages (total pages) and reltuples (total tuples) statistics from the pg_class system catalog. By calculating the ratio of relpages to reltuples, the query derives an estimated density of tuples per page. This density is then applied to the observed number of tuple writes to project how many physical pages were likely impacted by these write operations. This approach provides a practical way to infer disk I/O related to writes without needing to track every individual page modification.

Final comparison and classification

After the table_list CTE has computed the estimated read pages and write-affected pages, the final stage of the query involves a comparative analysis. The calculated number of read pages is directly compared against the estimated number of write pages. Based on this comparison, and in conjunction with the ratio_target defined earlier, the query then classifies each table (or the database as a whole) into one of several categories. These categories typically include:

Read-heavy: This classification is applied when the proportion of read pages significantly outweighs the write pages, based on the defined ratio_target.
Write-heavy: Conversely, this classification indicates that write operations are more prevalent, with a higher number of write-affected pages relative to read pages.
Other scenarios: The query can also identify other scenarios, such as balanced workloads where read and write operations are roughly equivalent, or cases where the data volume is too low to make a definitive classification.

The read/write Postgres query:

WITH
ratio_target AS (SELECT 5 AS ratio),
table_list AS (SELECT
 s.schemaname,
 s.relname AS table_name,
 -- Sum of heap and index blocks read from disk (from pg_statio_user_tables)
 si.heap_blks_read + si.idx_blks_read AS blocks_read,
 -- Sum of all write operations (tuples) (from pg_stat_user_tables)
s.n_tup_ins + s.n_tup_upd + s.n_tup_del AS write_tuples,
relpages * (s.n_tup_ins + s.n_tup_upd + s.n_tup_del ) / (case when reltuples = 0 then 1 else reltuples end) as blocks_write
FROM
 -- Join the user tables statistics view with the I/O statistics view
 pg_stat_user_tables AS s
JOIN pg_statio_user_tables AS si ON s.relid = si.relid
JOIN pg_class c ON c.oid = s.relid
WHERE
 -- Filter to only show tables that have had some form of read or write activity
(s.n_tup_ins + s.n_tup_upd + s.n_tup_del) > 0
AND
 (si.heap_blks_read + si.idx_blks_read) > 0
 )
SELECT *,
 CASE
   -- Handle case with no activity
   WHEN blocks_read = 0 and blocks_write = 0 THEN
     'No Activity'
   -- Handle write-heavy tables
   WHEN blocks_write * ratio > blocks_read THEN
     CASE
       WHEN blocks_read = 0 THEN 'Write-Only'
       ELSE
         ROUND(blocks_write :: numeric / blocks_read :: numeric, 1)::text || ':1 (Write-Heavy)'
     END
   -- Handle read-heavy tables
   WHEN blocks_read > blocks_write * ratio THEN
     CASE
       WHEN blocks_write = 0 THEN 'Read-Only'
       ELSE
         '1:' || ROUND(blocks_read::numeric / blocks_write :: numeric, 1)::text || ' (Read-Heavy)'
     END
   -- Handle balanced tables
   ELSE
     '1:1 (Balanced)'
 END AS activity_ratio
FROM table_list, ratio_target
ORDER BY
 -- Order by the most active tables first (sum of all operations)
 (blocks_read + blocks_write) DESC;

Results will look something like this:

schemaname |  table_name   | blocks_read | write_tuples | blocks_write | ratio |     activity_ratio

- -----------+---------------+-------------+--------------+--------------+-------+------------------------

public     | audit_logs    |           2 |      1500000 |        18519 |     5 | 9259.5:1 (Write-Heavy)
public     | orders        |           8 |            4 |           -0 |     5 | Read-Only
public     | articles      |           2 |           10 |            1 |     5 | 0.5:1 (Write-Heavy)
public     | user_profiles |           1 |            3 |           -0 |     5 | Read-Only

pg_stat_statements

Another way to look at read and write traffic is through the pg_stat_statements extension. It aggregates statistics for every unique query run on your database. It also will collect data about Postgres queries row by row.

While the above query accounts for a bit more distribution in workload, pg_stat_statements is also a good checkpoint for traffic volume.

SELECT
  SUM(CASE WHEN query ILIKE 'SELECT%' THEN rows ELSE 0 END) AS rows_read,
   SUM(CASE WHEN query ILIKE 'INSERT%' OR query ILIKE 'UPDATE%' OR query ILIKE 'DELETE%' THEN rows ELSE 0 END) AS rows_written
FROM pg_stat_statements;

 cache_hits | disk_reads | rows_read | rows_written
------------+------------+-----------+--------------
      27586 |        998 |    443628 |           30
(1 row)

Performance Tuning for High Write Traffic in Postgres

For write-heavy systems, the bottleneck is often I/O and transaction throughput. You're constantly writing to the disk, which is slower than reading from memory.

Faster Storage: The most direct way to improve write performance is to use faster storage, such as NVMe SSDs, and provision more I/O operations per second (IOPS).
More RAM: While reads benefit from RAM for caching too, writes also benefit from a larger shared_buffers pool, which can hold more dirty pages before they need to be flushed to disk.
I/O burst systems: Many cloud based systems come with extra I/O out of the box, so looking at these numbers may also be helpful.
Minimize Indexes: While essential for reads, every index needs to be updated during a write operation. Over-indexing can significantly slow down writes so remove unused indexes.
Utilizing HOT updates: Postgres has a performance improvement for frequently updated rows that are indexed, so adjusting fill factor to take advantage of this could be worth looking into.
Tune the WAL (Write-Ahead Log): The WAL is where every change is written before it's committed to the main database files. Tuning parameters like wal_buffers can reduce the number of disk flushes and improve write performance.
Optimize Checkpoints: Checkpoints sync the data from shared memory to disk. Frequent or large checkpoints can cause I/O spikes. Adjusting checkpoint_timeout and checkpoint_completion_target can smooth out these events.

Performance tuning for read traffic

For read-heavy systems, the primary goal is to get data to the user as quickly as possible and ideally have much data in the buffer cache so it is not reading from disk.

Effective Caching: Ensure your shared_buffers and effective_cache_size are configured to take advantage of available RAM. This lets Postgres keep frequently accessed data in memory, avoiding costly disk reads.
Optimize Queries and Indexes: Use EXPLAIN ANALYZE to pinpoint slow SELECT queries and add indexes on columns used in WHERE clauses, JOIN conditions, and ORDER BY statements. Remember, indexes speed up lookups at the cost of slower writes.
Scaling out with read replicas: A read replica is a copy of your primary database that's kept in sync asynchronously. All write operations go to the primary, but you can distribute read queries across one or more replicas. This distributes the read load, offloads traffic from your primary server, and can dramatically improve read throughput without impacting your write performance.
Postgres 18 now has asynchronous I/O which should mean better read performance than traditional methods. Upgrade soon if you can.

Most Postgres databases are read heavy

Most Postgres databases are going to be far more read heavy than write heavy. I estimate just based on experience that 10:1 reads to writes is probably something where it is starting to get write heavy. Of course, there are outliers to this.

The right scaling strategy depends entirely on your workload. By proactively monitoring your Postgres stats using internal statistics in the Postgres catalog, you can make informed decisions that will keep your database healthy and your application fast.

Co-authored with Elizabeth Christensen

Building PostgreSQL Extensions: Dropping Extensions and Cleanup

David.Christensen@crunchydata.com (David Christensen) — Wed, 10 Apr 2024 09:00:00 EDT

I recently created a Postgres extension which utilizes the pg_cron extension to schedule recurring activities using the cron.schedule(). Everything worked great. The only problem was when I dropped my extension, it left the cron job scheduled, which resulted in regular errors:

2024-04-06 16:00:00.026 EST [1548187] LOG:  cron job 2 starting: SELECT bridge_stats.update_stats('55 minutes', false)
2024-04-06 16:00:00.047 EST [1580698] ERROR:  schema "bridge_stats" does not exist at character 8
2024-04-06 16:00:00.047 EST [1580698] STATEMENT:  SELECT bridge_stats.update_stats('55 minutes', false)

If you look in the cron.job table, you can see the SQL for the cron job is still present, even though the extension/schema isn’t:

select schedule, command, jobname from cron.job;

schedule  |                        command                        |             jobname
-----------+-------------------------------------------------------+----------------------------------
0 0 * * 0 | SELECT bridge_stats.weekly_stats_update()             | bridge-stats-weekly-maintenance
0 * * * * | SELECT bridge_stats.update_stats('55 minutes', false) | bridge-stats-hourly-snapshot
(2 rows)

This got me thinking: how can you create a Postgres extension that can clean up after itself for cases like this?

How Extension Creation/Cleanup works

If you’ve created or used an extension in Postgres (such as pg_partman, PostGIS, pg_kaboom, etc) you may know that every extension in PostgreSQL has a SQL file that gets run as part of the creation.

This SQL file may create database objects for you, such as schemas, tables, functions, etc. When database objects are created in the context of a CREATE EXTENSION command, they have an object dependency created against the underlying pg_extension object. (These are stored in the pg_depend system catalog, if you are interested in the more fine-grained details.)

When Postgres removes an extension (via the DROP EXTENSION command), it will also remove any dependent objects that were created for this extension. (This is true for any dependencies, all of which are tracked in a similar way.)

This is how a simple command like DROP EXTENSION can remove dozens or hundreds of associated objects.

Why didn’t this cleanup?

You may be asking why this didn’t clean up the underlying cron jobs, since Postgres is clearly able to track the individual database objects associated with a given extension?

This is because the dependencies are tracked at the database object level (basically tracking the entries in the system tables that depend on each other). It is not general-purpose for cleanup.

So how to clean up?

We would like to be able to clean up these rows that were created by our extension. We don’t want to spam the user’s logs with unnecessary errors, particularly since we know exactly what we did to create the external rows.

In an ideal world, the extension itself could register a function that could be called when it’s being cleaned up. However, we do not live in an ideal world. (Not to mention there is probably a 125-email thread on the pgsql-hackers mailing list as to why that’s a bad idea; leaving finding that as an exercise to the reader…)

Since we don’t have that capacity, the general advice on the interwebs and in the Postgres docs is to use an EVENT TRIGGER.

Attempt 1: `CREATE EVENT TRIGGER`

Event triggers are a function that runs around special “events” that occur in a database. The current event trigger types are ddl_command_start, ddl_command_end, sql_drop, and rewrite_table. These let you take special action inside the database and run code when a given event occurs.

Since we are trying to run some code when this extension is dropped, clearly we want the sql_drop event trigger type.

Let’s take an initial stab at our cleanup function, created in our extension’s SQL file:

CREATE FUNCTION bridge_stats.cleanup() RETURNS event_trigger AS $$
DECLARE
    obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_dropped_objects() LOOP
        IF obj.object_identity = 'bridge_stats' AND obj.object_type = 'extension' THEN
            PERFORM cron.unschedule('bridge-stats-weekly-maintenance');
            PERFORM cron.unschedule('bridge-stats-hourly-snapshot');
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER bridge_stats_cleanup ON sql_drop
WHEN TAG IN ('DROP EXTENSION')
EXECUTE FUNCTION bridge_stats.cleanup();

This seems like a straightforward attempt. We have created a function and an event trigger pair that end up being run any time a DROP EXTENSION is run. Our bridge_stats.cleanup() function in turn verifies that the extension itself is in the list of the dropped objects (returned by the pg_event_trigger_dropped_objects() function), and if it is, then we run the appropriate commands to unschedule our cron jobs. Easy-peasy.

Let’s go ahead and verify

“That was easy,” I say to myself, closing my text editor of choice (emacs, of course), and open my terminal to verify:

postgres=# create extension bridge_stats;
CREATE EXTENSION

postgres=# drop extension bridge_stats;
DROP EXTENSION

postgres=# select schedule, command, jobname from cron.job;

schedule  |                        command                        |             jobname
-----------+-------------------------------------------------------+----------------------------------
0 0 * * 0 | SELECT bridge_stats.weekly_stats_update()             | bridge-stats-weekly-maintenance
0 * * * * | SELECT bridge_stats.update_stats('55 minutes', false) | bridge-stats-hourly-snapshot
(2 rows)

The sweet smell of succ—oh wait. That didn’t work.

Adding logging (a la RAISE NOTICE 'BLARGH'), it appears that my event trigger was not even being called.

After considering a bit, it occurred to me that this wasn’t working because the event trigger must have been deleted as part of the extension’s schema, so it did not exist in the system when the sql_drop event trigger was called.

Perhaps the sql_drop event was run too late in the process? What about another one of the event trigger types?

Attempt 2: `CREATE EVENT TRIGGER 2: the what the heckening`

Looking at other options in the event trigger space, what are we left with?

ddl_command_start - run at the start of a DDL command
ddl_command_end - run at the end of a DDL command
rewrite_table - run when a table is rewritten

Clearly rewrite_table is off the, uh, err—you know—menu. Reading the docs for ddl_command_start and ddl_command_end shows that they are triggered before and after a DDL command is run.

“Ahh,” I exclaim, quickly transforming my existing event trigger into one based around the ddl_command_start event, since ddl_command_end runs after even sql_drop, so that one was out:

CREATE FUNCTION bridge_stats.cleanup() RETURNS event_trigger AS $$
DECLARE
    obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands() LOOP
        IF obj.object_identity = 'bridge_stats' AND obj.object_type = 'extension' THEN
            PERFORM cron.unschedule('bridge-stats-weekly-maintenance');
            PERFORM cron.unschedule('bridge-stats-hourly-snapshot');
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER bridge_stats_cleanup ON ddl_command_start
WHEN TAG IN ('DROP EXTENSION')
EXECUTE FUNCTION bridge_stats.cleanup();

You can see that I’ve changed a couple of things relative to the previous version:

I am using pg_event_trigger_ddl_commands() instead of pg_event_trigger_dropped_objects(); simple API change for this specific filter.
I changed the ON action of the CREATE EVENT TRIGGER statement to be ddl_command_start

Verification, part deux

And now, on to verification:

postgres=# create extension bridge_stats;
CREATE EXTENSION

postgres=# drop extension bridge_stats;
ERROR:  pg_event_trigger_ddl_commands() can only be called in an event trigger function
CONTEXT:  PL/pgSQL function bridge_stats.cleanup() line 5 at FOR over SELECT rows

Queue the reaction gif where I am puzzled at the turn of events.

This function is now clearly getting called, since it is giving me an error message related to the specific function I’m calling. It is also clearly an event trigger, since it’s literally a function returning event_trigger and it’s been executed by the event trigger created by CREATE EVENT TRIGGER.

Well, for whatever reason, it would empirically appear that there is something odd going on with using ddl_start_command in this way; perhaps something with running this on a DROP command? In any case, rather than trying to debug this clearly odd behavior, I started thinking about a different approach.

Attempt 3: A Hero’s Journey

So if we recall my explanation about the dependencies inside Postgres and the objects created by extensions, we can see that the DROP EXTENSION was preemptively deleting my event trigger and the underlying function, meaning that it didn’t exist at the time the sql_drop event was issued. What if there was some way to somehow break that dependency so the event trigger would still exist to be fired, then it could clean itself up after it was done?

This lead me down the path to ALTER EXTENSION.

ALTER EXTENSION lets you dynamically add or remove dependencies between a specific extension and other database objects. While database objects created during CREATE EXTENSION are automatically associated with the creating extension, perhaps we could use this to our advantage.

With blazing eyes and a new tool in my hand, I made the following adjustments to my original attempt:

CREATE FUNCTION bridge_stats.cleanup() RETURNS event_trigger AS $$
DECLARE
    obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_dropped_objects() LOOP
        IF obj.object_identity = 'bridge_stats' AND obj.object_type = 'extension' THEN
            PERFORM cron.unschedule('bridge-stats-weekly-maintenance');
            PERFORM cron.unschedule('bridge-stats-hourly-snapshot');
        END IF;
    END LOOP;
    DROP SCHEMA bridge_stats CASCADE;  -- the only new line in this function!
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER bridge_stats_cleanup ON sql_drop
WHEN TAG IN ('DROP EXTENSION')
EXECUTE FUNCTION bridge_stats.cleanup();

ALTER EXTENSION bridge_stats DROP EVENT TRIGGER bridge_stats_cleanup;
ALTER EXTENSION bridge_stats DROP FUNCTION bridge_stats.cleanup();
ALTER EXTENSION bridge_stats DROP SCHEMA bridge_stats;

As you can see, I have added the ALTER EXTENSION command to exclude the event trigger, the underlying function, and the owning schema from being owned by the extension.

I also added a DROP SCHEMA inside the cleanup() function to ensure that the objects that I manually detached from the extension’s schema would still get clean up.

Since everything else in the bridge_stats schema would get cleaned up by the DROP EXTENSION command, this would serve to finish the job, since a function can successfully delete itself in Postgres. (It’s true!)

Final verification

So of course, we need to verify that everything works as expected:

postgres=# create extension bridge_stats;
CREATE EXTENSION

postgres=# drop extension bridge_stats;
DROP EXTENSION

postgres=# select schedule, command, jobname from cron.job;
 schedule | command | jobname
----------+---------+---------
(0 rows)

Success!

TL;DR;

The top-down takeaway here is if you want to run some sort of cleanup action within a Postgres extension, you will have to:

Create your event trigger and associated function
ALTER EXTENSION DROP the event trigger, the function, and the schema
Ensure the cleanup function removes the objects you detached after doing whatever other cleanup job.

I hope that my experience of figuring out “just write an event trigger” helps someone else!

Tuple shuffling: Postgres CTEs for Moving and Deleting Table Data

David.Christensen@crunchydata.com (David Christensen) — Thu, 02 Nov 2023 09:00:00 EDT

Recently we published an article about some of the best sql subquery tools and we were talking about all the cool things you can do with CTEs. One thing that doesn’t get mentioned near enough is the use of CTEs to do work in your database moving things around. Did you know you can use CTEs for tuple shuffling? Using CTEs to update, delete, and insert data can be extremely efficient and safe for your Postgres database.

PostgreSQL 15 included the MERGE statement, which can be similar. There are however some cases which cannot be covered by this, or if you need to use PostgreSQL versions before MERGE was introduced, this technique may come in handy.

Deleting rows and inserting to another table

A common use case where this technique can come in handy is to move rows from one table to another in a single statement. Imagine that you have a schema with a single table and an archive table and you want to move data to the archive table from the source table when it gets to be a year inactive.

This can be accomplished via something like:

WITH
  deleted AS (
    DELETE FROM table_a
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  )
INSERT INTO
  archive
SELECT
  *
FROM
  deleted;

This straightforward approach simply returns all rows from table_a which were deleted, then inserts them into our archive table (which is assumed to have the same structure as table_a).

What happens if this transaction gets interrupted? Fortunately due to the magic of MVCC, these actions all table place in a single transaction. You can of course use explicit transaction control if you were splitting this up into multiple statements run interactively or by an app, but having it be a single statement means you are guaranteed to have this be a single transaction from the get-go.

Filtering data

Sometimes you might want to delete all of the rows, but only archive some of them; you could accomplish this by sticking the qual in the WHERE clause of the INSERT statement, for example:

WITH
  deleted AS (
    DELETE FROM table_a
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  )
INSERT INTO
  archive
SELECT
  *
FROM
  deleted
WHERE
  priority = 'important';

Here we apply a filter to the rows which were returned by the DELETE statement and only archive those which were already marked as important. For a deeper dive into adding filters with LIMIT to UPDATE and DELETE, see my previous post.

More complicated examples

Diving in deeper, what if we had multiple tables that we wanted to archive from. Each had the same table structure. Imagine that we wanted to track the original record’s source table and modified the archive table to include this as the first field in the archive table.

We can use more CTE clauses and handle this still in one go:

WITH
  deleted_a AS (
    DELETE FROM table_a
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  ),
  deleted_b AS (
    DELETE FROM table_b
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  ),
  deleted_c AS (
    DELETE FROM table_c
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  )
INSERT INTO
  archive
SELECT
  'table_a',
  *
FROM
  deleted_a
UNION ALL
SELECT
  'table_b',
  *
FROM
  deleted_b
UNION ALL
SELECT
  'table_c',
  *
FROM
  deleted_c;

Since our INSERT statement includes all of the DELETE clauses, we will be pulling in all of the rows that were deleted from each of them and inserting them into a single table.

Update

This also works for UPDATE as well. UPDATE RETURNING will return the contents of the modified row, so we could simplify some logic to handle some more complex cases in a single query, for instance:

WITH
  target_accounts AS (
    SELECT
      id
    FROM
      accounts
    WHERE
      type = 'savings'
  ),
  balance_update AS (
    UPDATE balances
    SET
      amount = amount + 100
    FROM
      target_accounts
    WHERE
      account_id = target_accounts.id RETURNING *
  )
INSERT INTO
  awards (account_id, award)
SELECT
  account_id,
  'met savings goal'
FROM
  balance_update
WHERE
  amount >= 1000;

Partitioning

CTEs can be used to do more complicated things, like split up a table for partitioning:

WITH
  source_rows AS (
    DELETE FROM movies_unsorted RETURNING *
  ),
  action_movie_rows AS (
    INSERT INTO
      action_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'action' RETURNING id
  ),
  comedy_movie_rows AS (
    INSERT INTO
      comedy_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'comedy' RETURNING id
  ),
  romance_movie_rows AS (
    INSERT INTO
      romance_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'romance' RETURNING id
  ),
  horror_movie_rows AS (
    INSERT INTO
      horror_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'horror' RETURNING id
  )
INSERT INTO
  other_movies
SELECT
  *
FROM
  source_rows
WHERE
  id NOT IN (
    SELECT
      id
    FROM
      action_movie_rows
    UNION ALL
    SELECT
      id
    FROM
      comedy_movie_rows
    UNION ALL
    SELECT
      id
    FROM
      romance_movie_rows
    UNION ALL
    SELECT
      id
    FROM
      horror_movie_rows
  );

With this example, we delete our source rows for all of the movie data in “movies_unsorted”, then use multiple CTE clauses to categorize the data into the appropriate movie type, inserting in the partition that corresponds to the movies type that we’ve determined with our query, with a final catch-all that both serves to provide a way to insert any non-classified data as well as force the evaluation of the underlying CTEs (so in fact perform the INSERT into the appropriate tables).

Summary

CTEs are an important part of your toolkit and can be used for data manipulations and more complex tuple routing. Being able to name individual query pieces - including data manipulating ones like INSERT, UPDATE, or DELETE - and treating as an independent tuple source unlocks a lot of power and can be a source of creativity and problem solving.

Postgres Data Flow

David.Christensen@crunchydata.com (David Christensen) — Mon, 19 Sep 2022 11:00:00 EDT

At Crunchy we talk a lot about memory, shared buffers, and cache hit ratios. Even our new playground tutorials can help users learn about memory usage. The gist of many of those conversations is that you want to have most of your frequently accessed data in the memory pool closest to the database, the shared buffer cache.

There's a lot more to the data flow of an application using Postgres than that. There could be application-level poolers and Redis caches in front of the database. Even on the database server, data exists at multiple layers, including the kernel and various on-disk caches. So for those of you that like to know the whole story, this post pulls together the full data flow for Postgres reads and writes, stem-to-stern.

Application Server

The application server sends queries to the individual PostgreSQL backend and gets the result set back. However there may in fact be multiple data layers at play here.

Application caching

Application caching can have many layers/places:

Browser-level caching: a previously-requested resource can be re-used by the client without needing to request a new copy from the application server.
Reverse proxy caches, i.e. Cloudflare, Nginx: a resource is requested by a user, but does not even need to hit an application server to return the result.
Individual per-worker process caches: Within specific application code, each backend could store some state to reduce querying against the database.
Framework-specific results or fragment caching: Whole parts of resources could be stored and returned piecemeal, or entire database result sets could be stored locally or in a shared resource outside of the database itself to obviate the need for accessing the database at all. This could be something like Redis or memcached to name a few examples.

Application connection pooler

When the application requests data that is not cached with one of the above methods, the application initiates an upstream connection to the database. Rather than always connecting directly, many application frameworks support application-level pooling. Application pooling allows multiple workers to share some smaller number of database connections among them. This reduces the resources like memory needed. At the same time, reusing open connections decreases the average time spent creating new database connections.

PostgreSQL server

Once we reach the level of the database connection, we can see some of the ways that data flows there. Connections to the database may be direct or through a database pooler.

Connection Poolers

Similar to the application-level pooling, a database pooler can be placed between the incoming database connections and the PostgreSQL server backends. pgBouncer is the de facto connection pooling tool. A connection pooler allows requests to share database resources among others with similar connection requirements. This also ensures that you are using fewer connections more efficiently, rather than having many idle connections.

The database pooler acts as a proxy of sorts, intermixing client requests with a smaller number of upstream PostgreSQL connections.

Client backends

When a connection is made to the PostgreSQL postmaster, a client backends is launched to communicate with it. This individual backend services all queries for a specific connection and returns the result sets. The client backend does this by coordinating access to table or index data through use of the shared_buffers memory segment. This is the point at which data requested and returned stops being "logical" requests and drills down to the filesystem.

Shared buffers / buffer cache

When a query requires data from a specific table, it will first check shared_buffers to see if the target block already exists there. If not, it will read the block into shared_buffers from the disk IO system. Buffers are a shared resource that all PostgreSQL backends use. When a disk block is loaded for one backend, later queries requesting it will find it’s already loaded in memory.

Shared buffers and data changes

If a query changes data in a table, it must first load the data page into shared_buffers (if it is not already loaded). The change is then made on the shared memory page, modified disk blocks written to the Write Ahead Log (assuming we are a LOGGED relation), and the page is marked dirty. Once the WAL page has been successfully written to disk at COMMIT time the transaction is safe on disk.

The block changes of dirty pages are written out asynchronously, with the eventual writer then marking it clean in shared_buffers. Possible writers include other (or the same) client(s), the database's Background Writer, and the system CHECKPOINT process. When multiple changes are made to the same disk pages in a short period, with enough memory this design enables an accelerated write path. Only a delta of additional WAL needs to be written each time the dirty page changes. Ideally the full content of the block is written to disk just once: during the next checkpoint.

Linux Subsystem

Page removal from shared_buffers

If Postgres needs to load additional pages to answer a query and shared_buffers is full, it will pick a page that is currently unused and evict it. Even though this page is not now in shared_buffers it may still be in the filesystem cache from the original disk read.

File system cache / os buffer cache/ kernel cache

In Linux, memory not in active use by programs caches recently used disk pages. This transparently accelerates workloads where data is re-read. Keeping the page in memory means we do not need to read it from the disk, a relatively slow process, if another client asks for it. Indexes are the classic example of a frequently re-read database hot spot.

Cached memory is available when needed for other parts of the system, so it doesn’t prevent programs from requesting additional memory. If this happens, the kernel will just drop some number of buffers from the OS cache for the kernel to fulfill the memory request.

For read buffers, there is no issue with dumping the contents of memory here; worst case it will just reload the original data from the disk. When the WAL or disk block changes are written, PostgreSQL waits for the write to complete via the appropriate system kernel call, i.e. fsync(). That ensures that the contents of the changed disk buffers have made it to the hardware I/O controller and potentially further.

Disk Cache

Once you’ve made it to the I/O layer you might assume you’d be done with caching, but caching is everywhere. Most disks have an internal I/O cache for reads/writes which will buffer and reorder I/O access so the device can manage optimal access/throughput.

If you read a disk block from the operating system, the internal disk cache layer will likely read-ahead surrounding blocks to have them in the internal disk cache, available for subsequent reads. When you write to disk, even if you fsync, the drive itself may have a caching layer (be it battery-backed controller, SSD, or NVMe cache) that will buffer these writes, then flush out to physical storage at some point in the near-immediate future.

Physical storage

Congratulations, if you got this far then your disk writes have actually been saved on the underlying medium. These days that’s some form of SSD or NVMe storage. At this layer, the hardware disk cache durably writes data changes to disk and reads data from block addresses. This is generally considered the lowest level of our data layers.

Internally SSD and NVMe hardware can have their own caches (yes, even more!) below where the database operates. Examples include a DRAM metadata cache for flash mapping tables and/or a write cache using faster SLC flash cells.

Conclusion

.....and the diagram you've been scrolling for

Feel like you just took a trip to the center of the earth? Data flow from Postgres involves all of these parts to get you the most used data the fastest:

Application
Possible Application Pooler
Individual Client Backend (Postgres connection)
Shared Buffers
File System Cache
Disk Cache
Physical Disk Storage

co-authored with Elizabeth Christensen, Stephen Frost, and Greg Smith

Postgres Locking: When is it Concerning?

David.Christensen@crunchydata.com (David Christensen) — Fri, 01 Jul 2022 11:00:00 EDT

When using monitoring tools like PgMonitor or pganalyze, Crunchy clients will often ask me about high numbers of locks and when to worry. Like most engineering-related questions, the answer is: "it depends".

In this post, I will provide a little more information about locks, how they are used in PostgreSQL, and what things to look for to spot problems vs high usage.

PostgreSQL uses locks in all parts of its operation to serialize or share access to key data. This can come in the form of two basic types of locks: shared or exclusive.

Shared locks - the particular resource can be accessed by more than one backend/session at the same time.
Exclusive locks - the particular resource can only be accessed by a single backend/session at a time.

The same resource can have different locks taken against them with the same or differing strengths.

Lock duration

Every statement run in a PostgreSQL session runs in a transaction. Either one explicitly created via transaction control statements (BEGIN, COMMIT, etc) or an implicit transaction created for a single statement.

When PostgreSQL takes a lock, it takes it for the duration of the transaction. It can never be explicitly unlocked except by the final termination of the transaction. One of the reasons for this is for snapshot consistency and ensuring proper dependencies are in place for existing transactions. Once the transaction is over, this doesn't matter now, so said locks can be released.

It is worth noting that PostgreSQL (and any multi-process system) uses locks internally for other accesses besides SQL-level backends and transactions.

Monitoring locks

If you're reading this far and are curious about monitoring, you are likely already familiar with the pg_locks views. This is a system view that exposes the current state of the built-in lock arrays. The details about the fields available here and documentation can vary across version; select your PostgreSQL version from this page for details.

The documentation provides a lot of detail about this view. The important thing to know here is this is the primary way for monitoring/reviewing this system. Some of the relevant fields are:

Column name	Data type	Description
granted	boolean	Whether this backend successfully acquired the lock
mode	text	The mode of the given lock request
pid	int	The backend pid that holds or requested this lock

Of particular note is the “granted” field, a boolean which shows whether the given lock has been granted to the backend in question. If there is an ungranted lock (i.e., granted = f) then this means that the backend is blocked waiting for a lock. Until the process that successfully has the lock completes in some way (i.e., either commits or rolls back), this process will be stuck in limbo and will not be able to proceed.

A related system view that can be used for more information about PostgreSQL backend processes is the venerable pg_stat_activity view, in particular the wait_event field. The wait_event will show for a given backend process if it is currently waiting for a lock, either "heavyweight" lock or a "lightweight" lock (indicated by wait_event_type = LWLock).

Regular lock usage

When a query accesses a table for a SELECT statement, it takes an AccessShare lock against that table. If a query accesses multiple tables, it will take locks against each of these tables. Depending on your query patterns and transaction lengths you could end up with dozens or even hundreds of AccessShare locks per backend connection without this being indicative of an issue. This is also a reason why just strictly looking at the count of locks in pg_locks as a metric for issues in the database isn't necessarily useful. If there are a high number of connections running queries or if the workload changes (say with an application deployment), this can cause high numbers of locks without this being an issue.

What is an issue then?

While high numbers of locks does not necessarily indicate a problem, some problems can result in high numbers of locks. For example, if a query is not running efficiently and thus takes a long time, there can be large number of backed up connections resulting in additional lock buildup as the backends wait for the resource to be freed.

An ungranted lock for any significant length of time indicates an issue and is something that should be looked into.

SELECT COUNT(*) FROM pg_locks WHERE NOT granted;

Note that depending on when this query is run, there can appear brief instances of ungranted locks. Yet if the same lock persists for a second invocation of this query this is likely to indicate a larger issue.

Investigating more

If you do have an ungranted lock, you will want to look at the process that currently has the lock; this is the process that would be misbehaving. To do this, you can run the following query to get the information about the specific backend and the query it is running:

SELECT pid, pg_blocking_pids(pid), wait_event, wait_event_type, query
FROM pg_stat_activity
WHERE backend_type = 'client backend'
AND wait_event_type ~ 'Lock'

Here, the pid process will be the process that is blocked, while the pg_blocking_pids() function will return an array of pids that are currently blocking this process from running. (Effectively, this is a list of processes that have a lock that the pid backend is waiting on.) Depending on what this process is doing, you may want to take some sort of corrective action, such as canceling or terminating that backend. (The correct course of action here will depend on your specific application.)

Non-blocking locks

Since locks are just a normal way that PostgreSQL controls access to its resources, high locks can be expected with high usage. So whether high numbers of locks are an indication of problems can depend on what those locks are and whether any additional issues are seen in the system proper.

If IO usage is very high, you will often see the LWLock DataRead, which can affect multiple backends. If IO is overloaded, any processes which are trying to read files from the disk will be in this state. So performing more IO operations will not be able to accomplish any more reading; the IO bandwidth of a system is finite, and if you are already at the limits of the system. Adding more requests will only further fragment and split the resource among additional backends.

If the system is reading in high numbers of buffers or has a lot of contention for the same buffers (say, attempting to vacuum ones used but other processes) you could end up encountering a BufferContent LWLock. This is a lock that is basically seen when trying to concurrently load large numbers of buffers. There are multiple shared locks that are used to ensure that there is not a single lock guarding the buffer page load, but this is still a finite resource so in times of high load you can see this show up as a blocking process. Any one lock is likely very brief, but in periods of high load, you will see these registers in pg_stat_activity quite frequently.

Depending on your system's transaction volume and types of transactions, you could see lots of queries with one of several SLRU locks, either on a primary or a replica. There are several types here, including SubtransSLRU and MultiXactSLRU.

Advisory locks

Clients have also run into some questions about advisory locks, particularly when using a transaction-level connection pooler such as PgBouncer. The explicit advisory lock functions in PostgreSQL allow the user to access lock primitives in their application code and allow the serialization of resources at the application level, which can be particularly useful when trying to coordinate access with external systems. That said, there can be issues encountered if not using these primitives properly.

Of particular note, if the user uses pg_advisory_lock() from application code when they are using a database pooler, they can end up with either deadlocks or confusing behavior due to the potential for different database sessions being used. Since the pg_advisory_lock() function grabs a lock for its current database session (not the current transaction), multiple pg_advisory_lock() calls could end up getting run against different backends (since PgBouncer would use a fairly arbitrary backend for separate transactions).

Since the locks are being taken and potentially released in separate sessions (even from the same application database connection), there is no guarantee that the resource they are intending to serialize access to is being done in a consistent manner. PgBouncer specifically recommends against the usage of these session-based locking functions for just this reason.

Applications using a database pool should look at using transaction-based locks in order to serialize these accesses; i.e., pg_advisory_xact_lock() and friends. If this is not possible, then a separate database pool in session mode which allows the session handling to work as expected should be utilized.

Note that pg_advisory_lock() comes with its own set of issues outside of a database pooler. It doesn’t release the lock even if the transaction that created it rolls back. It can take careful coordination and exception handling on the part of the application code to use it effectively.

Final Thoughts

I hope this article has given you a little insight into what sorts of locking might be of concern at the application level. These situations are ones which may warrant investigation and/or application changes:

Ungranted locks
High numbers of LWLocks showing up consistently in pg_stat_activity
Session-level advisory locks