CrunchyData Blog

Crunchy Data Warehouse: Postgres with Iceberg Available for Kubernetes and On-premises

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Tue, 01 Apr 2025 08:00:00 EDT

Today I'm excited to announce the release of Crunchy Data Warehouse on premises, which provides one of the easiest and yet richest ways to work with your data lake in the environment of your choosing. Built on top of Crunchy Postgres for Kubernetes, Crunchy Data Warehouse extends Postgres with a modern data warehouse solution, giving you:

The ability to easily query data where it resides in S3 or S3 compatible storage (like MinIO). With a variety of data formats supported including CSV, JSON, Parquet, Geoparquet, and Iceberg you can leave behind complicated ETL processes and work with your data directly. With standard SQL and copy support in Postgres you can choose to query data directly, or move to/from S3 alongside the rest of your data.
The simplest way of creating and managing data within the Iceberg format. If you're unfamiliar with Iceberg it takes Parquet (an open format for columnar files) and transforms them from a single immutable file to a full database with a collection of files and metadata that represent your database.
Fast analytical queries. Iceberg gives you columnar compression of your data. We also include an adaptive query engine that can seamlessly leverage a vectorized query execution to provide analytics on your data at performance speeds of up to 100x over standard Postgres.
Automatic management and maintenance of your Iceberg data. A common aspect of working with Iceberg is having to run processes to recompact your data to ensure efficient distribution of your data. Crunchy Data Warehouse automatically manages this for you behind the scenes so you have one less thing to think about.

All of the above is in a production ready box built on the experience of Crunchy Postgres for Kubernetes.

Let’s dig into setting up a warehouse cluster

Assuming you have Crunchy Postgres for Kubernetes installed, as well as access to S3-compatible storage, you can create your first Crunchy Data Warehouse cluster with just a few lines of YAML:

apiVersion: v1
kind: Secret
metadata:
  name: cdw-secret
type: Opaque
stringData:
  s3-key: <s3-key>
  s3-secret: <s3-secret>
–--
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: CrunchyDataWarehouse
metadata:
  name: cdw-sample
spec:
  externalStorage:
    - name: s3bucket
      scope: s3://<s3-bucket>
      region: <s3-region>
      endpoint: <s3-endpoint>
      accessKeyIDRef:
        key: s3-key
        name: cdw-secret
      secretAccessKeyRef:
        key: s3-secret
        name: cdw-secret
  image: registry.crunchydata.com/crunchydata/crunchy-data-warehouse:ubi9-17.4-2.1.2-2513
  postgresVersion: 17
  instances:
    - replicas: 1
      dataVolumeClaimSpec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 4Gi

This creates a Postgres instance on version 17 with requested storage. After running kubectl apply, you'll have a running Crunchy Data Warehouse cluster. This includes a pod for the Postgres database ready to work with the S3 storage and all the other services that currently run with Crunchy Postgres for Kubernetes like backups, disaster recovery, high availability, and connection pooling. Once you initialize a connection to the Postgres database, you’re ready to start working with data lake files in Postgres.

Getting started with data (in CSV format)

Now that we’ve got a provisioned Crunchy Data Warehouse instance, let’s start working with some data. We’re going to start with a CSV file to see how simple it is to work with existing data in S3, but eventually load it into Iceberg, fully managed by our warehouse. We’ll be able to see the performance speed up we get over standard Postgres HEAP tables compared to columnar Parquet files managed in Iceberg.

To begin we’re going to load a set of historical data of stock prices per day, in total just over 32 million records, and the stock listing data. These files already exists in S3 and we can set them up as a lakehouse tables:

CREATE FOREIGN TABLE stock_csv(trade_dt timestamptz,
    ticker varchar(20),
    open_price numeric(38, 9),
    high_price numeric(38, 9),
    low_price numeric(38, 9),
    close_price numeric(38, 9),
    trade_volume numeric(38, 9),
    dividends numeric(38, 9),
    stock_splits numeric(38, 9))
SERVER crunchy_lake_analytics
OPTIONS (
    header 'true',
    path 's3://crunchydatawarehouse/stock/stock_history.csv',
    format 'csv');

CREATE FOREIGN TABLE stock_list_csv()
SERVER crunchy_lake_analytics
OPTIONS (
    header 'true',
    path 's3://crunchydatawarehouse/stock/stock_list.csv',
    format 'csv');

Working with Parquet data

Similar to how we referenced our CSV file above we could do the exact same with Parquet. We also have the ability to easily move data in and out of Parquet format. Parquet is an open standard file format that is self describing of the data types and brings columnar compression. For time series data, compression offers some great benefits in storage but also in performance to scan and read larger amounts of data.

In the above, we can work with the data directly as it sits in CSV format within S3, but we can also import this directly into our Postgres table. Let’s go ahead and setup a Postgres table and load the data:

CREATE TABLE stock () WITH (load_from = 's3://crunchydatawarehouse/stock/stock_history.csv');

From here if we wanted to either import or export data whether in CSV or in this case Parquet we could use the Postgres copy command. As an example we could export the existing historical stock data that we loaded into Postgres out to a Parquet file:

COPY stock to 's3://crunchydatawarehouse/stock/stock_history.parquet';

The easiest way to work with Iceberg

Now we’re going to create a table with the Iceberg format. Iceberg is another open standard, one that extends Parquet (an immutable file from a point in time) and maintains metadata about new files, changeset, and more to give you essentially a full database on top of Parquet files. We’re going to create our Iceberg table and point to those Parquet files that we generate above:

CREATE TABLE stock_list_iceberg()
USING iceberg
WITH (load_from = 's3://crunchydatawarehouse/stock/stock_list.parquet');

CREATE TABLE stock_history_iceberg()
USING iceberg
WITH (load_from = 's3://crunchydatawarehouse/stock/stock_history.parquet');

By using Iceberg here as our table type, we’re not only getting the columnar compression of Parquet, we also have the ability to add new records, update existing records, or delete data. We can now work on this as if it is a standard Postgres table - now with great data compression and amazingly fast analytical querying capabilities. To see just how fast, let’s compare a query that gives us the average closing price of stocks over time. First against standard Postgres, then against our Iceberg data:

SELECT extract('year' from trade_dt) trade_year,
       count(distinct ticker) cnt,
       trunc(avg(close_price)) close_price
FROM stock
WHERE trade_dt >= '01/01/2020'
GROUP BY trade_year
ORDER BY 1;

SELECT extract('year' from trade_dt) trade_year,
       count(distinct ticker) cnt,
       trunc(avg(close_price)) close_price
FROM stock_iceberg
WHERE trade_dt >= '01/01/2020'
GROUP BY trade_year
ORDER BY 1;

Above we see that our columnar Iceberg table is over 20x faster than the standard Postgres row based table. With Crunchy Data Warehouse for real-time analytics workloads we see query speed-ups of anywhere from 10 to 100x over standard Postgres.

In a matter of minutes we went from raw data sitting in a data lake within S3, loading it into our Postgres database, exporting data to Parquet, and creating our table as Iceberg. We had a fully managed Iceberg experience with a single command.

In short you get:

Native Postgres experience for data warehousing
Simple Iceberg data management
Fast analytical performance

Next Steps

We’re excited to show you what we’ve built!

There are a few ways to to get started:

Contact our team today to talk about building your data warehouse on Kubernetes.
If you don't need to self manage, give our managed warehouse experience a try on Crunchy Bridge.
Join me and Andrew L'Ecuyer for a live demo and webinar on April 10.

Reducing Cloud Spend: Migrating Logs from CloudWatch to Iceberg with Postgres

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Wed, 26 Mar 2025 12:00:00 EDT

As a database service provider, we store a number of logs internally to audit and oversee what is happening within our systems. When we started out, the volume of these logs is predictably low, but with scale they grew rapidly. Given the number of databases we run for users on Crunchy Bridge, the volume of these logs has grown to a sizable amount. Until last week, we retained those logs in AWS CloudWatch. Spoiler alert: this is expensive.

While we have a number of strategies to drive efficiency around the logs, we retain and we regularly remove unnecessary noise or prune old logs. That growth has driven AWS CloudWatch to represent a sizable portion of our infrastructure spend.

Going forward, we now have a new workflow that makes use of low cost S3 storage with Iceberg tables and the power and simplicity of Crunchy Data Warehouse, which has reduced our spend on logging by over $30,000 a month.

Using this new workflow, we can simply:

archive logs directly into S3
incrementally load those logs into Iceberg via Crunchy Data Warehouse
use SQL to query the logs required using Crunchy Data Warehouse

The crux of any log ingestion service is more or less: ingest log traffic, index the data, offload the logs to more cost efficient storage, and, when necessary, access later.

Historically, we used AWS CloudWatch but there are many logging services available. These services offer a range of capabilities, but come with a price tag representing a premium to the cost of storing logs directly in S3. While simply exporting logs to S3 always represented a potential cost savings, without a query engine to efficiently investigate these logs when required, exporting logs to S3 was not previously a viable solution. Crunchy Data Warehouse's ability to easily query S3 was the breakthrough we needed.

Setting up logs with S3 and Iceberg

The first step? Get all of our logs flowing into S3.

Every server in our fleet, whether that be a server running our customer’s Postgres workloads or the servers that make up the Crunchy Bridge service itself, is running a logging process that continuously collects a variety of logs. The logs are generated from various sources. A few examples are SSH access, the Linux kernel, and Postgres. These logs all have different schemas and encodings that the logging agent transforms into a consistent CSV structure before batching and flushing them to durable, long-term storage. Once these logs make it off host, they are indexed and stored where they can be queried as needed.

Now that we have our logs flowing in S3, we provision a Crunchy Data Warehouse so we can:

Move the data from CSV to Iceberg for better compression
Query our logs using standard SQL with Postgres.

Once the warehouse is provisioned, create a foreign table from within Crunchy Data Warehouse called logs that points at the S3 bucket's CSV files:

create foreign table logs (
   /* column names and types */
)
server crunchy_lake_analytics
options (path 's3://crunchy-bridge/tmp/*.tsv.gz', format 'csv', compression 'gzip', delimiter E'\t', filename 'true');

Now we create a fully managed Iceberg table that is an exact copy of the foreign table referencing the CSVs. Here Iceberg is beneficial because it will automatically compress the data into parquet files of 512 MB per file, know how to add data easily across files, push down queries that are targeting only a narrow window. Essentially, we've gone from CSV to columnar file format and from flat files to a full database:

-- Create an Iceberg table with the same schema
create table logs_iceberg (like logs)
using iceberg;

Finally, we're going to layer in the open source extension pg_incremental. Pg_incremental is a Postgres extension that makes it easy to do fast, reliable incremental batch processing within Postgres. pg_incremental is most commonly used for incremental rollups of data. In this case it is equally useful for processing new CSV data as it arrives and moving it into our Iceberg table within S3–connected to Postgres.

-- Set up a pg_incremental job to process existing files and automatically process new files every hour
select incremental.create_file_list_pipeline('process-logs',
   file_pattern := 's3://crunchy-bridgetmp/*.tsv.gz',
   batched := true,
   max_batch_size := 20000,
   schedule := '@hourly',
   command := $$
       insert into logs_iceberg select * from logs where _filename = any($1)
   $$);

Final thoughts

And there you have it! Cheaper, cleaner log management. As one of my colleagues described it: “personally, I always hated the imitation SQL query languages of logging providers–just get me real SQL”. Between using SQL to query logs, to simplifying our stack, to the cost savings - this project showcases some of our favorite things about Crunchy Data Warehouse.

We often get questions on the architecture of Crunchy Bridge. We have talked about it a bit. The short version is that Crunchy Bridge is built from the ground up using public cloud primitives to create a highly scalable and efficiently managed Postgres service. At the time, AWS CloudWatch was chosen due to the lack of better options. We don't want to be a logging provider, it's a fundamentally different business. But seeing how well this works, who knows 😉

Automatic Iceberg Maintenance Within Postgres

Önder.Kalacı@crunchydata.com (Önder Kalacı) — Thu, 20 Mar 2025 10:00:00 EDT

Today we're excited to announce built-in maintenance for Iceberg in Crunchy Data Warehouse. This enhancement to Crunchy Data Warehouse brings PostgreSQL-style maintenance directly to Iceberg. The warehouse autovacuum workers continuously optimize Iceberg tables by compacting data and cleaning up expired files. In this post, we'll explore how we handle cleanup, and in the follow-up posts, we'll take a deeper dive into compaction.

If you use Postgres, you are probably familiar with tables and rows in a relational database. Instead of storing data in Postgres’ pages, Iceberg organizes the data into Parquet files and typically stores them in object storage like S3 with an organizational layer on top. Parquet is a compressed columnar file format that stores data efficiently. And Iceberg is designed to handle analytical queries across large datasets.

On Crunchy Data Warehouse, Postgres tables backed by Iceberg behave almost exactly like regular Postgres tables. You can run full SQL queries, perform ACID transactions, and use standard DDL commands like CREATE TABLE or ALTER TABLE. We’re excited to add vacuum processes to Iceberg to create an even better and hassle free user experience.

Orphan Files in Iceberg

In Postgres, when you update or delete rows, the changes happen inside the same table storage. The database keeps track of visibility using MVCC, and old versions of rows are eventually freed up by vacuum.

Iceberg works differently because its data files are immutable. When you update or delete data, Iceberg doesn’t modify existing files—it creates new ones with the updated data. The table’s metadata is then updated to point to the new files, while the old ones become unreferenced.

Over time, as more updates and deletes happen, these orphaned files—ones that are no longer referenced by any active table snapshot—start to accumulate.

Cleaning up orphan files

Just like autovacuum keeps your PostgreSQL tables lean, Crunchy Data Warehouse has background workers that automatically clean up orphan files in Iceberg. This helps ensure efficient storage without manual intervention. Crunchy Data Warehouse also does compaction on Iceberg, combining smaller files into larger files for efficiency and performance.

With Crunchy Data Warehouse, the autovacuum takes care of this cleanup automatically, just like in Postgres. It scans for unreferenced files and removes them, ensuring that storage does not grow unnecessarily over time. If a data file is no longer referenced by any snapshot and the retention period has passed (by default, 10 days), it is time to delete it. The design of autovacuum for Iceberg tables ensures that only files generated by PostgreSQL transactions are deleted, eliminating any risk of removing unintended files.

We support autovacuum for both expired snapshots and transaction rollbacks. There are two primary ways orphan files are created.

Iceberg Snapshot Expiration

When data is deleted or updated, the corresponding files become unreachable by queries. Snapshot expiration ensures that these files are safely removed during maintenance. In Crunchy Data Warehouse, the familiar VACUUM command handles snapshot expiration. Remember that Crunchy Data Warehouse supports autovacuum on Iceberg tables, so you don’t have to manually keep track of this. By default, we retain files for up to 10 days to provide backups.

Here’s a step-by-step example demonstrating snapshot expiration in action:

-- Increase log verbosity to see detailed file operations
SET client_min_messages TO DEBUG4;

-- Create a table with 100 rows, generating the "data_0.parquet" file on S3
CREATE TABLE expire_data USING iceberg AS SELECT id FROM generate_series(0,100) id;
....
DEBUG:  adding s3://testbucketcdw/../data_0.parquet with 101 rows
....

-- TRUNCATE the table to remove all rows. This marks underlying files as orphaned.
TRUNCATE expire_data;

-- For demo purposes, set retention periods to 0 to expire files immediately
-- In production, files are retained for 10 days as a backup.
SET crunchy_query_engine.orphaned_file_retention_period TO '0s';
SET crunchy_iceberg.max_snapshot_age TO '0';

-- Trigger snapshot expiration to clean up orphaned files and verify their removal
VACUUM (verbose) expire_data;
...
INFO:  deleting expired file s3://testbucketcdw/../data_0.parquet
...

Rollback Transactions

Postgres users are already familiar with transactional rollbacks. Iceberg tables in the warehouse bring this same feature to data files, ensuring that storage is automatically cleaned up for rolled-back operations.

Here’s an example demonstrating how unused files from a rolled-back transaction are handled:

-- Create a new Iceberg table
CREATE TABLE rollback_data(id INT) USING iceberg;

BEGIN;
    -- Increase log verbosity to see detailed file operations
    SET LOCAL client_min_messages TO DEBUG4;

    -- Insert data, generating the "data_0.parquet" file in S3
    INSERT INTO rollback_data SELECT id FROM generate_series(0,100) id;
    ....
    DEBUG:  adding s3://testbucketcdw/../data_0.parquet with 101 rows
    ....

-- Roll back the transaction to discard the changes
ROLLBACK;

-- Trigger the cleanup process to remove the unused file
VACUUM (verbose) rollback_data;
...
INFO:  deleting unused file s3://testbucketcdw/../data_0.parquet
...

This example shows how Iceberg tables handle transaction rollbacks cleanly and efficiently. Crunchy Data Warehouse removes any files created during rolled-back transactions during autovacuum, ensuring storage is not wasted on unused files.

Closing Thoughts

Orphan files are a natural consequence of how Iceberg manages immutable data files. Without proper cleanup, they can lead to wasted storage and unnecessary costs. In Postgres, autovacuum handles similar maintenance tasks seamlessly, and in Crunchy Data Warehouse, we bring that same convenience to Iceberg tables.

By automatically identifying and removing unreferenced files, Crunchy Data Warehouse ensures your Iceberg tables stay efficient, just like a well-maintained Postgres database. No manual cleanup, no wasted storage—just a streamlined experience for large-scale analytics.

Whether you're coming from a Postgres background or an Iceberg-first world - Crunchy Data Warehouse combines the best of both with the power of Iceberg’s scalable architecture with the ease of Postgres-style maintenance.

Citus: The Misunderstood Postgres Extension

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Tue, 18 Mar 2025 09:50:00 EDT

Citus is in a small class of the most advanced Postgres extensions that exist. While there are many Postgres extensions out there, few have as many hooks into Postgres or change the storage and query behavior in such a dramatic way. Most that come to Citus have very wrong assumptions. Citus turns Postgres into a sharded, distributed, horizontally scalable database (that's a mouthful), but it does so for very specific purposes.

Citus, in general, is fit for these type of applications and only these type:

Sharding a multitenant application: a SaaS/B2B style app, where data is never joined between customers
Low user facing, high data volume analytics: specifically where the dashboards are hand-curated with minimal levers-and-knobs for the user to change (i.e. customer cannot generate unknown queries)

Mistaken use cases for Citus that are not a great fit:

Lack of rigid control over queries sent to database
Geographic residency goals or requirements; Citus is distributed for scale, not distributed for edge.

Let's look closer at each of the two use cases that Citus is a good fit for.

Multitenant/SaaS applications

Multitenant or SaaS applications typically follow a pattern: 1) tenant data is siloed and does not intermingle with any other tenant's data, and 2) a "tenant" is a larger entity like a "team" or "organization".

An example of this could be Salesforce. Within Salesforce you have the notion of an organization, and the organization has accounts, customers, and opportunities within them. When you create a Salesforce account, all of your customers and opportunities are solely yours — data is not shared with other Salesforce organizations.

For these types of applications, Citus distributes the data for each tenant into a shard. Citus handles the splitting of data by creating placement groups that know they are grouped together, and placing the data within shards on specific nodes. A physical node may contain multiple shards. Let me restate that to understand Citus at a high-level:

physical node: the physical container that holds shards
shard: a logical container for data; resides on a physical node, and can be moved between physical nodes
placement group: uses a hash-based algorithm to assign a tenant id to a shard

Regarding shards, while possible to split a large shard, it is easier to start with the proper configuration. Getting scaling right in the beginning makes it easier later because moving full shards is easier than splitting them once they already exist, though that is possible.

In a very basic Citus cluster, you might have something that looks like:

Within Citus, multitenant/SaaS applications can work well because sharding is at the core of what Citus does. In the case of a tenant application, the tenant id becomes the shard key. When you shard all the tables on the same key, Citus places each table on the same physical node. Then, queries with joins are executed local to the instance and faster.

Alternatively, poor shard key planning would require joining data across the network. This shuffling of data is detrimental to performance within databases – especially distributed ones. For multitenant/SaaS, leveraging Citus requires the tenant id a column on every table.

While in a more simple design, accounts, customers, and opportunities tables may have only a primary key and a foreign key reference to their parent relationship. In Citus, we need to turn those into composite primary keys that leverage both tenant id and the foreign key. Extending the above diagram, if we were to now create accounts, customers, and opportunities tables as sharded tables with Citus, we'd have something that roughly results to the following:

To speed query performance, include a where condition for the tenant id (below org_id) in all queries as well — this ensures that Citus knows how to push down the join to that single node. A query for open opportunities for a specific tenant might look something like:

SELECT customer.email, customer.first_name, customer.last_name, opportunity.amount, opportunity.notes
FROM opportunity,
     customer,
     account
WHERE customer.org_id = account.org_id
  AND opportunity.org_id = account.org_id
  AND opportunity.account_id = customer.account_id
  AND account.org_id = 4;

Citus would then quietly re-write this query to target the appropriate sharded tables, and effectively execute the query against only the relevant tables:

Now, there is a bit more to designing multitenant apps to work with Citus. For example, universal data can be placed in reference tables that can be distributed across all nodes, or local tables that can live solely within the coordinator. For the bulk of a Citus multitenant workload, tables will:

Contain your shard key
Be indexed using a composite key on shard key + foreign key
Be distributed based on the shard key / tenant id
Be queried used the on the shard key / tenant id

Let's shift to the other common use case for Citus: what Citus defines as real-time dashboards or analytics.

Real-time analytics with Citus

Where multitenant leverages the shard separation of Citus, here you're looking to leverage the parallelism of Citus.

Real-time analytics is indeed a bit vague. It is often some kind of event data that is high volume and is presented as a dashboard, report, monitoring, or alerting. Query patterns are often aggregating in some form; while there may be joins, they happen at a lower level then bubble up to a higher level for aggregation.

When operating a small volume of data, you don't necessarily need Citus — plain old Postgres can work just fine. With high data volume, Postgres is not as suited for analytics (unless you're talking Crunchy Data Warehouse, which is optimized for OLAP workloads – see more here).

With the multitenant/SaaS example, we wanted the query to be pushed down to a single node and operate within a single physical node. With real-time analytics, we want the opposite: queries execute across all the nodes using as many cores as available within the cluster.

Let's make this a little more concrete. Start with the idea of a Google Analytics type of event analytics — similar to what is talked about in the Citus docs. Here we may have something like:

CREATE TABLE http_request (
  site_id INT,
  ingest_time TIMESTAMPTZ DEFAULT now(),

  url TEXT,
  request_country TEXT,
  ip_address TEXT,

  status_code INT,
  response_time_msec INT
);

Let's jump ahead and look past how we shard the data and to the query itself. The query shows a better idea of how Citus works in these situations. Let's build a query to return how many 404s and 200s from the country "Australia" along with the average response time for each:

SELECT
  status_code,
  COUNT(id) AS request_count,
  AVG(response_time_msec) AS average_response_time_msec
FROM http_request
WHERE request_country = 'Australia'
  AND status_code IN (200, 404);

This query will run on every single shard. To process the query as fast as possible, the number of shards should match the number of cores available. If you end up with something like 16 shards in a single node, you'd want ideally 16 cores or even more (to handle additional concurrency). The query will be executed as smaller composable building blocks.

Citus processing the count of 404s and 200s is easy. It runs the query as a count on the nodes, then the coordinator calculates the sum of counts. We simply get: the sum of count(id) where country = "Australia" and the appropriate status code.

But! To calculate the average response time we need to get the count from each shard as well as the sum of the response_time_msec values. From there, Citus recombines all those back on the coordinator. Citus has each shard sending 4 values back (versus all the raw data), and doing the final math on the coordinator.

This results in fast aggregations across large datasets. But if you haven't thought ahead yet, this only works for very specific queries. Counts and averages are great. If you're looking to do something like median, that gets a little harder. You need the full data set to compute a perfect median. (For now we're setting aside there are probabilistic approaches to getting approximate results that work quite well. Algorithms like t-digest or KLL can work if you're okay with approximate or inexact answers).

The other big piece of this is your queries need to be able to be constructed for Citus to push down any joins as locally as possible. While our example in this case is a very basic one, most applications still have data they need to join. This can work on Citus, but you still need to apply some of the thought in making joins to be as low level as possible — similar to the multitenant app.

Within the "real-time analytics" model you need the following to work in order to be successful:

Ability to push joins down vs. joins that move data between nodes
Heavy aggregation or roll-up workload
Control over crafting the queries that are created

Concurrency and connections

The one "gotcha" of the real-time analytics use case is concurrency. In our simple example of querying http_request, it's great if you only have 4 shards. But in a world of 64 shards spread across 4 nodes–you have 16 nodes per shard. This means a single query to Postgres could open 16 connections to each node. One weak area of Postgres is connection management and scaling those, so, we recommend and support pgBouncer out of the box across all our products.

Designing up front for Citus

A success factor with Citus will be your use case. If it is a fit, the more greenfield the application, the better your chance. Existing applications can absolutely be retrofitted to work with Citus, but it often takes some data maneuvering, schema modifications, and query modifications. As with many technologies, if Citus is the right tool for you then "Awesome!", you should absolutely use it. For questions if you think Citus may or may not be a fit, reach out to us @crunchydata. We've helped a number of customers successfully adopt Citus in cases. In others, we've helped our customers be successful on different paths. While Citus is very powerful, it is a special purpose tool.

Postgres, dbt, and Iceberg: Scalable Data Transformation

Aykut.Bozkurt@crunchydata.com (Aykut Bozkurt) — Tue, 11 Mar 2025 09:30:00 EDT

Seamless integration of dbt with Crunchy Data Warehouse automates data movement between Postgres and Apache Iceberg. dbt’s modular SQL approach, combined with Iceberg’s scalable storage, and Postgres’ query engine means you can build fast, efficient, and reliable analytics—with minimal complexity.

Today let’s dig into an example of using dbt with Postgres and Iceberg. The steps will be:

Set up Iceberg tables in Crunchy Data Warehouse using real-world real-time data from GitHub events
Configure dbt to transform and summarize the data with rollups/aggregations
Utilize incremental models to process new data
Query and analyze the results for insights with Postgres

Creating Iceberg tables with dbt

Typically when working a data warehouse you’ll initially create and stage your source table, then have other systems operate on top of it. Here, instead of manually creating the source table, we can use a dbt macro to automate the process. Creating Iceberg tables with dbt allows you to keep your data pipelines under version control and test them locally. Below is a sample dbt macro that defines the source table to efficiently store and process the events:

{% macro create_crunchy_events() %}

{% set sql %}
    set crunchy_iceberg.default_location_prefix TO '{{ env_var('ICEBERG_LOCATION_PREFIX', '') }}';

    create schema if not exists crunchy_gh;

    create table crunchy_gh.events (
        id text,
        type text,
        actor text,
        repo text,
        payload text,
        public boolean,
        created_at timestamptz,
        org text)
    using iceberg;
{% endset %}

{% do run_query(sql) %}
{% do log("create_crunchy_events finished", info=True) %}
{% endmacro %}

Before creating the source table, let’s set location prefix where our iceberg table’s files is going to be located at:

export ICEBERG_LOCATION_PREFIX='s3://v5zf6dobuac3rmwxnzykbdncdqzckxzh/6xl6nijprvcp3i2dolnfcv6l4e'

You can run the macro to create the source table as shown below:

dbt run-operation create_crunchy_events

postgres=# \d crunchy_gh.events
                          Foreign table "crunchy_gh.events"
   Column   |           Type           | Collation | Nullable | Default | FDW options
------------+--------------------------+-----------+----------+---------+-------------
 id         | text                     |           |          |         |
 type       | text                     |           |          |         |
 actor      | text                     |           |          |         |
 repo       | text                     |           |          |         |
 payload    | text                     |           |          |         |
 public     | boolean                  |           |          |         |
 created_at | timestamp with time zone |           |          |         |
 org        | text                     |           |          |         |
Server: crunchy_iceberg
FDW options: (location 's3://ipmikgqfjhtnenhmfu2nek7v43pmwxdk/feooahhfg5eolm7js2dsbhg7kq/postgres/crunchy_gh/events/16802')

Ingesting sample data from GitHub events in S3

In this example, we’ll use GitHub events of the repos of Crunchy Data that contain several events such as when an issue is opened or a pull request is commented on. GitHub event data has been exposed by ClickHouse in a public URL: s3://clickhouse-public-datasets/gharchive/original/.

To load new data into the source table, a dbt macro will fetch GitHub events for a given date and inserts them into the table:

{% macro copy_crunchy_events(event_date) %}
{% set sql %}
    CREATE OR REPLACE PROCEDURE copy_crunchy_events(event_date date)
    LANGUAGE plpgsql
    AS $$
    BEGIN

        FOR h IN 0..23 LOOP
            RAISE NOTICE 'Executing hour: %', h;

            BEGIN
                EXECUTE 'COPY crunchy_gh.events
                         FROM ''s3://clickhouse-public-datasets/gharchive/original/' || event_date || '-' || h || '.json.gz''
                         WITH (format ''json'')
                         WHERE repo LIKE ''%%CrunchyData/%%'';';
            EXCEPTION
                WHEN OTHERS THEN
                    -- sometimes files are not compressed as expected
                    EXECUTE 'COPY crunchy_gh.events
                             FROM ''s3://clickhouse-public-datasets/gharchive/original/' || event_date || '-' || h || '.json.gz''
                             WITH (format ''json'', compression ''none'')
                             WHERE repo LIKE ''%%CrunchyData/%%'';';
            END;
        END LOOP;

    END
    $$;

    CALL copy_crunchy_events('{{ event_date }}');
{% endset %}

{% do run_query(sql) %}
{% do log("copy_crunchy_events finished", info=True) %}
{% endmacro %}

You can simply ingest events of a specific day as shown below:

dbt run-operation copy_crunchy_events --args "{event_date: 2024-10-17}"

postgres=# select count(*) from crunchy_gh.events;
 count
-------
    97
(1 row)

postgres=# select * from crunchy_gh.events;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id         | 42934521395
type       | ForkEvent
actor      | {"id":144409824,"login":"sagar-shrestha24","display_login":"sagar-shrestha24","gravatar_id":"","url":"https://api.github.com/users/sagar-shrestha24","avatar_url":"https://avatars.githubusercontent.com/u/144409824?"}
repo       | {"id":362921285,"name":"CrunchyData/postgres-operator-examples","url":"https://api.github.com/repos/CrunchyData/postgres-operator-examples"}
payload    | {"forkee":{"id":874076753,"node_id":"R_kgDONBlaUQ","name":"postgres-operator-examples","full_name":"sagar-shrestha24/postgres-operator-examples","private":false,"public":true}}
public     | t
created_at | 2024-10-17 08:04:13+00
org        | {"id":8248870,"login":"CrunchyData","gravatar_id":"","url":"https://api.github.com/orgs/CrunchyData","avatar_url":"https://avatars.githubusercontent.com/u/8248870?"}
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id         | 42931737143
type       | ForkEvent
actor      | {"id":8130747,"login":"ISingularity","display_login":"ISingularity","gravatar_id":"","url":"https://api.github.com/users/ISingularity","avatar_url":"https://avatars.githubusercontent.com/u/8130747?"}
repo       | {"id":83363132,"name":"CrunchyData/postgres-operator","url":"https://api.github.com/repos/CrunchyData/postgres-operator"}
payload    | {"forkee":{"id":874036356,"node_id":"R_kgDONBi8hA","name":"postgres-operator","full_name":"ISingularity/postgres-operator","private":false,"public":true}}
public     | t
created_at | 2024-10-17 06:33:32+00
org        | {"id":8248870,"login":"CrunchyData","gravatar_id":"","url":"https://api.github.com/orgs/CrunchyData","avatar_url":"https://avatars.githubusercontent.com/u/8248870?"}

In the next section, we’ll configure dbt to use this table as a source and start building transformations to get daily stars of the repos under Crunchy Data organization!

Transform Iceberg table via dbt

The transformation process in dbt involves defining a model, which specifies how we want the data to be transformed from the raw source table into a more refined dataset. We’ll explain the model configuration and SQL logic used in this process, along with the key feature of incremental processing.

Model Configuration

In dbt, the model configuration controls how the transformation process behaves. For this transformation, we use the incremental materialization, which is key to processing new data without reprocessing the entire dataset. The configuration includes a couple of important options:

materialized='incremental': This tells dbt to perform incremental updates instead of fully rebuilding the table each time.
unique_key='created_at': This specifies the unique identifier for each record, used to detect new records.
pre_hook and post_hook: These hooks are executed before and after the model runs. In this case, the pre_hook sets the default access method to iceberg and configures the location prefix for storing Iceberg tables in S3. The post_hook resets these settings after the model has completed.

{
	{
		config(
			(materialized = 'incremental'),
			(unique_key = 'created_at'),
			(pre_hook =
				"SET default_table_access_method TO 'iceberg'; SET crunchy_iceberg.default_location_prefix = '{{ env_var('ICEBERG_LOCATION_PREFIX', '') }}';"),
			(post_hook =
				'RESET default_table_access_method; RESET crunchy_iceberg.default_location_prefix;'),
		)
	}
}

dbt's incremental processing ensures that we only process the data that has changed, reducing computational cost.

dbt SQL for data summary and rollup

The transformation SQL aggregates the events from the source table and groups them by day and repo. It then counts the number of stars on each day.

select date_trunc('day', created_at)::date as created_at,
       (repo::jsonb)->>'name' AS repo,
       count(*) as stars
from {{ source('crunchy_gh', 'events') }}
where type = 'WatchEvent'
group by date_trunc('day', created_at)::date, (repo::jsonb)->>'name'

{% if is_incremental() %}
having (date_trunc('day', created_at)::date >= (select coalesce(max(created_at),'1900-01-01') from {{ this }} ))
{% endif %}

The key part of this SQL is the conditional HAVING clause, which ensures that only new records are processed during the incremental runs. Here's how it works:

The is_incremental() function checks if dbt is running in incremental mode.
If the run is incremental, the HAVING clause filters the records to only include those with a created_at value that is greater than or equal to the latest date in the already processed data (select coalesce(max(created_at), '1900-01-01') from {{ this }}).

This ensures that dbt only processes the new data that has been ingested since the last run, making the transformation process more efficient.

Let’s run the model to create its table and feed it with initial data:

dbt build --models daily_stars

postgres=# select * from crunchy_demos_crunchy_gh.daily_stars order by created_at;
 created_at |             repo              | stars
------------+-------------------------------+-------
 2024-10-17 | CrunchyData/pg_tileserv       |     1
 2024-10-17 | CrunchyData/postgres-operator |     1
 2024-10-17 | CrunchyData/pg_parquet        |    74
(3 rows)

Assume, it is the new day and we want to ingest the new data:

dbt run-operation copy_crunchy_events --args "{event_date: 2024-10-18}"

postgres=# select * from crunchy_gh.events LIMIT 2;
 count
-------
   242
(1 row)

Then, rerun the model to incrementally process the new events from the source table to update the daily stars:

dbt build --models daily_stars

postgres=# select * from crunchy_demos_crunchy_gh.daily_stars order by created_at;
 created_at |              repo               | stars
------------+---------------------------------+-------
 2024-10-17 | CrunchyData/pg_tileserv         |     1
 2024-10-17 | CrunchyData/postgres-operator   |     1
 2024-10-17 | CrunchyData/pg_parquet          |    74
 2024-10-18 | CrunchyData/pgCompare           |     2
 2024-10-18 | CrunchyData/pg_parquet          |   118
 2024-10-18 | CrunchyData/postgres-operator   |     2
 2024-10-18 | CrunchyData/pgmonitor-extension |     1
(7 rows)

Summary

We're excited about the new automation capabilities for scalable analytics solutions across Postgres and Iceberg using dbt and Crunchy Data Warehouse. This integration can make real-time analytics in Postgres more accessible to any organization.

We just looked at how dbt can create scalable, version controlled automation with Iceberg and Postgres in different way, like:

A dbt macro that automates the creation of Iceberg tables
A custom dbt macro that loads event data from an S3 dataset using Postgres' COPY command
Incremental processing in dbt for processing only new records
A dbt SQL transformation model that aggregates event data by day for easy analytics

As you start working with dbt, Iceberg, and Postgres, we'd love to hear from you.