CrunchyData Blog

Archive Postgres Partitions to Iceberg

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Wed, 21 May 2025 10:00:00 EDT

Postgres comes with built-in partitioning and you can also layer in for pg_partman for additional help with maintenance of your partitioning. It works quite well for partitioning your data to make it easy to retain a limited set of data and improve performance if your primary workload is querying a small time series focused subset of data. Oftentimes, when implementing partitioning you only keep a portion of your data then drop older data as it ages out for cost management.

But what if we could move old partitions seamlessly to Iceberg that could retain all our data forever, while only maintaining recent partitions within Postgres? Could we have a perfect world of full long term copy in Iceberg easily query-able from a warehouse, but Postgres still functioning as the operational database with the most recent 30 days of data?

With the latest replication support for Crunchy Data Warehouse this works seamlessly, lets dig in.

First lets setup our partitioning

If you’d like to follow along at home, here’s some code to set up a sample set of partitioned data resembling a web analytics data set.

CREATE TABLE page_hits (
    id SERIAL,
    site_id INT NOT NULL,
    ingest_time TIMESTAMPTZ NOT NULL,
    url TEXT NOT NULL,
    request_country TEXT,
    ip_address INET,
    status_code INT,
    response_time_msec INT,
    PRIMARY KEY (id, ingest_time)
) PARTITION BY RANGE (ingest_time);

This function will create a set of partitions for us for the last 30 days.

DO $$
DECLARE
  d DATE;
BEGIN
  FOR d IN SELECT generate_series(DATE '2025-04-20', DATE '2025-05-19', INTERVAL '1 day') LOOP
    EXECUTE format($f$
      CREATE TABLE IF NOT EXISTS page_hits_%s PARTITION OF page_hits
      FOR VALUES FROM ('%s') TO ('%s');
    $f$, to_char(d, 'YYYY_MM_DD'), d, d + INTERVAL '1 day');
  END LOOP;
END $$;

Your database should look something like this:

                            List of relations
 Schema |          Name           |       Type        |       Owner
--------+-------------------------+-------------------+-------------------
 public | page_hits               | partitioned table | postgres
 public | page_hits_2025_04_20    | table             | postgres
 public | page_hits_2025_04_21    | table             | postgres
...
 public | page_hits_2025_05_18    | table             | postgres
 public | page_hits_2025_05_19    | table             | postgres
 public | page_hits_id_seq        | sequence          | postgres

Now we can generate some sample data. In this case we’re going to generate 1000 rows per day for each of our tables:

DO $$
DECLARE
  d DATE;
BEGIN
  FOR d IN
    SELECT generate_series(DATE '2025-04-20', DATE '2025-05-19', '1 day'::INTERVAL)
  LOOP
    INSERT INTO page_hits (site_id, ingest_time, url, request_country, ip_address, status_code, response_time_msec)
    SELECT
        (RANDOM() * 30)::INT,
        d + (i || ' seconds')::INTERVAL,
        'http://example.com/' || substr(md5(random()::text), 1, 12),
        (ARRAY['China', 'India', 'Indonesia', 'USA', 'Brazil'])[1 + (random() * 4)::INT],
        inet '10.0.0.0' + (random() * 1000000)::INT,
        (ARRAY[200, 200, 200, 404, 500])[1 + (random() * 4)::INT],
        (random() * 300)::INT
    FROM generate_series(1, 1000) AS s(i);
  END LOOP;
END $$;

Now that we have some data within our Postgres setup lets connect things to our Crunchy Data Warehouse and get them replicated over.

Set up replication to Iceberg

Within the setup you want to specify to publish via the root partition - root=true. This keeps partitions in Postgres but does not partition Iceberg since it has its own organization of data files.

CREATE PUBLICATION hits_to_iceberg
FOR TABLE page_hits
WITH (publish_via_partition_root = true);

Set up the replications users

-- create a new user
CREATE USER replication_user WITH REPLICATION PASSWORD '****';

-- grant appropriate permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;

And on the warehouse end, subscribe to the originating data. Since we’ve specified the create_tables_using Iceberg, this data will be stored in Iceberg.

CREATE SUBSCRIPTION http_to_iceberg
CONNECTION 'postgres://replication_user:****@p.qzyqhjdg3fhejocnta3zvleomq.db.postgresbridge.com:5432/postgres?sslmode=require'
PUBLICATION hits_to_iceberg
WITH (create_tables_using = 'iceberg', streaming, binary, failover);

And here’s the Iceberg table.

                          List of relations
 Schema |          Name           |     Type      |       Owner
--------+-------------------------+---------------+-------------------
 public | page_hits               | foreign table | postgres

Now query data stored in Iceberg from Postgres

Here we can see the daily traffic insights for each country, breaking down the number of hits, success rate, average response time, and top error codes:

SELECT
  date_trunc('day', ingest_time) AS day,
  request_country,
  COUNT(*) AS total_hits,
  ROUND(100.0 * SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) / COUNT(*), 2) AS success_rate_percent,
  ROUND(AVG(response_time_msec), 2) AS avg_response_time_msec,
  MODE() WITHIN GROUP (ORDER BY status_code) AS most_common_status
FROM
  page_hits
GROUP BY
  day, request_country
ORDER BY
  day, request_countr

          day           | request_country | total_hits | success_rate_percent | avg_response_time_msec | most_common_status
------------------------+-----------------+------------+----------------------+------------------------+--------------------
 2025-04-20 00:00:00+00 | Brazil          |        128 |                68.75 |                 146.83 |                200
 2025-04-20 00:00:00+00 | China           |        138 |                65.94 |                 145.67 |                200
 2025-04-20 00:00:00+00 | India           |        245 |    64.90000000000001 |                  153.8 |                200
 2025-04-20 00:00:00+00 | Indonesia       |        230 |    64.34999999999999 |                 151.43 |                200

Now drop the older Postgres partition

Since data is replicated and a copy is in Iceberg, we can drop partitions at a specific time to free up storage and memory on our main operational Postgres database.

--drop partition
DROP TABLE page_hits_2025_04_20;

-- show missing partition in the table list
                            List of relations
 Schema |          Name           |       Type        |       Owner
--------+-------------------------+-------------------+-------------------
 public | page_hits               | partitioned table | postgres
 public | page_hits_2025_04_21    | table             | postgres
 public | page_hits_2025_04_22    | table             | postgres

-- query iceberg, data is still there
          day           | request_country | total_hits | success_rate_percent | avg_response_time_msec | most_common_status
------------------------+-----------------+------------+----------------------+------------------------+--------------------
 2025-04-20 00:00:00+00 | Brazil          |        128 |                68.75 |                 146.83 |                200
 2025-04-20 00:00:00+00 | China           |        138 |                65.94 |                 145.67 |                200
 2025-04-20 00:00:00+00 | India           |        245 |    64.90000000000001 |                  153.8 |                200
 2025-04-20 00:00:00+00 | Indonesia       |        230 |    64.34999999999999 |                 151.43 |                200

Summary

Here’s the recipe for simple Postgres archiving with long term cost effective data retention:

1 - Partition your high throughput data - this is ideal for performance and management anyways.

2 - Replicate your data to Iceberg for easy reporting and long term archiving.

3 - Drop partitions at the ideal interval.

4 - Continue to query archived data from Postgres.

Announcing pg_parquet v.0.4.0: Google Cloud Storage, https storage, and more

Aykut.Bozkurt@crunchydata.com (Aykut Bozkurt) — Wed, 07 May 2025 08:00:00 EDT

What began as a hobby Rust project to explore the PostgreSQL extension ecosystem and the Parquet file format has grown into a handy component for folks integrating Postgres and Parquet into their data architecture. Today, we’re excited to release version 0.4 of pg_parquet.

This release includes:

COPY TO/FROM Google Cloud Storage
COPY TO/FROM http(s) stores
COPY TO/FROM stdin/stdout with (FORMAT PARQUET)
Support Parquet UUID, JSON, JSONB types

If you're unfamiliar with pg_parquet, pg_parquet makes it easy to export and import Parquet files directly within Postgres, without relying on third-party tools. It's not a query engine but a migration tool. When working with pg_parquet if you're looking to export data to other locations you can drop it off in your data lake to then be processed by other engines such as Snowflake, Clickhouse, Redshift, or if you want something Postgres native Crunchy Data Warehouse.

What is Parquet?

Heard about Parquet but not sure what it is? Parquet is an open standard file format that is self documenting for data types and comes with columnar compression. It is a flat file - so a file at point in time of the data you're working with or a subset of your tables. If you're looking to leverage cloud storage for a full database, consider looking into Apache Iceberg which applies a metadata layer and catalog on top of parquet. For simply moving data around, pg_parquet integrates Postgres and parquet with a simple sql handshake.

Working with pg_parquet

Pg_parquet hooks into Postgres to now provide support for moving data in and out cloud storage via the Postgres copy command. Work with copy just like you normally work.

-- Copy a Postgres query result into a Parquet file
COPY (SELECT * FROM table) TO '/tmp/data.parquet' WITH (format 'parquet');

-- Copy a Postgres query result into Parquet in S3
COPY (SELECT * FROM table) TO '
[s3://mybucket/data.parquet](s3://mybucket/data.parquet)'
WITH (format 'parquet');

-- Load data from Parquet in S3 to Postgres
COPY table FROM '
[s3://mybucket/data.parquet](s3://mybucket/data.parquet)'
WITH (format 'parquet');

Conclusion

With version 0.4, pg_parquet continues to simplify the process of moving data between Postgres and Parquet. If you’re archiving data, populating a lakehouse, or bridging systems together for data analytics, pg_parquet has a wide variety of use cases. Now that pg_parquet supports all of the public cloud storage areas and a wide variety of data types, it is ready to be integrated into modern data workflows. Also, using COPY in Postgres, means that pg_parquet is lightweight, performant, and Postgres native.

We’re excited to see how the community puts this release to use and look forward to what’s next. Contributions and feedback are always welcome on GitHub.

Logical replication from Postgres to Iceberg

Marco.Slot@crunchydata.com (Marco Slot) — Tue, 22 Apr 2025 09:00:00 EDT

Operational and analytical workloads have historically been handled by separate database systems, though they are starting to converge. We built Crunchy Data Warehouse to put PostgreSQL at the frontier of analytics systems, using modern technologies like Iceberg and a hybrid query engine.

Combining operational and analytical capabilities is extremely useful, but it is not meant to drive all your workloads into a single system. In most organizations, application developers and analysts work in different teams with different requirements on data modeling, resource management, operational practices, and various other aspects.

What will always be needed is a way to bring data and the stream of changes from an operational database into a separate analytics system. As it turns out, if both sides are PostgreSQL, magical things can happen…

Today, we are announcing the availability of native logical replication from Postgres tables in any Postgres server to Iceberg tables managed by Crunchy Data Warehouse.

The latest release of Crunchy Data Warehouse includes full support for:

Insert, update, delete, and truncate replication into Iceberg
High transaction rates
Low < 60 second apply lag
Preservation of transaction boundaries–foreign key constraints still hold
Automatic table creation and data copy
Automatic compaction
Advanced replication protocol features like row filters, streaming (v4 protocol), and failover slots.
Automatic handling of TOAST columns
Ability to rebuild tables while old data remains readable

While it sounds like something from the future, logical replication to Iceberg is available right now on Crunchy Bridge, and will be available for self-managed users in the next release of Crunchy Postgres for Kubernetes.

Setting up logical replication into Iceberg

Getting started with logical replication to Iceberg is very simple. You can literally set up everything with just 2 commands.

On the source:

create publication pub for table chats, users;

On Crunchy Data Warehouse, after ensuring connectivity to the source:

create subscription sub connection '...' publication pub with (create_tables_using = 'iceberg');

The create subscription command will create Iceberg tables for all tables in the publication, then copy the initial data in the background, and then replicate changes. You can also set up the Iceberg tables manually before creating the subscription.

You can run high performance analytical queries and data transformations directly on the Iceberg tables in Crunchy Data Warehouse once the initial data copy completes, or use other query engines with the SQL/JDBC Iceberg catalog driver.

How Postgres-to-Iceberg replication works

Conventional tools for applying a stream of changes to a data warehouse take large batches and apply them using merge commands. While effective, the computational cost of running these commands is relatively high, and increases significantly as the table grows.

We invented several new techniques to apply insertions and deletions to Iceberg in micro batches by taking advantage of Postgres’ transactional capabilities. Queries use an efficient merge-on-read method to apply deletions. Insertion and deletion files are later merged during automatic compaction, and compaction only accesses files that were (significantly) modified.

What that means is that replication can be sustained with relatively low lag and low overhead. The main cost is that the replication requires some disk space, though usually much less than the source data.

Get started with replication to your Postgres Data Warehouse

Our goal is to bring all PostgreSQL features and extensions to Iceberg with high performance analytics. Logical replication is a useful Postgres feature that becomes essential in the context of a data warehouse, given the need to synchronize data from operational databases.

Of course, PostgreSQL isn’t perfect. Where possible we try to go the extra mile to build a seamless experience, for instance by enabling automatic Iceberg table creation in CREATE SUBSCRIPTION. There are many other ways in which we think the logical replication experience can be improved, especially for Iceberg, so this is the start of a journey.

If you want to get started with this seamless Postgres -> Iceberg replication experience we encourage you to reach out to us or check out the documentation.

Creating Histograms with Postgres

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 04 Apr 2025 10:00:00 EDT

Histograms were first used in a lecture in 1892 by Karl Pearson — the godfather of mathematical statistics. With how many data presentation tools we have today, it’s hard to think that representing data as a graphic was classified as “innovation”, but it was. They are a graphic presentation of the distribution and frequency of data. If you haven’t seen one recently, or don’t know the word histogram off the top of your head - it is a bar chart, each bar represents the count of data with a defined range of values. When Pearson built the first histogram, he calculated it by hand. Today we can use SQL (or even Excel) to extract this data continuously across large data sets.

While true statistical histograms have a bit more complexity for choosing bin ranges, for many business intelligence purposes, Postgres width_bucket is good-enough to counting data inside bins with minimal effort.

Postgres width_bucket for histograms

Given the number of buckets and max/min value, width_bucket returns the index for the bucket that a value will fall. For instance, given a minimum value of 0, a maximum value of 100, and 10 buckets, a value of 43 would fall in bucket #5: select width_bucket(43, 0, 100, 10) AS bucket; But 5 is not correct for 43, or is it?

You can see how the values would fall using generate_series (shown below using Metabase):

SELECT value, width_bucket(value, 0, 100, 10) AS bucket FROM generate_series(0, 100) AS value;

When running the query, the values 0 through 9 go into bucket 1. As you can see in the image above, width_bucket behaves as a step function that starts indexing with 1. In this scenario, when passed a value of 100, width_bucket returns 11, because the maximum value given the width_bucket is an exclusive range (i.e. the logic is minimum <= value < maximum).

We can use the bucket value to generate more readable labels.

Auto-formatting histogram with SQL

Let’s build out a larger query that creates ranges, range labels, and formats the histogram. We will start by using a synthetic table within a CTE called formatted_data. We are doing it this way so that we can replace that query with new data in the future.

Here’s the beginning of the query (this is copy-pastable into Postgres):

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
)
SELECT
  WIDTH_BUCKET(value, 0, 100, 10) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;

Let’s use another CTE to define some settings for our width_bucket:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		10 as bucket_count,
		0::integer AS min_value, -- can be null::integer or an integer
		100::integer AS max_value -- can be null::integer or an integer
)

SELECT
  WIDTH_BUCKET(value,
	  (SELECT min_value FROM bucket_settings),
		(SELECT max_value FROM bucket_settings),
		(SELECT bucket_count FROM bucket_settings)
	) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;

In the bucket_settings CTE, we use ::integer to cast any value there as an integer. We do this since we will want to compare NULL against other integers later. If we don’t cast NULLs then the SQL will fail.

Now, we will use a CTE called calculated_bucket_settings to set a dynamic range if the static range is not defined. This will let the data specify the values if they are not defined by the bucket_settings:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		5 AS bucket_count,
		null::integer AS min_value, -- can be null or an integer
		null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
		(SELECT bucket_count FROM bucket_settings) AS bucket_count,
		COALESCE(
			(SELECT min_value FROM bucket_settings),
			(SELECT min(value) FROM formatted_data)
		) AS min_value,
		COALESCE(
			(SELECT max_value FROM bucket_settings),
			(SELECT max(value) + 1 FROM formatted_data)
		) AS max_value
), histogram AS (
  SELECT
     WIDTH_BUCKET(value, min_value, max_value, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
)

SELECT
   bucket,
   frequency,
   CONCAT(
     (min_value + (bucket - 1) * (max_value - min_value) / bucket_count)::INT,
     ' - ',
     (((min_value + bucket * (max_value - min_value) / bucket_count)) - 1)::INT) AS range
FROM histogram, calculated_bucket_settings;

In the histogram CTE, we use max_value + 1 because the range of values is treated as an exclusive range. Also, because we are working with integers, when you create the pretty label for the range, we subtracted 1 from the maximum value for the range to reduce confusion from what would appear to be overlapping ranges. This decision fits into the “good-enough for business intelligence” caveats listed above. We could have changed the label logic to be 75 <= value < 94 in lieu of the subtraction, but most folks like it see the dash instead of math logic for a histogram.

The query above will give results like the following:

bucket   | frequency |  range
---------+-----------+---------
       1 |         3 | 1 - 18
       3 |         4 | 38 - 55
       4 |         1 | 56 - 74
       5 |         1 | 75 - 93
(4 rows)

Now we see that all buckets and frequencies are not represented. So, if a value is empty, we need to fill in the frequency with a zero. This is where SQL requires thinking in sets. We can use generate_series to generate all values for the buckets, then join the histogram to all values. Flipping the order of the query around makes it simpler than joining an incomplete set. In the following query, we’ve built out the buckets in the all_buckets CTE, then joined that to the histogram values:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
  SELECT
        5 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
    COUNT(value) AS frequency
  FROM formatted_data, calculated_bucket_settings
  GROUP BY 1
  ORDER BY 1
 ), all_buckets AS (
  SELECT
    fill_buckets.bucket AS bucket,
    FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
    FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
  FROM calculated_bucket_settings,
	  generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;

Try modifying the values in the bucket_settings CTE to see how the histogram responds. By increasing the bucket_count, min_value, or max_value, you’ll see the histogram respond appropriately. If you modify the range to exclude values, using the FULL OUTER JOIN, you’ll see that all non-classified items are bucketed as “out of bounds”.

Using a presentation tool, display the histogram as a bar chart (shown below using Metabase):

Real Life Data with Histograms

Now that we have a really nice auto-adjusting query, we can simply build a histogram from other examples. I have a little experimental database from the database of clinical trials.

What if we wanted to build a histogram for the count of participants in various clinical trial studies? To start, build the query that finds the number of participants for each study:

SELECT
	outcomes.nct_id,
	max(outcome_counts.count) AS value
FROM outcomes
INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
WHERE param_type = 'COUNT_OF_PARTICIPANTS'
GROUP BY 1

We can take the above query, and place it in the formatted_data CTE:

WITH formatted_data AS (
	SELECT
		outcomes.nct_id,
		MAX(outcome_counts.count) AS value
	FROM outcomes
	INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
	WHERE param_type = 'COUNT_OF_PARTICIPANTS'
	GROUP BY 1
), bucket_settings AS (
  SELECT
        20 AS bucket_count,
        null::integer AS min_value, -- can be null or an integer
        null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
 ), all_buckets AS (
   SELECT
     fill_buckets.bucket AS bucket,
     FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
     FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
   FROM calculated_bucket_settings,
	   generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;

The query will output the following. This is a bit un-desirable because the distribution is concentrated in the first bucket:

 bucket |       range       | frequency
--------+-------------------+-----------
      1 | 1 - 359943        |     23261
      2 | 359944 - 719886   |         3
      3 | 719887 - 1079829  |         1
      4 | 1079830 - 1439773 |         0
      5 | 1439774 - 1799716 |         1
      6 | 1799717 - 2159659 |         0
      7 | 2159660 - 2519602 |         0
      8 | 2519603 - 2879546 |         0
      9 | 2879547 - 3239489 |         0
     10 | 3239490 - 3599432 |         0
     11 | 3599433 - 3959375 |         0
     12 | 3959376 - 4319319 |         0
     13 | 4319320 - 4679262 |         0
     14 | 4679263 - 5039205 |         0
     15 | 5039206 - 5399148 |         0
     16 | 5399149 - 5759092 |         0
     17 | 5759093 - 6119035 |         0
     18 | 6119036 - 6478978 |         0
     19 | 6478979 - 6838921 |         0
     20 | 6838922 - 7198865 |         1
(20 rows)

If you’ve loaded the data, to improve the presentation, we can adjust the bucket_settings CTE to modify how the buckets are defined. For instance, with this dataset, if we changed the bucket settings to:

  SELECT
        20 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer

It outputs a much nicer distribution of data:

 bucket |     range     | frequency
--------+---------------+-----------
      1 | 0 - 49        |     13584
      2 | 50 - 99       |      3612
      3 | 100 - 149     |      1720
      4 | 150 - 199     |       942
      5 | 200 - 249     |       645
      6 | 250 - 299     |       477
      7 | 300 - 349     |       338
      8 | 350 - 399     |       237
      9 | 400 - 449     |       176
     10 | 450 - 499     |       137
     11 | 500 - 549     |       150
     12 | 550 - 599     |       101
     13 | 600 - 649     |        77
     14 | 650 - 699     |        58
     15 | 700 - 749     |        61
     16 | 750 - 799     |        41
     17 | 800 - 849     |        41
     18 | 850 - 899     |        33
     19 | 900 - 949     |        36
     20 | 950 - 999     |        43
        | out of bounds |       758

In brief

Using Postgres width_bucket will build buckets to gather frequency values to create histograms.
- Creating a function assigns values to predefined buckets based on a min/max range and bucket count.
- By casting, you can work with data that contains some null values
- You can create values that fall outside the defined range
By using Common Table Expressions (CTEs), you can define bucket settings dynamically with auto-adjusting bins based on the dataset.
Histograms can aid with the visualization of data and data distribution in your set. Histograms show how frequently data points appear within specific ranges (bins), making it easier to understand patterns, trends, and outliers. Bin size does affect interpretation so choosing the right number of bins is crucial; too few can oversimplify the data, while too many can create noise and obscure trends.

Build an interesting histogram? Show us @crunchydata!

Reducing Cloud Spend: Migrating Logs from CloudWatch to Iceberg with Postgres

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Wed, 26 Mar 2025 12:00:00 EDT

As a database service provider, we store a number of logs internally to audit and oversee what is happening within our systems. When we started out, the volume of these logs is predictably low, but with scale they grew rapidly. Given the number of databases we run for users on Crunchy Bridge, the volume of these logs has grown to a sizable amount. Until last week, we retained those logs in AWS CloudWatch. Spoiler alert: this is expensive.

While we have a number of strategies to drive efficiency around the logs, we retain and we regularly remove unnecessary noise or prune old logs. That growth has driven AWS CloudWatch to represent a sizable portion of our infrastructure spend.

Going forward, we now have a new workflow that makes use of low cost S3 storage with Iceberg tables and the power and simplicity of Crunchy Data Warehouse, which has reduced our spend on logging by over $30,000 a month.

Using this new workflow, we can simply:

archive logs directly into S3
incrementally load those logs into Iceberg via Crunchy Data Warehouse
use SQL to query the logs required using Crunchy Data Warehouse

The crux of any log ingestion service is more or less: ingest log traffic, index the data, offload the logs to more cost efficient storage, and, when necessary, access later.

Historically, we used AWS CloudWatch but there are many logging services available. These services offer a range of capabilities, but come with a price tag representing a premium to the cost of storing logs directly in S3. While simply exporting logs to S3 always represented a potential cost savings, without a query engine to efficiently investigate these logs when required, exporting logs to S3 was not previously a viable solution. Crunchy Data Warehouse's ability to easily query S3 was the breakthrough we needed.

Setting up logs with S3 and Iceberg

The first step? Get all of our logs flowing into S3.

Every server in our fleet, whether that be a server running our customer’s Postgres workloads or the servers that make up the Crunchy Bridge service itself, is running a logging process that continuously collects a variety of logs. The logs are generated from various sources. A few examples are SSH access, the Linux kernel, and Postgres. These logs all have different schemas and encodings that the logging agent transforms into a consistent CSV structure before batching and flushing them to durable, long-term storage. Once these logs make it off host, they are indexed and stored where they can be queried as needed.

Now that we have our logs flowing in S3, we provision a Crunchy Data Warehouse so we can:

Move the data from CSV to Iceberg for better compression
Query our logs using standard SQL with Postgres.

Once the warehouse is provisioned, create a foreign table from within Crunchy Data Warehouse called logs that points at the S3 bucket's CSV files:

create foreign table logs (
   /* column names and types */
)
server crunchy_lake_analytics
options (path 's3://crunchy-bridge/tmp/*.tsv.gz', format 'csv', compression 'gzip', delimiter E'\t', filename 'true');

Now we create a fully managed Iceberg table that is an exact copy of the foreign table referencing the CSVs. Here Iceberg is beneficial because it will automatically compress the data into parquet files of 512 MB per file, know how to add data easily across files, push down queries that are targeting only a narrow window. Essentially, we've gone from CSV to columnar file format and from flat files to a full database:

-- Create an Iceberg table with the same schema
create table logs_iceberg (like logs)
using iceberg;

Finally, we're going to layer in the open source extension pg_incremental. Pg_incremental is a Postgres extension that makes it easy to do fast, reliable incremental batch processing within Postgres. pg_incremental is most commonly used for incremental rollups of data. In this case it is equally useful for processing new CSV data as it arrives and moving it into our Iceberg table within S3–connected to Postgres.

-- Set up a pg_incremental job to process existing files and automatically process new files every hour
select incremental.create_file_list_pipeline('process-logs',
   file_pattern := 's3://crunchy-bridgetmp/*.tsv.gz',
   batched := true,
   max_batch_size := 20000,
   schedule := '@hourly',
   command := $$
       insert into logs_iceberg select * from logs where _filename = any($1)
   $$);

Final thoughts

And there you have it! Cheaper, cleaner log management. As one of my colleagues described it: “personally, I always hated the imitation SQL query languages of logging providers–just get me real SQL”. Between using SQL to query logs, to simplifying our stack, to the cost savings - this project showcases some of our favorite things about Crunchy Data Warehouse.

We often get questions on the architecture of Crunchy Bridge. We have talked about it a bit. The short version is that Crunchy Bridge is built from the ground up using public cloud primitives to create a highly scalable and efficiently managed Postgres service. At the time, AWS CloudWatch was chosen due to the lack of better options. We don't want to be a logging provider, it's a fundamentally different business. But seeing how well this works, who knows 😉