<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0"><channel><title>CrunchyData Blog</title>
<atom:link href="https://www.crunchydata.com/blog/topic/postgres-17/rss.xml" rel="self" type="application/rss+xml" />
<link>https://www.crunchydata.com/blog/topic/postgres-17</link>
<image><url>https://www.crunchydata.com/card.png</url>
<title>CrunchyData Blog</title>
<link>https://www.crunchydata.com/blog/topic/postgres-17</link>
<width>800</width>
<height>419</height></image>
<description>PostgreSQL experts from Crunchy Data share advice, performance tips, and guides on successfully running PostgreSQL and Kubernetes solutions</description>
<language>en-us</language>
<pubDate>Wed, 05 Mar 2025 09:30:00 EST</pubDate>
<dc:date>2025-03-05T14:30:00.000Z</dc:date>
<dc:language>en-us</dc:language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item><title><![CDATA[ Validating Data Types from Semi-Structured Data Loads in Postgres with pg_input_is_valid ]]></title>
<link>https://www.crunchydata.com/blog/validating-data-types-from-semi-structured-data-loads-in-postgres-with-pg_input_is_valid</link>
<description><![CDATA[ Elizabeth talks about how to validate data with the new Postgres feature pg_input_is_valid. ]]></description>
<content:encoded><![CDATA[ <p>Working on big data loads and or data type changes can be tricky - especially finding and correcting individual errors across a large data set. Postgres versions, 16, 17, and newer have a new function to help with data validation: <code>pg_input_is_valid</code> .<p><code>pg_input_is_valid</code> is a sql function that can be queried that will determine if a given input can be parsed into a specific type like numeric, date, JSON, etc. Here’s a super basic query to ask if ‘123’ is a valid integer.<pre><code class=language-bash>SELECT pg_input_is_valid('123', 'integer');
 pg_input_is_valid
-------------------
 t
</code></pre><p>This function gives a t-true and f-false response. So if I asked <code>SELECT pg_input_is_valid('123', 'date');</code> the answer would be `f', since that's not a date.<p>This does not require special error handling or special scripts, it is just built right into Postgres and can be used with standard SQL. At Crunchy Data we’ve seen some nice use cases with this where you can validate data before importing it. Generally this works best if with a staging table or a temporary table and the validation is done and offending rows can be identified before a final data copy or import is run. Let’s take a look today with a few examples about how the validation input function might help.<h2 id=validating-data-for-columns-changes><a href=#validating-data-for-columns-changes>Validating data for columns changes</a></h2><p>There’s a lot of occasions when a database administrator needs to change data types. You can check something like text to integer easily. You might want to use newer JSON features and move away from old formatting. For moving columns to JSON, <code>pg_input_is_valid</code> can query existing rows to see if they’d conform to JSONB.<pre><code class=language-sql>SELECT pg_input_is_valid(data_column, 'jsonb')
FROM bytea_table;
</code></pre><p>You might also want to use pg_input_is_valid to check text columns you want to use for integer or date. You can use a regular validity check for this or could create a new date column with only data that is valid.<pre><code class=language-sql>UPDATE test_data
SET
    actual_date = CASE
        WHEN pg_input_is_valid (maybe_date, 'date') THEN maybe_date::date
        ELSE NULL
    END;

SELECT * from test_data ;

   name    | maybe_date   | actual_date
-----------+--------------+-------------
 David     | 2023-01-02   | 2023-01-02
 Elizabeth | Jan 1, 2024  |

</code></pre><h2 id=validating-data-for-data-load><a href=#validating-data-for-data-load>Validating data for data load</a></h2><p>Let’s say you have a CSV file containing customer data to import it into a table named <code>customers</code>. Before importing, it is a good idea to ensure that the data in the CSV file adheres to the expected format, particularly for the <code>age</code> and <code>signup_date</code> columns.<p>The table has the following structure:<pre><code class=language-sql>customer_id SERIAL PRIMARY KEY,
name TEXT,
email TEXT,
age INTEGER,
signup_date DATE
</code></pre><h3 id=create-a-staging-table><a href=#create-a-staging-table>Create a staging table</a></h3><p>Import the CSV data into a staging table without data type casting yet. Everything will go in as text:<pre><code class=language-sql>CREATE TEMP TABLE staging_customers (
    customer_id TEXT,
    name TEXT,
    email TEXT,
    age TEXT,
    signup_date TEXT
);

-- copy in the data to the temp table
COPY staging_customers FROM '/path/to/customers.csv' CSV HEADER;
</code></pre><h3 id=use-pg_input_is_valid-to-validate-data-types><a href=#use-pg_input_is_valid-to-validate-data-types>Use <code>pg_input_is_valid</code> to validate data types</a></h3><p>Now we can write queries to identify rows with invalid data. For example, validate that the age column can be an integer and that the signup column can be a date field.<pre><code class=language-sql>SELECT *
FROM staging_customers
WHERE NOT pg_input_is_valid(age, 'integer')
   OR NOT pg_input_is_valid(signup_date, 'date');
</code></pre><p>This query will return all rows with either an invalid <code>age</code> or <code>signup_date</code>.<h3 id=exclude-invalid-rows-and-copy-data-to-your-final-table><a href=#exclude-invalid-rows-and-copy-data-to-your-final-table><strong>Exclude invalid rows and copy data to your final table</strong></a></h3><p>Once the problematic rows have been identified, the rows can be manually fixed or removed. Sometimes an even cleaner option is to use <code>pg_input_is_valid</code> to skip bad rows as the data is copied to the table and insert only valid rows.<pre><code class=language-sql>INSERT INTO customers (name, email, age, signup_date)
SELECT name, email, age::integer, signup_date::date
FROM staging_customers
WHERE pg_input_is_valid(age, 'integer')
  AND pg_input_is_valid(signup_date, 'date');
</code></pre><h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><p><code>pg_input_is_valid</code> is a great recent addition to the Postgres toolkit data manipulation - moving data or changing data types. In general, where I’ve seen the best use of pg_input_is valid is doing a two step data import with a staging table, a validation step to check for errors, and a final migration of data. Since this is build right into Postgres itself, whether you’re working with small datasets or millions of rows, <code>pg_input_is_valid</code> is a scalable, performant, and reliable way to clean and validate your data. ]]></content:encoded>
<category><![CDATA[ Postgres 17 ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">32b586408cc560e5ca52557765db0ec41581fdc8eef785e6a220b19df1af05c7</guid>
<pubDate>Wed, 05 Mar 2025 09:30:00 EST</pubDate>
<dc:date>2025-03-05T14:30:00.000Z</dc:date>
<atom:updated>2025-03-05T14:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ Loading the World! OpenStreetMap Import In Under 4 Hours ]]></title>
<link>https://www.crunchydata.com/blog/loading-the-world-openstreetmap-import-in-under-4-hours</link>
<description><![CDATA[ Greg has a full OSM load for the entire world running in record time. He digs into turning and recent software and hardware updates that make a full planet run in less than 4 hours. ]]></description>
<content:encoded><![CDATA[ <p>The OpenStreetMap (OSM) database builds almost 750GB of location data from a single file download. <a href=https://www.openstreetmap.org/>OSM</a> notoriously takes a full day to run. A fresh open street map load involves both a massive write process and large index builds. It is a great performance stress-test bulk load for any Postgres system. I use it to stress the latest PostgreSQL versions and state-of-the-art hardware. The stress test validates new tuning tricks and identifies performance regressions.<p>Two years ago, I presented (<a href="https://www.youtube.com/watch?v=BCMnu7xay2Y">video</a> / <a href=https://www.slideshare.net/slideshow/speedrunnin-the-open-street-map-osm2pgsql-loader/254313657>slides</a>) at PostGIS Day on challenges of this workload. In honor of this week’s <a href=https://www.crunchydata.com/community/events/postgis-day-2024>PostGIS Day 2024</a>, I’ve run the same benchmark on Postgres 17 and the very latest hardware. The findings:<ul><li><strong>PostgreSQL</strong> keeps getting better! Core improvements sped up index building in particular.<li>The <strong>osm2pgsql</strong> loader got better too! New takes on indexing speed things up.<li><strong>Hardware</strong> keeps getting better! It has been two years since my last report and the state-of-the-art has advanced.</ul><h2 id=tune-your-instrument><a href=#tune-your-instrument>Tune Your Instrument</a></h2><p>First, we are using bare metal hardware—a server with 128GB RAM—so so let’s tune Postgres for loading and to match that server:<pre><code>max_wal_size = 256GB
shared_buffers = 48GB
effective_cache_size = 64GB
maintenance_work_mem = 20GB
work_mem = 1GB
</code></pre><p>Second, let’s prioritize bulk load. The following settings do not make sense for a live system under read/write load, but they will improve performance for this bulk load scenario:<pre><code class=language-bash>checkpoint_timeout = 60min
synchronous_commit = off
# if you don't have replication:
wal_level = minimal
max_wal_senders = 0
# if you believe my testing these make things
# faster too
fsync = off
autovacuum = off
full_page_writes = off
</code></pre><p>It’s also possible to tweak the <a href=https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-BACKGROUND-WRITER>background writer</a> for the particular case of massive data ingestion, but for bulk loads without concurrency it doesn’t make a large difference.<h2 id=how-postgresql-has-improved><a href=#how-postgresql-has-improved>How PostgreSQL has Improved</a></h2><p>In 2022, testing that year's new AMD AM5 hardware loaded the data in just under 8 hours with Postgres 14. Today the amount of data in the OSM Planet files has grown another 14%. Testing with Postgres 17 still halves the load time, with the biggest drops coming from software improvements in the PG14-16 time-frame.<p><img alt="osm building time"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/d4f15c2b-2a9f-4d95-f397-00101bc60900/public><p>The benchmark orchestration and metrics framework here is my <a href=https://github.com/gregs1104/pgbench-tools>pgbench-tools</a>. Full hardware details are published to <a href=https://browser.geekbench.com/user/232126>GeekBench</a>.<h3 id=gist-index-building-in-postgresql-15><a href=#gist-index-building-in-postgresql-15>GIST Index Building in PostgreSQL 15</a></h3><p>The biggest PostgreSQL speed gains are from <a href="https://www.youtube.com/watch?v=TG28lRoailE">improvements in the GIST index building code</a>.<p>The new code pre-sorts index pages before merging them, and for large GIST index builds the performance speed-up can be substantial, as <a href=https://osm2pgsql.org/news/2023/01/22/faster-with-postgresql15.html>reported by the author</a> of osm2pgsql.<p>My tests showed going from PostgreSQL 14 to 15 delivered:<ul><li>16% speedup<li>15% size reduction<li>86% GIST index build speedup!</ul><p><img alt="osm index building time"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/7fd95086-d103-449e-1336-23566d73b500/public><p>There have been further improvements in PostgreSQL 16 and 17 in <a href=https://www.crunchydata.com/blog/real-world-performance-gains-with-postgres-17-btree-bulk-scans>B-Tree index building</a>, but this osm2pgsql benchmark does not really show them. The GIST index time build times wash out the other index builds.<h2 id=how-osm2pgsql-has-improved><a href=#how-osm2pgsql-has-improved>How osm2pgsql has improved</a></h2><p>In Q3 2022, osm2pgsql 1.7 made a technique called the <a href=https://osm2pgsql.org/doc/manual-v1.html#bucket-index-for-slim-mode>Middle Way Node Index ID Shift</a> the new <strong>default</strong>.<p><a href=https://osm2pgsql.org/doc/manual-v1.html#bucket-index-for-slim-mode>Middle Way Node Index ID Shift</a> is a clever design approach that compresses the database's largest index, trading off lookup and update performance for a smaller footprint. It uses a Partial Index to merge nearby values together into less fine grained sections. When an index is used frequently, this would waste too many CPU cycles. Similar to hash bucket collision, partial indexes have to constantly exclude non-matched items. That chews through extra CPU on every read. In addition, because individual blocks hold so many more values, the locking footprint for updates increases proportionately. However, for large but infrequently used indexes like this one, those are satisfactory trade-offs.<p>Applying that improvement dropped my loading times by 37% and plummeted the database size from 1000GB to under 650GB. Total time at the terabyte size had crept upward to near 10 hours. The speed-up drove it back below 6 hours.<p>The osm2pgsql manual shows the details in its <a href=https://osm2pgsql.org/doc/manual-v1.html#update-for-expert-users>Update for Expert Users</a>. I highly recommend that section and its <a href=https://blog.jochentopf.com/2023-07-25-improving-the-middle-of-osm2pgsql.html>Improving the middle</a> blog entry. It's a great study of how PG's skinnable indexing system lets applications optimize for their exact workload.<h2 id=how-hardware-has-improved><a href=#how-hardware-has-improved>How hardware has improved</a></h2><h3 id=ssd-write-speed><a href=#ssd-write-speed>SSD Write Speed</a></h3><p>During data import, the osm2pgsql workload writes heavily at medium <a href=https://www.techtarget.com/searchstorage/definition/queue-depth>queue depths</a> for hours. The best results come from SSDs with oversized <a href=https://www.advantech.com/en/resources/news/maximizing-ssd-performance-with-slc-cache#1>SLC caches</a> that also balance cleanup compaction of that cache. The later CREATE TABLE AS (CTAS) sections of the build reach its peak read/write speeds.<p>I saw 11GB/s from a <a href="https://www.google.com/search?client=safari&#38rls=en&#38q=Crucial+T705+PCIe+5.0&#38ie=UTF-8&#38oe=UTF-8">Crucial T705</a> PCIe 5.0 drive the week (foreshadowing!) I was running that with an Intel i9-14900K:<p><img alt="read write for osm"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/3a260fb6-37cc-43e3-f9ad-930ffe535200/public><p>osm2pgsql has a tuning parameter named <code>--number-processes</code> that guides how many parallel operations the code tries to spawn.<p>For the server and memory I used in this benchmark, increasing<code>--number-processes</code>from my earlier 2 to 5 worked well. However, be careful: you can easily go too far! Bumping up this parameter increases memory usage too. Going wild on the concurrent work will run you out of memory and put you into the hands of the Linux Out of Memory (OOM) killer.<h3 id=processor-advances><a href=#processor-advances>Processor advances</a></h3><p>Obviously, every year processors get a little better, but they do so in different ways and at different rates.<p>For later 2023 and testing against PostgreSQL 15 and 16, an <strong><a href=https://www.intel.com/content/www/us/en/products/sku/230493/intel-core-i513600k-processor-24m-cache-up-to-5-10-ghz/specifications.html>Intel i7-13600K</a></strong> overtook the earlier <strong>AMD R5 7700X</strong>. There was another small bump in 2024 upgrading to an <strong>i9-14900K</strong>.<p>But this is a demanding regression test workload, and it only took a few weeks of running the OSM workload to trigger the <strong>i9-14900K</strong>’s <a href=https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239>voltage bugs</a> to the point where my damaged CPU could not even finish the test.<p>Thankfully I was able to step away from those issues when <strong>AMD's 9600X</strong> launched. Here's the latest results from PG17 on an AMD 9600X, with the same SK41 2TB drive as I tested in 2022 for my <a href="https://www.youtube.com/watch?v=BCMnu7xay2Y">PostGIS Day talk</a>.<h2 id=my-best-osm-import-results-to-date><a href=#my-best-osm-import-results-to-date>My best OSM import results to date</a></h2><pre><code>2024-10-15 10:03:41  [00] Reading input files done in 7851s (2h 10m 51s).
2024-10-15 10:03:41  [00]   Processed 9335778934 nodes in 490s (8m 10s) - 19053k/s
2024-10-15 10:03:41  [00]   Processed 1044011263 ways in 4301s (1h 11m 41s) - 243k/s
2024-10-15 10:03:41  [00]   Processed 12435485 relations in 3060s (51m 0s) - 4k/s
2024-10-15 10:03:41  [00] Overall memory usage: peak=158292MByte current=157746MByte...
2024-10-15 11:32:13  [00] osm2pgsql took 13162s (3h 39m 22s) overall. f
</code></pre><p>Completed in <strong>less than 4 hours</strong>!<p>PostgreSQL 17 is about 3% better on this benchmark than PostgreSQL 16 when replication is used, thanks to improvements in the WAL infrastructure in PostgreSQL 17.<p>I look forward to following up on this benchmark in more detail, after my scorched Intel system is fully running again! Like the speed of the Postgres ecosystem, the pile of hardware I've benchmarked to death grows every year. ]]></content:encoded>
<category><![CDATA[ Spatial ]]></category>
<category><![CDATA[ Postgres 17 ]]></category>
<author><![CDATA[ Greg.Smith@crunchydata.com (Greg Smith) ]]></author>
<dc:creator><![CDATA[ Greg Smith ]]></dc:creator>
<guid isPermalink="false">9e2c4eee6da51e862d11c5257ce74d187841b9a7deed0a7f62b44557857a70d8</guid>
<pubDate>Tue, 19 Nov 2024 09:30:00 EST</pubDate>
<dc:date>2024-11-19T14:30:00.000Z</dc:date>
<atom:updated>2024-11-19T14:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ A change to ResultRelInfo - A Near Miss with Postgres 17.1 ]]></title>
<link>https://www.crunchydata.com/blog/a-change-to-relresultinfo-a-near-miss-with-postgres-17-1</link>
<description><![CDATA[ A new point version was released on Nov 14th for 17.1, 16.5, 15.9, and others. This included an update to the Postgres ABI potentially breaking extensions. Craig digs into the change and what you need to know. ]]></description>
<content:encoded><![CDATA[ <p><span style="background-color: #DCD2CC">Version 17.2 of PostgreSQL has now released which rolls back the changes to ResultRelInfo. See the <a href=https://www.postgresql.org/about/news/postgresql-172-166-1510-1415-1318-and-1222-released-2965/>release notes</a> for more details.</span><p>Since its inception <a href=https://www.crunchydata.com/about>Crunchy Data</a> has released new builds and packages of Postgres on the day community packages are released. Yesterday's minor version release was the first time we made the decision to press pause on a release. Why did we not release it immediately? There appeared to be a very real risk of breaking existing installations. Let's back up and walk through a near miss of Postgres release day.<p>Yesterday when Postgres 17.1 was released there appeared to be breaking changes in the Application Build Interface (ABI). The ABI is the contract that exists between PostgreSQL and its extensions. Initial reports showed that a number of extensions could be affected, triggering <a href=https://www.postgresql.org/message-id/CABOikdNmVBC1LL6pY26dyxAS2f%2BgLZvTsNt%3D2XbcyG7WxXVBBQ%40mail.gmail.com>warning sirens</a> around the community. In other words, if you were to upgrade from 17.0 to 17.1 and use these extensions, you could be left with a non-functioning Postgres database. Further investigation showed that <em>TimescaleDB</em> and <em>Apache AGE</em> were the primarily affected extensions and if you are using them you should hold off at this time upgrading to the latest minor release or ensure to rebuild the extension against the latest PostgreSQL release in coordination with your upgrade.<p>The initial list of extensions for those curious:<table><thead><tr><th align=center>Affected<th align=left>Unaffected<tbody><tr><td align=center>Apache AGE<td align=left>HypoPG<tr><td align=center>TimescaleDB<td align=left>pg-query<tr><td align=center><td align=left>Citus<tr><td align=center><td align=left>pglast<tr><td align=center><td align=left>pglogical<tr><td align=center><td align=left>pgpool2<tr><td align=center><td align=left>ogr-fdw<tr><td align=center><td align=left>pg-squeeze<tr><td align=center><td align=left>mysql-fdw</table><p>First, a little bit on Postgres releases. Postgres releases major versions each year, and minor versions every three months roughly. The major versions are expected to be forward compatible, but do introduce bigger changes that result in catalog changes. Major version upgrades are intended to be treated with caution. Minor version releases in contrast are intended to be only security and bug fix related. They are meant to be able to drop in and continue working within the same existing major version line.<h2 id=about-the-postgres-abi-and-postgres-extension><a href=#about-the-postgres-abi-and-postgres-extension>About the Postgres ABI and Postgres Extension</a></h2><p>The Postgres ABI, Application Binary Interface, refers to the binary-level interface between Postgres and compiled extensions, modules, or clients that interact with it. The ABI includes various <strong>structs</strong> that define key components of the system's internal workings. These structs represent how PostgreSQL manages and manipulates data, query execution, memory. They typically include things like:<ul><li>System catalogs<li>Function signatures<li>Data structure layouts</ul><h3 id=why-does-the-abi-matter><a href=#why-does-the-abi-matter>Why Does the ABI Matter?</a></h3><p>Developers of extensions ensure their extensions are compatible with the Postgres ABI. Changes to the ABI between major versions necessitates recompiling any extensions to prevent runtime issues.<p>ABI compatibility is typically not maintained across major versions. For instance, an extension compiled for PostgreSQL 14 will likely need to be recompiled for PostgreSQL 15 because ABI changes can occur.<p>PostgreSQL typically aims to maintain compatibility for extensions across minor versions. This means if you build an extension for PostgreSQL 15.1, it should work for 15.2. However, this is not always the case. The nuances of PostgreSQL ABI guarantees have been a sufficiently hot topic that they produced new <a href=https://www.postgresql.org/message-id/E1sZ5TL-0020gU-3t%40gemulon.postgresql.org>documentation</a> on the subject back in July.<p>Yesterday there was a major struct change in 17.1.<h3 id=with-us-so-far-lets-go-deeper><a href=#with-us-so-far-lets-go-deeper>With us so far? Let’s go deeper</a></h3><p>Within a PostgreSQL extension there is C code that includes header files from PostgreSQL itself. When the extension is compiled, functions from those headers are represented as abstract symbols in binary. The symbols are linked to the actual implementations of the functions when the extension is loaded based on the function names. That way, an extension compiled against PostgreSQL 17.0 can usually still be loaded into PostgreSQL 17.1, as long as the function names and signatures from headers do not change (i.e. the application binary interface or "ABI" is stable).<p>The header files also declare structs that are passed to functions (as pointers). Strictly speaking, the struct definitions are also part of the ABI, but there is more subtlety around that. After compilation, structs are mostly defined by their size and offsets of fields, so for instance a name change does not affect ABI (though does affect API). A size change does affect ABI, a little.<pre><code class=language-c>typedef struct ResultRelInfo
{
	NodeTag		type;

        /*... (130 other lines) ...*/

	/* updates do LockTuple() before oldtup read; see README.tuplock */
	bool		ri_needLockTagTuple;

} ResultRelInfo;
</code></pre><p>Most of the time, PostgreSQL allocates structs on the heap using a macro that looks at the compile-time size of the struct ("makeNode") and initializes the bytes to 0. The discrepancy that arose in 17.1 is that a new boolean was added to the ResultRelInfo struct, which <strong>increased its size from 376 bytes to 384</strong>.<p>What happens next depends on who calls makeNode. If it's PostgreSQL 17.1 code, then it uses the new size. If it's an extension compiled against 17.0, then it uses the old size. When it calls a PostgreSQL function with a pointer to a block allocated using the old size, the PostgreSQL function still assumes the new size and may write past the allocated block.<p>That is in general quite problematic. It could lead to bytes being written into an unrelated section of memory, or the program crashing. When running tests, PostgreSQL has internal checks (asserts) to detect that situation and throw warnings.<p>So, in general this particular change in the struct does not actually affect the allocation size. There may be uninitialized bytes, but that is usually resolved by calling InitResultRelInfo. The issue primarily causes warnings in tests / assert-enabled builds for extensions that allocate ResultRelInfo, though only when running those tests using the new PostgreSQL version with an extension binary that was compiled against the old PostgreSQL versions.<h3 id=did-we-lose-you-yet-and-so-whats-the-result><a href=#did-we-lose-you-yet-and-so-whats-the-result>Did we lose you yet, and so what’s the result?</a></h3><p>Unfortunately, that's not the end of the story. Extensions that rely heavily on ResultRelInfo (like TimescaleDB) and can do some things that suffer from the size change. For instance, in one of TimescaleDB's <a href=https://github.com/timescale/timescaledb/blob/2.17.2/src/nodes/hypertable_modify.c#L1245>code paths</a>, it needs to find the index of a ResultRelInfo pointer in an array, and to do so it does pointer math. This array was allocated by PostgreSQL (384 bytes), but the Timescale binary assumes 376 bytes and the result is a nonsense number which then hits an assert failure or segmentation fault.<p>To be clear, the code here is not really at fault. The contract with PostgreSQL was simply not quite as assumed. For developers of Postgres extensions that's an interesting lesson for all of us.<p>It's quite possible that there are other issues like this in other extensions. TimescaleDB is quite popular and thus subject to broader testing that identified the issue. That said, as investigation occurred over the past 24 hours most that built against this header thus far do seem to be safe. Another advanced extension is Citus, but from our investigation the Citus extension does seem safe.<h2 id=what-should-you-do><a href=#what-should-you-do>What should you do?</a></h2><p><strong>If you’re a Crunchy Data customer you do not need to worry</strong>. If you’re using Crunchy Data Postgres on any platform, Crunchy Bridge, Crunchy Postgres for Kubernetes - our build, release and certification procedures worked as anticipated and appropriate mitigations were applied to any of our software releases. We are fortunate to have a fantastic build and release team that is largely behind the scenes but ensures issues like this are handled. If you’re a community Postgres user, or have packaged your own extensions, it is worth reading the psql-hackers thread in order to understand which extensions have been determined to potentially be impacted and to understand the potential mitigations for the below affected versions:<ul><li>17.0 -> 17.1<li>16.4 (and earlier) -> 16.5<li>15.8 (and earlier) -> 15.9<li>14.13 (and earlier) -> 14.14<li>13.16 (and earlier) -> 13.17<li>12.20 (and earlier) -> 12.21</ul><p>In short:<ul><li><p>If you are using TimescaleDB extension, Timescale is <a href=https://x.com/michaelfreedman/status/1857148280167174455>recommending</a> that users do not perform the minor version installs at this time.<li><p>If you are using extensions that are indicated as potentially impacted within the pgsql-hackers list thread, additional caution is warranted before upgrading (though our own <a href=https://x.com/marcoslot/status/1857403646134153438>Marco Slot</a> has confirmed that Citus is not impacted)<li><p>If you are compiling Postgres extensions from source, make sure your extensions have been compiled using the latest point version 17.1<li><p>If you are developing or installing custom Postgres extensions, it is worth taking the time to understand the impact of this particular issue and the Postgres ABI ‘commitments’.<p>Ultimately the default guidance of performing Postgres minor version upgrades stands and the impact of this issue was not as broad as was initially feared. The Postgres community once again provided a timely minor version release to address a collection of CVEs and fixes, and the community promptly responded to a report of potential issues. The ecosystem of Postgres providers release processes worked as intended and it appears any potential impact was largely averted.<p>That said, software is hard, databases in particular are tricky. As Postgres extensions grow in popularity these risks will continue to emerge and it is helpful to understand these details or ensure when selecting who is supporting you on your database they understand these issues.</ul> ]]></content:encoded>
<category><![CDATA[ Postgres 17 ]]></category>
<author><![CDATA[ Craig.Kerstiens@crunchydata.com (Craig Kerstiens) ]]></author>
<dc:creator><![CDATA[ Craig Kerstiens ]]></dc:creator>
<guid isPermalink="false">455f5668fc452c9083751a8b6a1e7f75eacc374196b14cd6808e41185a36b328</guid>
<pubDate>Fri, 15 Nov 2024 10:30:00 EST</pubDate>
<dc:date>2024-11-15T15:30:00.000Z</dc:date>
<atom:updated>2024-11-15T15:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ Convert JSON into Columns and Rows with JSON_TABLE ]]></title>
<link>https://www.crunchydata.com/blog/easily-convert-json-into-columns-and-rows-with-json_table</link>
<description><![CDATA[ Paul shows you how to easily load JSON into Postgres relational format with JSON_TABLE, just released in Postgres 17.  ]]></description>
<content:encoded><![CDATA[ <h2 id=json_table-new-in-postgres-17><a href=#json_table-new-in-postgres-17>JSON_TABLE, new in Postgres 17</a></h2><p>If you missed some of the headlines and release notes, Postgres 17 added another huge JSON feature to its growing repository of strong JSON support with the <a href=https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-TABLE>JSON_TABLE</a> feature. JSON_TABLE lets you query JSON and display and query data like it is native relational SQL. So you can easily take JSON data feeds and work with it like you would any other Postgres data in your database.</p><!--more--><h2 id=shaking-the-earth-with-json_table><a href=#shaking-the-earth-with-json_table>Shaking the Earth with JSON_TABLE</a></h2><p>A few days ago, I was awakened in the middle of the night when my house started to shake. Living in the <a href=https://www.pnsn.org/outreach/earthquakesources/csz>Cascadia subduction zone</a>, when things start to shake I wake up really fast, because you never know if this one is going to be the <a href=https://www.newyorker.com/magazine/2015/07/20/the-really-big-one>Big One</a>.<p>Fortunately this one was only a little one, 4.0 magnitude quake several miles to the north of the city, captured and memorialized in its own <a href=https://earthquake.usgs.gov/earthquakes/eventpage/uw62050041/executive>USGS earthquake page</a>, almost as soon as it had finished shaking.<p>The <a href=https://earthquake.usgs.gov/>USGS</a> keeps a <a href=https://earthquake.usgs.gov/earthquakes/feed/>near-real-time collection</a> of information about the latest quakes online, served up in a variety of formats including <a href=https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php>GeoJSON</a>.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/c846d472-f72f-4dcc-4f9d-122c59651a00/public><p>The <a href=https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_week.geojson>weekly feed</a> of magnitude 4.5 quakes has a nice amount of data in it.<p>If we could import this feed into the database, we could use it for other queries, like finding potential customers to sell tents and emergency supplies to! (When the big one hits me, sell me some tents and emergency supplies.)<p>This readily available GeoJSON earthquake file seems like the perfect chance to try out the new JSON_TABLE. And maybe give me something to do in the middle of night.<h2 id=retrieving-a-json-file-with-http><a href=#retrieving-a-json-file-with-http>Retrieving a JSON file with HTTP</a></h2><p>The first step is to retrieve the feed. The simplest way is to use the <a href=https://github.com/pramsey/pgsql-http>http</a> extension, which provides a simple functional API to making HTTP requests.<pre><code>CREATE EXTENSION http;
</code></pre><p>The <code>http_get(url)</code> function returns an <code>http_response</code>, with a status code, content_type, headers and content. We could write a wrapper to check the status code, but for this example we will just assume the feed works and look at the content.<pre><code>SELECT jsonb_pretty(content::jsonb)
  FROM http_get('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_week.geojson');
</code></pre><h2 id=reading-json-features-with-postgres-json_table><a href=#reading-json-features-with-postgres-json_table>Reading JSON Features with Postgres JSON_TABLE</a></h2><p>The feed is actually a GeoJSON <a href=https://datatracker.ietf.org/doc/html/rfc7946#section-3.3>FeatureCollection</a>, which is just a container for a list of <a href=https://datatracker.ietf.org/doc/html/rfc7946#section-3.2>Feature</a>s. In order to convert it to a table, we need to iterate through the list.<p><a href=https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-TABLE>JSON_TABLE</a> is part of the <a href=https://www.iso.org/standard/78937.html>SQL/JSON</a> standard, and allows users to filter parts of JSON documents using the <a href=https://www.ietf.org/archive/id/draft-goessner-dispatch-jsonpath-00.html>JSONPath</a> filter language.<p><img alt="json to tables"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/301f6f13-526c-42d5-884d-227e6e704200/public><p>We can use JSON_TABLE to take specific fields from the JSON structure (using JSONPath expressions) to map them to corresponding SQL columns:<pre><code class=language-sql>
-- Download the GeoJSON feed from USGS
WITH http AS (
    SELECT * FROM
    http_get('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_week.geojson')
),
-- Filter the JSON feed into a record set, providing
-- the type informant and JSONPath to each column
jt AS (
    SELECT * FROM
    http,
    JSON_TABLE(content, '$.features[*]' COLUMNS (
        title text PATH '$.properties.title',
        mag real PATH '$.properties.mag',
        place text PATH '$.properties.place',
        ts text PATH '$.properties.time',
        url text PATH '$.properties.url',
        detail text PATH '$.properties.detail',
        felt integer PATH '$.properties.felt',
        id text PATH '$.id',
        geometry jsonb PATH '$.geometry'
   )))
SELECT * FROM jt
;
</code></pre><ul><li>The first argument is the JSON document get via http<li>The second argument is the filter that generates the rows, in this case one row for each member of the <code>features</code> list in the GeoJSON <code>FeatureCollection</code><li>The <code>COLUMNS</code> provides a path, within each <code>Feature</code> to pull the column data from, and the database type to apply to that data. Most of the columns come from the GeoJSON <code>properties</code> but others, like the <code>id</code> and <code>geometry</code> come from other attributes.</ul><h2 id=transforms-on-json><a href=#transforms-on-json>Transforms on JSON</a></h2><p>Once we’re reading this json as sql, we might want to do a few more things like convert timestamps into our standard format, transform the geometry column, and add a srid. So here’s a query building on the above that does that too. Note that you'll need PostGIS for these.<pre><code class=language-sql>-- Download the GeoJSON feed from USGS
WITH http AS (
    SELECT * FROM
    http_get('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_week.geojson')
),
-- Filter the JSON feed into a record set, providing
-- the type informant and JSONPath to each column
jt AS (
    SELECT * FROM
    http,
    JSON_TABLE(content, '$.features[*]' COLUMNS (
        title text PATH '$.properties.title',
        mag real PATH '$.properties.mag',
        place text PATH '$.properties.place',
        ts text PATH '$.properties.time',
        url text PATH '$.properties.url',
        detail text PATH '$.properties.detail',
        felt integer PATH '$.properties.felt',
        id text PATH '$.id',
        geometry jsonb PATH '$.geometry'
    ))
)
-- Apply any remaining transforms to the columns
-- in this case converting the epoch time into a timestamp
-- and the GeoJSON into a geometry
SELECT
    jt.title,
    jt.mag,
    jt.place,
    to_timestamp(jt.ts::bigint / 1000),
    jt.url,
    jt.detail,
    jt.felt,
    jt.id,
    ST_SetSRID(ST_GeomFromGeoJSON(jt.geometry),4326) AS geom
FROM jt;
</code></pre><h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><p>Reading data from JSON files in the past might have involved writing functions in PL/PgSQL and building a complicated loop to iterate through each feature, creating relational rows of data. With JSON_TABLE:<ul><li>You can read JSON from a URL or other source<li>Extract specific fields (in this case magnitude, location, and time)<li>Using standard Postgres functions, convert any data into a usable format (in this example PostGIS geometry, timezone conversions).</ul><p>Now we have JSON data in SQL format and we can easily do further analysis or visualization.<p>The <a href=https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-TABLE>JSON_TABLE</a> documentation includes some much more complicated examples, but this basic example of JSON document handling is probably good for 80% of use cases, pulling data from web APIs, live into the database. ]]></content:encoded>
<category><![CDATA[ Postgres 17 ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">712fc7ef4bdc9989d8f0fa8661a8a9c9a6f82678bf2ab781df611c90fefb34d2</guid>
<pubDate>Fri, 11 Oct 2024 10:30:00 EDT</pubDate>
<dc:date>2024-10-11T14:30:00.000Z</dc:date>
<atom:updated>2024-10-11T14:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ Enhanced Postgres Release Notes ]]></title>
<link>https://www.crunchydata.com/blog/enhanced-postgres-release-notes</link>
<description><![CDATA[ Just out in Postgres 17, the Postgres release notes now link to git commit messages! ]]></description>
<content:encoded><![CDATA[ <p>There is something new you may not have seen in the <a href=https://www.postgresql.org/docs/17/release-17.html>release notes for Postgres 17</a>. No, not a new feature - I mean inside the actual release notes themselves! The Postgres project uses the git program to track commits to the project, and now each item in the release notes has a link to the actual commit (or multiple commits) that enabled it.</p><!--more--><p>You may have missed it if you were scanning the release notes, but after the end of each specific item in the release note is a small “section” symbol which looks like this: <strong>§</strong>. Each of these symbols is a link leading to the relevant commit for that item. Here’s what it looks like on the Postgres 17 release notes page:<p><img alt=pg_commit_plain.png loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/3e213b8a-045b-4db0-befc-ff20f4076500/public><p>Clicking the section symbol will send you to the GIT link for each individual patch, for example, this one:<p><img alt="hackers email messsage"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/a7514091-2a9a-4478-f087-0c9f4c92a000/public><p>Note that there’s a “Discussion” link in each commit linking back to the full thread on the hackers channel.<p>Writing the release notes is hard work, and involves a good bit of debate in the community. We have to make sure we list all the changes, in a concise yet comprehensible manner, and decide what level of detail to include. Oftentimes, this level is not sufficient for people interested in learning about this feature. That’s where these new commit links are invaluable. They link to the git commit, which not only lets you see the exact code changes that were made, but show the actual commit message, which has more detail than what can be provided in the release notes.<h3 id=postgres-notes-in-lots-of-places><a href=#postgres-notes-in-lots-of-places>Postgres Notes in Lots of Places</a></h3><p>Postgres release notes appear in lots of different places so this addition will surely make its way into other downstream projects. This new link also now appears on “<a href="https://www.notion.so/cff5e34f441140d687828e40e50f603c?pvs=21">postgres all versions</a>” - a project that I maintain that collates the information from all of the release notes for every version of Postgres (over 500 now!) into a single page. To make the link more visible and easier to use, I converted it to a larger Unicode scroll symbol, then added some tooltip tech to make it show information about the link like so:<p><img alt=pg_commit_tooltip.gif loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/2f220f04-b12d-4fe3-bd85-d58ef452ee00/public><p>There was some community debate about including this, and about how prominent to make it. I think we ended up a little too cryptic and subtle with this section mark, but I welcome the addition and find it extraordinarily useful. I’d love to see the www Postgres project adopt a bigger symbol and tooltips in the future too!<p>The Postgres project takes the quality of its code very seriously, and that extends to the git commit messages as well. You will find them quite detailed; they not only describe the change that has been made, but have a link back to the mailing list discussion, as well as giving credit to the people who authored the change, discovered the bug, or otherwise helped out. A great new addition to the release notes! ]]></content:encoded>
<category><![CDATA[ Postgres 17 ]]></category>
<author><![CDATA[ Greg.Sabino.Mullane@crunchydata.com (Greg Sabino Mullane) ]]></author>
<dc:creator><![CDATA[ Greg Sabino Mullane ]]></dc:creator>
<guid isPermalink="false">fe2e381fe3d770d3ce23b77158ba10800b5b32895408db4affbd897f334bd59c</guid>
<pubDate>Wed, 09 Oct 2024 17:30:00 EDT</pubDate>
<dc:date>2024-10-09T21:30:00.000Z</dc:date>
<atom:updated>2024-10-09T21:30:00.000Z</atom:updated></item></channel></rss>