<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0"><channel><title>CrunchyData Blog</title>
<atom:link href="https://www.crunchydata.com/blog/topic/postgres-tutorials/rss.xml" rel="self" type="application/rss+xml" />
<link>https://www.crunchydata.com/blog/topic/postgres-tutorials</link>
<image><url>https://www.crunchydata.com/card.png</url>
<title>CrunchyData Blog</title>
<link>https://www.crunchydata.com/blog/topic/postgres-tutorials</link>
<width>800</width>
<height>419</height></image>
<description>PostgreSQL experts from Crunchy Data share advice, performance tips, and guides on successfully running PostgreSQL and Kubernetes solutions</description>
<language>en-us</language>
<pubDate>Wed, 11 Oct 2023 09:00:00 EDT</pubDate>
<dc:date>2023-10-11T13:00:00.000Z</dc:date>
<dc:language>en-us</dc:language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item><title><![CDATA[ Working with Money in Postgres ]]></title>
<link>https://www.crunchydata.com/blog/working-with-money-in-postgres</link>
<description><![CDATA[ Elizabeth has a primer for working with money in Postgres including what data type to choose, storing currency, and some sample functions. ]]></description>
<content:encoded><![CDATA[ <p>Wouldn’t it be awesome if money worked just like <a href=https://www.crunchydata.com/blog/working-with-time-in-postgres>time</a> in Postgres? You could store one canonical version of it, it worked across the globe? Well sadly, money is a whole different ball of wax. Though like time, money is part of most database implementations and I wanted to lay out some of the best practices I’ve gathered for working with money in Postgres.<p>I also have a <a href=https://www.crunchydata.com/developers/playground/working-with-money-in-postgres>tutorial</a> up if you want to try this with Postgres running in a web browser.<h2 id=data-types-for-money><a href=#data-types-for-money>Data types for money</a></h2><h3 id=money><a href=#money>Money</a></h3><p>Postgres actually does have a <code>money</code> data type. This is not recommended because it doesn’t handle fractions of a cent and currency is tied to a <a href=https://www.postgresql.org/docs/current/locale.html>database locale</a> setting. While the money type isn’t best practice for storing money, I do think money is handy for casting data when you want your query output to be formatted like money.<h3 id=floats><a href=#floats>Floats</a></h3><p>Float numbers are often popular with any system using positive and negative numbers with decimals (since the name float means that the numbers float across the number scale). Float (<code>real</code> / <code>float4</code>) &#38 double float datatypes (<code>float8</code>) could be used for money, but they generally aren’t ideal because they’re fundamentally imprecise.<p>So for example, this is correct:<pre><code class=language-pgsql>select 0.0001::float4;

0.0001
(1 row)
</code></pre><p>And so is this:<pre><code class=language-pgsql>select 0.0001::float4 + 0.0001::float4;

0.0002
(1 row)
</code></pre><p>But if we try to go out to additional fractions, this isn’t really the expected result:<pre><code class=language-pgsql>select 0.0001::float4 + 0.0001::float4 + 0.0001::float4;

0.00029999999
(1 row)
</code></pre><h3 id=integers><a href=#integers>Integers</a></h3><p>Lots of folks use <code>integer</code> for storing money. Integers do not support any kind of decimals, so 100/3 might equal 33.3333, but in integer math that’s just 33. This can work for storing money if you know what your smallest unit is going to be (even down to fractions of a cent) and can use a multiplier in your database. So the multiplier would be 100 for dealing with a whole number of cents, or 1000000000 if you want to represent an amount like 0.237928372 BTC. This unit is stored whole, which solves the issues of float’s unrepresentable values.<p>There are some physical limitations with this technique, as <code>integer</code> can only store numbers up to 2147483647 and <code>bigint</code> can store only up to 9223372036854775807.<p>Integer is however notably performant and storage efficient. It's only a 4 byte sized column, 8 if you’re using <code>bigint</code>. Also, keep in mind, storing money as an integer will require division or casting to a different data type to output in a traditional format for your front end or sql reports to represent dollars, cents, or decimal numbers.<h3 id=numeric><a href=#numeric>Numeric</a></h3><p><code>numeric</code> is widely considered the ideal datatype for storing money in Postgres. <code>numeric</code> and <code>decimal</code> are synonyms for each other, there's no difference in functionality between these two, but I hear numeric used more often in Postgres conversations. Numeric can go out to a lot of decimal places (10,000+ digits!!!) and you get to define the precision. The number data type has two qualifiers, the precision and scale to let you define a sensible number of decimal points to use.<p>When you create the type, it will look something like this <code>NUMERIC(10,5)</code> where precision is 10 and scale factor is 5.<ul><li>Precision is the total number of digits before and after the decimal point. You need to set this to the highest amount you ever might need to store. So here 99,999.9999 is the maximum and -99,999.9999 the minimum.<li>Scale factor is the number of digits following your decimal, so this would be 5 decimal places.</ul><p>Choosing a scale factor means that at some point Postgres will be rounding numbers. If you want to prevent rounding, make sure your scale number is really really high.<p>Compared to integer, number data types take up a lot of space, 10 bytes per column row. So if space and performance are a huge concern, and decimal precision is not, you might be better off with integer.<h2 id=storing-money><a href=#storing-money>Storing money</a></h2><p>Ok, so we have a data type to store actual cents, dollars, euros, etc. Now how do we store currency? In general it is best practice to store the currency alongside the number itself if you need to store money in multiple currencies at the same time. See ISO 4217 if you want the official currency codes. You can use a <a href=https://www.crunchydata.com/blog/enums-vs-check-constraints-in-postgres>custom check constraint</a> to require your data be entered for only certain currencies, for example, if you’re using dollars, pounds, and euros that might look like.<pre><code class=language-pgsql>CREATE TABLE products (
    sku SERIAL PRIMARY KEY,
    name VARCHAR(255),
    price NUMERIC(7,5),
    currency TEXT CHECK (currency IN ('USD', 'EUR', 'GBP'))
);
</code></pre><p>If you’re working with currency in many formats there’s a lot to consider. In many cases, a lot of stuff will happen at the time of the transaction. Say a price set in the database in USD displayed to a user in GBP. You’d have a table like the above with a different table for a GBP exchange rate. Perhaps that table updates via API as currency values fluctuate throughout the day. You may have prices set in one currency and the price paid in a different one, entered with the amount paid at the time of purchase.<h2 id=functions-for-money><a href=#functions-for-money>Functions for money</a></h2><ul><li><strong>Averages</strong></ul><p>and rounding to the nearest cent<pre><code class=language-pgsql>SELECT ROUND(AVG(price), 2) AS truncated_average_price
FROM products;
</code></pre><ul><li><strong>Rounding up with ceiling</strong></ul><p>totaling and rounding up to the nearest integer<pre><code class=language-pgsql>SELECT CEIL(SUM(price)) AS rounded_total_price
FROM products;
</code></pre><ul><li><strong>Rounding down with floor</strong></ul><p>totaling and rounding down to the nearest integer<pre><code class=language-pgsql>SELECT FLOOR(SUM(price)) AS rounded_total_price
FROM products;
</code></pre><ul><li><strong>Medians</strong></ul><p>Calculating the median can be a bit more involved because PostgreSQL doesn't have a built-in median function, but you can use window functions to calculate this<pre><code class=language-pgsql>WITH sorted_prices AS (
    SELECT price,
           ROW_NUMBER() OVER (ORDER BY price) as r,
           COUNT(*) OVER () as total_count
    FROM products
)
SELECT FLOOR(AVG(price)) AS rounded_median_price
FROM sorted_prices
WHERE r IN (total_count / 2, (total_count + 1) / 2);
</code></pre><ul><li><strong>Casting to the money type</strong></ul><p>If you’d like a result with a currency sign, commas, and periods.<pre><code class=language-pgsql>SELECT CEIL(SUM(price))::money AS rounded_total_price_money
FROM products;
</code></pre><p>Note that the currency sign will appear based on your locale settings, <code>show lc_monetary;</code> will tell you what that is and you can update it to a different currency.<h2 id=summary><a href=#summary>Summary</a></h2><ul><li>Use <code>int</code> or <code>bigint</code> if you can work with whole numbers of cents and you don’t need fractional cents. This saves space and offers better performance. Store your money in cents and convert to a decimal on your output. This is also really the preferred method if all currency is the same type. If you’re changing currency often and dealing with fractional cents, move on to <code>numeric</code>.<li>Use <code>decimal</code> / <code>numeric</code> for storing money in fractional cents and even out to many many decimal points. If you need to support lots of precision in money, this is the best method but there’s a bit of storage and performance cost.<li>Store currency separately from the actual monetary values, so you can run calculations on currency conversions.</ul> ]]></content:encoded>
<category><![CDATA[ Postgres Tutorials ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">74aa5a5c5d53ba804475d38cb7e69ded765d90378d75a035bf121ab3cff2a5c5</guid>
<pubDate>Wed, 11 Oct 2023 09:00:00 EDT</pubDate>
<dc:date>2023-10-11T13:00:00.000Z</dc:date>
<atom:updated>2023-10-11T13:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Top 10 Postgres Management Tasks ]]></title>
<link>https://www.crunchydata.com/blog/top-10-postgres-management-tasks</link>
<description><![CDATA[ The must know Postgres management tasks to look at for any scale. Plus a bonus image included showing the recent and upcoming Postgres release schedule. ]]></description>
<content:encoded><![CDATA[ <h2 id=1-add-a-statement-timeout><a href=#1-add-a-statement-timeout>1. Add a statement timeout</a></h2><p>Postgres databases are very compliant, they do what you tell them until you tell them to stop. It is really common for a runaway process, query, or even something a co-worker runs to accidentally start a <a href=https://www.crunchydata.com/blog/control-runaway-postgres-queries-with-statement-timeout>never ending transaction</a> in your database. This potentially uses up memory, i/o, or other resources.<p>Postgres has no preset default for this. To find out your current setting:<pre><code class=language-pgsql>SHOW statement_timeout;
</code></pre><p>A good rule of thumb can be a minute or a couple minutes.<pre><code class=language-pgsql>ALTER DATABASE mydatabase 
SET statement_timeout = '60s';
</code></pre><p>This is a connection-specific setting, so you’ll need to reconnect and have your application reconnect, for this to take effect on an ongoing basis.<h2 id=2-confirm-you-have-enough-memory><a href=#2-confirm-you-have-enough-memory>2. Confirm you have enough memory</a></h2><p>For application workloads you want your most frequently accessed <a href=https://www.crunchydata.com/blog/postgres-data-flow>Postgres data</a> to be accessible in memory/cache. You can check your cache hit ratio to see how often Postgres is using the cache. Ideally, you have 98-99% of data in the cache. If you see your cache hit ratio below that, you probably need to look at your memory configuration or move to an instance with larger memory.<pre><code class=language-pgsql>SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) +  sum(heap_blks_read)) as ratio
FROM 
  pg_statio_user_tables;
</code></pre><p><em>Note: For warehouse or analytical workloads, you will probably have a much lower cache hit ratio.</em><h2 id=3-check-shared-buffers><a href=#3-check-shared-buffers>3. Check shared buffers</a></h2><p>Shared buffers is another key memory check. The default shared_buffers is 128MB. Find your current setting with:<pre><code class=language-pgsql>SHOW shared_buffers;
</code></pre><p>The value should be set to 15% to 25% of the machine’s total RAM. So if you have an 8GB machine, a quarter of that would be 2GB.<pre><code class=language-pgsql>SET shared_buffers='2GB';
</code></pre><p>Shared buffers is a parameter that requires a restart to take effect.<h2 id=4-use-ssltls-for-data-in-transit><a href=#4-use-ssltls-for-data-in-transit>4. Use SSL/TLS for data in transit</a></h2><p>To find out if you’re currently using ssl:<pre><code class=language-pgsql>SHOW ssl;
</code></pre><p>Hopefully you’ll see <code>ssl | on</code>.<p>If you’re not, you’ll need to do some work on the database and application servers to make sure connections are encrypted. See <a href=https://www.postgresql.org/docs/current/ssl-tcp.html>more docs</a> here.<h2 id=5-set-up-backups><a href=#5-set-up-backups>5. Set up backups</a></h2><p><a href=https://www.crunchydata.com/blog/introduction-to-postgres-backups>Backups</a> are a must have in database management. There’s a few ways to get backup data from Postgres but here’s the essential info:<ul><li><strong>pg_dump</strong> generates backup files but it shouldn’t be used as a real backup, it is more of a data manipulation tool<li><strong>pg_basebackup</strong> generates a full binary copy of the database including WAL files, but by itself it is not a complete backup system<li><strong>pgBackRest</strong> is a complete WAL archive and backup tool which can be used for disaster recovery and point-in-time recovery</ul><p>You should be using a full disaster recovery data backup tool or working with a vendor that does it for you.<h2 id=6-stay-on-top-of-postgres-releases-and-upgrade-frequently><a href=#6-stay-on-top-of-postgres-releases-and-upgrade-frequently>6. Stay on top of Postgres releases and upgrade frequently</a></h2><p>The PostgreSQL development community releases about 4 minor versions a year and 1 major version a year.<p>You should be planning to patch your database in some alignment to this schedule. Staying on top of security patches and the most recent versions will make sure you’re running on the most up to date and most efficient software. Here’s a graphic of where we are now and what is coming later this year. Make sure you have plans to upgrade frequency and to major versions annually.<p><img alt="Postgres 14-17 schedule"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/b1c4390d-211c-4126-7cca-77ef3ff8f800/public><h2 id=7-use-pg_stat_statements><a href=#7-use-pg_stat_statements>7. Use pg_stat_statements</a></h2><p><a href=https://www.crunchydata.com/blog/tentative-smarter-query-optimization-in-postgres-starts-with-pg_stat_statements>pg_stat_statements</a> has to be the most valuable Postgres tool that’s not part of the out of the box software. I mentioned to some committers at a conference recently that we should get it in core Postgres and they assured me I could have a patch in and rejected before the day was over. <em>To be fair</em> it is a contrib module that generally ships with Postgres so you don’t have to go searching for it.<p>Since pg_stat_statements comes with the Postgres contrib libraries, its really easy to add with <code>CREATE EXTENSION pg_stat_statements</code>. You also have to add it to shared preloaded libraries since it shares some memory. Adding it also requires a restart.<p>Here’s a quick query for checking on your 10 slowest queries. Always a good idea to peek in on these and see if there’s any easy fixes to make things work a little faster.<pre><code class=language-pgsql>SELECT
  (total_exec_time / 1000 / 60) as total_min,
  mean_exec_time as avg_ms,
  calls,
  query
FROM pg_stat_statements
ORDER BY 1 DESC
LIMIT 10;
</code></pre><h2 id=8-add-indexes><a href=#8-add-indexes>8. Add indexes</a></h2><p>Indexes are really the foundational key to query performance for Postgres. Without indexes, your database is doing full sequential scans each time you query data which uses up a lot of memory and precious query time. Adding indexes gives Postgres an easy way to find and sort your data. Using that handy pg_stat_statements above, you already know what queries are the slowest.<p>The pg_indexes view will show you what you’ve got at the moment:<pre><code class=language-pgsql>SELECT * FROM pg_indexes;
</code></pre><p>Check out <a href=https://www.crunchydata.com/blog/postgres-indexes-for-newbies>Postgres Indexes for Newbies</a> if you’re just getting started.<h2 id=9-check-for-unused-indexes><a href=#9-check-for-unused-indexes>9. Check for unused indexes</a></h2><p>Indexes are incredibly helpful but sometimes folks go too far adding indexes for everything. Indexes can take up a fair amount of storage space, and all new writes have to be written to them, so keeping them around if they’re not being used can be bad for performance. The pg_stat_user_indexes table has all the information for you on this, so you can look at index usage with a <code>select * from pg_stat_user_indexes</code>. A more sophisticated query that removes unique indexes and primary keys, showing you unused indexes ordered by size is.<pre><code class=language-pgsql>SELECT schemaname || '.' || relname AS table,
       indexrelname AS index,
       pg_size_pretty(pg_relation_size(i.indexrelid)) AS "index size",
       idx_scan as "index scans"
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT indisunique
  AND idx_scan &#60 50
  AND pg_relation_size(relid) > 5 * 8192
ORDER BY
  pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST,
  pg_relation_size(i.indexrelid) DESC;
</code></pre><p>If you’re using read replicas, don’t forget to check those too before you delete unused indexes. An unused index on the primary might be used on the replica.<h2 id=10-review-your-connection-settings><a href=#10-review-your-connection-settings>10. Review your connection settings</a></h2><p>Postgres has a max_connections setting that defaults at 100. This will show you how many connections your instance is currently configured for:<pre><code class=language-pgsql>SHOW max_connections;
</code></pre><p>For tuning the max_connections setting in Postgres, you’ll need to know how your application is connecting and how many connections are allowed. You’ll also want to leave a little headroom, like 10% for other processes, or people, to connect to the database as well. For example if you have 4 servers that can use 50 connections each, plus 10%, you’d want to set max connections to 220.<p>You may also want to look at a <a href=https://www.crunchydata.com/blog/your-guide-to-connection-management-in-postgres>connection pooler</a>. You can check for idle and active connections in your database with the below query.<pre><code class=language-pgsql>SELECT count(*), state
FROM pg_stat_activity
GROUP BY 2;
</code></pre><p>If you're in the high 10s or if you have more idle than active connections, pooling might be a good option.<h2 id=need-more-postgres-tips><a href=#need-more-postgres-tips>Need more Postgres tips?</a></h2><p>We have an <a href=https://www.crunchydata.com/postgres-tips>awesome tips page</a> we’ve been building out. We also just started a new <a href=https://discord.gg/cKdBbJfq>Discord</a> channel to chat about Postgres, stop by, say hi, and let me know what your Top 10 list for Postgres is. ]]></content:encoded>
<category><![CDATA[ Postgres Tutorials ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">c2df72bf2bc1dd935afb5d9e4af2cca3ce3dce58594178c1b1800b3dae087572</guid>
<pubDate>Tue, 29 Aug 2023 09:00:00 EDT</pubDate>
<dc:date>2023-08-29T13:00:00.000Z</dc:date>
<atom:updated>2023-08-29T13:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Postgres Subquery Powertools: CTEs, Materialized Views, Window Functions, and LATERAL Join ]]></title>
<link>https://www.crunchydata.com/blog/postgres-subquery-powertools-subqueries-ctes-materialized-views-window-functions-and-lateral</link>
<description><![CDATA[ Wondering when to use a Materialized View or a CTE? Elizabeth has summaries, example queries, and comparisons for the most popular subquery tools. ]]></description>
<content:encoded><![CDATA[ <p>Beyond a basic query with a join or two, many queries require extracting subsets of data for comparison, conditionals, or aggregation. Postgres’ use of the SQL language is standards compliant and SQL has a world of tools for subqueries. This post will look at many of the different subquery tools. We’ll talk about the advantages and use cases of each, and provide further reading and tutorials to dig in more.<p>I’ll take a broad definition of “subquery”. Why am I calling all of these subqueries? These are all queries that work on subsets of data. Having read the article title, you might have come here to say that a subquery is a specific thing <em>vs</em> all these other SQL tools I’m talking about. And you’d be right! If you have a better name for this group of tools, let me know.<h2 id=what-is-a-subselect><a href=#what-is-a-subselect>What is a subselect?</a></h2><p>A subquery extracts values from a group of something else. It’s a subset of a set. A basic subquery is a nested select statement inside of another query. The most basic subselects are typically found in WHERE statements.<p>In this example we want to summarize a quantity of SKUs sold that have a certain sale price. The subquery returns the SKU of all products that have a sale price less than 75% of the price. Then the top-level query sums the quantity of each product_order.<pre><code class=language-pgsql>SELECT
   sum(qty) as total_qty,
   sku
FROM
   product_orders
WHERE
   sku in
   (
      SELECT
         sku
      FROM
         products
      WHERE
         sale_price &#60= price * .75
   )
GROUP BY
   sku;
</code></pre><p>As with most things SQL, you could build this query a few ways. Most queries that you execute with a join you could also execute with a subquery. Why would you use a subquery instead of a join? Mostly, it depends on your syntax preference and what you want to do with it. So the above could also be written as a plain select statement joining products and product_orders by SKU. See our blog post on <a href=https://www.crunchydata.com/blog/joins-or-subquery-in-postgresql-lessons-learned>choosing joins vs subqueries</a> for more.<h3 id=when-to-use-a-basic-subselect><a href=#when-to-use-a-basic-subselect>When to use a basic subselect</a></h3><ul><li>Your subquery is simple and can go in the WHERE clause</ul><h2 id=what-is-a-postgres-view><a href=#what-is-a-postgres-view>What is a Postgres view?</a></h2><p>A view is a stored query that you access as you would a table. View functionality is quite common across relational database systems. Since a view is a query, it can be data from one table or consolidated data from multiple tables. When called, a view will execute a query or it can be called as a subquery. Commonly, a view is used to save a query that you might be running often from inside your database. Views can be used as a join or a subquery from inside another query.<p>Views are a little more advanced than just a query, they can have separate user settings. You could specify views for certain individuals or applications if you want to show parts of certain tables. Some developers have their applications query a view instead of the base table, so if changes are made to the underlying tables, fewer changes will impact the application code.<p>Using the example we started with above, let’s say we often need to call the SKUs of sale items in other queries, so we want to create a view for that. Here’s sample syntax for a Postgres view. We name this view <code>skus_on_sale</code>, which selects SKUs from the product table that have a sale price less than 75% of their original price.<pre><code class=language-pgsql>CREATE VIEW skus_on_sale AS
SELECT
   sku
FROM
   products
WHERE
   sale_price &#60= price * .75;
</code></pre><p>Previously, we nested a full subquery, this time, we join this view in another query. Logically, this will return the same values as the prior query:<pre><code class=language-pgsql>SELECT
   sum(po.qty) as total_qty,
   sk.sku
FROM
   product_orders po
   JOIN
      skus_on_sale sk
      ON sk.sku = po.sku
GROUP BY
   sk.sku;
</code></pre><h3 id=when-to-use-a-view><a href=#when-to-use-a-view>When to use a view?</a></h3><ul><li>When you want to save a specific query for use later or in other queries<li>You have a security issue or need to to show a user or application only the view and not the entire table or tables involved</ul><h2 id=what-is-a-materialized-view><a href=#what-is-a-materialized-view>What is a Materialized View?</a></h2><p>Materialized views are saved queries that you store in the database like you would store a table. Unlike the regular view, a materialized view is stored on disk and information does not need to be re-computed to be used each time. Materialized views can be queried like any other table. Typically materialized views are used for situations where you want to save yourself, or the database, from intensive queries or for data that is frequently used.<p>The big upside to materialized views is performance. Since the data has been precomputed, materialized views often have better response times than other subquery methods. No matter how complex the query, how many tables involved, Postgres stores these results as a simple table. This simple table becomes a simple join to the materialized view, and the materialized view hides complexity of the subqueries heavy lifting.<p>Here’s an example of a materialized view that will get my SKUs and the shipped quantity by SKU. This shows the most frequently sold SKUs at the top since I’m ordering by qty in descending order.<pre><code class=language-pgsql>CREATE MATERIALIZED VIEW recent_product_sales AS
SELECT
   p.sku,
   SUM(po.qty) AS total_quantity
FROM
   products p
   JOIN
      product_orders po
      ON p.sku = po.sku
   JOIN
      orders o
      ON po.order_id = o.order_id
WHERE
   o.status = 'Shipped'
GROUP BY
   p.sku
ORDER BY
   2 DESC;
</code></pre><p>To improve query performance on materialized views, we can also create indexes on their fields, here;s an example that indexes on the quantity column.<pre><code class=language-pgsql>CREATE INDEX sku_qty ON recent_product_sales(total_quantity);
</code></pre><p>Just like the view, we can call the materialized view in a query. So for example, we can quickly review the top 10 products sold without having to write a subquery to sum or rank.<pre><code class=language-pgsql>SELECT
   sku
FROM
   recent_product_sales LIMIT 10;
</code></pre><p>To update the data held on disk, run a refresh command, <code>REFRESH MATERIALIZED VIEW CONCURRENTLY recent_product_sales;</code>. Use CONCURRENTLY to allow queries to execute to the existing output while the new output is refreshed.<p>See our <a href=https://www.crunchydata.com/developers/playground/materialized-views>tutorial on materialized views</a> if you want to see it in action.<h3 id=when-to-use-a-materialized-view><a href=#when-to-use-a-materialized-view>When to use a materialized view?</a></h3><ul><li>Your subquery is intensive so storing the generated results rather than computed each time will help overall performance<li>Your data doesn’t need to be updated in real time</ul><h2 id=what-is-a-common-table-expression-cte><a href=#what-is-a-common-table-expression-cte>What is a common table expression (CTE)?</a></h2><p>A <abbr>CTE</abbr>, a <dfn>common table expression</dfn>, allows you to split a complex query into different named parts and reference those parts later in the query.<ul><li>CTEs always start with a WITH statement that creates a subquery first<li>The WITH statement is followed by a select statement that references the CTE, the CTE cannot exist alone</ul><p>Similar to the view statement above, here is a sample CTE that creates a subselect called <code>huge_savings</code>, then uses this in a select statement.<pre><code class=language-pgsql>WITH huge_savings AS
(
   SELECT
      sku
   FROM
      products
   WHERE
      sale_price &#60= price * .75
)
SELECT
   sum(qty) as total_qty,
   sku
FROM
   product_orders
   JOIN
      huge_savings USING (sku)
GROUP BY
   sku;
</code></pre><p>Often as queries become more and more complex, CTEs are a great way to make understanding queries easier by combining data manipulation into sensible parts.<h3 id=what-is-a-recursive-cte><a href=#what-is-a-recursive-cte>What is a recursive CTE?</a></h3><p>A recursive CTE is a CTE that selects against itself. You’ll define an initial condition and then append rows as part of the query. This goes on and on until a terminating condition. We have some <a href=https://www.crunchydata.com/blog/solving-advent-of-code-2022-using-postgres-day-7>awesome examples of recurring CTEs</a> in our Advent of Code series. Recursive CTEs will start with <code>WITH recursive AS</code>.<h3 id=when-to-use-a-cte><a href=#when-to-use-a-cte>When to use a CTE?</a></h3><ul><li>To separate and define a complicated subquery<li>You have multiple subqueries to include in a larger query<li>Your subquery needs to select against itself so you’ll need a recursive CTE</ul><h2 id=what-is-a-window-function><a href=#what-is-a-window-function>What is a window function?</a></h2><p>A window function is an aggregate function that looks at a certain set, ie - the window, of data. The function is typically first and the operator OVER is used to define the group / partition of data you’re looking at. Window functions are used in subqueries often to do averages, summations, max/min, ranks, averages, lead (next row), or lag (previous row).<p>For example, you could write a simple window function to sum product orders by sku. The SUM is the aggregations and the OVER PARTITION looks at the sku set.<pre><code class=language-pgsql>SELECT
   sku,
   SUM(qty) OVER (PARTITION BY sku)
FROM
   product_orders LIMIT 10;
</code></pre><p>We have a nice <a href=https://www.crunchydata.com/developers/playground/ctes-and-window-functions>tutorial on window functions with a CTEs</a> for the birth data set. Here’s one sql example using the window function lag. We use a CTE to create a count of births per week. Then, we use the lag function to return this weeks’ birth count, and the birth count for the prior week.<pre><code class=language-pgsql>WITH weekly_births AS
(
   SELECT
      date_trunc('week', day) week,
      sum(births) births
   FROM
      births
   GROUP BY
      1
)
SELECT
   week,
   births,
   lag(births, 1) OVER (
ORDER BY
   week DESC ) prev_births
FROM
   weekly_births;
</code></pre><p>It is worth calling out here, that similar to a window function, the FILTER functionality on GROUP BY aggregations is also a powerful sql tool. I won’t include more here because it’s not a subquery as much as is a filter.  For more information, Crunchy Data has a <a href=https://www.crunchydata.com/blog/using-postgres-filter>walkthrough on using FILTER with GROUP BY</a>.<h3 id=when-to-use-a-window-function><a href=#when-to-use-a-window-function>When to use a window function?</a></h3><ul><li>If you have a subquery that’s an aggregation, like a sum, rank, or average<li>The subquery applies to a limited set of the overall data to be returned</ul><h2 id=what-is-a-lateral-join><a href=#what-is-a-lateral-join>What is a LATERAL join?</a></h2><p>LATERAL lets you use values from the top-level query in the subquery. So, if you are querying on accounts in the top-level query, you can then reference that in the subquery. When run, LATERAL is kind of like running a subquery for each individual row. LATERAL is commonly used for querying against an array or JSON data, as well as a replacement for the DISTINCT ON syntax. Check out our <a href=https://www.crunchydata.com/developers/playground/lateral-join>LATERAL tutorial</a> to see if you get any ideas about where to add it to your query tools. I would also double check performance when using LATERAL, in our internal testing, its generally not as good as other join options.<p>Below, we use <code>LATERAL</code> to find the last purchase for every account:<pre><code class=language-pgsql>SELECT
   accounts.id,
   last_purchase.*
FROM
   accounts
   INNER JOIN
      LATERAL (
      SELECT
         *
      FROM
         purchases
      WHERE
         account_id = accounts.id
      ORDER BY
         created_at DESC LIMIT 1 ) AS last_purchase
         ON true;
</code></pre><p>When to use a LATERAL join?<ul><li>You want to lookup data for each row<li>You’re using an array or JSON data in a join</ul><h2 id=summary><a href=#summary>Summary</a></h2><p>Here’s my reference guide for the tools I discussed above:<table><thead><tr><th>what<th>details<th>example<tbody><tr><td>subselect<td>select inside a select<td>SELECT<br /> sum(qty) as total_qty,sku<br />FROM product_orders<br />WHERE<br /> sku in (SELECT sku FROM products WHERE sale_price &#60= price * .75)<br />GROUP BY sku;<tr><td>CTE<td>subqueries with named parts<td>WITH huge_savings AS (<br /> SELECT sku <br /> FROM products <br /> WHERE<br />sale_price &#60= price * .75)<br />SELECT sum(qty) as total_qty, sku <br />FROM product_orders <br />JOIN huge_savings<br />USING (sku)<br />GROUP BY sku;<tr><td>materialized view<td>saved query to a table<td>CREATE MATERIALIZED VIEW recent_product_sales AS <br />SELECT p.sku, SUM(po.qty) AS total_quantity <br />FROM products p <br />JOIN product_orders po ON p.sku = po.sku <br />JOIN orders o ON po.order_id = o.order_id <br />WHERE o.status = 'Shipped' <br />GROUP BY p.sku<br />ORDER BY 2 DESC;<tr><td>window functions<td>aggregations on subsets of data<td>SELECT<br /> sku, SUM(qty) OVER (PARTITION BY sku)<br />FROM<br />product_orders<br />LIMIT 10;<tr><td>lateral join<td>correlated subquery<td>SELECT<br /> sku, SUM(qty) OVER (PARTITION BY sku)<br />FROM product_orders<br />LIMIT 10;</table><p>If you’re wondering which to use when, you can just get in there and test. Don’t forget to use your query planning best friend <a href=https://www.crunchydata.com/blog/get-started-with-explain-analyze>EXPLAIN ANALYZE</a> to test query efficiency and plans.<p>Links to our web based Postgres tutorials for more on these topics:<p><a href=https://www.crunchydata.com/developers/playground/ctes-and-window-functions>CTEs and Window functions</a><p><a href=https://www.crunchydata.com/developers/playground/materialized-views>Materialized views</a><p><a href=https://www.crunchydata.com/developers/playground/lateral-join>Lateral joins</a> ]]></content:encoded>
<category><![CDATA[ Postgres Tutorials ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">fe5f1f1609da04cf1a0237f96a93006d48acef801e6950a5adce4056c0e2d0ca</guid>
<pubDate>Thu, 17 Aug 2023 09:00:00 EDT</pubDate>
<dc:date>2023-08-17T13:00:00.000Z</dc:date>
<atom:updated>2023-08-17T13:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Tags and Postgres Arrays, a Purrrfect Combination ]]></title>
<link>https://www.crunchydata.com/blog/tags-aand-postgres-arrays-a-purrfect-combination</link>
<description><![CDATA[ Are you using tags in your database with some of your main database properties? Paul reviews some of the ways to store tags in a database from basic relational models to text arrays. He provides some performance tests, sample queries, and guidance on choosing the best path. ]]></description>
<content:encoded><![CDATA[ <p>In a previous life, I worked on a CRM system that really loved the idea of tags. Everything could be tagged, users could create new tags, tags were a key organizing principle of searching and filtering.<p>The trouble was, modeled traditionally, tags can really make for some ugly tables and equally ugly queries. Fortunately, and as usual, Postgres has an answer.<p>Today I’m going to walk through working with tags in Postgres with a sample database of 🐈 cats and their attributes<ul><li>First, I’ll look at a traditional relational model<li>Second, I’ll look at using an integer array to store tags<li>Lastly, I’ll test text arrays directly embedding the tags alongside the feline information</ul><p>This post is also available as an interactive tutorial in our <a href=https://www.crunchydata.com/developers/playground/tags-and-postgres-arrays>Postgres playground</a>.<h2 id=tags-in-a-relational-model><a href=#tags-in-a-relational-model>Tags in a relational model</a></h2><p>For these tests, we will use a very simple table of 🐈 <code>cats</code>, our entity of interest, and <code>tags</code> a short table of ten tags for the cats. In between the two tables, the relationship between tags and cats is stored in the <code>cat_tags</code> table.<p><img alt="diagram of 3 tables with called cats, cat_tags, and tags with cat_tags being the join table"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/18c51a25-8dcf-4cb4-259e-bae8b7ef7600/public></p><details><summary>Table Creation SQL</summary><pre><code class=language-pgsql>CREATE TABLE cats (
    cat_id serial primary key,
    cat_name text not null
);

CREATE TABLE cat_tags (
    cat_id integer not null,
    tag_id integer not null,
    unique(cat_id, tag_id)
);

CREATE TABLE tags (
    tag_id serial primary key,
    tag_name text not null,
    unique(tag_name)
);
</code></pre></details><p>I filled the tables with over 1.7M entries for the <code>cats</code>, ten entries for the <code>tags</code>, and 4.7M entries for the cat/tag relationship.</p><details><summary>Data Generation SQL</summary><pre><code class=language-pgsql>-- Generate cats random names
INSERT INTO cats (cat_name)
WITH
hon AS (
    SELECT *
    FROM unnest(ARRAY['mr', 'ms', 'miss', 'doctor', 'frau', 'fraulein', 'missus', 'governer']) WITH ORDINALITY AS hon(n, i)
),
fn AS (
    SELECT *
    FROM unnest(ARRAY['flopsy', 'mopsey', 'whisper', 'fluffer', 'tigger', 'softly']) WITH ORDINALITY AS fn(n, i)
),
mn AS (
    SELECT *
    FROM unnest(ARRAY['biggles', 'wiggly', 'mossturn', 'leaflittle', 'flower', 'nonsuch']) WITH ORDINALITY AS mn(n, i)
),
ln AS (
    SELECT *
    FROM unnest(ARRAY['smithe-higgens', 'maclarter', 'ipswich', 'essex-howe', 'glumfort', 'pigeod']) WITH ORDINALITY AS ln(n, i)
)
SELECT initcap(concat_ws(' ', hon.n, fn.n, mn.n, ln.n)) AS name
FROM hon, fn, mn, ln, generate_series(1,1000)
ORDER BY random();

-- Fill in the tag names
INSERT INTO tags (tag_name) VALUES
    ('soft'), ('cuddly'), ('brown'), ('red'), ('scratches'), ('hisses'), ('friendly'), ('aloof'), ('hungry'), ('birder'), ('mouser');

-- Generate random tagging. Every cat has 25% chance of getting each tag.
INSERT INTO cat_tags
WITH tag_ids AS (
    SELECT DISTINCT tag_id FROM tags
),
tag_count AS (
    SELECT Count(*) AS c FROM tags
)
SELECT cat_id, tag_id
FROM cats, tag_ids, tag_count
WHERE random() &#60 0.25;

CREATE INDEX cat_tags_x ON cat_tags (tag_id);
</code></pre></details><p>In total, the relational model needs <strong>446MB</strong> for the <code>cats</code>, <code>tags</code> and the tag relationships.<pre><code class=language-pgsql>SELECT pg_size_pretty(
    pg_total_relation_size('cats') +
    pg_total_relation_size('cat_tags') +
    pg_total_relation_size('tags'));
</code></pre><h3 id=performance-of-relational-queries><a href=#performance-of-relational-queries>Performance of relational queries</a></h3><p>There are two standard directions of tag queries:<ul><li>"What are the tags for this particular cat?"<li>"What cats have this particular tag or set of tags?"</ul><h3 id=what-tags-does-this-cat-have><a href=#what-tags-does-this-cat-have>What tags does this cat have?</a></h3><p><img alt="diagram of three tables with a join table between the two"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/4949fc8e-c81d-4655-1fc6-093d0544dd00/public><p>The query is simple, and the performance is very good (<strong>under 1 ms</strong>).<pre><code class=language-pgsql>SELECT tag_name
FROM tags
JOIN cat_tags USING (tag_id)
WHERE cat_id = 444;
</code></pre><h3 id=what-cats-have-this-tag><a href=#what-cats-have-this-tag>What cats have this tag?</a></h3><p><img alt="diagram of 3 tables with a join table between the two"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/a98c716c-9a28-4fb9-5524-3ef16d9e9700/public><p>The query is still simple, and the performance is not unexpected (<strong>about 500ms</strong>) for the number of records (431K) that meet the filter criterion.<pre><code class=language-pgsql>SELECT Count(*)
FROM cats
JOIN cat_tags a ON (cats.cat_id = a.cat_id)
JOIN tags ta ON (a.tag_id = ta.tag_id)
WHERE ta.tag_name = 'brown';
</code></pre><h3 id=what-cats-have-these-two-tags><a href=#what-cats-have-these-two-tags>What cats have these two tags?</a></h3><p>This is where things start to come off the rails for the relational model, because finding just the records that have <strong>two</strong> particular tags involves quite complicated SQL.<pre><code class=language-pgsql>SELECT Count(*)
FROM cats

JOIN cat_tags a ON (cats.cat_id = a.cat_id)
JOIN cat_tags b ON (a.cat_id = b.cat_id)
JOIN tags ta    ON (a.tag_id = ta.tag_id)
JOIN tags tb    ON (b.tag_id = tb.tag_id)
WHERE ta.tag_name = 'brown' AND tb.tag_name = 'aloof';
</code></pre><p>This query takes around <strong>900ms</strong> to find the 108K cats that are both "brown" and "aloof".<h3 id=what-cats-have-these-three-tags><a href=#what-cats-have-these-three-tags>What cats have these three tags?</a></h3><p>Just so you can see the pattern, here's the three tag version.<pre><code class=language-pgsql>SELECT Count(*)
FROM cats
JOIN cat_tags a ON (cats.cat_id = a.cat_id)
JOIN cat_tags b ON (a.cat_id = b.cat_id)
JOIN cat_tags c ON (b.cat_id = c.cat_id)
JOIN tags ta    ON (a.tag_id = ta.tag_id)
JOIN tags tb    ON (b.tag_id = tb.tag_id)
JOIN tags tc    ON (c.tag_id = tc.tag_id)
WHERE ta.tag_name = 'brown'
AND tb.tag_name = 'aloof'AND tc.tag_name = 'red';
</code></pre><p>At this point the decreasing number of records in the result set (27K) is balancing out the growing complexity of the multi-join and query time has only grown to <strong>950ms</strong>.<p>But imagine doing this for five, six or seven tags?<h2 id=tags-in-an-integer-array-model><a href=#tags-in-an-integer-array-model>Tags in an integer array model</a></h2><p>What if we changed our model, and instead of modeling the cat/tag relationship with a correlation table, we model it with an integer array?<pre><code class=language-pgsql>CREATE TABLE cats_array (
    cat_id serial primary key,
    cat_name text not null,
    cat_tags integer[]
);
</code></pre><p>Now our model has just <strong>two</strong> tables, <code>cats_array</code> and <code>tags</code>.<p>We can populate the array-based table from the relational data, so we can compare answers between models.</p><details><summary>- Data Generation SQL</summary>
  ```pgsql
  INSERT INTO cats_array
  SELECT cat_id, cat_name, array_agg(tag_id) AS cat_tags
  FROM cats
  JOIN cat_tags USING (cat_id)
  GROUP BY cat_id, cat_name;
  ```
</details><p>With this new data model, the size of the required tables has gone down, and we are using only <strong>199MB</strong>.<pre><code class=language-pgsql>SELECT pg_size_pretty(
    pg_total_relation_size('cats_array') +
    pg_total_relation_size('tags'));
</code></pre><p>Once the data are loaded, we need the <strong>most important part</strong> -- an index on the <code>cat_tags</code> integer array.<pre><code class=language-pgsql>CREATE INDEX cats_array_x ON cats_array USING GIN (cat_tags);
</code></pre><p>This <a href=https://www.postgresql.org/docs/current/gin-intro.html>GIN index</a> is a perfect fit for indexing collections (like our array) where there is a fixed and finite number of values in the collection (like our ten tags). While Postgres ships with an <a href=https://www.postgresql.org/docs/current/intarray.html>intarray</a> extension, the core <a href=https://www.postgresql.org/docs/current/functions-array.html>support for arrays and array indexes</a> have caught up with and rendered much of the extension redundant.<p>As before, we will test common tag-based use cases.<h3 id=what-tags-does-this-cat-have-1><a href=#what-tags-does-this-cat-have-1>What tags does this cat have?</a></h3><p><img alt="diagram of two tables using an array for tags instead of a join table"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/6490dec2-971f-4ff3-5c4d-41b7466c6900/public><p>The query is much less pretty! First we have to lookup the <code>tag_id</code> values in <code>cat_tags</code> and use <code>unnest()</code> to expand them out into a relation. Then we're ready to join that relation to the <code>tags</code> table to find the <code>tag_name</code> that corresponds to the <code>tag_id</code>.<pre><code class=language-pgsql>SELECT c.cat_id, c.cat_name, t.tag_name
FROM cats_array c
CROSS JOIN unnest(cat_tags) AS tag_id
JOIN tags t USING (tag_id)
WHERE cat_id = 779;
</code></pre><p>The query hits the <code>cats</code> primary key index and returns in the <strong>1ms</strong> range. Great performance!<h3 id=what-cats-have-these-three-tags-1><a href=#what-cats-have-these-three-tags-1>What cats have these (three) tags?</a></h3><p>This is the query that flummoxed our relational model. Let's jump straight to the hardest case and try to find all the cats that are "red", "brown" and "aloof".<pre><code class=language-pgsql>WITH tags AS MATERIALIZED (
    SELECT array_agg(tag_id) AS tag_ids
    FROM tags
    WHERE tag_name IN ('red', 'brown', 'aloof')
    )
SELECT Count(*)
FROM cats_array
CROSS JOIN tags
WHERE cat_tags @> tags.tag_ids;
</code></pre><p>First we have to go into the <code>tags</code> table to make an array of <code>tag_id</code> entries that correspond to our tags. Then we can use the <code>@></code> Postgres <a href=https://www.postgresql.org/docs/current/functions-array.html>array operator</a> to test to see which cats have <code>cat_tags</code> arrays that contain the query array.<p>The query hits the GIN index on <code>cat_tags</code> and returns the count of <strong>26,905</strong> cats in around <strong>120ms</strong>. About <strong>seven times</strong> faster than the same query on the relational model!<h2 id=tags-in-a-text-array-model><a href=#tags-in-a-text-array-model>Tags in a text array model</a></h2><p>So if partially de-normalizing our data from a <code>cats -> cat_tags -> tags</code> model to a <code>cats -> tags</code> model makes things faster... what if we went all the way to the simplest model of all -- just <code>cats</code>?<pre><code class=language-pgsql>CREATE TABLE cats_array_text (
    cat_id serial primary key,
    cat_name text not null,
    cat_tags text[] not null
);
</code></pre><p>Again we can populate this new model directly from the relational model.</p><details><summary>- Data Generation SQL</summary><pre><code class=language-pgsql>  INSERT INTO cats_array_text
  SELECT cat_id, cat_name, array_agg(tag_name) AS cat_tags
  FROM cats
  JOIN cat_tags USING (cat_id)
  JOIN tags USING (tag_id)
  GROUP BY cat_id, cat_name;
</code></pre></details><p>The result is <strong>234MB</strong>, about <strong>17%</strong> larger than the integer array version.<pre><code class=language-pgsql>SELECT pg_size_pretty(
    pg_total_relation_size('cats_array_text') +
    pg_total_relation_size('tags'));
</code></pre><p>Now every cat has the tag names right in the record.<h3 id=what-tags-does-this-cat-have-2><a href=#what-tags-does-this-cat-have-2>What tags does this cat have?</a></h3><p><img alt="a table showing a cat&#39s name and the associated tags"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/be47732c-0232-4c16-20f6-82447a67ab00/public><p>Since there's only one table with all the data, finding the tags of a cat is ridiculously easy.<pre><code class=language-pgsql>SELECT cat_name, cat_tags
FROM cats_array_text
WHERE cat_id = 888;
</code></pre><h3 id=what-cats-have-these-three-tags-2><a href=#what-cats-have-these-three-tags-2>What cats have these (three) tags?</a></h3><p>Once again, in order to get good performance we need a GIN index on the array we will be searching.<pre><code class=language-pgsql>CREATE INDEX cats_array_text_x ON cats_array_text USING GIN (cat_tags);
</code></pre><p>The query to find the cats are "red", "brown" and "aloof" is also wonderfully simple.<pre><code class=language-pgsql>SELECT Count(*)
FROM cats_array_text
WHERE cat_tags @> ARRAY['red', 'brown', 'aloof'];
</code></pre><p>This query takes about the same amount of time as the integer array based query, <strong>120ms</strong>.<h2 id=wins-and-losses><a href=#wins-and-losses>Wins and losses</a></h2><p>So are array-based models the final answer for tag-oriented query patterns in Postgres? On the plus side, the array models are:<ul><li>Faster to query;<li>Smaller to store; and,<li>Simpler to query!</ul><p>Those are really compelling positives!<p>However, there are some caveats to keep in mind when working with these models:<ul><li>For the text-array model, there's no general place to lookup all tags. For a list of all tags, you will have to scan the entire <code>cats</code> table.<li>For the integer-array model, there's no way to create a simple constraint that guarantees that integers used in the <code>cat_tags</code> integer array exist in the <code>tags</code> table. You can work around this with a <code>TRIGGER</code>, but it's not as clean as a relational foreign key constraint.<li>For both array models, the SQL can get a little crufty when the tags have to be un-nested to work with relational querying.</ul> ]]></content:encoded>
<category><![CDATA[ Postgres Tutorials ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">7b3e6ca22c35ed59da64ba51490a49589fc646dc2c1156543342d4fdff98ed7a</guid>
<pubDate>Mon, 22 May 2023 09:00:00 EDT</pubDate>
<dc:date>2023-05-22T13:00:00.000Z</dc:date>
<atom:updated>2023-05-22T13:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Working with Time in Postgres ]]></title>
<link>https://www.crunchydata.com/blog/working-with-time-in-postgres</link>
<description><![CDATA[ A primer on working with time in Postgres. Covers data types, query formats, intervals, overlaps, range types, indexing, and roll ups. ]]></description>
<content:encoded><![CDATA[ <p>Since humans first started recording data, they’ve been keeping track of time. Time management is one of those absolutely crucial database tasks and Postgres does a great job of it. Postgres has a lot of options for storing and querying time so I wanted to provide an overview of some of the most common needs for storing and retrieving time data.<p>This blog is also available as a <a href=https://www.crunchydata.com/developers/playground/working-with-time-in-postgres>hands on tutorial</a> running in your local browser via our Postgres playground.<p>If you ask Postgres what time it is,<pre><code class=language-pgsql>SELECT now();
</code></pre><p>You’ll get something like<pre><code class=language-text>now
-----------------------------
 2023-05-15 18:23:58.5603+00
</code></pre><p>The default time representation here is a full timestamp string, containing the date, time, and a reference to timezone. In this case, the +00 represents equal with UTC. UTC has long been a standard time measurement following suit from the Greenwich mean time (if you’re as old as I am).<p>If I want to know the time in my local timezone<pre><code class=language-pgsql>SELECT now() AT TIME ZONE 'America/Chicago';
</code></pre><p>The full list of timezones names you can use is stored in a system table and can be retrieved with <code>select * from pg_timezone_names;</code><h2 id=data-types-for-time><a href=#data-types-for-time>Data types for time</a></h2><p>Postgres has a <code>TIME</code> data type, with and without a time zone if you want to store that separately from a date. This is generally not recommended since in most cases time requires an accompanying date. There’s a <code>TIMESTAMP</code> datatype. Adding timezone to TIMESTAMP is <code>TIMESTAMP WITH TIMEZONE</code> or aliased as the <code>TIMESTAMPTZ</code>. Without a doubt <strong>TIMESTAMPTZ</strong> is going to be the MVP of Postgres time storage. If you store data in with the full date, time, and timezone you’ll never have to worry about the server time, what time the user entered the data, what time it is where you’re querying data, or any of those crazy calculations. And you or your application can pull out the time and display it in whatever local user timezone you need.<p>When working with Postgres, you’ll also see epoch which is how seconds are represented. This is not a timestamp, its an integer (a double precision floating-point number, 64 bits) and it represents the number of seconds since January 1st, 1970. This can be used if you need a specific comparison or need time in that format. Postgres can easily convert back at forth between timestamps and epochs. To find the current epoch:<pre><code class=language-pgsql>SELECT EXTRACT (EPOCH FROM now());
</code></pre><h2 id=time-formats--functions><a href=#time-formats--functions>Time formats &#38 functions</a></h2><p>I’m an American midwesterner so of course, I would write Bastille Day like - July 14th, 1789 or 7-14-1789. Of course all my French friends would write it 14 July 1789 or 14-07-1789. And while I’d love to debate with you all over beers about the best way to do this, ISO has some standards for time formats, namely ISO 8601 which states that dates will be read like this 1789-07-14 17:30:00.000, year-month-day-time. This date format is what used in TIMESTAMP and what you’ll see most often in the database and engineering world.<p>Time storage has the ISO8601 best practice, however, depending on your end users or business needs, you may want to change the time format in your queries whey they’re output. So to change the time format of a query you can use the <code>TO_CHAR</code> function which will translate a time string into different characters.<pre><code class=language-pgsql>SELECT TO_CHAR(NOW(), 'DY, Mon dd, yyyy HH24:MI:SS OF');
</code></pre><p><code>TO_CHAR</code> let’s you convert the time interval string to text and characters. Then using some <a href=https://www.postgresql.org/docs/current/functions-formatting.html>formatting functions</a>, I can pull out the day of the week, an American date format, and UTC time. The result of that query would be:<pre><code class=language-text> MON, May 15, 2023 14:22:28 +00
</code></pre><h2 id=time-intervals><a href=#time-intervals>Time intervals</a></h2><p>Now that we’re fancy and can get dates in any format we want, how about calculating intervals and lapsed time in different formats?<p>We’ve loaded in a sample table with some train schedule data, take a peek<pre><code class=language-pgsql>SELECT * FROM train_schedule LIMIT 3;
</code></pre><p>and it looks like this<pre><code class=language-text> trip_id | track_number | train_number |  scheduled_departure   |   scheduled_arrival    |    actual_departure    |     actual_arrival
---------+--------------+--------------+------------------------+------------------------+------------------------+------------------------
       1 |            1 |          683 | 2023-04-29 11:15:00+00 | 2023-04-29 12:35:00+00 | 2023-04-29 11:21:00+00 | 2023-04-29 12:52:00+00
       2 |            1 |          953 | 2023-04-29 13:49:00+00 | 2023-04-29 15:10:00+00 | 2023-04-29 13:50:00+00 | 2023-04-29 15:17:00+00
       3 |            1 |          140 | 2023-04-29 15:06:00+00 | 2023-04-29 15:23:00+00 | 2023-04-29 15:06:00+00 | 2023-04-29 15:22:00+00
(3 rows)
</code></pre><p>Let’s say you are storing an update_time fields. To find your the lower and upper bounds of arrival times times in your data set you would do:<pre><code class=language-pgsql>SELECT min(actual_arrival) FROM train_schedule;
</code></pre><p>and<pre><code class=language-pgsql> SELECT max(actual_arrival) FROM train_schedule;
</code></pre><p>To find the interval between them:<pre><code class=language-pgsql>SELECT
(SELECT max(actual_arrival) FROM train_schedule)
- (SELECT min(actual_arrival)
FROM train_schedule);
</code></pre><p>Ok, so we have about 10 days of train schedule information in here.<p>Taking this a step further, if I want to look at intervals between scheduled time of departure and actual time of departure. I can create an arrival_delta and a subquery that compares actual arrival minus scheduled arrival.<pre><code class=language-pgsql>SELECT avg(arrival_delta)
FROM (SELECT scheduled_arrival, actual_arrival,
	actual_arrival - scheduled_arrival AS arrival_delta
FROM train_schedule)q;
</code></pre><p>You can also add a filter to find interval sizes. If we build on the above query but only for departures that were more than 10 minutes later than their original scheduled time we can add this interval > ‘10 minutes`.<pre><code class=language-pgsql>SELECT avg(arrival_delta)
FROM (select scheduled_arrival, actual_arrival,
actual_arrival - scheduled_arrival AS arrival_delta
FROM train_schedule WHERE (actual_arrival - scheduled_arrival)
> INTERVAL '10 minutes')q;
</code></pre><h3 id=overlapping--intersecting-time><a href=#overlapping--intersecting-time>Overlapping / intersecting time</a></h3><p>What if I wanted to find all of the trains that were running at a specific time - or now. You can use the OVERLAP operator with the INTERVAL.<pre><code class=language-pgsql>SELECT count(*) FROM train_schedule
WHERE (actual_departure, actual_arrival)
OVERLAPS (now(), now() - INTERVAL '2 hours');
</code></pre><h3 id=time-range-types><a href=#time-range-types>Time Range Types</a></h3><p>Postgres also supports working with time ranges that include both a single range, and even <a href=https://www.crunchydata.com/blog/better-range-types-in-postgres-14-turning-100-lines-of-sql-into-3>multiple ranges</a>. Single ranges of the timestamptz is called <code>tstzrange</code> and one for multiple ranges would be <code>tstzmultirange</code><p>For example, if we wanted to create a table in our train database that has some peak travel season fares, we could do:<pre><code class=language-pgsql>CREATE TABLE fares
(peak_id int,
peak_name text,
peak_times tstzmultirange,
fare_change numeric);

INSERT INTO fares(peak_id, peak_name, peak_times, fare_change)
VALUES (1, 'holiday', '{[2023-12-24 00:00:, 2023-12-27 00:00],[2023-12-31 00:00, 2024-01-02 00:00]}', 50),
(1, 'peak_summer', '{[2023-05-27 00:00:, 2023-05-30 00:00],[2023-07-03 00:00, 2023-08-30 00:00]}', 30);
</code></pre><p>And now to query something with the multi-timezone range, Postgres has a special operator for this, <code>@></code>. Let’s see if travel today is during peak time.<pre><code class=language-pgsql>SELECT * from fares WHERE peak_times @> now();
</code></pre><h2 id=indexing-time-columns><a href=#indexing-time-columns>Indexing time columns</a></h2><p>Anytime you’re querying time a lot, you’ll want to add an index so that time lookups are faster. Timestamps column indexes work will with the traditional B-tree index as well as <a href=https://www.crunchydata.com/blog/postgresql-brin-indexes-big-data-performance-with-minimal-storage>BRIN</a>. In general, if you have tons of data entered sequentially a <a href=https://www.crunchydata.com/blog/postgres-indexing-when-does-brin-win>BRIN index is probably recommended</a>.<p>A B-tree would be created like this:<pre><code class=language-pgsql>CREATE INDEX btree_actual_departure ON train_schedule (actual_departure);
</code></pre><p>And a BRIN<pre><code class=language-pgsql>CREATE INDEX brin_sequential ON train_schedule USING BRIN (actual_departure);
</code></pre><h2 id=roll-ups><a href=#roll-ups>Roll ups</a></h2><p>So let’s say you have quite a bit of time data. Using the <code>date_trunc</code> function you can easily pull out timestamp data by day or date and then you can use a query to count by the date/date.<p>If I want to find in my train data a count of train trips per day, that would look like this:<pre><code class=language-pgsql>SELECT
date_trunc('day', train_schedule.actual_departure) d,
COUNT (actual_departure)
FROM
train_schedule
GROUP BY
d
ORDER BY
d;
</code></pre><p>Roll ups won’t be the only way to deal with lots and lots of time data. <a href=https://www.crunchydata.com/blog/native-partitioning-with-postgres>Partitioning</a> can be really helpful once you have lots of time data that can be easing sectioned off. If you’re getting into <a href=https://www.crunchydata.com/blog/thinking-fast-vs-slow-with-your-data-in-postgres>measuring analytics or metrics</a>, there’s some options for that as well, like hyperloglog.<h2 id=summary><a href=#summary>Summary</a></h2><p>Thanks for spending your <em>time</em> learning about <em>time</em> ;) Some takeaways<ul><li>store time in UTC +/- values<li><code>timestamptz</code> is your bff<li><code>to_char</code> and all of the formatting functions let you query time however you want<li>Postgres has lots of functions for <code>interval</code> and <code>overlap</code> so you can look at data that intersects<li><code>date_trunc</code> can be really helpful if you want to roll up time fields and count by day or month</ul> ]]></content:encoded>
<category><![CDATA[ Postgres Tutorials ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">3d28b35bfa0321dfa971affa24e98de6a0bb0c102c0dab39ad00a7d90d02b8d1</guid>
<pubDate>Mon, 15 May 2023 12:00:00 EDT</pubDate>
<dc:date>2023-05-15T16:00:00.000Z</dc:date>
<atom:updated>2023-05-15T16:00:00.000Z</atom:updated></item></channel></rss>