CrunchyData Blog

Working with Money in Postgres

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Wed, 11 Oct 2023 09:00:00 EDT

Wouldn’t it be awesome if money worked just like time in Postgres? You could store one canonical version of it, it worked across the globe? Well sadly, money is a whole different ball of wax. Though like time, money is part of most database implementations and I wanted to lay out some of the best practices I’ve gathered for working with money in Postgres.

I also have a tutorial up if you want to try this with Postgres running in a web browser.

Data types for money

Money

Postgres actually does have a money data type. This is not recommended because it doesn’t handle fractions of a cent and currency is tied to a database locale setting. While the money type isn’t best practice for storing money, I do think money is handy for casting data when you want your query output to be formatted like money.

Floats

Float numbers are often popular with any system using positive and negative numbers with decimals (since the name float means that the numbers float across the number scale). Float (real / float4) & double float datatypes (float8) could be used for money, but they generally aren’t ideal because they’re fundamentally imprecise.

So for example, this is correct:

select 0.0001::float4;

0.0001
(1 row)

And so is this:

select 0.0001::float4 + 0.0001::float4;

0.0002
(1 row)

But if we try to go out to additional fractions, this isn’t really the expected result:

select 0.0001::float4 + 0.0001::float4 + 0.0001::float4;

0.00029999999
(1 row)

Integers

Lots of folks use integer for storing money. Integers do not support any kind of decimals, so 100/3 might equal 33.3333, but in integer math that’s just 33. This can work for storing money if you know what your smallest unit is going to be (even down to fractions of a cent) and can use a multiplier in your database. So the multiplier would be 100 for dealing with a whole number of cents, or 1000000000 if you want to represent an amount like 0.237928372 BTC. This unit is stored whole, which solves the issues of float’s unrepresentable values.

There are some physical limitations with this technique, as integer can only store numbers up to 2147483647 and bigint can store only up to 9223372036854775807.

Integer is however notably performant and storage efficient. It's only a 4 byte sized column, 8 if you’re using bigint. Also, keep in mind, storing money as an integer will require division or casting to a different data type to output in a traditional format for your front end or sql reports to represent dollars, cents, or decimal numbers.

Numeric

numeric is widely considered the ideal datatype for storing money in Postgres. numeric and decimal are synonyms for each other, there's no difference in functionality between these two, but I hear numeric used more often in Postgres conversations. Numeric can go out to a lot of decimal places (10,000+ digits!!!) and you get to define the precision. The number data type has two qualifiers, the precision and scale to let you define a sensible number of decimal points to use.

When you create the type, it will look something like this NUMERIC(10,5) where precision is 10 and scale factor is 5.

Precision is the total number of digits before and after the decimal point. You need to set this to the highest amount you ever might need to store. So here 99,999.9999 is the maximum and -99,999.9999 the minimum.
Scale factor is the number of digits following your decimal, so this would be 5 decimal places.

Choosing a scale factor means that at some point Postgres will be rounding numbers. If you want to prevent rounding, make sure your scale number is really really high.

Compared to integer, number data types take up a lot of space, 10 bytes per column row. So if space and performance are a huge concern, and decimal precision is not, you might be better off with integer.

Storing money

Ok, so we have a data type to store actual cents, dollars, euros, etc. Now how do we store currency? In general it is best practice to store the currency alongside the number itself if you need to store money in multiple currencies at the same time. See ISO 4217 if you want the official currency codes. You can use a custom check constraint to require your data be entered for only certain currencies, for example, if you’re using dollars, pounds, and euros that might look like.

CREATE TABLE products (
    sku SERIAL PRIMARY KEY,
    name VARCHAR(255),
    price NUMERIC(7,5),
    currency TEXT CHECK (currency IN ('USD', 'EUR', 'GBP'))
);

If you’re working with currency in many formats there’s a lot to consider. In many cases, a lot of stuff will happen at the time of the transaction. Say a price set in the database in USD displayed to a user in GBP. You’d have a table like the above with a different table for a GBP exchange rate. Perhaps that table updates via API as currency values fluctuate throughout the day. You may have prices set in one currency and the price paid in a different one, entered with the amount paid at the time of purchase.

Functions for money

Averages

and rounding to the nearest cent

SELECT ROUND(AVG(price), 2) AS truncated_average_price
FROM products;

Rounding up with ceiling

totaling and rounding up to the nearest integer

SELECT CEIL(SUM(price)) AS rounded_total_price
FROM products;

Rounding down with floor

totaling and rounding down to the nearest integer

SELECT FLOOR(SUM(price)) AS rounded_total_price
FROM products;

Medians

Calculating the median can be a bit more involved because PostgreSQL doesn't have a built-in median function, but you can use window functions to calculate this

WITH sorted_prices AS (
    SELECT price,
           ROW_NUMBER() OVER (ORDER BY price) as r,
           COUNT(*) OVER () as total_count
    FROM products
)
SELECT FLOOR(AVG(price)) AS rounded_median_price
FROM sorted_prices
WHERE r IN (total_count / 2, (total_count + 1) / 2);

Casting to the money type

If you’d like a result with a currency sign, commas, and periods.

SELECT CEIL(SUM(price))::money AS rounded_total_price_money
FROM products;

Note that the currency sign will appear based on your locale settings, show lc_monetary; will tell you what that is and you can update it to a different currency.

Summary

Use int or bigint if you can work with whole numbers of cents and you don’t need fractional cents. This saves space and offers better performance. Store your money in cents and convert to a decimal on your output. This is also really the preferred method if all currency is the same type. If you’re changing currency often and dealing with fractional cents, move on to numeric.
Use decimal / numeric for storing money in fractional cents and even out to many many decimal points. If you need to support lots of precision in money, this is the best method but there’s a bit of storage and performance cost.
Store currency separately from the actual monetary values, so you can run calculations on currency conversions.

Top 10 Postgres Management Tasks

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Tue, 29 Aug 2023 09:00:00 EDT

1. Add a statement timeout

Postgres databases are very compliant, they do what you tell them until you tell them to stop. It is really common for a runaway process, query, or even something a co-worker runs to accidentally start a never ending transaction in your database. This potentially uses up memory, i/o, or other resources.

Postgres has no preset default for this. To find out your current setting:

SHOW statement_timeout;

A good rule of thumb can be a minute or a couple minutes.

ALTER DATABASE mydatabase 
SET statement_timeout = '60s';

This is a connection-specific setting, so you’ll need to reconnect and have your application reconnect, for this to take effect on an ongoing basis.

2. Confirm you have enough memory

For application workloads you want your most frequently accessed Postgres data to be accessible in memory/cache. You can check your cache hit ratio to see how often Postgres is using the cache. Ideally, you have 98-99% of data in the cache. If you see your cache hit ratio below that, you probably need to look at your memory configuration or move to an instance with larger memory.

SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) +  sum(heap_blks_read)) as ratio
FROM 
  pg_statio_user_tables;

Note: For warehouse or analytical workloads, you will probably have a much lower cache hit ratio.

3. Check shared buffers

Shared buffers is another key memory check. The default shared_buffers is 128MB. Find your current setting with:

SHOW shared_buffers;

The value should be set to 15% to 25% of the machine’s total RAM. So if you have an 8GB machine, a quarter of that would be 2GB.

SET shared_buffers='2GB';

Shared buffers is a parameter that requires a restart to take effect.

4. Use SSL/TLS for data in transit

To find out if you’re currently using ssl:

SHOW ssl;

Hopefully you’ll see ssl | on.

If you’re not, you’ll need to do some work on the database and application servers to make sure connections are encrypted. See more docs here.

5. Set up backups

Backups are a must have in database management. There’s a few ways to get backup data from Postgres but here’s the essential info:

pg_dump generates backup files but it shouldn’t be used as a real backup, it is more of a data manipulation tool
pg_basebackup generates a full binary copy of the database including WAL files, but by itself it is not a complete backup system
pgBackRest is a complete WAL archive and backup tool which can be used for disaster recovery and point-in-time recovery

You should be using a full disaster recovery data backup tool or working with a vendor that does it for you.

6. Stay on top of Postgres releases and upgrade frequently

The PostgreSQL development community releases about 4 minor versions a year and 1 major version a year.

You should be planning to patch your database in some alignment to this schedule. Staying on top of security patches and the most recent versions will make sure you’re running on the most up to date and most efficient software. Here’s a graphic of where we are now and what is coming later this year. Make sure you have plans to upgrade frequency and to major versions annually.

7. Use pg_stat_statements

pg_stat_statements has to be the most valuable Postgres tool that’s not part of the out of the box software. I mentioned to some committers at a conference recently that we should get it in core Postgres and they assured me I could have a patch in and rejected before the day was over. To be fair it is a contrib module that generally ships with Postgres so you don’t have to go searching for it.

Since pg_stat_statements comes with the Postgres contrib libraries, its really easy to add with CREATE EXTENSION pg_stat_statements. You also have to add it to shared preloaded libraries since it shares some memory. Adding it also requires a restart.

Here’s a quick query for checking on your 10 slowest queries. Always a good idea to peek in on these and see if there’s any easy fixes to make things work a little faster.

SELECT
  (total_exec_time / 1000 / 60) as total_min,
  mean_exec_time as avg_ms,
  calls,
  query
FROM pg_stat_statements
ORDER BY 1 DESC
LIMIT 10;

8. Add indexes

Indexes are really the foundational key to query performance for Postgres. Without indexes, your database is doing full sequential scans each time you query data which uses up a lot of memory and precious query time. Adding indexes gives Postgres an easy way to find and sort your data. Using that handy pg_stat_statements above, you already know what queries are the slowest.

The pg_indexes view will show you what you’ve got at the moment:

SELECT * FROM pg_indexes;

Check out Postgres Indexes for Newbies if you’re just getting started.

9. Check for unused indexes

Indexes are incredibly helpful but sometimes folks go too far adding indexes for everything. Indexes can take up a fair amount of storage space, and all new writes have to be written to them, so keeping them around if they’re not being used can be bad for performance. The pg_stat_user_indexes table has all the information for you on this, so you can look at index usage with a select * from pg_stat_user_indexes. A more sophisticated query that removes unique indexes and primary keys, showing you unused indexes ordered by size is.

SELECT schemaname || '.' || relname AS table,
       indexrelname AS index,
       pg_size_pretty(pg_relation_size(i.indexrelid)) AS "index size",
       idx_scan as "index scans"
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT indisunique
  AND idx_scan < 50
  AND pg_relation_size(relid) > 5 * 8192
ORDER BY
  pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST,
  pg_relation_size(i.indexrelid) DESC;

If you’re using read replicas, don’t forget to check those too before you delete unused indexes. An unused index on the primary might be used on the replica.

10. Review your connection settings

Postgres has a max_connections setting that defaults at 100. This will show you how many connections your instance is currently configured for:

SHOW max_connections;

For tuning the max_connections setting in Postgres, you’ll need to know how your application is connecting and how many connections are allowed. You’ll also want to leave a little headroom, like 10% for other processes, or people, to connect to the database as well. For example if you have 4 servers that can use 50 connections each, plus 10%, you’d want to set max connections to 220.

You may also want to look at a connection pooler. You can check for idle and active connections in your database with the below query.

SELECT count(*), state
FROM pg_stat_activity
GROUP BY 2;

If you're in the high 10s or if you have more idle than active connections, pooling might be a good option.

Need more Postgres tips?

We have an awesome tips page we’ve been building out. We also just started a new Discord channel to chat about Postgres, stop by, say hi, and let me know what your Top 10 list for Postgres is.

Postgres Subquery Powertools: CTEs, Materialized Views, Window Functions, and LATERAL Join

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Thu, 17 Aug 2023 09:00:00 EDT

Beyond a basic query with a join or two, many queries require extracting subsets of data for comparison, conditionals, or aggregation. Postgres’ use of the SQL language is standards compliant and SQL has a world of tools for subqueries. This post will look at many of the different subquery tools. We’ll talk about the advantages and use cases of each, and provide further reading and tutorials to dig in more.

I’ll take a broad definition of “subquery”. Why am I calling all of these subqueries? These are all queries that work on subsets of data. Having read the article title, you might have come here to say that a subquery is a specific thing vs all these other SQL tools I’m talking about. And you’d be right! If you have a better name for this group of tools, let me know.

What is a subselect?

A subquery extracts values from a group of something else. It’s a subset of a set. A basic subquery is a nested select statement inside of another query. The most basic subselects are typically found in WHERE statements.

In this example we want to summarize a quantity of SKUs sold that have a certain sale price. The subquery returns the SKU of all products that have a sale price less than 75% of the price. Then the top-level query sums the quantity of each product_order.

SELECT
   sum(qty) as total_qty,
   sku
FROM
   product_orders
WHERE
   sku in
   (
      SELECT
         sku
      FROM
         products
      WHERE
         sale_price <= price * .75
   )
GROUP BY
   sku;

As with most things SQL, you could build this query a few ways. Most queries that you execute with a join you could also execute with a subquery. Why would you use a subquery instead of a join? Mostly, it depends on your syntax preference and what you want to do with it. So the above could also be written as a plain select statement joining products and product_orders by SKU. See our blog post on choosing joins vs subqueries for more.

When to use a basic subselect

Your subquery is simple and can go in the WHERE clause

What is a Postgres view?

A view is a stored query that you access as you would a table. View functionality is quite common across relational database systems. Since a view is a query, it can be data from one table or consolidated data from multiple tables. When called, a view will execute a query or it can be called as a subquery. Commonly, a view is used to save a query that you might be running often from inside your database. Views can be used as a join or a subquery from inside another query.

Views are a little more advanced than just a query, they can have separate user settings. You could specify views for certain individuals or applications if you want to show parts of certain tables. Some developers have their applications query a view instead of the base table, so if changes are made to the underlying tables, fewer changes will impact the application code.

Using the example we started with above, let’s say we often need to call the SKUs of sale items in other queries, so we want to create a view for that. Here’s sample syntax for a Postgres view. We name this view skus_on_sale, which selects SKUs from the product table that have a sale price less than 75% of their original price.

CREATE VIEW skus_on_sale AS
SELECT
   sku
FROM
   products
WHERE
   sale_price <= price * .75;

Previously, we nested a full subquery, this time, we join this view in another query. Logically, this will return the same values as the prior query:

SELECT
   sum(po.qty) as total_qty,
   sk.sku
FROM
   product_orders po
   JOIN
      skus_on_sale sk
      ON sk.sku = po.sku
GROUP BY
   sk.sku;

When to use a view?

When you want to save a specific query for use later or in other queries
You have a security issue or need to to show a user or application only the view and not the entire table or tables involved

What is a Materialized View?

Materialized views are saved queries that you store in the database like you would store a table. Unlike the regular view, a materialized view is stored on disk and information does not need to be re-computed to be used each time. Materialized views can be queried like any other table. Typically materialized views are used for situations where you want to save yourself, or the database, from intensive queries or for data that is frequently used.

The big upside to materialized views is performance. Since the data has been precomputed, materialized views often have better response times than other subquery methods. No matter how complex the query, how many tables involved, Postgres stores these results as a simple table. This simple table becomes a simple join to the materialized view, and the materialized view hides complexity of the subqueries heavy lifting.

Here’s an example of a materialized view that will get my SKUs and the shipped quantity by SKU. This shows the most frequently sold SKUs at the top since I’m ordering by qty in descending order.

CREATE MATERIALIZED VIEW recent_product_sales AS
SELECT
   p.sku,
   SUM(po.qty) AS total_quantity
FROM
   products p
   JOIN
      product_orders po
      ON p.sku = po.sku
   JOIN
      orders o
      ON po.order_id = o.order_id
WHERE
   o.status = 'Shipped'
GROUP BY
   p.sku
ORDER BY
   2 DESC;

To improve query performance on materialized views, we can also create indexes on their fields, here;s an example that indexes on the quantity column.

CREATE INDEX sku_qty ON recent_product_sales(total_quantity);

Just like the view, we can call the materialized view in a query. So for example, we can quickly review the top 10 products sold without having to write a subquery to sum or rank.

SELECT
   sku
FROM
   recent_product_sales LIMIT 10;

To update the data held on disk, run a refresh command, REFRESH MATERIALIZED VIEW CONCURRENTLY recent_product_sales;. Use CONCURRENTLY to allow queries to execute to the existing output while the new output is refreshed.

See our tutorial on materialized views if you want to see it in action.

When to use a materialized view?

Your subquery is intensive so storing the generated results rather than computed each time will help overall performance
Your data doesn’t need to be updated in real time

What is a common table expression (CTE)?

A CTE, a common table expression, allows you to split a complex query into different named parts and reference those parts later in the query.

CTEs always start with a WITH statement that creates a subquery first
The WITH statement is followed by a select statement that references the CTE, the CTE cannot exist alone

Similar to the view statement above, here is a sample CTE that creates a subselect called huge_savings, then uses this in a select statement.

WITH huge_savings AS
(
   SELECT
      sku
   FROM
      products
   WHERE
      sale_price <= price * .75
)
SELECT
   sum(qty) as total_qty,
   sku
FROM
   product_orders
   JOIN
      huge_savings USING (sku)
GROUP BY
   sku;

Often as queries become more and more complex, CTEs are a great way to make understanding queries easier by combining data manipulation into sensible parts.

What is a recursive CTE?

A recursive CTE is a CTE that selects against itself. You’ll define an initial condition and then append rows as part of the query. This goes on and on until a terminating condition. We have some awesome examples of recurring CTEs in our Advent of Code series. Recursive CTEs will start with WITH recursive AS.

When to use a CTE?

To separate and define a complicated subquery
You have multiple subqueries to include in a larger query
Your subquery needs to select against itself so you’ll need a recursive CTE

What is a window function?

A window function is an aggregate function that looks at a certain set, ie - the window, of data. The function is typically first and the operator OVER is used to define the group / partition of data you’re looking at. Window functions are used in subqueries often to do averages, summations, max/min, ranks, averages, lead (next row), or lag (previous row).

For example, you could write a simple window function to sum product orders by sku. The SUM is the aggregations and the OVER PARTITION looks at the sku set.

SELECT
   sku,
   SUM(qty) OVER (PARTITION BY sku)
FROM
   product_orders LIMIT 10;

We have a nice tutorial on window functions with a CTEs for the birth data set. Here’s one sql example using the window function lag. We use a CTE to create a count of births per week. Then, we use the lag function to return this weeks’ birth count, and the birth count for the prior week.

WITH weekly_births AS
(
   SELECT
      date_trunc('week', day) week,
      sum(births) births
   FROM
      births
   GROUP BY
      1
)
SELECT
   week,
   births,
   lag(births, 1) OVER (
ORDER BY
   week DESC ) prev_births
FROM
   weekly_births;

It is worth calling out here, that similar to a window function, the FILTER functionality on GROUP BY aggregations is also a powerful sql tool. I won’t include more here because it’s not a subquery as much as is a filter. For more information, Crunchy Data has a walkthrough on using FILTER with GROUP BY.

When to use a window function?

If you have a subquery that’s an aggregation, like a sum, rank, or average
The subquery applies to a limited set of the overall data to be returned

What is a LATERAL join?

LATERAL lets you use values from the top-level query in the subquery. So, if you are querying on accounts in the top-level query, you can then reference that in the subquery. When run, LATERAL is kind of like running a subquery for each individual row. LATERAL is commonly used for querying against an array or JSON data, as well as a replacement for the DISTINCT ON syntax. Check out our LATERAL tutorial to see if you get any ideas about where to add it to your query tools. I would also double check performance when using LATERAL, in our internal testing, its generally not as good as other join options.

Below, we use LATERAL to find the last purchase for every account:

SELECT
   accounts.id,
   last_purchase.*
FROM
   accounts
   INNER JOIN
      LATERAL (
      SELECT
         *
      FROM
         purchases
      WHERE
         account_id = accounts.id
      ORDER BY
         created_at DESC LIMIT 1 ) AS last_purchase
         ON true;

When to use a LATERAL join?

You want to lookup data for each row
You’re using an array or JSON data in a join

Summary

Here’s my reference guide for the tools I discussed above:

what	details	example
subselect	select inside a select	SELECT sum(qty) as total_qty,sku FROM product_orders WHERE sku in (SELECT sku FROM products WHERE sale_price <= price * .75) GROUP BY sku;
CTE	subqueries with named parts	WITH huge_savings AS ( SELECT sku FROM products WHERE sale_price <= price * .75) SELECT sum(qty) as total_qty, sku FROM product_orders JOIN huge_savings USING (sku) GROUP BY sku;
materialized view	saved query to a table	CREATE MATERIALIZED VIEW recent_product_sales AS SELECT p.sku, SUM(po.qty) AS total_quantity FROM products p JOIN product_orders po ON p.sku = po.sku JOIN orders o ON po.order_id = o.order_id WHERE o.status = 'Shipped' GROUP BY p.sku ORDER BY 2 DESC;
window functions	aggregations on subsets of data	SELECT sku, SUM(qty) OVER (PARTITION BY sku) FROM product_orders LIMIT 10;
lateral join	correlated subquery	SELECT sku, SUM(qty) OVER (PARTITION BY sku) FROM product_orders LIMIT 10;

If you’re wondering which to use when, you can just get in there and test. Don’t forget to use your query planning best friend EXPLAIN ANALYZE to test query efficiency and plans.

Links to our web based Postgres tutorials for more on these topics:

CTEs and Window functions

Materialized views

Lateral joins

Tags and Postgres Arrays, a Purrrfect Combination

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Mon, 22 May 2023 09:00:00 EDT

In a previous life, I worked on a CRM system that really loved the idea of tags. Everything could be tagged, users could create new tags, tags were a key organizing principle of searching and filtering.

The trouble was, modeled traditionally, tags can really make for some ugly tables and equally ugly queries. Fortunately, and as usual, Postgres has an answer.

Today I’m going to walk through working with tags in Postgres with a sample database of 🐈 cats and their attributes

First, I’ll look at a traditional relational model
Second, I’ll look at using an integer array to store tags
Lastly, I’ll test text arrays directly embedding the tags alongside the feline information

This post is also available as an interactive tutorial in our Postgres playground.

Tags in a relational model

For these tests, we will use a very simple table of 🐈 cats, our entity of interest, and tags a short table of ten tags for the cats. In between the two tables, the relationship between tags and cats is stored in the cat_tags table.

Table Creation SQL

Working with Time in Postgres

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Mon, 15 May 2023 12:00:00 EDT

Since humans first started recording data, they’ve been keeping track of time. Time management is one of those absolutely crucial database tasks and Postgres does a great job of it. Postgres has a lot of options for storing and querying time so I wanted to provide an overview of some of the most common needs for storing and retrieving time data.

This blog is also available as a hands on tutorial running in your local browser via our Postgres playground.

If you ask Postgres what time it is,

SELECT now();

You’ll get something like

now
-----------------------------
 2023-05-15 18:23:58.5603+00

The default time representation here is a full timestamp string, containing the date, time, and a reference to timezone. In this case, the +00 represents equal with UTC. UTC has long been a standard time measurement following suit from the Greenwich mean time (if you’re as old as I am).

If I want to know the time in my local timezone

SELECT now() AT TIME ZONE 'America/Chicago';

The full list of timezones names you can use is stored in a system table and can be retrieved with select * from pg_timezone_names;

Data types for time

Postgres has a TIME data type, with and without a time zone if you want to store that separately from a date. This is generally not recommended since in most cases time requires an accompanying date. There’s a TIMESTAMP datatype. Adding timezone to TIMESTAMP is TIMESTAMP WITH TIMEZONE or aliased as the TIMESTAMPTZ. Without a doubt TIMESTAMPTZ is going to be the MVP of Postgres time storage. If you store data in with the full date, time, and timezone you’ll never have to worry about the server time, what time the user entered the data, what time it is where you’re querying data, or any of those crazy calculations. And you or your application can pull out the time and display it in whatever local user timezone you need.

When working with Postgres, you’ll also see epoch which is how seconds are represented. This is not a timestamp, its an integer (a double precision floating-point number, 64 bits) and it represents the number of seconds since January 1st, 1970. This can be used if you need a specific comparison or need time in that format. Postgres can easily convert back at forth between timestamps and epochs. To find the current epoch:

SELECT EXTRACT (EPOCH FROM now());

Time formats & functions

I’m an American midwesterner so of course, I would write Bastille Day like - July 14th, 1789 or 7-14-1789. Of course all my French friends would write it 14 July 1789 or 14-07-1789. And while I’d love to debate with you all over beers about the best way to do this, ISO has some standards for time formats, namely ISO 8601 which states that dates will be read like this 1789-07-14 17:30:00.000, year-month-day-time. This date format is what used in TIMESTAMP and what you’ll see most often in the database and engineering world.

Time storage has the ISO8601 best practice, however, depending on your end users or business needs, you may want to change the time format in your queries whey they’re output. So to change the time format of a query you can use the TO_CHAR function which will translate a time string into different characters.

SELECT TO_CHAR(NOW(), 'DY, Mon dd, yyyy HH24:MI:SS OF');

TO_CHAR let’s you convert the time interval string to text and characters. Then using some formatting functions, I can pull out the day of the week, an American date format, and UTC time. The result of that query would be:

 MON, May 15, 2023 14:22:28 +00

Time intervals

Now that we’re fancy and can get dates in any format we want, how about calculating intervals and lapsed time in different formats?

We’ve loaded in a sample table with some train schedule data, take a peek

SELECT * FROM train_schedule LIMIT 3;

and it looks like this

 trip_id | track_number | train_number |  scheduled_departure   |   scheduled_arrival    |    actual_departure    |     actual_arrival
---------+--------------+--------------+------------------------+------------------------+------------------------+------------------------
       1 |            1 |          683 | 2023-04-29 11:15:00+00 | 2023-04-29 12:35:00+00 | 2023-04-29 11:21:00+00 | 2023-04-29 12:52:00+00
       2 |            1 |          953 | 2023-04-29 13:49:00+00 | 2023-04-29 15:10:00+00 | 2023-04-29 13:50:00+00 | 2023-04-29 15:17:00+00
       3 |            1 |          140 | 2023-04-29 15:06:00+00 | 2023-04-29 15:23:00+00 | 2023-04-29 15:06:00+00 | 2023-04-29 15:22:00+00
(3 rows)

Let’s say you are storing an update_time fields. To find your the lower and upper bounds of arrival times times in your data set you would do:

SELECT min(actual_arrival) FROM train_schedule;

and

 SELECT max(actual_arrival) FROM train_schedule;

To find the interval between them:

SELECT
(SELECT max(actual_arrival) FROM train_schedule)
- (SELECT min(actual_arrival)
FROM train_schedule);

Ok, so we have about 10 days of train schedule information in here.

Taking this a step further, if I want to look at intervals between scheduled time of departure and actual time of departure. I can create an arrival_delta and a subquery that compares actual arrival minus scheduled arrival.

SELECT avg(arrival_delta)
FROM (SELECT scheduled_arrival, actual_arrival,
	actual_arrival - scheduled_arrival AS arrival_delta
FROM train_schedule)q;

You can also add a filter to find interval sizes. If we build on the above query but only for departures that were more than 10 minutes later than their original scheduled time we can add this interval > ‘10 minutes`.

SELECT avg(arrival_delta)
FROM (select scheduled_arrival, actual_arrival,
actual_arrival - scheduled_arrival AS arrival_delta
FROM train_schedule WHERE (actual_arrival - scheduled_arrival)
> INTERVAL '10 minutes')q;

Overlapping / intersecting time

What if I wanted to find all of the trains that were running at a specific time - or now. You can use the OVERLAP operator with the INTERVAL.

SELECT count(*) FROM train_schedule
WHERE (actual_departure, actual_arrival)
OVERLAPS (now(), now() - INTERVAL '2 hours');

Time Range Types

Postgres also supports working with time ranges that include both a single range, and even multiple ranges. Single ranges of the timestamptz is called tstzrange and one for multiple ranges would be tstzmultirange

For example, if we wanted to create a table in our train database that has some peak travel season fares, we could do:

CREATE TABLE fares
(peak_id int,
peak_name text,
peak_times tstzmultirange,
fare_change numeric);

INSERT INTO fares(peak_id, peak_name, peak_times, fare_change)
VALUES (1, 'holiday', '{[2023-12-24 00:00:, 2023-12-27 00:00],[2023-12-31 00:00, 2024-01-02 00:00]}', 50),
(1, 'peak_summer', '{[2023-05-27 00:00:, 2023-05-30 00:00],[2023-07-03 00:00, 2023-08-30 00:00]}', 30);

And now to query something with the multi-timezone range, Postgres has a special operator for this, @>. Let’s see if travel today is during peak time.

SELECT * from fares WHERE peak_times @> now();

Indexing time columns

Anytime you’re querying time a lot, you’ll want to add an index so that time lookups are faster. Timestamps column indexes work will with the traditional B-tree index as well as BRIN. In general, if you have tons of data entered sequentially a BRIN index is probably recommended.

A B-tree would be created like this:

CREATE INDEX btree_actual_departure ON train_schedule (actual_departure);

And a BRIN

CREATE INDEX brin_sequential ON train_schedule USING BRIN (actual_departure);

Roll ups

So let’s say you have quite a bit of time data. Using the date_trunc function you can easily pull out timestamp data by day or date and then you can use a query to count by the date/date.

If I want to find in my train data a count of train trips per day, that would look like this:

SELECT
date_trunc('day', train_schedule.actual_departure) d,
COUNT (actual_departure)
FROM
train_schedule
GROUP BY
d
ORDER BY
d;

Roll ups won’t be the only way to deal with lots and lots of time data. Partitioning can be really helpful once you have lots of time data that can be easing sectioned off. If you’re getting into measuring analytics or metrics, there’s some options for that as well, like hyperloglog.

Summary

Thanks for spending your time learning about time ;) Some takeaways

store time in UTC +/- values
timestamptz is your bff
to_char and all of the formatting functions let you query time however you want
Postgres has lots of functions for interval and overlap so you can look at data that intersects
date_trunc can be really helpful if you want to roll up time fields and count by day or month