CrunchyData Blog

Window Functions for Data Analysis with Postgres

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Mon, 16 Sep 2024 10:00:00 EDT

SQL makes sense when it's working on a single row, or even when it's aggregating across multiple rows. But what happens when you want to compare between rows of something you've already calculated? Or make groups of data and query those? Enter window functions.

Window functions tend to confuse people - but they’re a pretty awesome tool in SQL for data analytics. The best part is that you don’t need charts, fancy BI tools or AI to get some actionable and useful data for your stakeholders. Window functions let you quickly:

Calculate running totals
Provide summary statistics for groups/partitions of data
Create rankings
Perform lag/lead analysis, ie comparing two separate sets of data with each other
Compute moving/rolling averages

In this post, I will show various types of window functions and how they can apply to certain situations. I’m using a super simple e-commerce schema for this to follow along with the kinds of queries I’m going to run with window functions.

This post is available as an interactive tutorial as well in our Postgres playground.

The OVER function

The OVER part of the Window function is what creates the window. Annoyingly the word window appears nowhere in any of the functions. 😂 Typically the OVER part is preambled by another function, either an aggregate or mathematical function. There’s also often a frame, to specify which rows you’re looking at like ROWS BETWEEN 6 PRECEDING AND CURRENT ROW.

Window functions vs where clauses

Window functions kind of feel like a where clause at first, since they’re looking at a set of data. But they’re really different. Window functions are more for times when you need to look across sets of data or across groups of data. There are cases where you could use either. In general:

Use WHERE clause when you need to filter rows based on a condition.
Use window functions when you need to perform calculations across rows that remain after filtering, without removing any rows from the result set.

Running totals

Here’s a simple place to get started. Let’s ask for orders, customer data, order totals, and then a running total of orders. This will show us our total orders across a date range.

SELECT
    SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
    order_date,
    order_id,
    customer_id,
    total_amount
FROM
    orders
ORDER BY
    order_date;

 running_total |     order_date      | order_id | customer_id | total_amount
---------------+---------------------+----------+-------------+--------------
        349.98 | 2024-08-21 10:00:00 |       21 |           1 |       349.98
       1249.96 | 2024-08-22 11:30:00 |       22 |           2 |       899.98
       1284.94 | 2024-08-23 09:15:00 |       23 |           3 |        34.98
       1374.93 | 2024-08-24 14:45:00 |       24 |           4 |        89.99
       1524.92 | 2024-08-25 08:25:00 |       25 |           5 |       149.99
       1589.90 | 2024-08-26 12:05:00 |       26 |           6 |        64.98

What's happening here is that each frame of data is the existing row plus the rows before. This sort of does calculations one one slice at a time, which you might see in the docs called a virtual table. Here's a diagram to get the general idea of how each frame of data is a set of rows, aggregated by the function with the SUM OVER.

First and last values

Window functions can look at groups of data, so say a specific customer ID and give you something like their first and last order, total, for your most recent 10 orders.

SELECT
    FIRST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS first_order_date,
    LAST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS last_order_date,
    o.order_id,
    o.customer_id,
    o.order_date,
    o.total_amount
FROM
    orders o
ORDER BY
    o.order_date DESC;

  first_order_date   |   last_order_date   | order_id | customer_id |     order_date      | total_amount
---------------------+---------------------+----------+-------------+---------------------+--------------
 2024-08-30 17:50:00 | 2024-09-19 18:50:00 |       50 |          10 | 2024-09-19 18:50:00 |       149.98
 2024-08-29 13:10:00 | 2024-09-18 14:10:00 |       49 |           9 | 2024-09-18 14:10:00 |       199.98
 2024-08-28 10:20:00 | 2024-09-17 11:20:00 |       48 |           8 | 2024-09-17 11:20:00 |       139.99
 2024-08-27 16:35:00 | 2024-09-16 17:35:00 |       47 |           7 | 2024-09-16 17:35:00 |       249.98
 2024-08-26 12:05:00 | 2024-09-15 13:05:00 |       46 |           6 | 2024-09-15 13:05:00 |        89.98

Using date_trunc GROUP BY CTEs with Window Functions

date_trunc is an incredibly handy Postgres function that summarizes units of time, hours, days, weeks, month. When combined with a GROUP BY in a CTE, you can create really easy summary statistics by day, month, week, year, etc.

When you combine the date_trunc GROUP BY partitions with window functions, some pretty magical stuff happens, and you can get ready-made summary statistics straight out of your database. In my opinion this is one of the most powerful features of Postgres window functions that really gets you to the next level.

Here’s an example query that starts with a CTE, calling a date_trunc to sum orders by daily totals. The second part of the query, the window function, ranks the sales in descending order with the best day of sales.

WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    RANK() OVER (
        ORDER BY daily_total_sales DESC
    ) AS sales_rank
FROM
    DailySales
ORDER BY
    sales_rank;

    sales_date      | daily_total_sales | sales_rank
---------------------+-------------------+------------
 2024-09-02 00:00:00 |           2419.97 |          1
 2024-09-01 00:00:00 |           1679.94 |          2
 2024-08-22 00:00:00 |            899.98 |          3
 2024-09-07 00:00:00 |            699.95 |          4
 2024-09-10 00:00:00 |            659.96 |          5
 2024-09-09 00:00:00 |            499.94 |          6
 2024-09-06 00:00:00 |            409.94 |          7
 2024-08-30 00:00:00 |            349.99 |          8

I should note that this uses one of the really helpful math functions, RANK in a Window function.

`LAG` Analysis

Let’s do more with our date_trunc CTEs. Now that we know we have our data by day, we can use a window function to calculate changes between these groups. For example we could look at the difference in sales from the prior day. In this example, LAG looks at the sales date and creates a comparison with the previous day.

WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    LAG(daily_total_sales) OVER (
        ORDER BY sales_date
    ) AS previous_day_sales,
    daily_total_sales - LAG(daily_total_sales) OVER (
        ORDER BY sales_date
    ) AS sales_difference
FROM
    DailySales
ORDER BY
    sales_date;


     sales_date      | daily_total_sales | previous_day_sales | sales_difference
---------------------+-------------------+--------------------+------------------
 2024-08-21 00:00:00 |            349.98 |                    |
 2024-08-22 00:00:00 |            899.98 |             349.98 |           550.00
 2024-08-23 00:00:00 |             34.98 |             899.98 |          -865.00
 2024-08-24 00:00:00 |             89.99 |              34.98 |            55.01
 2024-08-25 00:00:00 |            149.99 |              89.99 |            60.00
 2024-08-26 00:00:00 |             64.98 |             149.99 |           -85.01

LEAD works the same way, looking forward in the data set.

Rolling averages

Using our same day groups we can also make a rolling average. The AVG function takes an input ROWS BETWEEN 6 PRECEDING AND CURRENT ROW for a rolling 7 day sales average.

WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    AVG(daily_total_sales) OVER (
        ORDER BY sales_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_average_7_days
FROM
    DailySales
ORDER BY
    sales_date
LIMIT 10;

     sales_date      | daily_total_sales | rolling_average_7_days
---------------------+-------------------+------------------------
 2024-08-21 00:00:00 |            349.98 |   349.9800000000000000
 2024-08-22 00:00:00 |            899.98 |   624.9800000000000000
 2024-08-23 00:00:00 |             34.98 |   428.3133333333333333
 2024-08-24 00:00:00 |             89.99 |   343.7325000000000000
 2024-08-25 00:00:00 |            149.99 |   304.9840000000000000
 2024-08-26 00:00:00 |             64.98 |   264.9833333333333333
 2024-08-27 00:00:00 |            249.98 |   262.8400000000000000
 2024-08-28 00:00:00 |            129.99 |   231.4128571428571429
 2024-08-29 00:00:00 |            179.98 |   128.5557142857142857
 2024-08-30 00:00:00 |            349.99 |   173.5571428571428571
(10 rows)

N-tiles with Window Functions

The NTILE function is a window function in SQL that is used to divide a result set into a specified number of roughly equal parts, known as tiles or buckets. By assigning a unique tile number to each row, n-tiles helps categorize and analyze data distribution within a dataset. This function is particularly useful in statistical and financial analysis for understanding how data is distributed across different segments, identifying trends, and making comparisons between groups with different characteristics.

For example, using NTILE(4) divides the data into four quartiles, ranking each row into one of four groups. The descending part here makes sure that quartile 1 is the top quarter and so on.

WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    NTILE(4) OVER (
        ORDER BY daily_total_sales DESC
    ) AS sales_quartile
FROM
    DailySales
ORDER BY
    sales_date;

     sales_date      | daily_total_sales | sales_quartile
---------------------+-------------------+----------------
 2024-09-06 00:00:00 |            409.94 |              1
 2024-08-30 00:00:00 |            349.99 |              1
 2024-08-21 00:00:00 |            349.98 |              2
 2024-09-08 00:00:00 |            349.96 |              2

Window functions resources

We’re big fans of Postgres functions at Crunchy Data, so check out our resources on Window functions in our Postgres Tutorials:

Six Degrees of Kevin Bacon - Postgres Style

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Mon, 05 Aug 2024 08:00:00 EDT

Back in the 1990s, before anything was cool (or so my children tell me) and at the dawn of the Age of the Meme, a couple of college students invented a game they called the "Six Degrees of Kevin Bacon".

The conceit behind the Six Degrees of Kevin Bacon was that actor Kevin Bacon could be connected to any other actor, via a chain of association of no more than six steps.

Why Kevin Bacon? More or less arbitrarily, but the students had noted that Bacon said in an interview that "he had worked with everybody in Hollywood or someone who's worked with them" and took that statement as a challenge.

Bacon Number

The number of steps necessary to get from some actor to Kevin Bacon is their "Bacon Number".

For example, comedy legend Steve Martin has a Bacon Number of 1, since Kevin Bacon appeared with him in the 1987 road trip comedy Planes, Trains and Automobiles.

Zendaya has as Bacon number of 2. In 2017 she appeared with Marisa Tomei in Spider-Man: Homecoming, and in 2005 Tomei appeared with Bacon in Loverboy (which Bacon also directed).

IMDB Data

The challenge of the original '90s Six Degrees of Kevin Bacon was to link up two actors using only the knowledge in your head. This seems improbably difficult to me, but perhaps people were smarter in the '90s.

In our modern age we don't need to be smart, we can attack the Bacon Number problem by combining data and algorithms.

The data half of the problem is relatively straightforward -- the Internet Movie Database (aka IMDB) allows direct download of the information we need.

In particular, the

name.basics.tsv.gz file for actor names and jobs
title.basics.tsv.gz file for movie names and dates
title.principals.tsv.gz file for relationships between actors and movies

The IMDB database files actually include information about every job on a film (writers, directors, producers, casting, etc, etc) and we are only interested in actors for the Kevin Bacon game.

ETL process to download and process raw data

Magic Tricks for Postgres psql: Settings, Presets, Echo, and Saved Queries

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Fri, 19 Jul 2024 08:00:00 EDT

As I’ve been working with Postgres psql cli, I’ve picked up a few good habits from my Crunchy Data co-workers that make my terminal database environment easier to work with. I wanted to share a couple of my favorite things I’ve found that make getting around Postgres better. If you’re just getting started with psql, or haven’t ventured too far out of the defaults, this is the post for you. I’ll walk you through some of the friendliest psql settings and how to create your own preset settings file.

Some of the most helpful psql commands

Formatting psql output

Postgres has an expanded display mode, which will read your query results as batches of column and data, instead of a huge wide list of columns that expand the display to the right.

A sample expanded display looks like this:

-[ RECORD 1 ]------------------------------
id         | 1
name       | Alice Johnson
position   | Manager
department | Sales
salary     | 75000.00
-[ RECORD 2 ]------------------------------
id         | 2
name       | Bob Smith
position   | Developer
department | Engineering
salary     | 65000.00

--Automatically format expanded display for wide columns
\x auto

I have a tutorial up about using basic psql if you’re just getting started and want to try these commands out.

Table column borders in psql output

If you’re not using the extended display, you can have psql do some fancy column outlines with the \pset linestyle.

--Outline table borders and separators using Unicode characters
\pset linestyle unicode

That will get you query output that looks like this:

┌────┬───────┬─────┐
│ id │ name  │ age │
├────┼───────┼─────┤
│  1 │ Alice │  30 │
│  2 │ Bob   │  25 │
└────┴───────┴─────┘

Show query run times in psql

This will give you a result in milliseconds for the time the query took to run at the bottom:

-- Always show query time
\timing

Create a preset for your null values in psql

This will work with emojis or really anything utf-8 compatible:

-- Set Null char output to differentiate it from empty string
\pset null '☘️'

Your psql history

You can create a history file for your psql command sessions like this:

-- Creates a history file for each database in your config directory CHECK IF THIS IS RIGHT
\set HISTFILE ~/.config/psql/psql_history-:DBNAME

-- Number of commands to save in history
\set HISTSIZE 2000

Echo PSQL commands as SQL

Any psql slash command (such as \d) runs against Postgres’ system tables. You can use the psql echo command to display the queries used for a given command, which can give you insight about Postgres’ internal tables, catalog, and other naming conventions.

-- output any SQL run by psql slash commands
\set ECHO_HIDDEN on

-- short name of ECHO_HIDDEN on
-E

Now let’s have the echo show us something. Do a table lookup with:

\dt+

Now you’ll see that it echos back to you the query it used to get this data, plus at the bottom, the normal results of \dt+.

SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  CASE c.relpersistence WHEN 'p' THEN 'permanent' WHEN 't' THEN 'temporary' WHEN 'u' THEN 'unlogged' END as "Persistence",
  am.amname as "Access method",
  pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid)) as "Size",
  pg_catalog.obj_description(c.oid, 'pg_class') as "Description"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
     LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam
WHERE c.relkind IN ('r','p','')
      AND n.nspname <> 'pg_catalog'
      AND n.nspname !~ '^pg_toast'
      AND n.nspname <> 'information_schema'
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;

**************************

                                    List of relations
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size  | Description
--------+---------+-------+----------+-------------+---------------+--------+-------------
 public | weather | table | postgres | permanent   | heap          | 856 kB |
(1 row)

Echo all Postgres psql queries

You can also have psql echo all queries that it runs:

-- Have psql echo back queries
\set ECHO queries

-- Short name of echo queries
-e

This can be useful if you’re running queries from a file, or they are presets in your psqlrc, and want the query output a 2nd time for record keeping.

I have a web based tutorial for ECHO HIDDEN and ECHO queries if you want to dig into either of these more.

Set up your default psql experience with `.psqlrc`

All of the above things I’ve listed you can set up to happen automatically every time you use your local psql. When psql starts, it looks for a .psqlrc file and if one exists, it will execute the commands within it. This allows you to customize prompts and other psql settings.

You can see if you have a .psqlrc file yet with:

ls -l ~/.psqlrc

If you want to try adding one:

touch ~/.psqlrc

Or edit your current file with:

open -e ~/.psqlrc

If you want to skip the logging of commands when you start psql, you can add these to the beginning and end of your file:

-- Don't log these commands at the beginning of the file
\set QUIET 1

-- Reset command logging at the end of the file
\set QUIET 0

Customizing your prompt line

The default prompt for psql shows your database name and not much else. In your psqlrc file, you can change the psql prompt line to use a different combination of information about the database host and session. I personally like using the date and time in here, since I’m saving sessions to refer back to later.

-- Create a prompt with host, database name, date, and time
\set PROMPT1 '%m@%/ %`date "+%Y-%m-%d %H:%M:%S"` '

For me, this looks like:

[local]@crunchy-dev-db 2024-07-19 15:06:37

Saved queries in your psqlrc file

This .psqlrc file is looking pretty cool, right? But wait … there’s more! You can add queries to this file so that you can just run them with a super simple psql input.

Add these sample queries to psqlrc for long running queries, cache hit ratio, unused_indexes, and table sizes.

\set long_running 'SELECT pid, now() - pg_stat_activity.xact_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.xact_start) > interval ''5 minutes'' ORDER by 2 DESC;'

\set cache_hit 'SELECT ''index hit rate'' AS name, (sum(idx_blks_hit)) / nullif(sum(idx_blks_hit + idx_blks_read),0) AS ratio FROM pg_statio_user_indexes UNION ALL SELECT ''table hit rate'' AS name, sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read),0) AS ratio FROM pg_statio_user_tables;'

\set unused_indexes 'SELECT schemaname || ''.'' || relname AS table, indexrelname AS index, pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, idx_scan as index_scans FROM pg_stat_user_indexes ui JOIN pg_index i ON ui.indexrelid = i.indexrelid WHERE NOT indisunique AND idx_scan < 50 AND pg_relation_size(relid) > 5 * 8192 ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST, pg_relation_size(i.indexrelid) DESC;'

\set table_sizes 'SELECT c.relname AS name, pg_size_pretty(pg_table_size(c.oid)) AS size FROM pg_class c LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace) WHERE n.nspname NOT IN (''pg_catalog'', ''information_schema'') AND n.nspname !~ ''^pg_toast'' AND c.relkind=''r'' ORDER BY pg_table_size(c.oid) DESC;'`,

Then to execute inside psql use a colon and the name of the query to run it, for example :long_running . If you’re using our managed Postgres, Crunchy Bridge, we built in a bunch of this for you with our CLI insights.

Experiment with your psql environment

I hope some of these things give you a few ideas about experimenting with your psql environments. It's pretty easy and fun! My tips for success:

Prioritize helping yourself with things you use every day that take time. Is there a query you run once a week on your database? Put that in your psqlrc file so its right there next time.
Don’t go crazy if you remote connect into databases. If you don’t use a local connection to the database and remote in directly, don’t create a lot of special tools because using a different environment might be painful.
Check out our tutorials for basic psql and ECHO HIDDEN and ECHO queries to experiment with these in a web browser

We have a ton of other handy psql tricks in our Postgres tips page.

Thanks so much to Craig Kerstiens, David Christensen, Greg Mullane, and Reid Thompson for sharing all their psqlrc file samples and ideas with me!