<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0"><channel><title>CrunchyData Blog</title>
<atom:link href="https://www.crunchydata.com/blog/topic/tutorials/rss.xml" rel="self" type="application/rss+xml" />
<link>https://www.crunchydata.com/blog/topic/tutorials</link>
<image><url>https://www.crunchydata.com/card.png</url>
<title>CrunchyData Blog</title>
<link>https://www.crunchydata.com/blog/topic/tutorials</link>
<width>800</width>
<height>419</height></image>
<description>PostgreSQL experts from Crunchy Data share advice, performance tips, and guides on successfully running PostgreSQL and Kubernetes solutions</description>
<language>en-us</language>
<pubDate>Mon, 16 Sep 2024 10:00:00 EDT</pubDate>
<dc:date>2024-09-16T14:00:00.000Z</dc:date>
<dc:language>en-us</dc:language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item><title><![CDATA[ Window Functions for Data Analysis with Postgres ]]></title>
<link>https://www.crunchydata.com/blog/window-functions-for-data-analysis-with-postgres</link>
<description><![CDATA[ Elizabeth has some sample queries and explanations for window functions like running totals, lag/lead, rolling averages, and more. ]]></description>
<content:encoded><![CDATA[ <p>SQL makes sense when it's working on a single row, or even when it's aggregating across multiple rows. But what happens when you want to compare between rows of something you've already calculated? Or make groups of data and query those? Enter window functions.<p>Window functions tend to confuse people - but they’re a pretty awesome tool in SQL for data analytics. The best part is that you don’t need charts, fancy BI tools or AI to get some actionable and useful data for your stakeholders. Window functions let you quickly:<ul><li>Calculate running totals<li>Provide summary statistics for groups/partitions of data<li>Create rankings<li>Perform lag/lead analysis, ie comparing two separate sets of data with each other<li>Compute moving/rolling averages</ul><!--more--><p>In this post, I will show various types of window functions and how they can apply to certain situations. I’m using a super simple e-commerce schema for this to follow along with the kinds of queries I’m going to run with window functions.<blockquote><p>This post is available as an <a href=https://www.crunchydata.com/developers/playground/postgres-window-functions-for-data-analysis>interactive tutorial</a> as well in our Postgres playground.</blockquote><p><strong>The <code>OVER</code> function</strong><p>The <code>OVER</code> part of the Window function is what creates the window. Annoyingly the word window appears nowhere in any of the functions. 😂 Typically the OVER part is preambled by another function, either an aggregate or mathematical function. There’s also often a frame, to specify which rows you’re looking at like <code>ROWS BETWEEN 6 PRECEDING AND CURRENT ROW</code>.<p><strong>Window functions vs where clauses</strong><p>Window functions kind of feel like a where clause at first, since they’re looking at a set of data. But they’re really different. Window functions are more for times when you need to look across sets of data or across groups of data. There are cases where you could use either. In general:<ul><li>Use <strong><code>WHERE</code> clause</strong> when you need to filter rows based on a condition.<li>Use <strong>window functions</strong> when you need to perform calculations across rows that remain after filtering, without removing any rows from the result set.</ul><h2 id=running-totals><a href=#running-totals>Running totals</a></h2><p>Here’s a simple place to get started. Let’s ask for orders, customer data, order totals, and then a running total of orders. This will show us our total orders across a date range.<pre><code class=language-sql>SELECT
    SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
    order_date,
    order_id,
    customer_id,
    total_amount
FROM
    orders
ORDER BY
    order_date;
</code></pre><pre><code class=language-sql> running_total |     order_date      | order_id | customer_id | total_amount
---------------+---------------------+----------+-------------+--------------
        349.98 | 2024-08-21 10:00:00 |       21 |           1 |       349.98
       1249.96 | 2024-08-22 11:30:00 |       22 |           2 |       899.98
       1284.94 | 2024-08-23 09:15:00 |       23 |           3 |        34.98
       1374.93 | 2024-08-24 14:45:00 |       24 |           4 |        89.99
       1524.92 | 2024-08-25 08:25:00 |       25 |           5 |       149.99
       1589.90 | 2024-08-26 12:05:00 |       26 |           6 |        64.98
</code></pre><p>What's happening here is that each frame of data is the existing row plus the rows before. This sort of does calculations one one slice at a time, which you might see in the docs called a virtual table. Here's a diagram to get the general idea of how each frame of data is a set of rows, aggregated by the function with the <code>SUM OVER</code>.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/fbe5bd99-2a10-4325-9399-27950f854900/public><h2 id=first-and-last-values><a href=#first-and-last-values>First and last values</a></h2><p>Window functions can look at groups of data, so say a specific customer ID and give you something like their first and last order, total, for your most recent 10 orders.<pre><code class=language-sql>SELECT
    FIRST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS first_order_date,
    LAST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS last_order_date,
    o.order_id,
    o.customer_id,
    o.order_date,
    o.total_amount
FROM
    orders o
ORDER BY
    o.order_date DESC;
</code></pre><pre><code class=language-sql>  first_order_date   |   last_order_date   | order_id | customer_id |     order_date      | total_amount
---------------------+---------------------+----------+-------------+---------------------+--------------
 2024-08-30 17:50:00 | 2024-09-19 18:50:00 |       50 |          10 | 2024-09-19 18:50:00 |       149.98
 2024-08-29 13:10:00 | 2024-09-18 14:10:00 |       49 |           9 | 2024-09-18 14:10:00 |       199.98
 2024-08-28 10:20:00 | 2024-09-17 11:20:00 |       48 |           8 | 2024-09-17 11:20:00 |       139.99
 2024-08-27 16:35:00 | 2024-09-16 17:35:00 |       47 |           7 | 2024-09-16 17:35:00 |       249.98
 2024-08-26 12:05:00 | 2024-09-15 13:05:00 |       46 |           6 | 2024-09-15 13:05:00 |        89.98
</code></pre><h2 id=using-date_trunc-group-by-ctes-with-window-functions><a href=#using-date_trunc-group-by-ctes-with-window-functions>Using date_trunc GROUP BY CTEs with Window Functions</a></h2><p><code>date_trunc</code> is an incredibly handy Postgres function that summarizes units of time, hours, days, weeks, month. When combined with a <code>GROUP BY</code> in a CTE, you can create really easy summary statistics by day, month, week, year, etc.<p>When you combine the date_trunc GROUP BY partitions with window functions, some pretty magical stuff happens, and you can get ready-made summary statistics straight out of your database. In my opinion this is one of the most powerful features of Postgres window functions that really gets you to the next level.<p>Here’s an example query that starts with a CTE, calling a date_trunc to sum orders by daily totals. The second part of the query, the window function, ranks the sales in descending order with the best day of sales.<pre><code class=language-sql>WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    RANK() OVER (
        ORDER BY daily_total_sales DESC
    ) AS sales_rank
FROM
    DailySales
ORDER BY
    sales_rank;
</code></pre><pre><code class=language-sql>    sales_date      | daily_total_sales | sales_rank
---------------------+-------------------+------------
 2024-09-02 00:00:00 |           2419.97 |          1
 2024-09-01 00:00:00 |           1679.94 |          2
 2024-08-22 00:00:00 |            899.98 |          3
 2024-09-07 00:00:00 |            699.95 |          4
 2024-09-10 00:00:00 |            659.96 |          5
 2024-09-09 00:00:00 |            499.94 |          6
 2024-09-06 00:00:00 |            409.94 |          7
 2024-08-30 00:00:00 |            349.99 |          8
</code></pre><p>I should note that this uses one of the really helpful math functions, <code>RANK</code> in a Window function.<h3 id=lag-analysis><a href=#lag-analysis><code>LAG</code> Analysis</a></h3><p>Let’s do more with our date_trunc CTEs. Now that we know we have our data by day, we can use a window function to calculate changes between these groups. For example we could look at the difference in sales from the prior day. In this example, <code>LAG</code> looks at the sales date and creates a comparison with the previous day.<pre><code class=language-sql>WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    LAG(daily_total_sales) OVER (
        ORDER BY sales_date
    ) AS previous_day_sales,
    daily_total_sales - LAG(daily_total_sales) OVER (
        ORDER BY sales_date
    ) AS sales_difference
FROM
    DailySales
ORDER BY
    sales_date;
</code></pre><pre><code class=language-sql>
     sales_date      | daily_total_sales | previous_day_sales | sales_difference
---------------------+-------------------+--------------------+------------------
 2024-08-21 00:00:00 |            349.98 |                    |
 2024-08-22 00:00:00 |            899.98 |             349.98 |           550.00
 2024-08-23 00:00:00 |             34.98 |             899.98 |          -865.00
 2024-08-24 00:00:00 |             89.99 |              34.98 |            55.01
 2024-08-25 00:00:00 |            149.99 |              89.99 |            60.00
 2024-08-26 00:00:00 |             64.98 |             149.99 |           -85.01
</code></pre><p><code>LEAD</code> works the same way, looking forward in the data set.<h2 id=rolling-averages><a href=#rolling-averages>Rolling averages</a></h2><p>Using our same day groups we can also make a rolling average. The <code>AVG</code> function takes an input <code>ROWS BETWEEN 6 PRECEDING AND CURRENT ROW</code> for a rolling 7 day sales average.<pre><code class=language-sql>WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    AVG(daily_total_sales) OVER (
        ORDER BY sales_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_average_7_days
FROM
    DailySales
ORDER BY
    sales_date
LIMIT 10;
</code></pre><pre><code class=language-sql>     sales_date      | daily_total_sales | rolling_average_7_days
---------------------+-------------------+------------------------
 2024-08-21 00:00:00 |            349.98 |   349.9800000000000000
 2024-08-22 00:00:00 |            899.98 |   624.9800000000000000
 2024-08-23 00:00:00 |             34.98 |   428.3133333333333333
 2024-08-24 00:00:00 |             89.99 |   343.7325000000000000
 2024-08-25 00:00:00 |            149.99 |   304.9840000000000000
 2024-08-26 00:00:00 |             64.98 |   264.9833333333333333
 2024-08-27 00:00:00 |            249.98 |   262.8400000000000000
 2024-08-28 00:00:00 |            129.99 |   231.4128571428571429
 2024-08-29 00:00:00 |            179.98 |   128.5557142857142857
 2024-08-30 00:00:00 |            349.99 |   173.5571428571428571
(10 rows)
</code></pre><h2 id=n-tiles-with-window-functions><a href=#n-tiles-with-window-functions>N-tiles with Window Functions</a></h2><p>The <code>NTILE</code> function is a window function in SQL that is used to divide a result set into a specified number of roughly equal parts, known as tiles or buckets. By assigning a unique tile number to each row, n-tiles helps categorize and analyze data distribution within a dataset. This function is particularly useful in statistical and financial analysis for understanding how data is distributed across different segments, identifying trends, and making comparisons between groups with different characteristics.<p>For example, using <code>NTILE(4)</code> divides the data into four quartiles, ranking each row into one of four groups. The descending part here makes sure that quartile 1 is the top quarter and so on.<pre><code class=language-sql>WITH DailySales AS (
    SELECT
        date_trunc('day', o.order_date) AS sales_date,
        SUM(o.total_amount) AS daily_total_sales
    FROM
        orders o
    GROUP BY
        date_trunc('day', o.order_date)
)
SELECT
    sales_date,
    daily_total_sales,
    NTILE(4) OVER (
        ORDER BY daily_total_sales DESC
    ) AS sales_quartile
FROM
    DailySales
ORDER BY
    sales_date;
</code></pre><pre><code class=language-sql>     sales_date      | daily_total_sales | sales_quartile
---------------------+-------------------+----------------
 2024-09-06 00:00:00 |            409.94 |              1
 2024-08-30 00:00:00 |            349.99 |              1
 2024-08-21 00:00:00 |            349.98 |              2
 2024-09-08 00:00:00 |            349.96 |              2
</code></pre><h2 id=window-functions-resources><a href=#window-functions-resources>Window functions resources</a></h2><p>We’re big fans of Postgres functions at Crunchy Data, so check out our resources on Window functions in our <a href=https://www.crunchydata.com/developers/tutorials>Postgres Tutorials</a>:<ul><li><p><a href=https://www.crunchydata.com/developers/playground/postgres-window-functions-for-data-analysis>Postgres Window Functions for Data Analysis</a> (this post)<li><p><a href=https://www.crunchydata.com/blog/percentage-calculations-using-postgres-window-functions>Percentage Calculations with Window Functions</a><li><p><a href=https://www.crunchydata.com/developers/playground/summaries-with-aggregate-filters-and-windows>Summaries with Aggregate Filters and Windows</a><li><p><a href=https://www.crunchydata.com/developers/playground/ctes-and-window-functions>CTEs and Window Functions with US Birth Data</a></ul> ]]></content:encoded>
<category><![CDATA[ Analytics ]]></category>
<category><![CDATA[ Tutorials ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">22577ce775ab3d802aaf8515f9aa43e311bf0a7160f1411e97401aaf26ecf3da</guid>
<pubDate>Mon, 16 Sep 2024 10:00:00 EDT</pubDate>
<dc:date>2024-09-16T14:00:00.000Z</dc:date>
<atom:updated>2024-09-16T14:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Six Degrees of Kevin Bacon - Postgres Style ]]></title>
<link>https://www.crunchydata.com/blog/six-degrees-of-kevin-bacon-postgres-style</link>
<description><![CDATA[ Paul Ramsey has some great examples of Postgres network analysis and graph theory in this sample code for playing the Kevin Bacon game. Both pgRouting and recursive CTE are used to solve graphing relationships. ]]></description>
<content:encoded><![CDATA[ <p>Back in the 1990s, before anything was cool (or so my children tell me) and at the dawn of the Age of the Meme, a couple of college students invented a game they called the "<a href=https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon>Six Degrees of Kevin Bacon</a>".<p>The conceit behind the <em>Six Degrees of Kevin Bacon</em> was that actor <a href=https://en.wikipedia.org/wiki/Kevin_Bacon>Kevin Bacon</a> could be connected to any other actor, via a chain of association of no more than six steps.<p>Why Kevin Bacon? More or less arbitrarily, but the students had noted that Bacon said in an interview that "he had worked with everybody in Hollywood or someone who's worked with them" and took that statement as a challenge.<h2 id=bacon-number><a href=#bacon-number>Bacon Number</a></h2><p>The number of steps necessary to get from some actor to Kevin Bacon is their "Bacon Number".<p>For example, comedy legend <a href=https://en.wikipedia.org/wiki/Steve_Martin>Steve Martin</a> has a Bacon Number of 1, since Kevin Bacon appeared with him in the 1987 road trip comedy <a href=https://en.wikipedia.org/wiki/Planes,_Trains_and_Automobiles>Planes, Trains and Automobiles</a>.<p><img alt=Bacon-Martin loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/5b6ba53e-d929-4705-5c68-c0ac37050d00/public><p>Zendaya has as Bacon number of 2. In 2017 she appeared with <a href=https://en.wikipedia.org/wiki/Marisa_Tomei>Marisa Tomei</a> in <a href=https://en.wikipedia.org/wiki/Spider-Man:_Homecoming>Spider-Man: Homecoming</a>, and in 2005 Tomei appeared with Bacon in <a href=https://en.wikipedia.org/wiki/Loverboy_(2005_film)>Loverboy</a> (which Bacon also directed).<p><img alt=Bacon-Zendaya loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/c7994862-fe37-44ea-a287-77228a585600/public><h2 id=imdb-data><a href=#imdb-data>IMDB Data</a></h2><p>The challenge of the original '90s <em>Six Degrees of Kevin Bacon</em> was to link up two actors using only the knowledge in your head. This seems improbably difficult to me, but perhaps people were smarter in the '90s.<p>In our modern age we don't need to be smart, we can attack the Bacon Number problem by combining data and algorithms.<p>The data half of the problem is relatively straightforward -- the <a href=https://www.imdb.com/>Internet Movie Database</a> (aka IMDB) allows <a href=https://datasets.imdbws.com/>direct download</a> of the information we need.<p>In particular, the<ul><li><code>name.basics.tsv.gz</code> file for actor names and jobs<li><code>title.basics.tsv.gz</code> file for movie names and dates<li><code>title.principals.tsv.gz</code> file for relationships between actors and movies</ul><p>The IMDB database files actually include information about every job on a film (writers, directors, producers, casting, etc, etc) and we are only interested in <strong>actors</strong> for the Kevin Bacon game.</p><details><summary>ETL process to download and process raw data</summary><pre><code class=language-sql>
CREATE SCHEMA imdb;

------------------------------------------------------------------------
-- Load the names from the raw file
--

CREATE TABLE imdb.name_basics (
    nconst  text,
    primaryName  text,
    birthYear integer,
    deathYear integer,
    primaryProfession text,
    knownForTitles text
);

COPY imdb.name_basics
    FROM program 'curl https://datasets.imdbws.com/name.basics.tsv.gz | gunzip'
    WITH (
        FORMAT csv,
        DELIMITER E'\t',
        NULL '\N',
        HEADER true,
        QUOTE E'\x01'
    );

CREATE INDEX name_basics_pk ON imdb.name_basics (nconst);

------------------------------------------------------------------------
-- Strip down the raw to just actors and actresses in an 'actors' table
--

CREATE TABLE actors AS
    SELECT int4(substring(nconst, 3,15)) AS actor_id,
        nconst,
        primaryname AS name,
        int2(birthyear) AS birthyear,
        int2(deathyear) AS deathyear
    FROM imdb.name_basics
    WHERE (primaryProfession ~ '^act' OR primaryProfession ~ ',act')
    AND birthyear IS NOT NULL;

CREATE UNIQUE INDEX actors_pk ON actors (actor_id);
CREATE INDEX actors_nconst_x ON actors (nconst);

------------------------------------------------------------------------
-- Load the movie titles from the raw file
--

CREATE TABLE imdb.title_basics (
    tconst text,
    titleType text,
    primaryTitle text,
    originalTitle text,
    isAdult boolean,
    startYear integer,
    endYear integer,
    runtimeMinutes integer,
    genres text
    );

COPY imdb.title_basics
    FROM program 'curl https://datasets.imdbws.com/title.basics.tsv.gz | gunzip'
    WITH (
        FORMAT csv,
        DELIMITER E'\t',
        NULL '\N',
        HEADER true,
        QUOTE E'\x01'
    );

------------------------------------------------------------------------
-- Strip down the raw table to just movies, no tv shows, etc.
--

CREATE TABLE movies AS
    SELECT int4(substring(tconst, 3,15)) AS movie_id,
        tconst,
        primaryTitle AS title,
        int2(startyear) as startyear, int2(endyear) as endyear,
        runtimeminutes as runtimeminutes
    FROM imdb.title_basics
    WHERE titleType = 'movie'
      AND NOT isadult;

CREATE UNIQUE INDEX movies_pk ON movies (movie_id);
CREATE INDEX movies_tconst_x ON movies (tconst);

------------------------------------------------------------------------
-- Load the raw table of movie/job relationships
--

CREATE TABLE imdb.title_principals (
    tconst text,
    ordering integer,
    nconst text,
    category text,
    job text,
    characters text
);

COPY imdb.title_principals
    FROM program 'curl https://datasets.imdbws.com/title.principals.tsv.gz | gunzip'
    WITH (
        FORMAT csv,
        DELIMITER E'\t',
        NULL '\N',
        HEADER true,
        QUOTE E'\x01'
    );

CREATE INDEX title_principals_tx ON imdb.title_principals (tconst);
CREATE INDEX title_principals_nx ON imdb.title_principals (nconst);

------------------------------------------------------------------------
-- Strip down the raw table to just the ids defining the relationship
--

CREATE TABLE movie_actors AS
    SELECT m.movie_id,
        a.actor_id
    FROM imdb.title_principals i
    JOIN actors a ON a.nconst = i.nconst
    JOIN movies m ON m.tconst = i.tconst
    WHERE i.category ~ '^act';

CREATE INDEX movie_actors_ax ON movie_actors (actor_id);
CREATE INDEX movie_actors_mx ON movie_actors (movie_id);

</code></pre></details><p>In order to make the tables smaller and functions faster, the raw files are stripped down during the ETL process, and the result is three smaller tables.<ul><li><code>actors</code> has 371,557 rows<li><code>movies</code> has 678,204 rows, and<li><code>movie_actors</code> has 1,866,533 rows.</ul><p><img alt=ERD loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/89928faa-188d-4790-ece0-7bdc7ed47000/public><h2 id=graph-solver><a href=#graph-solver>Graph Solver</a></h2><p>So, what is the algorithm we need to calculate the Bacon Number for any given actor? Look at the example above, for Steve Martin and Zendaya. Actors are joined together by movies. Taken together, the actors and movies form a <a href=https://en.wikipedia.org/wiki/Graph_theory>graph</a>! The actors are <strong>nodes</strong> of the graph and the movies are <strong>edges</strong> of the graph.<p><img alt=Graph loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/78ee614c-5a20-401f-0c9b-3ee532668000/public><p>And fortunately PostgreSQL already has a <strong>graph solver</strong> available, <a href=https://pgrouting.org/>pgRouting</a>!<p>(Wait, isn't pgRouting for solving transportation optimization problems? Well, that is what it is <strong>mostly</strong> used for, but it is built as a <strong>completely generic graph solver</strong>, suitable for all kinds of graph algorithms, including the <a href=https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm>Dijkstra shortest path</a> algorithm we need to calculate a Bacon Number.)<p>Alternatively, we could approach the problem directly within core PostgreSQL using a "<a href=https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-RECURSIVE>recursive CTE</a>" to walk the graph.<p>Let's look at both approaches.<h2 id=build-the-graph><a href=#build-the-graph>Build the Graph</a></h2><p>For both approaches we need to expand our table of movie/actor relationships into a table of <strong>graph edges</strong> where each edge is one pairing of actors in the same movie.<pre><code class=language-sql>CREATE TABLE actor_graph AS
SELECT
  row_number() OVER () AS edge_id,
  a.actor_id AS actor_id,
  a.movie_id AS movie_id,
  b.actor_id AS other_actor_id
FROM movie_actors a
JOIN movie_actors b ON a.movie_id = b.movie_id
WHERE a.actor_id != b.actor_id;

CREATE INDEX actor_graph_id_x ON actor_graph (actor_id);
CREATE INDEX actor_graph_other_id_x ON actor_graph (other_actor_id);
CREATE INDEX actor_graph_edge_id_x ON actor_graph (edge_id);
</code></pre><p>Self-joining the <code>movie_actors</code> table gets us a table of 11M edges that form the <code>actor_graph</code>.<pre><code class=language-sql>SELECT * FROM actor_graph LIMIT 5;
</code></pre><pre><code> edge_id | actor_id | movie_id | other_actor_id
---------+----------+----------+----------------
       1 |   951773 |  1861414 |         764895
       2 |   951773 |  1861414 |         618628
       3 |   951773 |  1861414 |         244428
       4 |   951773 |  1861414 |         258574
       5 |   951773 |  1861414 |         147923
</code></pre><h2 id=pgrouting><a href=#pgrouting>pgRouting</a></h2><p><a href=https://pgrouting.org/>pgRouting</a> is a unique solver that allows you to dynamically create the graph you will be solving on. This makes a lot of sense for a solver built into the database, since the data in a database is presumed to be fairly <strong>dynamic</strong>.<p>Every algorithm in pgRouting takes in a SQL query that generates the graph to be worked on, and parameters suitable for the algorithm chosen.<p>We are using the <a href=https://docs.pgrouting.org/latest/en/pgr_dijkstra.html>Dijksta algorithm</a>, so the parameters are just the graph <strong>SQL</strong>, the <strong>start node</strong> and the <strong>end node</strong>.<p>Everything works off integer keys, so we start by finding the keys for two actors, <strong>Kevin Bacon</strong> and <strong>Timothée Chalamet</strong>.<pre><code class=language-sql>CREATE EXTENSION pgrouting

SELECT actor_id FROM actors WHERE name = 'Kevin Bacon';
SELECT actor_id FROM actors WHERE name = 'Timothée Chalamet';

SELECT seq, node, edge FROM pgr_dijkstra(
    'SELECT
        a.movie_id AS id,
        a.actor_id AS source,
        a.other_actor_id AS target,
        1.0 AS cost
        FROM actor_graph a',
    3154303, -- Timothée Chalamet
    102      -- Kevin Bacon
    );
</code></pre><p>What comes back (in about 5 seconds) is the list of edges that forms the shortest path.<pre><code> seq |  node   |   edge
-----+---------+----------
   1 | 3154303 | 11286314
   2 | 2225369 |  1270798
   3 |     102 |       -1
</code></pre><p>This example is one that works against the strengths of pgRouting. We are asking for a route through a static graph, not a dynamic graph, and we are routing through a very large graph (11M edges).<p>Each run of the function will pull the whole graph out of the database, form the graph using the node keys, and then finally run the Dijkstra algorithm.<p>For our 11M record table, this takes about 5 seconds.<h2 id=recursive-cte><a href=#recursive-cte>Recursive CTE</a></h2><p>For this particular problem, it turns out that recursive CTE is a great fit. The graph is static, and the number of steps needed to form a shortest path is quite small -- no more than 6, according to our rules.<p>The downside of <a href=https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-RECURSIVE>recursive CTE</a> should be apparent from this example and the documentation. In a world of confusing SQL, recursive CTE SQL is the most confusing of all.<p>Here's a bare query to run the same search as we ran in pgRouting:<pre><code class=language-sql>WITH RECURSIVE bacon_numbers AS (
-- Starting nodes
SELECT
  ag.actor_id,
  ag.movie_id,
  ag.other_actor_id,
  1 AS bacon_number,
  ARRAY[ag.edge_id] AS path,
  false AS is_cycle
FROM actor_graph AS ag
WHERE actor_id = 102 -- Kevin Bacon

UNION ALL

-- Recursive set
SELECT
  bn.actor_id,
  ag.movie_id,
  ag.other_actor_id,
  bn.bacon_number + 1 AS bacon_number,
  path || ag.edge_id AS path,
  ag.edge_id = ANY(path) AS is_cycle
FROM actor_graph AS ag
JOIN bacon_numbers AS bn
  ON bn.other_actor_id = ag.actor_id
WHERE bn.bacon_number &#60= 5
  AND NOT is_cycle
)
SELECT path FROM bacon_numbers
WHERE other_actor_id = 3154303 -- Timothée Chalamet
LIMIT 1;
</code></pre><p>That's a lot more complex! Because we are writing the traversal by hand, with a relatively blunt instrument, the result is a lot more complex than the pgRouting solution. On the other hand, this solution runs in just a few hundred milliseconds, so for the Bacon problem it's clearly superior.<p>The output <code>path</code> array is an ordered list of edges that take us from the start node (Bacon) to the end node (Chalamet).<pre><code>       path
-------------------
 {2016551,4962882}
</code></pre><p>Join the edges and nodes back up to their human readable movies and actors.<pre><code class=language-sql>SELECT a.name, m.title, m.startyear, o.name
FROM unnest('{2016551,4962882}'::integer[]) p
JOIN actor_graph ag ON ag.edge_id = p
JOIN actors a USING (actor_id)
JOIN actors o ON ag.other_actor_id = o.actor_id
JOIN movies m USING (movie_id)
</code></pre><p>And we see the second order path from Kevin Bacon to Timothée Chalamet.<pre><code>      name       |        title         | startyear |       name
-----------------+----------------------+-----------+-------------------
 Kevin Bacon     | You Should Have Left |      2020 | Amanda Seyfried
 Amanda Seyfried | Love the Coopers     |      2015 | Timothée Chalamet
</code></pre><p><img alt=Bacon-Seyfried loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/7ef42d9d-4377-49ae-90ab-3925125b3000/public><p>If you want to play around and try to find higher numbers, here is a complete PL/PgSQL function to simplify the process.</p><details><summary>PL/PgSQL Function for RCTE Bacon search</summary><pre><code class=language-sql>
CREATE OR REPLACE FUNCTION bacon(query_name text)
RETURNS TABLE(name text, title text, year smallint, othername text, n bigint) AS $$
DECLARE
    bacon_id INTEGER := 102;
    query_id INTEGER;
    row_count INTEGER;
BEGIN
    SELECT actor_id INTO query_id
    FROM actors
    WHERE actors.name = query_name;
    GET DIAGNOSTICS row_count = ROW_COUNT;

    IF (row_count != 1) THEN
        RAISE EXCEPTION 'Found % entries for actor %', row_count, query_name;
    END IF;

    RETURN QUERY
    WITH RECURSIVE bacon_numbers AS (
    SELECT
      ag.actor_id,
      ag.movie_id,
      ag.other_actor_id,
      1 AS bacon_number,
      ARRAY[ag.edge_id] AS path,
      false AS is_cycle
    FROM actor_graph AS ag
    WHERE actor_id = 102 -- Kevin Bacon
    UNION ALL
    SELECT
      bn.actor_id,
      ag.movie_id,
      ag.other_actor_id,
      bn.bacon_number + 1 AS bacon_number,
      path || ag.edge_id AS path,
      ag.edge_id = ANY(path) AS is_cycle
    FROM actor_graph AS ag
    JOIN bacon_numbers AS bn
      ON bn.other_actor_id = ag.actor_id
    WHERE bn.bacon_number &#60= 5
      AND NOT is_cycle
    ),
    bacon_path AS (
        SELECT path, bacon_number FROM bacon_numbers
        WHERE other_actor_id = query_id
        LIMIT 1
    )
    SELECT a.name, m.title, m.startyear, o.name, e.n
    FROM bacon_path, unnest(path) WITH ORDINALITY e(edge_id, n)
    JOIN actor_graph ag ON ag.edge_id = e.edge_id
    JOIN actors a ON ag.actor_id = a.actor_id
    JOIN actors o ON ag.other_actor_id = o.actor_id
    JOIN movies m ON ag.movie_id = m.movie_id
    ORDER BY e.n;

END;
$$ LANGUAGE 'plpgsql';
</code></pre></details><p>The highest number I have found so far is a <em>3</em> for Gates McFadden, who played <em>Doctor Crusher</em> in <em>Star Trek: The Next Generation</em>.<pre><code class=language-sql>SELECT * FROM bacon('Gates McFadden');
</code></pre><pre><code>      name       |           title           | year |    othername    | n
-----------------+---------------------------+------+-----------------+---
 Kevin Bacon     | Beverly Hills Cop: Axel F | 2024 | Bronson Pinchot | 1
 Bronson Pinchot | Babes in Toyland          | 1997 | Jim Belushi     | 2
 Jim Belushi     | Taking Care of Business   | 1990 | Gates McFadden  | 3
</code></pre><p><img alt=Bacon-McFadden loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/034b01c1-c4dd-4125-517a-50c63e5b3f00/public><h2 id=conclusions><a href=#conclusions>Conclusions</a></h2><ul><li>Graph solving problems do not require special external software, you can attack those problems in PostgreSQL using pgRouting or a custom recursive CTE.<li>pgRouting is good for smaller graphs and especially powerful for graphs that are generated dynamically, to reflect changes in data, or changes in the desired graph.<li>Recursive CTEs can handle much larger graphs, but not large traversals of those graphs, as they tend to pile up very large intermediate results that grow with each recursion.</ul><h2 id=pictures><a href=#pictures>Pictures</a></h2><ul><li>Kevin Bacon by <a href=https://www.flickr.com/photos/gageskidmore/14781004366/>Gage Skidmore</a><li>Bronson Pinchot by Rob DiCaterino<li>Jim Belushi by COD Newsroom<li>Gates McFadden By <a href="https://commons.wikimedia.org/w/index.php?curid=130047788">Super Festivals</a><li>Marissa Tomei by Elena Ternovaja<li>Amanda Seyfried by <a href=https://commons.wikimedia.org/wiki/User:Toglenn>Glenn Francis</a><li>Steve Martin by <a href="https://commons.wikimedia.org/w/index.php?curid=137841186">David W Baker</a><li>Zendaya by <a href=https://commons.wikimedia.org/wiki/User:Toglenn>Glenn Francis</a><li>Timothee Chalamet by <a href=https://www.flickr.com/photos/terras/30755014688/>Somewhere In Toronto</a></ul> ]]></content:encoded>
<category><![CDATA[ Tutorials ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">6c7a27614732d8dbfa891c279d1bf451b4088ac2e76b9a4db2de25a724253a40</guid>
<pubDate>Mon, 05 Aug 2024 08:00:00 EDT</pubDate>
<dc:date>2024-08-05T12:00:00.000Z</dc:date>
<atom:updated>2024-08-05T12:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Magic Tricks for Postgres psql: Settings, Presets, Echo, and Saved Queries ]]></title>
<link>https://www.crunchydata.com/blog/magic-tricks-for-postgres-psql-settings-presets-cho-and-saved-queries</link>
<description><![CDATA[ Elizabeth has a set of tips for making your psql environment easier to work with. Find out how to save queries, echo back psql commands, and some of the psql settings to make your environment friendlier. ]]></description>
<content:encoded><![CDATA[ <p>As I’ve been working with Postgres psql cli, I’ve picked up a few good habits from my Crunchy Data co-workers that make my terminal database environment easier to work with. I wanted to share a couple of my favorite things I’ve found that make getting around Postgres better. If you’re just getting started with psql, or haven’t ventured too far out of the defaults, this is the post for you. I’ll walk you through some of the friendliest psql settings and how to create your own preset settings file.<h2 id=some-of-the-most-helpful-psql-commands><a href=#some-of-the-most-helpful-psql-commands>Some of the most helpful psql commands</a></h2><h3 id=formatting-psql-output><a href=#formatting-psql-output>Formatting psql output</a></h3><p>Postgres has an expanded display mode, which will read your query results as batches of column and data, instead of a huge wide list of columns that expand the display to the right.<p>A sample expanded display looks like this:<pre><code>-[ RECORD 1 ]------------------------------
id         | 1
name       | Alice Johnson
position   | Manager
department | Sales
salary     | 75000.00
-[ RECORD 2 ]------------------------------
id         | 2
name       | Bob Smith
position   | Developer
department | Engineering
salary     | 65000.00
</code></pre><pre><code>--Automatically format expanded display for wide columns
\x auto
</code></pre><p>I have a tutorial up about <a href=https://www.crunchydata.com/developers/playground/psql-basics>using basic psql</a> if you’re just getting started and want to try these commands out.<h3 id=table-column-borders-in-psql-output><a href=#table-column-borders-in-psql-output>Table column borders in psql output</a></h3><p>If you’re not using the extended display, you can have psql do some fancy column outlines with the <code>\pset linestyle</code>.<pre><code>--Outline table borders and separators using Unicode characters
\pset linestyle unicode
</code></pre><p>That will get you query output that looks like this:<pre><code>┌────┬───────┬─────┐
│ id │ name  │ age │
├────┼───────┼─────┤
│  1 │ Alice │  30 │
│  2 │ Bob   │  25 │
└────┴───────┴─────┘
</code></pre><h3 id=show-query-run-times-in-psql><a href=#show-query-run-times-in-psql>Show query run times in psql</a></h3><p>This will give you a result in milliseconds for the time the query took to run at the bottom:<pre><code>-- Always show query time
\timing
</code></pre><h3 id=create-a-preset-for-your-null-values-in-psql><a href=#create-a-preset-for-your-null-values-in-psql>Create a preset for your null values in psql</a></h3><p>This will work with emojis or really anything utf-8 compatible:<pre><code>-- Set Null char output to differentiate it from empty string
\pset null '☘️'
</code></pre><h3 id=your-psql-history><a href=#your-psql-history>Your psql history</a></h3><p>You can create a history file for your psql command sessions like this:<pre><code>-- Creates a history file for each database in your config directory CHECK IF THIS IS RIGHT
\set HISTFILE ~/.config/psql/psql_history-:DBNAME

-- Number of commands to save in history
\set HISTSIZE 2000
</code></pre><h3 id=echo-psql-commands-as-sql><a href=#echo-psql-commands-as-sql>Echo PSQL commands as SQL</a></h3><p>Any psql slash command (such as <code>\d</code>) runs against Postgres’ system tables. You can use the psql echo command to display the queries used for a given command, which can give you insight about Postgres’ internal tables, catalog, and other naming conventions.<pre><code>-- output any SQL run by psql slash commands
\set ECHO_HIDDEN on
</code></pre><pre><code>-- short name of ECHO_HIDDEN on
-E
</code></pre><p>Now let’s have the echo show us something. Do a table lookup with:<pre><code>\dt+
</code></pre><p>Now you’ll see that it echos back to you the query it used to get this data, plus at the bottom, the normal results of <code>\dt+</code>.<pre><code class=language-sql>SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  CASE c.relpersistence WHEN 'p' THEN 'permanent' WHEN 't' THEN 'temporary' WHEN 'u' THEN 'unlogged' END as "Persistence",
  am.amname as "Access method",
  pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid)) as "Size",
  pg_catalog.obj_description(c.oid, 'pg_class') as "Description"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
     LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam
WHERE c.relkind IN ('r','p','')
      AND n.nspname &#60> 'pg_catalog'
      AND n.nspname !~ '^pg_toast'
      AND n.nspname &#60> 'information_schema'
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;

**************************

                                    List of relations
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size  | Description
--------+---------+-------+----------+-------------+---------------+--------+-------------
 public | weather | table | postgres | permanent   | heap          | 856 kB |
(1 row)
</code></pre><h3 id=echo-all-postgres-psql-queries><a href=#echo-all-postgres-psql-queries>Echo all Postgres psql queries</a></h3><p>You can also have psql echo all queries that it runs:<pre><code>-- Have psql echo back queries
\set ECHO queries
</code></pre><pre><code>-- Short name of echo queries
-e
</code></pre><p>This can be useful if you’re running queries from a file, or they are presets in your psqlrc, and want the query output a 2nd time for record keeping.<p>I have a web based tutorial for <a href=https://www.crunchydata.com/developers/playground/psql-echo-commands>ECHO HIDDEN and ECHO queries</a> if you want to dig into either of these more.<h2 id=set-up-your-default-psql-experience-with-psqlrc><a href=#set-up-your-default-psql-experience-with-psqlrc>Set up your default psql experience with <code>.psqlrc</code></a></h2><p>All of the above things I’ve listed you can set up to happen automatically every time you use your local psql. When psql starts, it looks for a <code>.psqlrc</code> file and if one exists, it will execute the commands within it. This allows you to customize prompts and other psql settings.<p>You can see if you have a <code>.psqlrc</code> file yet with:<pre><code>ls -l ~/.psqlrc
</code></pre><p>If you want to try adding one:<pre><code>touch ~/.psqlrc
</code></pre><p>Or edit your current file with:<pre><code>open -e ~/.psqlrc
</code></pre><p>If you want to skip the logging of commands when you start psql, you can add these to the beginning and end of your file:<pre><code>-- Don't log these commands at the beginning of the file
\set QUIET 1

-- Reset command logging at the end of the file
\set QUIET 0
</code></pre><h3 id=customizing-your-prompt-line><a href=#customizing-your-prompt-line>Customizing your prompt line</a></h3><p>The default prompt for <code>psql</code> shows your database name and not much else. In your psqlrc file, you can change the psql prompt line to use a different combination of information about the database host and session. I personally like using the date and time in here, since I’m saving sessions to refer back to later.<pre><code>-- Create a prompt with host, database name, date, and time
\set PROMPT1 '%m@%/ %`date "+%Y-%m-%d %H:%M:%S"` '
</code></pre><p>For me, this looks like:<pre><code>[local]@crunchy-dev-db 2024-07-19 15:06:37
</code></pre><h3 id=saved-queries-in-your-psqlrc-file><a href=#saved-queries-in-your-psqlrc-file>Saved queries in your psqlrc file</a></h3><p>This <code>.psqlrc</code> file is looking pretty cool, right? But wait … there’s more! You can add queries to this file so that you can just run them with a super simple psql input.<p>Add these sample queries to psqlrc for long running queries, cache hit ratio, unused_indexes, and table sizes.<pre><code class=language-sql>\set long_running 'SELECT pid, now() - pg_stat_activity.xact_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.xact_start) > interval ''5 minutes'' ORDER by 2 DESC;'
</code></pre><pre><code class=language-sql>\set cache_hit 'SELECT ''index hit rate'' AS name, (sum(idx_blks_hit)) / nullif(sum(idx_blks_hit + idx_blks_read),0) AS ratio FROM pg_statio_user_indexes UNION ALL SELECT ''table hit rate'' AS name, sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read),0) AS ratio FROM pg_statio_user_tables;'
</code></pre><pre><code class=language-sql>\set unused_indexes 'SELECT schemaname || ''.'' || relname AS table, indexrelname AS index, pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, idx_scan as index_scans FROM pg_stat_user_indexes ui JOIN pg_index i ON ui.indexrelid = i.indexrelid WHERE NOT indisunique AND idx_scan &#60 50 AND pg_relation_size(relid) > 5 * 8192 ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST, pg_relation_size(i.indexrelid) DESC;'
</code></pre><pre><code class=language-sql>\set table_sizes 'SELECT c.relname AS name, pg_size_pretty(pg_table_size(c.oid)) AS size FROM pg_class c LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace) WHERE n.nspname NOT IN (''pg_catalog'', ''information_schema'') AND n.nspname !~ ''^pg_toast'' AND c.relkind=''r'' ORDER BY pg_table_size(c.oid) DESC;'`,
</code></pre><p>Then to execute inside psql use a colon and the name of the query to run it, for example <code>:long_running</code> . If you’re using our <a href=https://www.crunchydata.com/products/crunchy-bridge>managed Postgres</a>, Crunchy Bridge, we built in a bunch of this for you with our <a href=https://www.crunchydata.com/blog/crunchy-bridge-annoucing-postgres-insights-in-your-cli>CLI insights</a>.<h2 id=experiment-with-your-psql-environment><a href=#experiment-with-your-psql-environment>Experiment with your psql environment</a></h2><p>I hope some of these things give you a few ideas about experimenting with your psql environments. It's pretty easy and fun! My tips for success:<ul><li>Prioritize helping yourself with things you use every day that take time. Is there a query you run once a week on your database? Put that in your psqlrc file so its right there next time.<li>Don’t go crazy if you remote connect into databases. If you don’t use a local connection to the database and remote in directly, don’t create a lot of special tools because using a different environment might be painful.<li>Check out our tutorials for <a href=https://www.crunchydata.com/developers/playground/psql-basics>basic psql</a> and <a href=https://www.crunchydata.com/developers/playground/psql-echo-commands>ECHO HIDDEN and ECHO queries</a> to experiment with these in a web browser</ul><p>We have a ton of other handy psql tricks in our <a href=https://www.crunchydata.com/postgres-tips>Postgres tips</a> page.<p><br><br> Thanks so much to Craig Kerstiens, David Christensen, Greg Mullane, and Reid Thompson for sharing all their psqlrc file samples and ideas with me! ]]></content:encoded>
<category><![CDATA[ Tutorials ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">244095b4654cb09985ccdaa507acd02a1917c974b42266c7b32b834ab53f1ecb</guid>
<pubDate>Fri, 19 Jul 2024 08:00:00 EDT</pubDate>
<dc:date>2024-07-19T12:00:00.000Z</dc:date>
<atom:updated>2024-07-19T12:00:00.000Z</atom:updated></item></channel></rss>