<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0"><channel><title>CrunchyData Blog</title>
<atom:link href="https://www.crunchydata.com/blog/topic/spatial/rss.xml" rel="self" type="application/rss+xml" />
<link>https://www.crunchydata.com/blog/topic/spatial</link>
<image><url>https://www.crunchydata.com/card.png</url>
<title>CrunchyData Blog</title>
<link>https://www.crunchydata.com/blog/topic/spatial</link>
<width>800</width>
<height>419</height></image>
<description>PostgreSQL experts from Crunchy Data share advice, performance tips, and guides on successfully running PostgreSQL and Kubernetes solutions</description>
<language>en-us</language>
<pubDate>Fri, 14 Mar 2025 10:00:00 EDT</pubDate>
<dc:date>2025-03-14T14:00:00.000Z</dc:date>
<dc:language>en-us</dc:language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item><title><![CDATA[ Pi Day PostGIS Circles ]]></title>
<link>https://www.crunchydata.com/blog/postgis-pi-circlelinestring</link>
<description><![CDATA[ For a proper Pi Day celebration in Postgres, Paul shows off a proof for CIRCULARSTRING. ]]></description>
<content:encoded><![CDATA[ <p>What's your favourite infinite sequence of non-repeating digits? There are some people who make a case for <em>e</em>, but to my mind nothing beats the transcendental and curvy utility of π, the ratio of a circle's circumference to its diameter.<p>Drawing circles is a simple thing to do in PostGIS -- take a point, and buffer it. The result is circular, and we can calculate an estimate of <em>pi</em> just by measuring the perimeter of the unit circle.<pre><code class=language-sql>SELECT ST_Buffer('POINT(0 0)', 1.0);
</code></pre><p><img alt="buffer default PostGIS"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/76c87569-4fe5-486b-4b73-3d0b072acc00/public><p>Except, look a little more closely -- this "circle" seems to be made up of short straight lines. What is the ratio of its circumference to its diameter?<pre><code class=language-sql>SELECT ST_Perimeter(ST_Buffer('POINT(0 0)', 1.0)) / 2;
</code></pre><pre><code>3.1365484905459406
</code></pre><p>That's <strong>close</strong> to <em>pi</em>, but it's <strong>not</strong> pi. Can we generate a better approximation? What if we make the edges even shorter? The third parameter to <code>ST_Buffer()</code> is the "quadsegs", the number of segments to build each quadrant of the circle.<pre><code class=language-sql>SELECT ST_Perimeter(ST_Buffer('POINT(0 0)', 1.0, quadsegs => 128)) / 2;
</code></pre><pre><code>3.1415729403671087
</code></pre><p>Much closer!<p>We can crank this process up a lot more, keep adding edges, but at some point the process becomes silly. We should just be able to say "this edge is a portion of a circle, not a straight line", and get an actual circular arc.<p>Good news, we can do exactly that! The <code>CIRCULARSTRING</code> is the curvy analogue to a <code>LINESTRING</code> wherein every connection is between three points that define a portion of a circle.<p><img alt="circular arc"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/af3fa1ce-4e4c-4a07-10ef-5e1b78974800/public><p>The circular arc above is the arc that starts at A and ends at C, passing through B. Any three points define a unique circular arc. A <code>CIRCULARSTRING</code> is a connected sequence of these arcs, just as a <code>LINESTRING</code> is a connected sequence of linear edges.<p>How does this help us get to <em>pi</em> though? Well, PostGIS has a moderate amount of support for circular arc geometry, so if we construct a circle using "natively curved" objects, we should get an exact representation of a circle rather than an approximation.<p><img alt=circle loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/6a1e46ba-4c95-47e7-30b9-27b919036200/public><p>So, what is an arc that starts and ends at the same point? A circle! This is the unit circle -- a circle of radius one centered on the origin -- expressed as a <code>CIRCULARSTRING</code>.<pre><code class=language-sql>SELECT ST_Length('CIRCULARSTRING(1 0, -1 0, 1 0)') / 2;
</code></pre><pre><code>3.141592653589793
</code></pre><p>That looks a lot like <em>pi</em>!<p>Let's bust out the built-in <code>pi()</code> function from PostgreSQL and check to be sure.<pre><code class=language-sql>SELECT pi() - ST_Length('CIRCULARSTRING(1 0, -1 0, 1 0)') / 2;
</code></pre><pre><code>0
</code></pre><p>Yep, a perfect π to celebrate "Pi Day" with! ]]></content:encoded>
<category><![CDATA[ Spatial ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">8a1f454e1e702b91f52b2e6239ca07d79bcc89dd7b7c899084673a8240b30eee</guid>
<pubDate>Fri, 14 Mar 2025 10:00:00 EDT</pubDate>
<dc:date>2025-03-14T14:00:00.000Z</dc:date>
<atom:updated>2025-03-14T14:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Using Cloud Rasters with PostGIS ]]></title>
<link>https://www.crunchydata.com/blog/using-cloud-rasters-with-postgis</link>
<description><![CDATA[ Paul shows you how to access raster data stored in the cloud or object storage for PostGIS using cloud optimized GeoTIFF (aka COG) files. He also includes some functions for working with raster elevation. ]]></description>
<content:encoded><![CDATA[ <p>With the <code>postgis_raster</code> extension, it is possible to access gigabytes of raster data from the cloud, <strong>without ever downloading the data</strong>.<p>How? The venerable <code>postgis_raster</code> extension (released <a href=https://www.postgresql.org/about/news/postgis-200-released-1387/>13 years ago</a>) already has the critical core support built-in!<p>Rasters can be stored inside the database, or outside the database, on a local file system <strong>or</strong> anywhere it can be accessed by the underlying <a href=https://gdal.org>GDAL</a> raster support library. The <a href=https://gdal.org/en/stable/user/virtual_file_systems.html#network-based-file-systems>storage options</a> include S3, Azure, Google, Alibaba, and any HTTP server that supports <a href=https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests>RANGE requests</a>.<p>As long as the rasters are in the <a href=https://cogeo.org>cloud optimized GeoTIFF</a> (aka "COG") format, the network access to the data will be optimized and provide access performance limited mostly by the speed of connection between your database server and the cloud storage.<h2 id=tldr-it-works><a href=#tldr-it-works>TL;DR It Works</a></h2><h3 id=prepare-the-database><a href=#prepare-the-database>Prepare the Database</a></h3><p>Set up a database named <code>raster</code> with the <code>postgis</code> and <code>postgis_raster</code> extensions.<pre><code class=language-sql>CREATE EXTENSION postgis;
CREATE EXTENSION postgis_raster;

ALTER DATABASE raster
  SET postgis.gdal_enabled_drivers TO 'GTiff';

ALTER DATABASE raster
  SET postgis.enable_outdb_rasters TO true;
</code></pre><h3 id=investigate-the-data><a href=#investigate-the-data>Investigate The Data</a></h3><p>COG is still a new format for public agencies, so finding a large public example can be tricky. Here is a <a href=https://open.canada.ca/data/en/dataset/18752265-bda3-498c-a4ba-9dfe68cb98da>56GB COG of medium resolution (30m) elevation data for Canada</a>. <strong>Don't try and download it, it's 56GB!</strong><p><img alt="MrDEM for Canada"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/65ac139f-67b5-4239-a7f4-21db3aef3c00/public><p>You can see some metadata about the file using the <code>gdalinfo</code> utility to read the headers.<pre><code>gdalinfo /vsicurl/https://datacube-prod-data-public.s3.amazonaws.com/store/elevation/mrdem/mrdem-30/mrdem-30-dsm.tif
</code></pre><p>Note that we prefix the URL to the image with <code>/viscurl/</code> to tell GDAL to use <a href=https://gdal.org/en/stable/user/virtual_file_systems.html>virtual file system</a> access rather than direct download.<p>There is a lot of metadata!</p><details><summary>Metadata from gdalinfo</summary><pre><code>Driver: GTiff/GeoTIFF
Files: /vsicurl/https://datacube-prod-data-public.s3.amazonaws.com/store/elevation/mrdem/mrdem-30/mrdem-30-dsm.tif
Size is 183687, 159655
Coordinate System is:
PROJCRS["NAD83(CSRS) / Canada Atlas Lambert",
    BASEGEOGCRS["NAD83(CSRS)",
        DATUM["NAD83 Canadian Spatial Reference System",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4617]],
    CONVERSION["Canada Atlas Lambert",
        METHOD["Lambert Conic Conformal (2SP)",
            ID["EPSG",9802]],
        PARAMETER["Latitude of false origin",49,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-95,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",49,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",77,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Transformation of coordinates at 5m level of accuracy."],
        AREA["Canada - onshore and offshore - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon."],
        BBOX[38.21,-141.01,86.46,-40.73]],
    ID["EPSG",3979]]
Data axis to CRS axis mapping: 1,2
Origin = (-2454000.000000000000000,3887400.000000000000000)
Pixel Size = (30.000000000000000,-30.000000000000000)
Metadata:
  TIFFTAG_DATETIME=2024:05:08 12:00:00
  AREA_OR_POINT=Area
Image Structure Metadata:
  LAYOUT=COG
  COMPRESSION=LZW
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (-2454000.000, 3887400.000) (175d38'57.51"W, 68d 7'32.00"N)
Lower Left  (-2454000.000, -902250.000) (121d27'11.17"W, 36d35'36.71"N)
Upper Right ( 3056610.000, 3887400.000) ( 10d43'16.37"W, 62d45'36.29"N)
Lower Right ( 3056610.000, -902250.000) ( 63d 0'39.68"W, 34d21' 6.31"N)
Center      (  301305.000, 1492575.000) ( 88d57'23.39"W, 62d31'56.78"N)
Band 1 Block=512x512 Type=Float32, ColorInterp=Gray
  NoData Value=-32767
  Overviews: 91843x79827, 45921x39913, 22960x19956, 11480x9978, 5740x4989, 2870x2494, 1435x1247, 717x623, 358x311
</code></pre></details><p>The key things we need to take from the metadata are that:<ul><li>the spatial reference system is "NAD83(CSRS) / Canada Atlas Lambert", "EPSG:3979"; and,<li>the blocking (tiling) is 512x512 pixels.</ul><h3 id=load-the-database-table><a href=#load-the-database-table>Load The Database Table</a></h3><p>With this metadata in hand, we are ready to load a <strong>reference</strong> to the remote data into our database, using the <code>raster2pgsql</code> utility that comes with PostGIS.<pre><code>./raster2pgsql \
  -R \
  -k \
  -s 3979 \
  -t 512x512 \
  -Y 1000 \
  /vsicurl/https://datacube-prod-data-public.s3.amazonaws.com/store/elevation/mrdem/mrdem-30/mrdem-30-dsm.tif \
  mrdem30 \
  | psql raster
</code></pre><p>That is a lot of flags! What do they mean?<ul><li><strong>-R</strong> means store references, so the pixel data is not copied into the database.<li><strong>-k</strong> means do not skip tiles that are all NODATA values. While it would be nice to skip NODATA tiles, doing so involves reading <strong>all</strong> the pixel data, which is exactly what we are trying to avoid.<li><strong>-s 3979</strong> means that the projection of our data is <a href=https://epsg.io/3979>EPSG:3979</a>, the value we got from the metadata.<li><strong>-t 512x512</strong> means to create tiles with 512x512 pixels, so that the blocking of the tiles in our database matches the blocking of the remote file. This should help lower the number of network reads any given data request requires.<li><strong>-Y 1000</strong> means to use <code>COPY</code> mode when writing out the tile definitions, and to write out batches of 1000 rows in each <code>COPY</code> block.<li>Then the URL to the cloud GeoTIFF we are referencing, with <code>/vsicurl/</code> at the front to indicate using the "curl <a href=https://gdal.org/en/stable/user/virtual_file_systems.html>virtual file system</a>".<li>Then the table name (<code>mrdem30</code>) we want to use in the database.<li>Finally we pipe the result of the command (which is just SQL text) to <code>psql</code> to load it into the <code>raster</code> database.</ul><p>When we are done, we have a table of raster tiles that looks like this in the database.<pre><code>                     Table "public.mrdem30"
 Column |  Type   | Nullable |               Default
--------+---------+-----------+----------+--------------------------------------
 rid    | integer | not null | nextval('mrdem30_rid_seq'::regclass)
 rast   | raster  |          |
Indexes:
    "mrdem30_pkey" PRIMARY KEY, btree (rid)
</code></pre><p>We should add a <code>geometry</code> index on the raster column, specifically on the bounds of each tile.<pre><code class=language-sql>CREATE INDEX mrdem30_st_convexhull_idx
  ON mrdem30 USING GIST (ST_ConvexHull(rast));
</code></pre><p>This index will speed up the raster tile lookup needed when we are spatially querying.<h3 id=query-the-data><a href=#query-the-data>Query The Data</a></h3><p>The single MrDEM GeoTIFF data set is now represented in the database as a table of raster tiles.<pre><code class=language-sql>SELECT count(*) FROM mrdem30;
</code></pre><p>There are <strong>112008</strong> tiles in the collection.<p>Each tile is pretty big, spatially (512 pixels on a side, 30 meters per pixel, means a 15km tile).<p><img alt="MrDEM Tiles"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/e64368a2-8d19-4be5-454e-7fe322b69900/public><p>Each tile knows what file it references, where it is on the globe and what projection it is in.<pre><code class=language-sql>SELECT (ST_BandMetadata(rast)).*
  FROM mrdem30 OFFSET 50000 LIMIT 1;
</code></pre><pre><code>pixeltype     | 32BF
nodatavalue   | -32767
isoutdb       | t
path          | /vsicurl/https://datacube-prod-data-public.s3.amazonaws.com/store/elevation/mrdem/mrdem-30/mrdem-30-dsm.tif
outdbbandnum  | 1
filesize      | 59659542216
filetimestamp | 1718629812
</code></pre><p>The <a href=https://postgis.net/docs/RT_ST_ConvexHull.html>ST_ConvexHull()</a> function can be used to get a polygon geometry of the raster bounds.<pre><code class=language-sql>SELECT ST_AsText(ST_ConvexHull(rast))
  FROM mrdem30 OFFSET 50000 LIMIT 1;
</code></pre><pre><code>POLYGON((-2054640 -367320,-2039280 -367320,-2039280 -382680,-2054640 -382680,-2054640 -367320))
</code></pre><p>Just like geometries, raster tiles have a spatial reference id associated with them, in this case a projection that makes sense for a Canada-wide raster.<pre><code class=language-sql>SELECT ST_SRID(rast)
  FROM mrdem30 OFFSET 50000 LIMIT 1;
</code></pre><pre><code>3979
</code></pre><h3 id=query-elevation><a href=#query-elevation>Query Elevation</a></h3><p>So how do we get an elevation value from this collection of reference tiles? Easy! For any given point, we pull the tile that point falls inside, and then read off the elevation at that point.<pre><code class=language-sql>-- Make point for Toronto
-- Transform to raster coordinate system
WITH pt AS (
  SELECT ST_Transform(
    ST_Point(-79.3832, 43.6532, 4326),
    3979) AS toronto
)
-- Find the raster tile of interest,
-- and read the value of band one (there is only one band)
-- at that point.
SELECT
  ST_Value(rast, 1, toronto, resample => 'bilinear') AS elevation,
  toronto AS geom
FROM
  mrdem30, pt
WHERE ST_Intersects(ST_ConvexHull(rast), toronto);
</code></pre><p>Note that we are using "<a href=https://en.wikipedia.org/wiki/Bilinear_interpolation>bilinear interpolation</a>" in <a href=https://postgis.net/docs/RT_ST_Value.html>ST_Value()</a>, so if our point falls between pixel values, the value we get is interpolated in between the pixel values.<h3 id=query-a-larger-geometry><a href=#query-a-larger-geometry>Query a Larger Geometry</a></h3><p>What about something bigger? How about the flight line of a plane going from Victoria (YYJ) to Calgary (YYC) over the Rocky Mountains?<ul><li>Generate the points<li>Make a flight route to join them<li>Transform that route into the coordinate system of the raster<li>Pull all the rasters that touch the line and merge them into one giant raster in memory<li>Copy the values off the raster into the Z coordinate of the line<li>Dump the line into points to make a pretty picture</ul><pre><code class=language-sql>-- Create start and end points of route
-- YYJ = Victoria, YYC = Calgary
CREATE TABLE flight AS
WITH
end_pts AS (
    SELECT ST_Point(-123.3656, 48.4284, 4326) AS yyj,
           ST_Point(-114.0719, 51.0447, 4326) AS yyc
),
-- Construct line and add vertex every 10KM along great circle
-- Reproject to coordinate system of rasters
ln AS (
    SELECT ST_Transform(ST_Segmentize(
        ST_MakeLine(end_pts.yyj, end_pts.yyc)::geography,
        10000)::geometry, 3979) AS geom
    FROM end_pts
),
rast AS (
    SELECT ST_Union(rast) AS r
    FROM mrdem30, ln
    WHERE ST_Intersects(ST_ConvexHull(rast), ln.geom)
),
-- Add Z values to that line
zln AS (
    SELECT ST_SetZ(rast.r, ln.geom) AS geom
    FROM rast, ln
),
-- Dump the points of the line for the graph
zpts AS (
    SELECT (ST_DumpPoints(geom)).*
    FROM zln
)
SELECT geom, ST_Z(geom) AS elevation
FROM zpts;
</code></pre><p>From the elevated points, we can make a map showing the flight line, and the elevations along the way.<p><img alt="Elevation Profile"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/461454b9-ab9a-45ba-cf9d-d56be3be3f00/public><h2 id=why-does-it-work><a href=#why-does-it-work>Why does it work?</a></h2><p>How is it possible to read the values off of a 56GB GeoTIFF file without ever downloading the file?<h3 id=cloud-optimized-geotiff><a href=#cloud-optimized-geotiff>Cloud Optimized GeoTIFF</a></h3><p>The difference between a "cloud GeoTIFF" and a "local GeoTIFF" is mostly a difference in how software accesses the data.<ul><li><p>A local GeoTIFF probably resides on an SSD or some other storage that has fast random access. Small random reads will be fast, and so will large sequential reads. Local access is fast!<li><p>A cloud GeoTIFF resides on an "object store", a remote API that allows clients to real all of a file (with an HTTP "<a href=https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET>GET</a>") or part of a file (with an HTTP "<a href=https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests>RANGE</a>"). Each random read is quite slow, because the read involves setting up an HTTP connection (slow) and then transmitting the data over an internetwork (slow). The more reads you do, the worse performance get. So the core goal of a "cloud format" is to reduce the number of reads required to access a subset of the data.</ul><p>Reading multi-gigabyte raster files from object storage is a relatively new idea, formalized only a couple years ago in the <a href=https://cogeo.org>cloud optimized GeoTIFF</a> (aka <strong>COG</strong>) specification.<p>The "cloud optimization" takes the form of just a <a href=https://cogeo.org/in-depth.html>few restrictions</a> on the ordinary <a href=https://en.wikipedia.org/wiki/GeoTIFF>GeoTIFF</a>:<ul><li>Pixel data are tiled<li>Overviews are also tiled</ul><p>Forcing tiling means that pixels that are near each other in space are also near each other in the file. Pixels that are near each other in the file can be read in a <strong>single read</strong>, which is great when you are reading from cloud object storage.<p>(Another "cloud format" shaking up the industry is <a href=https://parquet.apache.org>Parquet</a>, and <a href=https://www.crunchydata.com/products/warehouse>Crunchy Data Warehouse</a> can do direct access and query on Parquet for precisely the same reasons that <code>postgis_raster</code> can query COG files -- the format is structured to reduce the number of reads needed to carry out common queries.)<h3 id=gdal-virtual-file-systems><a href=#gdal-virtual-file-systems>GDAL Virtual File Systems</a></h3><p>While a "cloud optimized" format like COG or GeoParquet is cool, it is not going to be a useful cloud format without a client library that knows how to efficiently read the file. The client needs to be native to the application, and it needs to be parsimonious in the number of file accesses it makes.<p>For a web application, that means that COG access requires a JavaScript library that understands the GeoTIFF format.<p>For a database written in C, like PostgreSQL/PostGIS, that means that access requires a C/C++ library that understands GeoTIFF and abstracts file system operations, so that the GeoTIFF reader can support both local file system access and remote cloud access.<p>For PostGIS raster, that library is <a href=https://gdal.org>GDAL</a>. Every build of <code>postgis_raster</code> is linked to GDAL and allows us to take advantage of the library capabilities.<p>GDAL allows direct access to COG files on <a href=https://gdal.org/en/stable/user/virtual_file_systems.html#network-based-file-systems>remote cloud storage services</a>.<ul><li>Any HTTP server that supports Range requests<li>AWS S3<li>Google Cloud Storage<li>Azure Blob Storage<li>and others!</ul><p>The specific cloud service support allows things like access keys to be used for reading private objects. There is more information about accessing secure buckets with PostGIS raster in this <a href=https://www.crunchydata.com/blog/waiting-for-postgis-3.2-secure-cloud-raster-access#security-and-gdal-network-virtual-file-systems>blog post</a>.<p>Under the covers GDAL not only reads COG format files, it also maintains a <a href=https://gdal.org/en/stable/user/configoptions.html#how-to-set-configuration-options>modest in-memory data cache</a>. This means there's a performance premium to making sure your raster queries are spatially coherent (each query point is near the previous one) because this maximizes the use of cached data. ]]></content:encoded>
<category><![CDATA[ Spatial ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">c1d2d638d18e65df1b0d2089b4807cd6750094d6ed3d7f2b733cfc9dfa183f82</guid>
<pubDate>Fri, 07 Feb 2025 10:30:00 EST</pubDate>
<dc:date>2025-02-07T15:30:00.000Z</dc:date>
<atom:updated>2025-02-07T15:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ Running an Async Web Query Queue with Procedures and pg_cron ]]></title>
<link>https://www.crunchydata.com/blog/running-an-async-web-query-queue-with-procedures-and-pg_cron</link>
<description><![CDATA[ Paul explains the best way to run the http extension in production. ]]></description>
<content:encoded><![CDATA[ <p>The number of cool things you can do with the <a href=https://github.com/pramsey/pgsql-http>http extension</a> is large, but putting those things into production raises an important problem.<p><strong>The amount of time an HTTP request takes, 100s of milliseconds, is 10- to 20-times longer that the amount of time a normal database query takes.</strong><p>This means that potentially an HTTP call could jam up a query for a long time. I recently ran an HTTP function in an update against a relatively small 1000 record table.<p>The query took 5 minutes to run, and during that time the table was locked to other access, since the update touched every row.<p>This was fine for me on my developer database on my laptop. In a production system, it would <strong>not be fine</strong>.<h2 id=geocoding-for-example><a href=#geocoding-for-example>Geocoding, For Example</a></h2><p>A really common table layout in a spatially enabled enterprise system is a table of addresses with an associated location for each address.<pre><code class=language-sql>CREATE EXTENSION postgis;

CREATE TABLE addresses (
  pk serial PRIMARY KEY,
  address text,
  city text,
  geom geometry(Point, 4326),
  geocode jsonb
);

CREATE INDEX addresses_geom_x
  ON addresses USING GIST (geom);

INSERT INTO addresses (address, city)
  VALUES ('1650 Chandler Avenue', 'Victoria'),
         ('122 Simcoe Street', 'Victoria');
</code></pre><p>New addresses get inserted without known locations. The system needs to call an external geocoding service to get locations.<pre><code class=language-sql>SELECT * FROM addresses;
</code></pre><pre><code> pk |       address        |   city   | geom | geocode
----+----------------------+----------+------+---------
  8 | 1650 Chandler Avenue | Victoria |      |
  9 | 122 Simcoe Street    | Victoria |      |
</code></pre><p>When a new address is inserted into the system, it would be great to geocode it. A trigger would make a lot of sense, but a trigger will run in the same transaction as the insert. So the insert will block until the geocode call is complete. <strong>That could take a while.</strong> If the system is under load, inserts will pile up, all waiting for their geocodes.<h2 id=procedures-to-the-rescue><a href=#procedures-to-the-rescue>Procedures to the Rescue</a></h2><p>A better performing approach would be to insert the address right away, and then <strong>come back later and geocode any rows that have a NULL geometry</strong>.<p>The key to such a system is being able to work through all the rows that need to be geocoded, <strong>without locking</strong> those rows for the duration. Fortunately, there is a PostgresSQL feature that does what we want, the <a href=https://www.postgresql.org/docs/current/sql-createprocedure.html>PROCEDURE</a>.<p>Unlike <strong>functions</strong>, which wrap their contents in a single, atomic transaction, <strong>procedures</strong> allow you to apply multiple commits while the procedure runs. This makes them perfect for long-running batch jobs, like our geocoding problem.<pre><code class=language-sql>CREATE PROCEDURE process_address_geocodes()
LANGUAGE plpgsql
AS $$
DECLARE
  pk_list BIGINT[];
  pk BIGINT;
BEGIN
  --
  -- Find all rows that need geocoding
  --
  SELECT array_agg(addresses.pk)
    INTO pk_list
    FROM addresses
    WHERE geocode IS NULL;

  --
  -- Geocode those rows one at a time,
  -- one transaction per row
  --
  IF pk_list IS NOT NULL THEN
    FOREACH pk IN ARRAY pk_list LOOP
      PERFORM addresses_geocode(pk);
      COMMIT;
    END LOOP;
  END IF;

END;
$$;
</code></pre><p>The important thing is to break the work up so it is done one row at a time. Rather than running a single <code>UPDATE</code> to the table, we find all the rows that need geocoding, and loop through them, one row at a time, committing our work after each row.<h2 id=geocoding-function><a href=#geocoding-function>Geocoding Function</a></h2><p>The <code>addresses_geocode(pk)</code> function takes in a row primary key and then geocodes the address using the <a href=https://github.com/pramsey/pgsql-http>http extension</a> to call the <a href=https://developers.google.com/maps/documentation/geocoding/overview>Google Maps Geocoding API</a>. Taking in the primary key, instead of the address string, allows us to call the function one-at-a-time on each row in our working set of rows.<p>The function:<ul><li>reads the Google API key from the environment;<li>reads the address string for the row;<li>sends the geocode request to Google using the <a href=https://github.com/pramsey/pgsql-http>http</a> extension;<li>checks the validity of the response; and<li>updates the row.</ul><p>Each time through the function is atomic, so the controlling procedure can commit the result as soon as the function is complete.</p><details><summary>Geocoding function addresses_geocode(pk)</summary><pre><code class=language-sql>--
-- Take a primary key for a row, get the address string
-- for that row, geocode it, and update the geometry
-- and geocode columns with the results.
--
CREATE FUNCTION addresses_geocode(geocode_pk bigint)
RETURNS boolean
LANGUAGE 'plpgsql'
AS $$
DECLARE
  js jsonb;
  full_address text;
  res http_response;
  api_key text;
  api_uri text;
  uri text := '&#60https://maps.googleapis.com/maps/api/geocode/json>';
  lat float8;
  lng float8;

BEGIN

  -- Fetch API key from environment
  api_key := current_setting('gmaps.api_key', true);

  IF api_key IS NULL THEN
      RAISE EXCEPTION 'addresses_geocode: the ''gmaps.api_key'' is not currently set';
  END IF;

  -- Read the address string to geocode
  SELECT concat_ws(', ', address, city)
    INTO full_address
    FROM addresses
    WHERE pk = geocode_pk
    LIMIT 1;

  -- No row, no work to do
  IF NOT FOUND THEN
    RETURN false;
  END IF;

  -- Prepare query URI
  js := jsonb_build_object(
          'address', full_address,
          'key', api_key
        );
  uri := uri || '?' || urlencode(js);

  -- Execute the HTTP request
  RAISE DEBUG 'addresses_geocode: uri [pk=%] %', geocode_pk, uri;
  res := http_get(uri);

  -- For any bad response, exit here, leaving all
  -- entries NULL
  IF res.status != 200 THEN
    RETURN false;
  END IF;

  -- Parse the geocode
  js := res.content::jsonb;

  -- Save the json geocode response
  RAISE DEBUG 'addresses_geocode: saved geocode result [pk=%]', geocode_pk;
  UPDATE addresses
    SET geocode = js
    WHERE pk = geocode_pk;

  -- For any non-usable geocode, exit here,
  -- leaving the geometry NULL
  IF js->>'status' != 'OK' OR js->'results'->>0 IS NULL THEN
    RETURN false;
  END IF;

  -- For any non-usable coordinates, exit here
  lat := js->'results'->0->'geometry'->'location'->>'lat';
  lng := js->'results'->0->'geometry'->'location'->>'lng';
  IF lat IS NULL OR lng IS NULL THEN
    RETURN false;
  END IF;

  -- Save the geocode result as a geometry
  RAISE DEBUG 'addresses_geocode: got POINT(%, %) [pk=%]', lng, lat, geocode_pk;
  UPDATE addresses
    SET geom = ST_Point(lng, lat, 4326)
    WHERE pk = geocode_pk;

  -- Done
  RETURN true;

END;
$$;
</code></pre></details><h2 id=deploy-with-pg_cron><a href=#deploy-with-pg_cron>Deploy with pg_cron</a></h2><p>We now have all the parts of a geocoding engine:<ul><li>a <strong>function</strong> to geocode a row; and,<li>a <strong>procedure</strong> that finds rows that need geocoding.</ul><p>What we need is a way to <strong>run that procedure</strong> regularly, and fortunately there is a very standard way to do that in PostgreSQL — <a href=https://github.com/citusdata/pg_cron>pg_cron</a>.<p>If you install and enable <code>pg_cron</code> in the usual way, in the <code>postgres</code> database, new jobs must be added from inside the <code>postgres</code> database, using the <code>cron.schedule_in_database()</code> function to target other databases.<pre><code class=language-sql>--
-- Schedule our procedure in the "geocode_example_db" database
--
SELECT cron.schedule_in_database(
  'geocode-process',                 -- job name
  '15 seconds',                      -- job frequency
  'CALL process_address_geocodes()', -- sql to run
  'geocode_example_db'               -- database to run in
  ));
</code></pre><p>Wait, <strong>15 seconds</strong> frequency? What if a process takes more than 15 seconds, won't we end up with a stampeding herd of procedure calls? Fortunately no, <code>pg_cron</code> is smart enough to check and defer if a job is already in process. So there's no major downside to calling the procedure fairly frequently.<h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><ul><li>HTTP and AI and BI rollup calls can run for a "long time" relative to desired database query run-times.<li>PostgreSQL <code>PROCEDURE</code> calls can be used to wrap up a collection of long running functions, putting each into an individual transaction to lower locking issues.<li><code>pg_cron</code> can be used to deploy those long running procedures, to keep the database up-to-date while keeping load and locking levels reasonable.</ul> ]]></content:encoded>
<category><![CDATA[ Production Postgres ]]></category>
<category><![CDATA[ Spatial ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">87cb83ee7ae7f80a85e4e9997d0a1e45fca8d5dee03fb478d29798def4c2303f</guid>
<pubDate>Mon, 06 Jan 2025 09:30:00 EST</pubDate>
<dc:date>2025-01-06T14:30:00.000Z</dc:date>
<atom:updated>2025-01-06T14:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ Name Collision of the Year: Vector ]]></title>
<link>https://www.crunchydata.com/blog/name-collision-of-the-year-vector</link>
<description><![CDATA[ Elizabeth digs into the history and various uses of the vector. ]]></description>
<content:encoded><![CDATA[ <p>I can’t get through a zoom call, a conference talk, or an afternoon scroll through LinkedIn without hearing about vectors. Do you feel like the term vector is everywhere this year? It is. <strong>Vector</strong> actually means several different things and it's confusing. Vector means AI data, GIS locations, digital graphics, and a type of query optimization, and more. The terms and uses are related, sure. They all stem from the same original concept. However their practical applications are quite different. So “Vector” is my choice for this year’s name collision of the year.<p>In this post I want to break down the vector. The history of the vector, how vectors were used in the past and how they evolved to what they are today (with examples!).<h2 id=the-original-vector><a href=#the-original-vector>The original vector</a></h2><p>The idea that vectors are based on goes back to the 1500s when René Descartes first developed the Cartesian coordinate XY system to represent points in space. Descartes didn't use the word vector but he did develop a numerical representation of a location and direction. Numerical locations is the foundational concept of the vector - used for measuring spatial relationships.<p>The first use of the term vector was in the 1840s by an Irish mathematician named William Rowan Hamilton. Hamilton defined a vector as a quantity with both magnitude and direction in three-dimensional space. He used it to describe geometric directions and distances, like arrows in 3D space. Hamilton combined his vectors with several other math terms to solve problems with rotation and three dimensional units.<p><img alt=image.png loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/0d864a5e-a64f-4c85-36ea-8a5938420900/public><p>The word Hamilton chose, vector, comes from the Latin word <strong>vehere</strong> meaning ‘to carry’ or ‘conveyor’ (yes, same origin for the word vehicle). We assume Hamilton chose this Latin word origin to emphasize the idea of a vector carrying a point from one location to another.<p>There’s a <a href=https://www.amazon.com/Vector-Surprising-Story-Mathematical-Transformation/dp/0226821102>book about the history of vectors</a> published just this year, and a <a href=https://www.siam.org/publications/siam-news/articles/the-curious-history-of-vectors-and-tensors/>nice summary here</a>. I’ve already let Santa know this is on my list this year.<h2 id=mathematical-vectors><a href=#mathematical-vectors>Mathematical vectors</a></h2><p>Building upon Hamilton’s work, vectors have been used extensively in linear algebra pre and post computational math. If it has been 20 since you took a math class here’s a quick refresher.<p>Linear algebra is a branch of mathematics that focuses on vectors, matrices, and arrays of numbers. Here’s a super simple mathematical vector equation. We have two points on an XY coordinate system, point A at 1, 2 and B at 4,6. The vector formula for this is below in this diagram, final solution 3,4.<p><img alt="basic math vector"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/42e42598-d0a7-44d7-5126-b0d67de19c00/public><p>Linear algebra of much more complicated forms is used in solving systems of linear differential equations. Vector equations have practical use cases in physics and engineering for things we use every day like heat conduction, fluids, and electrical circuits.<h2 id=computer-science-vectors><a href=#computer-science-vectors>Computer science vectors</a></h2><p>Early computer scientists made heavy use of the vector in a variety of ways. A computational vector can be similar to the example above or even just a simple numeric array of fixed size with where the numbers have related values. In early computer programming, simple operations like additions or subtraction would be applied to a set of vectors.<p>A basic example of this could be financial portfolio analysis where you have two vectors: 1 - Portfolio weights, v1, showing the proportion of investment in different stocks and 2 - market impact adjustments, v2, that adjusts markets based on current values. This code sample here in C calculates the adjusted weights for each stock in the portfolio by adding the two vectors.<pre><code class=language-C>#include &#60stdio.h>

#define STOCKS 8

typedef float Portfolio[STOCKS];

int main() {
    // Portfolio weights (in percentages, out of 100)
    Portfolio portfolioWeights = {10.0, 20.0, 15.0, 25.0, 5.0, 10.0, 10.0, 5.0};
    // Market impact adjustments (positive or negative percentages)
    Portfolio marketAdjustments = {0.5, -0.3, 1.0, -0.5, 0.2, -0.1, 0.0, 0.7};
    Portfolio adjustedWeights;

    // Perform vector addition
    for (int i = 0; i &#60 STOCKS; i++) {
        adjustedWeights[i] = portfolioWeights[i] + marketAdjustments[i];
    }

    // Print adjusted weights
    printf("Adjusted Portfolio Weights: &#60");
    for (int i = 0; i &#60 STOCKS; i++) {
        printf("%s%.1f%%", i > 0 ? ", " : "", adjustedWeights[i]);
    }
    printf(">\n");

    return 0;
}
</code></pre><p>Modern computer science builds on similar concepts of organizing and processing collections. The <code>std::vector</code> in C++ and <code>Vec&#60T></code> in Rust are general-purpose dynamic arrays. They can be virtually any data type to help manage or compute collections of elements.<h2 id=graphics-and-vectors><a href=#graphics-and-vectors>Graphics and vectors</a></h2><p>Vector graphics were used in early arcade and video game development. Think of something like Spacewar! or Asteroids. Vectors could be used to draw lines and shapes like ships and stars.<p>Here’s a super simple example of how vectors could be used to draw a triangle.<pre><code class=language-C>#define DrawLine(pt1, pt2)

typedef struct Point {
    int x, y;
} Point;

typedef struct Line {
    Point start;
    Point end;
} Line;

Line lines[3] = {
    {{0, 0}, {100, 100}},  // Line 1
    {{100, 100}, {200, 50}}, // Line 2
    {{200, 50}, {0, 0}}    // Line 3
};

// Loop through these points to draw our triangle on the screen.
int main()
{
    for (int i = 0; i &#60 3; i++)
    {
        DrawLine(lines[i].start, lines[i].end);
    }
    return 0;
}
</code></pre><p>These early xy arrays and computerized graphics paved the way for modern computer graphics which make use of vectors in even more advanced ways. When you play a modern 3D video game, many characters, objects, and movement you see on the screen are powered by linear algebra vectors.<p>The <strong>Graphics Processing Unit (GPU)</strong> was a specialized computer developed in the 1990s and then improved on in the decades since. GPUs handle the millions of vector operations required to create 3D graphics in real time. GPUs now are used for far more than 3D graphics. Vector-based assembly operations can operate on a continuous block of memory, doing the same operation across different chunks of memory.<p><strong>Scalable vector graphics (SVG)</strong><p>SVGs are 2D vector graphics that have become a de-facto image format in web design and development. There’s a vector standard that allows svg graphics to be created with a series of numbers that represent shapes and paths that work across devices and web browsers. SVG graphics display logos, icons, charts, and animations. Their popularity took off in the mid 2010s and continues to grow as they remain popular due to their performance and lightweight nature.<p>SVGs use some number of vector numbers to describe the object they represent. For a simple SVG with a few shapes might be dozens of numbers. A more complex SVG like one for a detailed icon or map might include thousands of numbers.<p>Here’s what the SVG of the <a href=https://www.crunchydata.com/>Crunchy Data</a> hippo logo looks like:<pre><code class=language-jsx>&#60svg
	id="aad9811e-aeeb-4dae-a064-7d889077489a"
	data-name="Layer 6"
	xmlns="http://www.w3.org/2000/svg"
	viewBox="0 0 1407.15 1158.38"
>
	&#60path
		d="M553.21,651l124.3,122.4-154.9-89Zm-304.5-496.6-54.6,148.9L35.71,415.19,6.81,523.49l-6.5,67.9,83.1,65.2h0l208.7-10.3,114.1-155.7,3.6-166,199.3-200.5-104.7-41.9Zm0,0,360.4-30.3m-104.7-41.9-114.1,61.4-130.7,213.5-105.5,150.5-70.8,149m322.9-166-145.9-135.4-222.5,62.1M294.21,642l-140.1-135.1L1,586.39m36.1-171.2,116.3,91,190.8-73.1m-95.5-278.7L259.61,357m150.1-32.4-19.4-181m218.8-19.5,14.7,196.7-59.5,137.4-49.1,104-92.7,47.2-128.8,35.9,139.8,39.3L621.21,632l62.4-196.3,16.7-174.4-92.4-136.9M621.21,632l-215-141.5,26.7,194-349.6-28m617-395.2-294.1,229.3,215,141.5m-217.1,50.2,8.6,306.7-17.5,35.7,6.1,52.8,101.7-4.8,63.5-63.9,6-47.9L588.41,792h0l89.2-18.4,97.2,23.4,84.2,19.7-2.1,46.5,10.5,30.4-19,28.9,28.1,1.9,1.6-.8,6,105.5-15.1,40.1,25.3,88.7,132.1-33-6.1-50.6,65.5-306.8,49.5-12.2,57-43,29,41.1,2.4,88.3,5.8,61.8-18.6,46.2,23.5,38.7,96.5-12.4,44.3-43.5-21.1-28.8,13.8-216.9,4-65.5,34.6-116.4-23.4-120.4-332.8-215.1L842,135l-151.2,47.5m119.9,84.8-202.4-143.1m202.4,143.1L849,552.39l134.2-214.2ZM1164,453.09l-180.8-115-42.6,277Zm-486.5,320.4,263-158.4L849,552.39Zm133.2-506.2-110.6-4-4.6,48.5,115-42.3m-133,504-154.9-89,65.7,107.4Zm170.3-25.9,35.1,87,57.6-219.4Zm117.7,83.3-25-215.8-57.6,219.4Zm-24.9-215.8,25,215.8,120.2-63.5Zm12.7,418.8,94-83.9-81.9-119.1Zm-105.5-285.6-170.3,25.2,200,47.7ZM1164,453.09l-70.6,270.3,141.1-114Zm70.5,156.3,77.8-132.8L1195,262.89Zm-251.3-271.3,180.8,115,31.1-190.2Zm67.1-168.8-67.1,168.8,211.9-75.2ZM842,135l-151.2,47.5,359.5-13.9Zm244.2,633.2,7.2-44.8m167.2-63.1,51.8-183.7-77.9,132.8Zm0,0-26.1-50.9-99.3,145.8Zm0,0,84.1-88.7-32.4-95Zm84.1-88.7-84.1,88.7,42.4-7.6Zm-22.6-226.7-9.8,131.7,32.4,95Zm0,0,22.6,226.7,62-69Zm46.3,339.3-65.3-30.2,56.7,161.5Zm-114.7,122.3,77.3-31.9-28.1-121.8Zm49.2-153.7,28.1,121.8,28.9,40.9Zm69.3-32.3-27.5-48.9,23.7,112.6ZM1331,774.59l-4.7,123.7,33.6-82.7Zm-93.9,213.3,94.5-12.7-5.4-78.4Zm16.6-181.4-30,35.1,13.4,139.9,63.4-138.2Zm0,0-33.1-115.9,3.1,150.6Zm-32.8-115.2,82.2-37.2m-73.5,249.3,7.6,84.6m94.5-12.8,43.7-42.9-49.1-35.5Zm-5.8-79.2,29.1,7.3m-942.3,85.6-11.4,88.5,63.4-55.8Zm51.2,31.9,38.7,52.5,63.8-64.5Zm556,53.9-66.6-40.8-59.2,123.9Zm-431.6-282.8-112.2,70.4-11.4,159.3Zm-178.6,89.3,2.9,107.7,63.5-126.6Zm238-729.1,40.7-57.4L702,45.29l-13.6-32L650.11.49l-13.6,2.6-31.2,41.3-10.3,73,14.1,6.7ZM650,.49l-48.6,74.7,81.4-45.9Zm32.7,28.4L702,45.19m-19.1-15.3,5.5,64.8L647.31,110l-38.2,14.1m0,0-7.7-48.9m87-61.9-5.5,16.6L650,.59m-269.3,116-4.1-59.1-45-22.9-43.7,26.8,2.7,42.8,11.5,35.3M346.21,81l-14.6-46.5-41,69.7L346.21,81l-43.8,58.5m74.2-82.1L346.21,81l34.5,35.6m486.4,777.9,10.9,29m4.9-90.7-15.6,60.6,10.7,30.1Zm-407,32,46.7-180.3-112.9,196.7m23.2-196.6,89.7-.1,30.6-33.4M744.81,394l-10.6,113.9L849,552.39Zm-75.5,84.8L621.21,632l113.1-124.1Zm64.9,29.1-56.7,265.6m0,0,27.2-133.3-83.6-8.1Zm68.1-380.1-59.2,18m9-99.7,49.4,82.3,65.7-124.6Zm-289.2,178.9,277.3-54.9m200.3,594.7,31-31.4,50.7-168.1m-82.6,1.9,31.9,166.1,38.5,34.9M1331,774.59l-30.4,68.7,25.8,53.5M287.91,61.39l23.9,6.7"
		fill="none"
		stroke="currentColor"
		stroke-linejoin="bevel"
	/>
&#60/svg>
</code></pre><h2 id=gis-vector-data><a href=#gis-vector-data>GIS vector data</a></h2><p>In modern computational GIS, vectors are used to represent geometric data types like points, line-strings, and polygons. Like any other x,y,z vector coordinate system the vectors refer to specific global points or objects. There’s quite a few different spatial reference systems that can be used. The vectors are typically stored in <a href=https://www.crunchydata.com/solutions/postgis>PostGIS</a> using a binary format Well-Known Binary (WKB), which is a standardized binary encoding for geometries. Vectorization also powers many of the key functions in modern geospatial data processing like intersections, distance calculations, joins, and proximity analysis.<p>Here’s the vector binary for (imho) the best BBQ restaurant in the world:<pre><code class=language-bash> restaurant_name |                        geom
-----------------+----------------------------------------------------
Gates Bar B Q    | 0101000020E610000082E673EE76A557C007B47405DB884340
</code></pre><h2 id=ai-vectors><a href=#ai-vectors>AI Vectors</a></h2><p>AI vectors emerged from the mathematical and computational foundations of vectors that I covered above. Through advancements in hardware and in machine learning algorithms, vectors can be used as a system to describe virtually anything. Large Language Models (LLMs) convert data like text, images, or other inputs into vectors through a process called embedding. LLMs use layers of neural networks to process the embeddings in a specific context. So the vectors numerically represent relationships between objects within the context they were created with.<p>You’ve probably heard of the <code>pgvector</code> extension that is used for storing and querying AI related embedding data. <a href=https://www.crunchydata.com/blog/topic/ai>pgvector</a> adds a custom data type <code>vector</code> for storing fixed-length arrays of floating-point numbers. pgvector stores up to 16k dimensions.<p>My colleague Karen Jex has a great embedding talk she does about AI called “<a href="https://www.youtube.com/watch?v=XUMVumOzA3M">What’s the Opposite of a Corn Dog</a>”. The vector embedding for a corn dog from an OpenAI menu dataset is an array of a staggering 1536 numbers. Here’s a snippet.<pre><code class=language-sql>// vector of a Corn Dog
[0.0045576594,-0.00088141876,-0.014024569,-0.011641564,0.0038251784,0.010306821,-0.01265076,-0.013672978,-0.01582159,-0.041670028,0.0044274405,.........0.040185533,-0.010463083,0.004326521,-0.019571891,0.01853014,0.025770308,-0.017787892,0.0018572462]
</code></pre><p>In AI and machine learning, a vector is an ordered list of numbers that represents data for literally anything. Really what “AI” is doing is turning anything and everything into a vector and then comparing that vector with other vectors in the same matrix.<h2 id=vectorized-queries><a href=#vectorized-queries>Vectorized queries</a></h2><p>As the use of computational vectors have become so popular along with machine learning, the underlying methods and CPU hardware for processing vector data is now used to process other kinds of data.<p>There are several databases on the market now like <a href=https://www.crunchydata.com/solutions/postgres-with-duckdb>DuckDB</a>, Big Query, Snowflake, and <a href=https://www.crunchydata.com/products/warehouse>Crunchy Data Warehouse</a> that make use of vectorized query execution to speed up analytics queries. Vectorized database queries split up and streamline queries into similar results over chunks of data of the same type. In a way, they’re treating columns of data like mathematical vectors. This can be much more powerful than reading data row by row. The power here also comes from the parallelization and effective CPU and IO usage.<p><img alt="vectorized queries.png"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/63909705-bb59-4405-9514-cf792eed9600/public><p>The values processed with vectorized execution are typically treated as vectors in the sense that they’re contiguous batches of data elements. Surprisingly, they do not need to represent mathematical vectors—they can be any kind of data that fits the processing model.<h2 id=vectors-are-everywhere><a href=#vectors-are-everywhere>Vectors are everywhere!</a></h2><p>Vectors are everywhere and they can mean virtually anything in a computerized context - especially now with AI - everything is or can be a vector.<p>Vectors and their uses are one of the main characters in the story of modern computing. An evolution from pen and ink math to modern ML algorithms. The beauty of the vector in its infinite use of numeric representation. From simple concepts like a point on the globe to computerized graphics and animation, and AI embeddings for any text or image. <br><br><h3 id=vector-use-summary><a href=#vector-use-summary>Vector use summary:</a></h3><p><img alt="vector uses.png"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/44a5573e-6d89-4285-2259-546f8a1c4900/public><p><br><br><br><br><br>Attributions<p><a href=https://old.maa.org/press/periodicals/convergence/mathematical-treasure-hamilton-s-lectures-on-quaternions>Hamilton’s Lecture on Vectors</a> ]]></content:encoded>
<category><![CDATA[ Spatial ]]></category>
<category><![CDATA[ AI ]]></category>
<author><![CDATA[ Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) ]]></author>
<dc:creator><![CDATA[ Elizabeth Christensen ]]></dc:creator>
<guid isPermalink="false">21260185d81ce54e8d5d72f33634008d11788d1ac6743c61c403369394487b5b</guid>
<pubDate>Thu, 26 Dec 2024 08:30:00 EST</pubDate>
<dc:date>2024-12-26T13:30:00.000Z</dc:date>
<atom:updated>2024-12-26T13:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ PostGIS Day 2024 Summary ]]></title>
<link>https://www.crunchydata.com/blog/postgis-day-2024-summary</link>
<description><![CDATA[ Crunchy Data hosted an online event for PostGIS on November 21st, 2024. Paul has a wrap up post discussing the highlights and themes throughout the day.  ]]></description>
<content:encoded><![CDATA[ <p>In late November, on the day after GIS Day, we hosted the annual PostGIS day online event. 22 speakers from around the world, in an agenda that ran from mid-afternoon in Europe to mid-afternoon on the Pacific coast.<p>We had an amazing collection of speakers, exploring all aspects of PostGIS, from highly technical specifics, to big picture culture and history. A <a href="https://youtube.com/playlist?list=PLesw5jpZchudlDbCzKtZwr5eCbvyT_FKW&#38si=BVWBmTvJ1-iy-Jd1">full playlist</a> of PostGIS Day 2024 is available on the <a href=https://www.youtube.com/@CrunchyDataPostgres>Crunchy Data YouTube channel</a>. Here’s a highlight reel of the talks and themes throughout the day.<h2 id=the-old-and-the-new><a href=#the-old-and-the-new>The Old and the New</a></h2><p>My contribution to the day is a historical look back at the <a href="https://youtu.be/aHB9labpBmk?feature=shared">history of databases and spatial databases</a>. The roots of PostGIS are the roots of PostgreSQL, and the roots of PostgreSQL in turn go back to the dawn of databases. The history of software involves a lot of coincidences, and turns on particular characters sometimes, but it’s never (too) dull!<p>Joshua Carlson delivered one of the stand-out talks of the day, exploring how he built a very old-style cartographic product–a street with a grid-based index to find street names–using a very new-style approach–spatial SQL to generate the grid and find the grid numbers for each street to fill in the index. Put <a href="https://youtu.be/O45Zy5zKkm8?feature=shared">Making a Dynamic Street Map Index with ST_SquareGrid</a> at the top of your video play list.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/7e54bcfa-fb9c-443b-7b79-de4d14b03a00/public><p>For the past ten years, Brian Timoney has been warning geospatial practitioners about the complexity of the systems they are delivering to end users. In <a href="https://youtu.be/pwtoh7IVoCk?feature=shared">Simplify, simplify, simplify</a>, Timoney both walks the walk and talks the talk, delivering denunciations of GIS dashboard mania, while building out a minimalist mapping solution using just PostGIS, SVG and (yes!) Excel. It turns out that SVG is an excellent medium for delivering cartographic products, and you can generate them entirely in PostgreSQL/PostGIS.<p>And then, for example, work with them directly in MS Word! (This is, as Brian says, what customers are looking for, not a dashboard.)<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/17a94998-a327-4305-20d2-838e7a8d0200/public><p>Steve Pousty brought the <a href="https://youtu.be/QXUr-Ia7OE8?feature=shared">mandatory AI-centric talk</a>, but avoided the hype and stuck to the practicalities of the new era: what do the terms mean, what are the models for, what tools are there in PostgreSQL to make use of them, and in particular what makes sense for spatial practitioners.<h4 id=parquet-and-postgis><a href=#parquet-and-postgis>Parquet and PostGIS</a></h4><p>Our own Rekha Khandhadia showed off the power of our latest product, <a href=https://www.crunchydata.com/products/warehouse>Crunchy Data Warehouse</a>, when combined with the massive map data available from Overture, and the analytical tools of PostGIS.<p>In <a href="https://youtu.be/1KhWJHKuNCY?feature=shared">Geospatial Analytics with GeoParquet</a>, using only SQL, she addressed the 300GB of Overture data, and ran a spatial analysis on the fly over the state of Michigan.<p>GeoParquet is the new kid on the block, with lots of folks in the researching phase.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/27e31527-df7d-408a-cf21-23d3c474bc00/public><p>Brian Loomis of Nikola Motor shared how he is <a href="https://youtu.be/ppel5KO9d7s?si=p0SBE3MKHZXfT4QP">using PostGIS/PostgreSQL to quantify</a> how much time their trucks are spending in various impacted communities, for reporting to the California Air Resources Board (CARB). Loomis also shares his use case for Crunchy Data Warehouse. In working with 4 billion points a day, they're using s3 to store partitioned data in Parquet. Loomis has some useful notes on Parquet file sizes and structure optimization if you're new to that topic.<h2 id=the-larger-world><a href=#the-larger-world>The Larger World</a></h2><p>PostGIS doesn’t exist in a vacuum, it’s part of a larger open ecosystem of data and other software and organizations trying to solve problems. Bonny McClain returned to PostGIS day with an update on her work on urban climate issues and using <a href="https://youtu.be/4Qw-jbzN5bc?feature=shared">SQL as an engine for public policy analysis</a>.<p>At Overture Maps, a collaboration of industry members is synthesizing a public world base map from multiple sources, and Dana Bauer and Jake Wasserman got us <a href="https://youtu.be/i1jVvVG_Y48?feature=shared">Started With Overture Maps</a>, how PostGIS can make use of the data and what is being built. At the other end of the spectrum, Felt is building end-user facing tools for spatial collaboration, and Michal Migurski walked us through a <a href="https://youtu.be/yyNMBI0bRss?feature=shared">demo of pulling climate data from a PostGIS service</a>, visualizing and story telling with the data.<p>Meanwhile, in the daily grind of GIS operations, Kurt Menke is seeing a wave of <a href="https://youtu.be/O4sJFcngk3A?feature=shared">open source adoption in Danish municipalities</a>, as QGIS and PostGIS take over and old MapInfo installations are phased out. The pattern of adoption across the nation is very interesting and Kurt provides lots of maps.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/68bfb737-57bd-4c04-391e-b6531710c300/public><p>This poll from the webinar shows a lot of QGIS use in our PostGIS Day audience! Not surprising, really, QGIS is the easiest desktop GIS to integrate with PostGIS.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/f86af75f-6913-4af2-d44a-abfb0e6a3c00/public><p>Finally, we got to hear from Pekka Sarkola on <a href="https://youtu.be/5x5cZYu7iok?feature=shared">How to Connect PostGIS to ArcGIS</a> and the answer is “it depends”. There’s a lot of complexity in the Esri environment, lots of products, and lots of history, so the precise way you want to connect will depend on your needs. But you can do it, just remember to read the docs carefully.<p>Regina with a pure SQL exploration of PostGIS-related extensions, shared <a href="https://youtu.be/HHOUqztMFdQ?feature=shared">PostGIS Surprise, the Sequel</a>;<h2 id=the-nitty-gritty><a href=#the-nitty-gritty>The Nitty Gritty</a></h2><p>Using PostGIS often means accessing and using from another language, and Tom Payne provided a great deep dive into using <a href="https://youtu.be/KA-Z50MH3ic?si=dj4TFpFuhlxIyTuY">PostGIS from within the Go language</a>. Tom’s work on 3D geospatial is built into flight devices to warn aviators of hazards in the Swiss alps. Also in the world of 3D, Loïc Bartoletti explained <a href="https://youtu.be/82czClBqFos?feature=shared">SFCGAL and PostGIS</a>, bringing new algorithms into PostGIS – in particular algorithms working with volumetric types and 3D data.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/320884bc-cfcc-468b-f7ab-927f04a28d00/public><p>Finally, Maxime Schoemans introduced us to the power of <a href="https://youtu.be/LQjDVnymvuA?feature=shared">Multi-entry Generalized Search Trees</a> – imagine the current PostGIS spatial indexes, but with each spatial object potentially represented with multiple index keys. The potential for performance improvements, as Maxime demonstrated, is very high, particularly for data involving large and complex shapes.<p>All these speakers crossed the threshold of true nitty – they talked about C and core code bindings!<h2 id=routing-and-driving><a href=#routing-and-driving>Routing and Driving</a></h2><p>Route finding and fleet management continue to be ever-green topics in the world of geospatial, as the world keeps spinning faster on more and more wheels. While it is tempting to reach for pgRouting to solve any routing problem, both Ibrahim Saricicek and Dennis Boachie Boateng counseled making sure your routing solutions matches your routing problem.<p>Everyone has a favourite cost for routing, and this poll shows the PostGIS day audience pretty divided on the right one.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/2f893a8b-808b-40d8-90ef-531b2464ec00/public><p>Ibrahim provided a good comparison of different open source routing options, in a <a href="https://youtu.be/ihXHy2cWNpY?feature=shared">Survey of pgRouting and Other Open Source Routing Tools</a>.<p>And Dennis went all-in on the bespoke routing path, describing the core principles of routing, and demonstrating his own <a href="https://youtu.be/MUkA9NvvdUU?feature=shared">Custom Routing Solutions with PostGIS</a>, in particular a live example of his own mobile way-finding application.<h2 id=you-get-an-api-you-get-an-api-you-all-get-apis><a href=#you-get-an-api-you-get-an-api-you-all-get-apis>You get an API, you get an API, you all get APIs!</a></h2><p>Web APIs to PostGIS are always a rich topic, because there’s a lot of them, and everyone has a favorite specification or implementation language. Michael Keller shared his incredibly well fleshed out <a href="https://youtu.be/iVYcJFVcZUA?feature=shared">FastCollection API</a>, a Python state-of-the-art implementation of the Open Geospatial Consortium standards, with a few extra API end points for easier web application building. We are looking forward to seeing Michael in future years, as he builds out a complete example application on top of this API.<p>Elizabeth Christensen showed off our favourite API tools, the lightweight services we use for building <a href="https://youtu.be/mR0WshjWfVY?feature=shared">Web maps from PostGIS – pg_featureserv and pg_tileserv</a>. Simplicity of deployment and interface are what distinguish these Go language services, just download and run, no dependencies, no fuss.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/8f65eb7a-2d27-479e-3f7d-fc8c3d11eb00/public><p>Martin Davis also showed off our microservices, but in the context of the Uber global hexagonal grid system. He built a live dashboard specifically to show <a href="https://youtu.be/BsFCVTBzTvY?feature=shared">Summarizing Data in H3 with PostGIS and pg_tileserv</a>. All the summary maps were generated on-the-fly, which is particularly impressive given the data on the backend.<h2 id=topological-data-models><a href=#topological-data-models>Topological Data Models</a></h2><p>Two approaches to managing data with shared boundaries were demonstrated at PostGIS day this year. The “traditional” approach was explained by Felipe Matas in <a href="https://youtu.be/mo-FKxqQ7zU?feature=shared">Simplify Space Relations like Country/State Divisions with Postgis Topology</a>. PostGIS comes with a built-in topology model, but understanding the moving parts can be hard, and Felipe provided a great talk with (importantly) a lot of pictures about how a topological model represents something like administrative boundaries.<p><img alt loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/05bd398a-84f3-4dfc-9671-93629e4c9b00/public><p>Yao Cui from the British Columbia Geological Survey showed off the data model he developed 20 years ago to handle the difficult problem of keeping geological data clean while still supporting a robust data update cycle. Cui’s approach uses <a href="https://youtu.be/6gUZ46mhpZg?feature=shared">PostGIS to Facilitate Polygonal Map Integration Without Edge Matching</a>. He keeps the topology implicit, and just manages the boundaries between areas, with a little careful work in identifying the boundaries of edit areas to allow long term data checkout, and clean data check-in.<h2 id=the-curtain-closes><a href=#the-curtain-closes>The curtain closes</a></h2><p>It was an honor to once again host PostGIS day, and we are in debt to all the great speakers who gave their time to participate. Thanks to everyone who participated in the chat and Q&#38A sessions, it was a lively experience, all 11 hours of it! ]]></content:encoded>
<category><![CDATA[ Spatial ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">73a35034aff0594e9b5aca12600c22d35bf28d5f9aa2a522931e82c0ea334098</guid>
<pubDate>Wed, 27 Nov 2024 11:30:00 EST</pubDate>
<dc:date>2024-11-27T16:30:00.000Z</dc:date>
<atom:updated>2024-11-27T16:30:00.000Z</atom:updated></item></channel></rss>