Joe Conway | CrunchyData Blog

Kubernetes + Postgres Cluster From Scratch on Rocky 8

Joe.Conway@crunchydata.com (Joe Conway) — Fri, 07 Jan 2022 04:00:00 EST

Co-authored by Brian Pace

I was excited to hear that Kubernetes 1.22 was recently released with better support for cgroup-v2 and has support for Linux swap. These changes potentially resolve two of my chief complaints about running Postgres under Kubernetes. Obviously it will take some time before we see uptake in the wild on these features, but I wanted to become familiar with them.

For what it's worth, I also want to eventually play with the new alpha seccomp support in Kubernetes v1.22, but will have to save that for another day. In the meantime, if you are interested in Seccomp with PostgreSQL, you can see my presentation from FOSDEM that discusses a Postgres extension I wrote, unsurprisingly called pgseccomp.

Disclaimer

I am not a Kubernetes Admin by profession, but rather an old database-curmudgeon(tm), so please be careful if you try this at home, and don't flame me if I don't approach everything in the canonical way.

The Goal

Specifically, I set out to build a three node v1.22 Kubernetes cluster with one main control-plane node and two worker-only nodes. However, I found that virtually every "recipe" I came across for doing this (using equivalent distributions, e.g. CentOS 8) would result in a non-working cluster, even when not trying to enable swap and/or cgroup v2.

And of course, once the cluster is up and running, my next desire was to install the Crunchy Data Operator v5 and deploy PostgreSQL starting from the examples.

So I enlisted some help from my friend and colleague Brian Pace, and documented my own successful recipe below.

The Journey

First, I started with a mostly vanilla Virtual machine image created with Rocky Linux 8.4 installed from ISO.

Mostly defaults
Server install, with desktop
Enable networking
Enable setting time from network
Create local user as an admin

After initial setup and rebooting into a working Rocky 8 base instance, I shut down the VM and made three copies of the qcow2 images. From there I created three VMs, each using one of the base-image copies.

On each of my three kube node VMs, I noted the MAC address for the network interface, and set up DHCP Static Mappings for the kube nodes by MAC address. I also set the desired hostnames -- kube01, kube02, and kube03.

Left TODO

Note that, in addition to seccomp, I have also punted on enabling the firewall, and on running with selinux in enforcing. I hope to tackle both of those later.

The Recipe

Without further ado, what follows is my recipe.

Caution: steps below with the preceding comment "... on main control-plane node only" should only be run on the control-plane node (in my case, kube01), and the ones with the preceding comment "... from worker-only nodes" should only be run on the worker-only nodes. Also note that this setup is in a lab as it is never recommended to run a single node control-plane configuration for production or other critical environments.

Basic Node Setup

Unless otherwise stated, each of the steps should be performed on each host with the necessary host name, ip, etc. modifications. Start from Rocky Linux 8 fresh server install as outlined above, and note the MAC address for the network interface. Set up DHCP Static Mappings for the kube nodes by MAC address, and then change the following variable values to suit your setup:

### variables setup ###

# IP Address for main control-plane node
MC_IP=<your-node-IP-here>

# POD network subnet
POD_NETWORK_CIDR="10.244.0.0/16"

# Hostname for the current node
MYHOSTNAME=kube01
#MYHOSTNAME=kube02
#MYHOSTNAME=kube03

# My username
LCLUSER=jconway

Local user setup

Next install ssh public key for my local username:

mkdir /home/${LCLUSER}/.ssh
vi /home/${LCLUSER}/.ssh/authorized_keys
# paste desired ssh public key and save

# ssh will not be happy if permissions are not correct
chmod 700 /home/${LCLUSER}/.ssh
chmod 600 /home/${LCLUSER}/.ssh/authorized_keys

Node setup

Update the system to get the latest fixes and fix the hostname:

sudo dnf update

# If needed, reset the desired hostname
# This may be required, for example, if the current
# host VM was cloned from a base image
sudo hostnamectl set-hostname ${MYHOSTNAME}

Kubernetes specific setup

Allow running with swap on

OUTFILE=/etc/sysconfig/kubelet
sudo out=$OUTFILE bash -c 'cat << EOF >> $out
KUBELET_EXTRA_ARGS="--fail-swap-on=false"
EOF'

Put selinux in permissive for now (but should be fixed later!)

sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

Setup required sysctl params -- these persist across reboots

OUTFILE=/etc/sysctl.d/k8s.conf
sudo out=$OUTFILE bash -c 'cat << EOF >> $out
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables  = 1
net.ipv4.ip_forward                 = 1
EOF'
sudo sysctl --system

systemd doesn't use cgroup v2 by default; configure the system to use it by adding systemd.unified_cgroup_hierarchy=1 to the kernel command line

sudo dnf install -y grubby && \
  sudo grubby \
  --update-kernel=ALL \
  --args="systemd.unified_cgroup_hierarchy=1"

Turn on controllers that are off by default; cpu controller, at least, seems to be required for the kubelet service to function

OUTFILE=/etc/systemd/system.conf
sudo out=$OUTFILE bash -c 'cat << EOF >> $out
DefaultCPUAccounting=yes
DefaultIOAccounting=yes
DefaultIPAccounting=yes
DefaultBlockIOAccounting=yes
EOF'

Seems that reboot is the only way to make the /etc/systemd/system.conf changes take effect and reboot is of course needed for the kernel command line args change anyway

sudo reboot

After the reboot we need to redo the variables setup

# IP Address for main control-plane node
MC_IP=<your-node-IP-here>

# POD network subnet
POD_NETWORK_CIDR="10.244.0.0/16"

Verify setup

# show swap is on
swapon --show

# check for type cgroup2
mount -l|grep cgroup

# check for cpu controller
cat /sys/fs/cgroup/cgroup.subtree_control

Install docker-ce

sudo dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
sudo dnf -y remove runc
sudo dnf -y install docker-ce --nobest

Tell docker to use systemd for cgroup control

sudo mkdir /etc/docker
OUTFILE=/etc/docker/daemon.json
sudo out=$OUTFILE bash -c 'cat << EOF >> $out
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF'

Enable and start the docker service

sudo systemctl enable --now docker
sudo systemctl status docker

Disable firewall for now (but fix later!): ports needed can be seen here and here.

sudo systemctl stop firewalld
sudo systemctl disable firewalld

Create kubernetes repo file (note that the "el7" is intentional)

OUTFILE=/etc/yum.repos.d/kubernetes.repo
sudo out=$OUTFILE bash -c 'cat << EOF >> $out
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF'

Install Kubernetes

sudo dnf -y install kubelet kubeadm kubectl --disableexcludes=kubernetes

Make systemd the kubelet cgroup driver

sudo mkdir -p /var/lib/kubelet
OUTFILE=/var/lib/kubelet/config.yaml
sudo out=$OUTFILE bash -c 'cat << EOF >> $out
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
EOF'

Enable and start the kubelet service (note that the kubelet service fails until the init/join step is run)

sudo systemctl enable --now kubelet
sudo systemctl status kubelet

Init Kubernetes on main control-plane node only. Don't forget to capture the "kubeadm join ..." output

sudo kubeadm init --pod-network-cidr=${POD_NETWORK_CIDR} --apiserver-advertise-address=${MC_IP} --kubernetes-version stable-1.22 --ignore-preflight-errors="Swap"

Enable root to run kubectl on main control-plane node only

sudo bash -c 'mkdir -p $HOME/.kube'
sudo bash -c 'cp -i /etc/kubernetes/admin.conf $HOME/.kube/config'

Install networking on main control-plane node only

sudo kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

Join kube cluster from worker-only nodes. Note that you should use the actual token emitted during the init on the main control-plane node. If you forgot to record it, run the following command on the main control-plane node: kubeadm token create --print-join-command

sudo kubeadm join ${MC_IP}:6443 --token wze441.4v1fexq9ak1ew8eq \
        --discovery-token-ca-cert-hash sha256:44bda0ab055d721ae00a1ab8f0d45b0bf5690501209c26a810bf251688891f84 \
        --ignore-preflight-errors="Swap"

View state of play on main control-plane node

sudo kubectl get nodes
sudo kubectl get deployment,svc,pods,pvc,rc,rs --all-namespaces
sudo kubectl get deployment,svc,pods,pvc,rc,rs --all-namespaces -o wide|less

Installing the Crunchy pgo Operator

At this point, you should have a fully functional Kubernetes cluster ready for you to enjoy. The next step we want to take is to install something useful, so I am going to start with the Crunchy pgo operator.

Install the kube cluster configuration on your local machine (in my case, my desktop)

scp ${MC_IP}:/home/jconway/admin.conf $HOME/.kube/kube01.config
export KUBECONFIG=$HOME/.kube/kube01.config
kubectl get nodes

The output of the last command there should look something like this

NAME     STATUS   ROLES                  AGE   VERSION
kube01   Ready    control-plane,master   88d   v1.22.2
kube02   Ready    <none>                 88d   v1.22.2
kube03   Ready    <none>                 88d   v1.22.2

Grab the pgo operator examples repo from github

cd ${HOME}
git clone git@github.com:CrunchyData/postgres-operator-examples.git
cd postgres-operator-examples
kubectl apply -k kustomize/install

The output of the last command there should look something like this

namespace/postgres-operator unchanged
customresourcedefinition.apiextensions.k8s.io/postgresclusters.postgres-operator.crunchydata.com configured
serviceaccount/pgo configured
clusterrole.rbac.authorization.k8s.io/postgres-operator configured
clusterrolebinding.rbac.authorization.k8s.io/postgres-operator configured
deployment.apps/pgo configured

Install an appropriate storage class

kubectl apply -f https://openebs.github.io/charts/openebs-operator.yaml
kubectl patch storageclass openebs-hostpath -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Deploy a Crunchy PostgreSQL Pod

kubectl apply -k kustomize/postgres
kubectl get pods -n postgres-operator

The output of the last command there should look something like this

NAME                     READY   STATUS    RESTARTS   AGE
hippo-instance1-pd9w-0   3/3     Running   0          10m
hippo-repo-host-0        1/1     Running   0          10m
pgo-69949584b9-65bqw     1/1     Running   0          10m

Exec into the Postgres pod to explore as desired

kubectl exec -it -n postgres-operator -c database hippo-instance1-pd9w-0 -- bash

The operator does create a default user that is the same name as the cluster (hippo in your case). To get that password you can execute the following

export PGPASSWORD=$(kubectl get secret hippo-pguser-hippo -n postgres-operator -o jsonpath={.data.password} | base64 --decode)

Create a NodePort service to allow for connectivity from outside of the Kubernetes cluster.

cat hippo-np.yaml

apiVersion: v1
kind: Service
metadata:
  name: hippo-np
spec:
  type: NodePort
  selector:
    postgres-operator.crunchydata.com/cluster: hippo
    postgres-operator.crunchydata.com/role: master
  ports:
    - protocol: TCP
      port: 5432
      targetPort: 5432
      nodePort: 30032

kubectl apply -f hippo-np.yaml -n postgres-operator

Finally connect to PostgreSQL

$ export PGSSLMODE=require
$ psql -h kube02 -p 30032 -U hippo
psql (13.3, server 13.4)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

hippo=> select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 f
(1 row)

Cleanup postgres instance only (CAUTION!)

kubectl delete -k kustomize/postgres

Complete cleanup (CAUTION!)

Note: Before removing the Operator, all postgres clusters that were created by the Operator should be deleted first. Failing to do so will cause the finalizer to create some issues.

kubectl get postgrescluster --all-namespaces

kubectl delete -k kustomize/install

Summary

In this article I showed you how to create a Kubernetes cluster from scratch, including getting it to run with cgroup v2 and swap turned on.

I also showed the basic deployment of the Crunchy Data pgo operator and a PostgreSQL pod.

I hope in the future to run some tests to understand how this setup behaves versus the more common (but in my opinion likely less resilient) environment with cgroup v1 and swap turned off. Stay tuned!

Musings of a PostgreSQL Data Pontiff Episode 2 - Hot Magick; Analytics and Graphical Output with PL/R

Joe.Conway@crunchydata.com (Joe Conway) — Thu, 18 Mar 2021 05:00:00 EDT

Welcome to Episode 2 of the "Musings of a PostgreSQL Data Pontiff" series! In this installment I’m aiming to achieve three objectives. First, you should see how the SQL language, as implemented by PostgreSQL, can perform interesting data analysis through the built-in aggregates and other capabilities such as Common Table Expressions (CTEs) and Window Functions. Second, you will get to see how native SQL combines with R code in PL/R in useful ways. And finally, I’ll show how to use PL/R to tap into the R language's ability to generate visual graphics which facilitates understanding the calculated values. So let's get started.

Background

By now you may be wondering, what's with the title of the episode? Why "Hot Magick"? Well, to begin with I thought it was catchy ;-). But seriously, it has a purpose. The example data that I will use for the rest of the episode is historical daily maximum temperature data captured by NOAA weather stations in various locations around the USA. That explains the "hot" part. The reference to "Magick" is a bow to an R package called "magick" which uses libmagick to generate and capture plotted data. More on that later.

One disclaimer is pertinent here, and for the rest of the episodes of this series. While I like to have real data to illustrate the capabilities of PostgreSQL and R when used together, I am not a proper "Data Scientist". In particular to the examples herein, additionally I am not a "Climate Scientist". So please understand that my point is not to draw definitive conclusions about climate change, it is to show off the cool capabilities of the tools I am using.

Source Data

The data can be downloaded from NOAA. The process involves the following:

Enter a location
Select "Daily Summaries" + date range + "Air Temperature"
Click on a specific weather station on the map
Add the request to the cart
Repeat for other weather stations as desired
Check out using "Custom GHCN-Daily CSV"
Submit order
Get data from download link once ready

The downloaded data will fit neatly into a table with the following structure:

CREATE TABLE temp_obs
(
  station text,
  name text,
  obs_date date,
  tmax int,
  tmin int,
  tobs int,
  PRIMARY KEY (obs_date, station)
);

Most of those field names should be pretty self-explanatory. The tobs column represents the observed temperature at a specific time of day. In at least one case that appeared to be 1600 (4 PM), but it was tedious to find that information and I did not try to look it up for each station since I was intending to base the analysis on tmax (i.e. the daily maximum temperature).

The downloaded CSV file can be loaded into the table using a command such as:

COPY temp_obs FROM '<path>/<filename>' CSV HEADER;

where '<path>/<filename>' points to the downloaded CSV file. That file needs to be in a location on the PostgreSQL server readable by the postgres user in order for this to work. If your workstation (and therefore CSV file location) is different from your database server, you can also use the psql \copy command to load the file. Please see the PostgreSQL documentation for the specifics, but the command will look a whole lot like the one above.

For many years the topic of capturing graphical output from PL/R was a bit of a sticky point. The options were not great. You could either write the plot image to a file on the storage system or you could use convoluted code such as the following, to capture the image and return it to the SQL client:

 library(cairoDevice)
  library(RGtk2)

 pixmap <- gdkPixmapNew(w=500, h=500, depth=24)
  asCairoDevice(pixmap)

  <generate a plot>

 plot_pixbuf <- gdkPixbufGetFromDrawable(NULL, pixmap,
                pixmap$getColormap(),0, 0, 0, 0, 500, 500)
 buffer <- gdkPixbufSaveToBufferv(plot_pixbuf, "jpeg",
                character(0),character(0))$buffer
 return(buffer)

In this example, everything except <generate a plot> is the code required to capture a 500 x 500 pixel image from memory and return it. This works well enough but has two downsides:

The code itself is obtuse
This (and for that matter the file capture method) requires an X Window System

Once you know the pattern, it’s easy enough to replicate. But often, your database is going to be running on a "headless" system (i.e. without an X Window System). This latter issue can be worked around by using a virtual frame buffer (VFB), but not every DBA has the access required to ensure that a VFB is running and configured correctly.

I have been looking for a better answer. The relatively new (at least compared to PL/R which has been around since 2003) "magick" package is just what is needed. The code snippet above can be rewritten like so:

library(magick)
  fig <- image_graph(width = 500, height = 500, res = 96)

  <generate a plot>

 dev.off()
  image_write(fig, format = "png")

This is simpler, more readable, has less dependencies, and importantly does not need an X Window System, nor a VFB.

First Level Analysis

Ok, so now that the preliminaries are over and we have data, let's take our first look at it.

First we will define a PL/R function called rawplot(). The function will grab tmax ordered by observation date for a specifically requested weather station. Then it will plot that data and return the resulting graphical output as a binary stream to the SQL level caller.

PL/R Function `rawplot()`

CREATE OR REPLACE FUNCTION rawplot
(
  stanum text,
  staname text,
  startdate date,
  enddate date
)
returns bytea AS $$

  library(RPostgreSQL)
  library(magick)
  library(ggplot2)

  fig <- image_graph(width = 1600, height = 900, res = 96)

  sqltmpl = "
     SELECT obs_date, tmax
     FROM temp_obs
     WHERE tmax is not null
     AND station = '%s'
     AND obs_date >= '%s'
     AND obs_date < '%s'
     ORDER BY 1
  "

  sql = sprintf(sqltmpl, stanum, startdate, enddate)
  drv <- dbDriver("PostgreSQL")
  conn <- dbConnect(drv, user="postgres", dbname="demo",
                    port = "55610", host = "/tmp")
  data <- dbGetQuery(conn, sql)
  dbDisconnect(conn)
  data$obs_date <- as.Date(data$obs_date)
  data$mean_tmax <- rep(mean(data$tmax),length(data$tmax))

  print(ggplot(data, aes(x = obs_date)) +
   scale_color_manual(values=c("darkorchid4", "darkgreen")) +
   geom_point(aes(y = tmax, color = "tmax")) +
   geom_line(aes(y = mean_tmax, color = "mean"), size=2) +
   labs(title = sprintf("Max Temperature by Day - %s", staname),
        y = "Temp(F)", x = "Date") +
   theme_bw(base_size = 15) +
   geom_smooth(aes(y = tmax)))

  dev.off()
  image_write(fig, format = "png")
$$ LANGUAGE plr;

There is a lot there to absorb, so let's break it down a few lines at a time.

CREATE OR REPLACE FUNCTION rawplot
(
  stanum text,
  staname text,
  startdate date,
  enddate date
)
returns bytea AS $$
 [...\]
$$ LANGUAGE plr;

The SQL function declaration code is about the same as for any SQL function defined in PostgreSQL. It starts with CREATE OR REPLACE which allows us to redefine the guts, but not the SQL interface, of the function without having to drop it first. Here we are passing four arguments into the function which serve to select the weather station of interest. These provide the verbose station name for the plot, and limit the rows to the desired date range.

Note that the function returns the data type bytea, which means "byte array", or in other words binary. PL/R has special functionality with respect to functions which return bytea. Instead of converting the R data type to its corresponding PostgreSQL data type, PL/R will return the entire R object in binary form. When returning a plot object from R, the image binary is wrapped by some identifying information which must be stripped off to get at the image itself. But more on that later.

The function body (the part written in R code) starts by loading three libraries.

 library(RPostgreSQL)
  library(magick)
  library(ggplot2)

The magick package mentioned earlier will be used to capture the plot graphic. ggplot2 is used to create nice looking plots with a great deal of power and flexibility. We could spend several episodes of this blog series on ggplot alone, but there is plenty of information about using ggplot available so we will gloss over its usage for the most part.

The RPostgreSQL package is, strictly speaking, not required in a PL/R function. The reason for its use here is that it is required when testing the code from an R client. As is often the case, it was convenient to develop the R code directly in R first, and then paste it into a PL/R function. PL/R includes compatibility functions which allow RPostgreSQL syntax to be used when querying data from PostgreSQL.

As also mentioned earlier, a few lines of the R code use the magick package to capture and return the plot graphic. Those are the following:

 fig <- image_graph(width = 1600, height = 900, res = 96)

  [...the "get data" and "generate plot" code goes here...\]

 dev.off()

 image_write(fig, format = "png")

The next section of code queries the data of interest and formats it as required for plotting:

 sqltmpl = "
     SELECT obs_date, tmax
     FROM temp_obs
     WHERE tmax is not null
     AND station = '%s'
     AND obs_date >= '%s'
     AND obs_date < '%s'
     ORDER BY 1
  "
  sql = sprintf(sqltmpl, stanum, startdate, enddate)
  drv <- dbDriver("PostgreSQL")
  conn <- dbConnect(drv, user="postgres", dbname="demo",
                    port = "55610", host = "/tmp")
  data <- dbGetQuery(conn, sql)
  dbDisconnect(conn)
  data$obs_date <- as.Date(data$obs_date)
  data$mean_tmax <- rep(mean(data$tmax),length(data$tmax))

First we see sqltmpl which is a template for the main data access SQL query. There are replaceable parameters for station, first date, and end date to pull back. The next line assigns the parameters into the template to create our fully formed SQL stored in the sql variable.

The next four lines connect to PostgreSQL and execute the query. Only the dbGetQuery() line does anything in PL/R, but the other three lines are needed if we are querying from an R client.

The last two lines in that stanza ensure our date column is in an R native form and generates/populates a mean column for us.

Finally we have the code that actually creates the plot itself:

 print(ggplot(data, aes(x = obs_date)) +
   scale_color_manual(values=c("darkorchid4", "darkgreen")) +
   geom_point(aes(y = tmax, color = "tmax")) +
   geom_line(aes(y = mean_tmax, color = "mean"), size=2) +
   labs(title = sprintf("Max Temperature by Day - %s", staname),
        y = "Temp(F)", x = "Date") +
   theme_bw(base_size = 15) +
   geom_smooth(aes(y = tmax)))

As I said above, the details of using ggplot are left to the reader to decipher. Yet, one bit worth mentioning, in part because it took me quite some time to figure out, is that the ggplot() call must be wrapped in the print() call as shown. Otherwise the returned plot will be empty. I finally found a reason for this fact buried in the ggplot documentation which said "Call print() explicitly if you want to draw a plot inside a function or for loop." When you create a PL/R function, the R code body is placed inside a named R function, thus this rule applies.

Executing `rawplot()`

Now let's see a couple of examples which call our rawplot() function:

SELECT octet_length(plr_get_raw(rawplot('USC00042706',
                                        'EL CAJON, CA US',
                                        '1979-10-01',
                                        '2020-10-01')));
 octet_length 
--------------
       369521
(1 row)

DO
$$
 DECLARE   
  stanum text = 'USC00042706';
  staname text = 'EL CAJON, CA US';
  startdate date = '1979-10-01';
  enddate date = '2020-10-01';
  l_lob_id OID;
  r record;
 BEGIN
  for r in
   SELECT
    plr_get_raw(rawplot(stanum, staname, startdate, enddate)) as img
  LOOP
    l_lob_id:=lo_from_bytea(0,r.img);
    PERFORM lo_export(l_lob_id,'/tmp/rawplot.png');
    PERFORM lo_unlink(l_lob_id);
  END LOOP;
 END;
$$;

The first example shows that the total octet_length (i.e. size in bytes) of the returned image is 369521. The second query is a somewhat convoluted method of taking the streamed binary from the query and directing it to a file on disk from psql. Why not create the file directly in the PL/R function you say? Well, it is an easy way to grab the images for the purposes of this blog. If I were doing this work for some "real world" purpose, presumably I would be streaming the image binary to someone's browser or something similar.

The resulting image looks like this:

Second Level Analysis

We have seen what the raw data looks like, which is a good start. But now we will dive a bit deeper using mostly good old SQL, although still wrapped by a PL/R function so that we can visualize the result. This function gives us a count of the days on which the maximum temperature was 100 degrees F or greater per year.

CREATE OR REPLACE FUNCTION count100plus
(
  stanum text,
  staname text,
  startdate date,
  enddate date
)
returns bytea AS $$
  library(RPostgreSQL)
  library(magick)
  library(ggplot2)

  fig <- image_graph(width = 1600, height = 900, res = 96)
  sqltmpl = "
     SELECT
       extract(isoyear from obs_date) AS year,
       count(tmax) as tcnt
     FROM temp_obs
     WHERE tmax is not null
     AND station = '%s'
     AND obs_date >= '%s'
     AND obs_date < '%s'
     AND tmax >= 100
     GROUP BY 1
     ORDER BY 1
  "
  sql = sprintf(sqltmpl, stanum, startdate, enddate)
  drv <- dbDriver("PostgreSQL")
  conn <- dbConnect(drv, user="postgres", dbname="demo",
                    port = "55610", host = "/tmp")
  data <- dbGetQuery(conn, sql)
  dbDisconnect(conn)
  data$mean_tcnt <- rep(mean(data$tcnt),length(data$tcnt))

  print(ggplot(data, aes(x = year)) +
   scale_color_manual(values=c("darkorchid4", "darkgreen")) +
   geom_point(aes(y = tcnt, color = "tcnt")) +
   geom_line(aes(y = mean_tcnt, color = "mean"), size=2) +
   labs(title = sprintf("Count of 100+ Degree(F) days by Year - %s", staname),
        y = "Count", x = "Year") +
   theme_bw(base_size = 15) +
   geom_smooth(aes(y = tcnt)))

  dev.off()
  image_write(fig, format = "png")
$$ LANGUAGE plr;

DO
$$
 DECLARE   
  stanum text = 'USC00042706';
  staname text = 'EL CAJON, CA US';
  startdate date = '1979-10-01';
  enddate date = '2020-10-01';
  l_lob_id OID;
  r record;
 BEGIN
  for r in
   SELECT
    plr_get_raw(count100plus(stanum, staname, startdate, enddate)) as img
  LOOP
    l_lob_id:=lo_from_bytea(0,r.img);
    PERFORM lo_export(l_lob_id,'/tmp/count100plus.png');
    PERFORM lo_unlink(l_lob_id);
  END LOOP;
 END;
$$;

Everything here is essentially identical to rawplot() except for:

 sqltmpl = "
     SELECT
       extract(isoyear from obs_date) AS year,
       count(tmax) as tcnt
     FROM temp_obs
     WHERE tmax is not null
     AND station = '%s'
     AND obs_date >= '%s'
     AND obs_date < '%s'
     AND tmax >= 100
     GROUP BY 1
     ORDER BY 1
  "

This SQL statement does the core work of aggregating our count by year for days greater than or equal to 100 F.

The resulting image looks like this:

Third Level Analysis

Finally we will dive even deeper. This will need a bit of background.

Statistical Process Control

Years before computers were available (or they were people or mechanical devices and not electronics) Walter Shewhart at Bell Labs pioneered Statistical Process Control (SPC). It was later promoted and developed by Edwards Deming. See more about the history of SPC here. We won't go into any depth about all that, but let's just say that the following analysis is derived from SPC techniques.

The Central Limit Theorem

One of the premises of this type of analysis is that the data follows a normal distribution. To ensure at least an approximate normal distribution, SPC typically relies on the Central Limit Theorem. In other words, by grouping the raw data into samples, the resulting data tend toward the desired form.

Standard Scores

Another problem we need to solve with temperature data is that it changes seasonally. We cannot very well expect the maximum temperature in week 1 (early January) to be the same as week 32 (mid-August). The approach taken to deal with that is Standard Score or Z Score.

Overall Approach

Combining these things, the overall approach taken is something like this:

For each week of all years in the dataset, by ISO week number, determine the average tmax (xb or "x-bar") and the range of tmax values (r).
For each week number (1-53), determine the overall average (sometimes called the "grand average", or xbb, or "x-bar-bar"), the average range (rb or "r-bar"), and the standard deviation (sd).
Standardize the weekly values using the per-week-number statistics.
Combine all the weekly group data onto a single plot across all the years.

Custom Auto-loaded R Code

Please permit me another digression before getting into the "final" solution. Sometimes there is R code that ideally would be common and reused by multiple PL/R functions. Fortunately PL/R provides a convenient way to do that. A special table, plr_modules, if it exists is presumed to contain R functions. These functions are fetched and loaded into R interpreter on initialization. Table plr_modules is defined as follows

CREATE TABLE plr_modules
(
  modseq int4,
  modsrc text
);

Where modseq is used to control order of installation and modsrc contains text of R code to be executed. plr_modules must be readable by all, but it is wise to make it owned and writable only by the database administrator. Note that you can use reload_plr_modules() to force re-loading of plr_modules table rows into the current session R interpreter.

Getting back to our problem at hand, the following will create an R function which can summarize and mutate our raw data in the way described in the previous section.

INSERT INTO plr_modules VALUES (0, $m$
obsdata <- function(stanum, startdate, enddate)
{
  library(RPostgreSQL)
  library(reshape2)

  sqltmpl = "
   WITH
   g (year, week, xb, r) AS
   (
     SELECT
       extract(isoyear from obs_date) AS year,
       extract(week from obs_date) AS week,
       avg(tmax) as xb,
       max(tmax) - min(tmax) as r
     FROM temp_obs
     WHERE tmax is not null
     AND station = '%s'
     AND obs_date >= '%s'
     AND obs_date < '%s'
     GROUP BY 1, 2
   ),
   s (week, xbb, rb, sd) AS
   (
     SELECT
       week,
       avg(xb) AS xbb,
       avg(r) AS rb,
       stddev_samp(xb) AS sd
     FROM g
     GROUP BY week
   ),
   z (year, week, zxb, zr, xbb, rb, sd) AS
   (
     SELECT
       g.year,
       g.week,  
       (g.xb - s.xbb) / s.sd AS zxb,
       (g.r - s.rb) / s.sd AS zr,
       s.xbb,
       s.rb,
       s.sd
     FROM g JOIN s ON g.week = s.week
   )
   SELECT
    year,
    week,
    CASE WHEN week < 10 THEN
     year::text || '-W0' || week::text || '-1'
    ELSE
     year::text || '-W' || week::text || '-1'
    END AS idate,
    zxb,
    0.0 AS zxbb,
    3.0 AS ucl,
    -3.0 AS lcl,
    zr,
    xbb,
    rb,
    sd
   FROM z
   ORDER BY 1, 2
  "

  sql = sprintf(sqltmpl, stanum, startdate, enddate)

  drv <- dbDriver("PostgreSQL")
  conn <- dbConnect(drv, user="postgres", dbname="demo", port = "55610", host = "/tmp")
  data <- dbGetQuery(conn, sql)
  dbDisconnect(conn)

  data$idate <- ISOweek::ISOweek2date(data$idate)

  return(data)
}
$m$);
SELECT reload_plr_modules();

There is a lot going on in that R function, but almost all the interesting bits are in the templatized SQL. The SQL statement builds up incrementally with a series of Common Table Expressions (CTEs). This is a good example of how powerful native PostgreSQL functionality is.

Before examining the SQL statement, another quick side-bar. Note the use of $m$ around the R code. The encapsulated R code is a Dollar-Quoted String Constant. Dollar quoting is particularly useful when dealing with long string constants that might have embedded quotes which have meaning when the string gets evaluated. Rather than doubling the quotes (or doubling the doubling, etc.), two dollar signs with a "tag" of length zero or more (in this case "m") in between are used to delimit the string. If you were paying attention, you might have already noticed the $$ delimiters used in the previous PL/R function definitions and even in the DO statements, for the same reason. This is one of the coolest unsung features of PostgreSQL in my humble opinion.

Anyway, the first stanza

 g (year, week, xb, r) AS
  (
    SELECT
      extract(isoyear from obs_date) AS year,
      extract(week from obs_date) AS week,
      avg(tmax) as xb,
      max(tmax) - min(tmax) as r
    FROM temp_obs
    WHERE tmax is not null
    AND station = '%s'
    AND obs_date >= '%s'
    AND obs_date < '%s'ß
    GROUP BY 1, 2
   ),

Is finding our weekly average and range of the daily maximum temperatures across all the years of the selected date range for the selected weather station.

The second stanza

 s (week, xbb, rb, sd) AS
  (
    SELECT
      week,
      avg(xb) AS xbb,
      avg(r) AS rb,
      stddev_samp(xb) AS sd
    FROM g
    GROUP BY week
   ),

takes the weekly summarized data and further summarizes it by week number across all the years. In other words, for week N we wind up with a grand average of daily maximum temperatures, and average of the weekly maximum temperature ranges, and the standard deviation of the weekly average maximum temperatures. Whew that was a virtual mouthful. Hopefully the explanation was clear enough. The assumption here is that for a given week of the year we can reasonably expect the temperature to be consistent from year to year, and so these statistics will help us see trends that are non-random across the years.

The third stanza

  z (year, week, zxb, zr, xbb, rb, sd) AS
   (
     SELECT
       g.year,
       g.week,  
       (g.xb - s.xbb) / s.sd AS zxb,
       (g.r - s.rb) / s.sd AS zr,
       s.xbb,
       s.rb,
       s.sd
     FROM g JOIN s ON g.week = s.week
   ),

calculates standard score values for the per-week data. In other words, the data is rescaled based on distance in "standard deviations" from the grand average. This in theory at least, allows us to meaningfully compare data from week 1 to week 32 for example.

The final stanza

  SELECT
    year,
    week,
    CASE WHEN week < 10 THEN
     year::text || '-W0' || week::text || '-1'
    ELSE
     year::text || '-W' || week::text || '-1'
    END AS idate,
    zxb,
    0.0 AS zxbb,
    3.0 AS ucl,
    -3.0 AS lcl,
    zr,
    xbb,
    rb,
    sd
   FROM z
   ORDER BY 1, 2

pulls it all together and adds a few calculated columns for our later convenience when we plot the output.

The Final Plot Functions

Now we can finally create the functions which will produce pretty plots for the third level analytics.

CREATE OR REPLACE FUNCTION plot_xb
(
  stanum text,
  staname text,
  startdate date,
  enddate date
)
returns bytea AS $$
  library(magick)
  library(ggplot2)

  data <- obsdata(stanum, startdate, enddate)

  fig <- image_graph(width = 1600, height = 900, res = 96)

  print(ggplot(data, aes(x = idate)) +
   scale_color_manual(values=c("red", "red", "darkorchid4", "darkgreen")) +
   geom_point(aes(y = zxb, color = "zxb")) +
   geom_line(aes(y = zxbb, color = "zxbb")) +
   geom_line(aes(y = ucl, color = "ucl")) +
   geom_line(aes(y = lcl, color = "lcl")) +
   labs(title = sprintf("Standardized Max Temp by Week - %s", staname),
        y = "Z Score", x = "Week") +
   theme_bw(base_size = 15) +
   theme(legend.title=element_blank()) +
   geom_smooth(aes(y = zxb)))

  dev.off()
  image_write(fig, format = "png")
$$ LANGUAGE plr;

DO
$$
 DECLARE   
  stanum text = 'USC00042706';
  staname text = 'EL CAJON, CA US';
  startdate date = '1979-10-01';
  enddate date = '2020-10-01';
  l_lob_id OID;
  r record;
 BEGIN
  for r in
   SELECT
    plr_get_raw(plot_xb(stanum, staname, startdate, enddate)) as img
  LOOP
    l_lob_id:=lo_from_bytea(0,r.img);
    PERFORM lo_export(l_lob_id,'/tmp/plot_xb.png');
    PERFORM lo_unlink(l_lob_id);
  END LOOP;
 END;
$$;

CREATE OR REPLACE FUNCTION plot_r
(
  stanum text,
  staname text,
  startdate date,
  enddate date
)
returns bytea AS $$
  library(magick)
  library(ggplot2)

  data <- obsdata(stanum, startdate, enddate)

  fig <- image_graph(width = 1600, height = 900, res = 96)

  print(ggplot(data, aes(x = idate)) +
   geom_point(aes(y = zr, color = "zr")) +
   labs(title = sprintf("Standardized Max Temp Range by Week - %s", staname), y = "Z Score", x = "Week") +
   theme_bw(base_size = 15) +
   theme(legend.title=element_blank()) +
   geom_smooth(aes(y = zr)))
  dev.off()
  image_write(fig, format = "png")
$$ LANGUAGE plr;

DO
$$
 DECLARE   
  stanum text = 'USC00042706';
  staname text = 'EL CAJON, CA US';
  startdate date = '1979-10-01';
  enddate date = '2020-10-01';
  l_lob_id OID;
  r record;
 BEGIN
  for r in
   SELECT
    plr_get_raw(plot_r(stanum, staname, startdate, enddate)) as img
  LOOP
    l_lob_id:=lo_from_bytea(0,r.img);
    PERFORM lo_export(l_lob_id,'/tmp/plot_r.png');
    PERFORM lo_unlink(l_lob_id);
  END LOOP;
 END;
$$;

Compared to some of the preceding examples, this code is relatively simple. That is in large part thanks to our use of the plr_modules table to auto-load common R code.

The resulting images look like this:

Summary

This episode turned out longer than I envisioned, but I wanted to be sure to get into enough detail to help you understand the code and the thinking behind it. Hopefully you persevered and are glad that you did so. My aim was to introduce you to some of the ways PostgreSQL and its ecosystem can be useful for Data Science. If you want to try this out for yourself, you can do so using Crunchy Bridge. I am planning to do several more installments as part of this series, so stay tuned for more!

Musings of a PostgreSQL Data Pontiff Episode 1

Joe.Conway@crunchydata.com (Joe Conway) — Thu, 18 Mar 2021 05:00:00 EDT

Introduction to a PostgreSQL "Data Science" Blog Series

This is the first in a series of blogs on the topic of using PostgreSQL for "data science". I put that in quotes because I would not consider myself to be a practicing data scientist, per se. Of course I'm not sure there is a universally accepted definition of data scientist. This article provides a nice illustration of my point.

I do believe my credentials are such that no one can accuse me of term appropriation. Toward establishment of that end, this first installment is a walk down memory lane.

Sometime around the end of the 1990's I had a boss (hi Roger!) that dubbed me "The Data Pontiff". I always liked that moniker because data collection, storage, and analysis has always been my thing.

In fact, ever since grammar school I have been interested in math and science. In high school and college I additionally became fascinated with computers. My college degree was Mechanical Engineering, and I still remember in the early 1980's one of my projects involving the use of a Commodore 64 to do tensor calculations (force and stress) of a crane at incremental positions as it loaded and unloaded a ship. I thought that was the coolest thing since sliced bread at the time.

During the latter half of the 1980's while I was in the Navy, one of my first duty stations was that of Diving Officer. Among other things, it was my responsibility to calculate the required water level for various tanks such that when the submarine submerged, it would stay on the prescribed depth with neutral buoyancy and a zero "bubble". That is, stay on depth with zero angle. The standard method for doing this would get you a reasonable approximation, but I always wanted to get as close to perfect as I could. So I would take other available information into account, such as the water temperature and salinity gradients versus depth, if they were known (i.e. we had recently been down already, or approximations based on historical data).

When I landed my first civilian job, I continued and doubled down, on my thirst for data and analysis. That company produced components for commercial nuclear reactors. As you might imagine, the requirements for testing and measuring were stringent, and a significant amount of data was collected to prove compliance within the specified tolerances. However when I initially arrived almost nothing was done with all that data beyond proving compliance. I set out over the course of a few years to change that, ensuring the data was retained in accessible form, and analysis was done which was used to iteratively improve the manufacturing processes. That paid off since it saved us money in scrapped components and allowed us to win contracts with newly tightened specs that our competitors could not meet. The tighter specs allowed greater operating margin for our customers and saved them expensive uranium, so it was a win all around.

It was the late 1990's at my second civilian job where I finally earned my title of "The Data Pontiff". That company produced very expensive, large, complex, and cutting edge industrial excimer lasers used by semiconductor manufacturers to make integrated circuits (a.k.a. chips). These lasers track a large number of configuration parameters and operational metrics. But for all their sophistication, the data was essentially restricted to an on board snapshot. I started two important initiatives there. One was to comprehensively store test and measurement data collected during manufacturing (see a trend here?). We called that POD, for Parametric Online Data.

The second project (which actually came first chronologically) involved attaching a device to the RS-232 diagnostic port of the lasers and periodically polling and storing the snapshots centrally. That project was called COL. The result was comprehensive information about each laser frame, and excellent aggregate data about the entire fleet. Our more advanced users of this data were able to create predictive models for when major components were failing or nearing end of life. In the semiconductor industry, downtime is measured in millions of dollars per hour, so coordinating maintenance in a predictable and scheduled way was a huge benefit. As was reducing downtime by having historical diagnostics to consult when things went awry. The aggregate data was useful for our executive team to keep the pulse of the entire industry. Our lasers made up something like 75% of the installed base in the free world at the time, and with the aggregate data we collected we could see in almost real time when the industry was ramping up or ramping down.

Finally, this data allowed the creation of an entirely new business model where the lasers were essentially leased and charged based on usage. You might think of it like Excimer Laser as a Service (ELaaS). By the way, the data underpinning all of this was stored in PostgreSQL, and as of about a year ago I was told that POD was still in service!

Sometime around 2003 I wrote PL/R, which is a procedural language handler for PostgreSQL that allows the use of R functions directly from PostgreSQL. It is essentially the same as PL/Python or PL/Perl in that the R interpreter gets fired up directly inside the PostgreSQL backend process associated with your database connection. As such, the embedded R function has direct access to data stored in tables and can call SQL statements making use of any other functions as well. PL/R was initially written specifically because I wanted to be able to use it to analyze data stored in POD and COL.

Anyway, there is much more to the story of each of those experiences but I have already risked boring you with my tales. In the years since I left civilian job #2, I have mainly focused on helping others use PostgreSQL in the most productive and secure way possible. But I have also tried to keep up on the side with trendy forms of data analysis including various statistical methods, machine learning, AI, etc. My goal in sharing all of the above is to illustrate some examples of using data and analysis to produce real world positive results. For me, that was always the allure.

In this blog series I hope to explore the possibilities for analysis presented by PostgreSQL through procedural languages such as PL/R and PL/Python as well as perhaps built-in capabilities of PostgreSQL itself. I hope you will find them as useful to read as I find them fun to write!

Deep PostgreSQL Thoughts: Resistance to Containers is Futile

Joe.Conway@crunchydata.com (Joe Conway) — Thu, 18 Feb 2021 04:00:00 EST

Recently I ran across grand sweeping statements that suggest containers are not ready for prime time as a vehicle for deploying your databases. The definition of "futile" is something like "serving no useful purpose; completely ineffective". See why I say this below, but in short, you probably are already, for all intents and purposes, running your database in a "container". Therefore, your resistance is futile.

And I'm here to tell you that, at least in so far as PostgreSQL is concerned, those sweeping statements are patently false. At Crunchy Data we have many customers that are very successfully running large numbers of PostgreSQL clusters in containers. Examples of our success in this area can be found with IBM and SAS.

However, just as you better have a special license and skills if you want to drive an 18 wheeler down the highway at 70 MPH, you must ensure that you have the skills and knowledge (either yourself or on your team) to properly operate your infrastructure, whether it be on-prem or in the cloud. This has always been true, but the requisite knowledge and skills have changed a bit.

What is a Container?

Let's start by reviewing exactly what a container is, and what it is not. According to someone who ought to know, Jérôme Petazzoni (formerly of Docker fame), containers are made of "namespaces, cgroups, and a little bit of copy-on-write storage". Here is a slightly dated (in particular, it is cgroup v1 specific) but still very good video in which Jérôme explains the details. Among other quotes from that talk, there is this gem:

There is this high level approach where we say, well a container is a little bit like a lightweight virtual machine, and then we also say, well but a container is not a lightweight virtual machine, stop thinking that because that puts you in the wrong mindset...

That statement is important because it implies that the degree of "virtualization" of containers is actually less than that of VMs, which of course are completely virtualized environments.

The processes in a container are running directly under the auspices of the host kernel in particular cgroups, and with their own namespaces. The cgroups provide accounting and control of the use of host resources, and namespaces provide a perceived degree of isolation, but the abstraction is much more transparent than that of a virtual machine.

In fact, to tie back to my "resistance is futile" statement above, on modern Linux systems everything is running under cgroups and namespaces, even if not running in what you think of as a "container".

For example, on a recently provisioned RHEL 8 machine running PostgreSQL I see the following:

$ sudo -i
# ls -la /sys/fs/cgroup/*/system.slice/postgresql-12.service/tasks
-rw-r--r--. 1 root root 0 Jan 29 23:58 /sys/fs/cgroup/blkio/system.slice/postgresql-12.service/tasks
-rw-r--r--. 1 root root 0 Feb  1 17:41 /sys/fs/cgroup/devices/system.slice/postgresql-12.service/tasks
-rw-r--r--. 1 root root 0 Feb  1 13:52 /sys/fs/cgroup/memory/system.slice/postgresql-12.service/tasks
-rw-r--r--. 1 root root 0 Feb  1 17:41 /sys/fs/cgroup/pids/system.slice/postgresql-12.service/tasks
-rw-r--r--. 1 root root 0 Feb  1 17:41 /sys/fs/cgroup/systemd/system.slice/postgresql-12.service/tasks

# cat /sys/fs/cgroup/memory/system.slice/postgresql-12.service/tasks
6827
6829
6831
6832
6833
6834
6835
6836

# ps -fu postgres
UID          PID    PPID  C STIME TTY          TIME CMD
postgres    6827       1  0 Jan29 ?        00:00:02 /usr/pgsql-12/bin/postgres -D /var/lib/pgsql/12/data/
postgres    6829    6827  0 Jan29 ?        00:00:00 postgres: logger
postgres    6831    6827  0 Jan29 ?        00:00:00 postgres: checkpointer
postgres    6832    6827  0 Jan29 ?        00:00:02 postgres: background writer
postgres    6833    6827  0 Jan29 ?        00:00:02 postgres: walwriter
postgres    6834    6827  0 Jan29 ?        00:00:01 postgres: autovacuum launcher
postgres    6835    6827  0 Jan29 ?        00:00:02 postgres: stats collector
postgres    6836    6827  0 Jan29 ?        00:00:00 postgres: logical replication launcher

This is not "PostgreSQL running in a container", yet PostgreSQL is nonetheless running in several cgroups. Further:

# ll /proc/6827/ns/
total 0
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 net -> 'net:[4026531992]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 pid -> 'pid:[4026531836]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 user -> 'user:[4026531837]'
lrwxrwxrwx. 1 postgres postgres 0 Feb  1 17:45 uts -> 'uts:[4026531838]'

# lsns
        NS TYPE   NPROCS   PID USER            COMMAND
4026531835 cgroup     95     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026531836 pid        95     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026531837 user       95     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026531838 uts        95     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026531839 ipc        95     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026531840 mnt        89     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026531860 mnt         1    15 root            kdevtmpfs
4026531992 net        95     1 root            /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026532216 mnt         1   888 root            /usr/lib/systemd/systemd-udevd
4026532217 mnt         1   891 root            /sbin/auditd
4026532218 mnt         1   946 chrony          /usr/sbin/chronyd
4026532219 mnt         1  1015 root            /usr/sbin/NetworkManager --no-daemon
4026532287 mnt         1  1256 systemd-resolve /usr/lib/systemd/systemd-resolved

From this we can see that the PostgreSQL processes are also running in several namespaces, again despite the "fact" that PostgreSQL is "not running in a container".

So hopefully you see that any statement insinuating that you should not run PostgreSQL "in a container" flies in the face of reality.

Considerations When Using Containers

In my experience, the key issues you run into when running something like PostgreSQL in containers could be generally categorized as:

OOM killer
Storage
Restarts and "in motion"
Custom Deployments

As alluded to above, none of these issues are unique to containers, although they may be exacerbated by the expectations many organization have about how the universe works once they switch to containers.

As an old database-curmudgeon(™) myself, I clearly remember the day when a small number of monolithic databases served up most of the critical data held by an organization. The hardware was expensive, and the teams of people catering to these systems were even more expensive. Careful thought, planning, testing, and processes were applied to deployment of the hardware and databases. Failing over in a crisis was an "all hands on deck" and very manual evolution. Similarly was disaster recovery from backup.

Today the expectation is to "automate all the things". A relatively inexperienced application developer should be able to go to some kind of portal and push an "easy button" and have themselves a database complete with automatic failover, healing, monitoring, and backups, with disaster recovery not too many steps away.

Containerization and container-orchestration have gone a long way to making that expectation possible, and Crunchy Data has brought together considerable expertise in PostgreSQL, containers, Kubernetes, and Kubernetes Operators in order to make it a reality. But the existence of opinionated automation does not mean that your organization can abdicate all responsibility. These are very complex distributed systems, and they deserve well trained and experienced people to watch over them. In other words, your team still needs to know what they are doing if you want all this automation to be reliable.

Without further ado, let's address these issue categories one at a time.

OOM killer

The OOM killer is nothing new -- it has been an issue for PostgreSQL users to worry about for at least 17 years now1. However there are some modern considerations to be aware of. Specifically when operating in a container it is common to set cgroup memory controller limits. This could also apply when running on bare metal if such limits were set, but under containers it is much more common for that to be the case. Overall this is a very complex topic and deserves its own blog post: please see my previous post, Deep PostgreSQL Thoughts: The Linux Assassin.

Storage

Storage issues are also not new and not container specific. Yes, pretty much all containerized
environments run on network attached storage, but so do VMs and many bare metal installations. The issues with storage are typically related to being network attached, not to being "in a container".

A big missing piece in this brave new world is proper testing. Referring back to the days when databases were huge monolithic things attended by groups of people, deploying a new database on new hardware typically involved significant end-to-end testing. Like literally pulling the plug on the power while writing database records under heavy load. Or yanking the Fibre Channel connection between the server hardware and the storage array under similar conditions. These kinds of tests would find weak links in the chain between PostgreSQL and the spinning disks of rust used for persistent storage. If everything was properly configured the tests would yield a database that recovered perfectly. On the other hand, if any layer was lying, about getting the data stored persistently, the database would be corrupted reliably enough that the configuration errors would be spotted and fixed prior to going production.

Today's containerized environments have more layers that need to be tested and properly configured. But the fundamental issue is no different.

Restarts and "in motion"

Restarts and "in motion" issues are usually related to container orchestration layers, not the containers themselves. Avoiding these types of issues comes down to "knowing what you are doing" with Kubernetes or whatever you are using. And to some extent the same issues exist with VMs when they are being orchestrated. It is possible to avoid these issues if you so choose.

Custom Deployments

As mentioned above, many organizations seem to have an implicit assumption once they switch to containers, that move should come with an "easy button" that is nonetheless customizable exactly to their needs. They take a carefully crafted distributed system and overlay their own changes. Then when they have operational or upgrade troubles, they wonder why it is hard to diagnose and fix. The situation reminds me of a commonly used adage among the PostgreSQL community when someone is doing something that is generally not recommended and/or unsupported: "You break it, you get to keep both halves." With paying customers we don't usually get to take quite such a hard line, but this is a common pain point, and we continue to add flexibility to our solution in order to mitigate the pain.

Summary

The world of computing is inexorably moving toward automating everything and distributing all the bits in containers. Don't fear it, embrace it. But make sure your team is up to the task, and partner with a good bodyguard -- like Crunchy Data -- to ensure reliability and success.

Deep PostgreSQL Thoughts: The Linux Assassin

Joe.Conway@crunchydata.com (Joe Conway) — Tue, 09 Feb 2021 04:00:00 EST

If you run Linux in production for any significant amount of time, you have likely run into the "Linux Assassin" that is, the OOM (out-of-memory) killer. When Linux detects that the system is using too much memory, it will identify processes for termination and, well, assassinate them. The OOM killer has a noble role in ensuring a system does not run out of memory, but this can lead to unintended consequences.

For years the PostgreSQL community has made recommendations on how to set up Linux systems to keep the Linux Assassin away from PostgreSQL processes, which I will describe below. These recommendations carried forward from bare metal machines to virtual machines, but what about containers and Kubernetes?

Below is an explanation of experiments and observations I've made on how the Linux Assassin works in conjunction with containers and Kubernetes, and methods to keep it away from PostgreSQL clusters in your environment.

Community Guidance

The first PostgreSQL community mailing list thread on the topic is circa 2003, and the first commit is right about the same time. The exact method suggested to skirt the Linux OOM Killer has changed slightly since that time, but it was, and currently still is, to avoid memory overcommit i.e. in recent years by setting vm.overcommit_memory=2.

Avoidance of memory overcommit means that when a PostgreSQL backend process requests memory and the request cannot be met, the kernel returns an error which PostgreSQL handles appropriately. Therefore, although the offending client then receives an error from PostgreSQL, importantly the client connection is not killed, nor are any other PostgreSQL child processes (see below).

In addition, or when that is not possible, the guidance specifies changing oom_score_adj=-1000 for the parent "postmaster" process via the privileged startup mechanism (e.g. service script or systemd unit file), and making oom_score_adj=0 for all child processes via two environment variables that are read during child process startup. This ensures that should the OOM killer need to reap one or more processes, the postmaster will be protected, and the most likely candidate to get killed will be a client backend. That way the damage can be minimized.

Host level OOM killer mechanics

It is worth a small detour to cover the OOM Killer in a bit more detail in order to understand what oom_score_adj does. However the true details are complex, with a long sordid history (certainly not all inclusive, but for a nice summary of articles on the OOM killer see LWN), so this description is still very superficial.

At the host OS level, when the system becomes too short of memory, the OOM killer kicks in. In a nutshell, it will determine which process has the highest value for oom_score, and kill it with a SIGKILL signal. The value of oom_score for a process is essentially "percentage of host memory consumed by this process" times 10 (let's call that "memory score"), plus oom_score_adj.

The value of oom_score_adj may be set to any value in the range -1000 to +1000, inclusive. As mentioned above, note that oom_score_adj=-1000 is a magic value in that the OOM killer will never reap a process with this setting.

Combining these two bits of kernel trivia result in the value of oom_score ranging from 0 to 2000. For example a process with oom_score_adj=-998 that uses 100% of host memory (i.e. a "score" of 1000) has an oom_score equal to 2 (1000 + -998), and a process with oom_score_adj=500 that uses 50% of host memory (i.e. a "memory score" of 500) has an oom_score equal to 1000 (500 + 500). Obviously this means that a process consuming a large portion of system memory with a high oom_score_adj is at or near the top of the list for the OOM killer.

CGroup Level OOM Killer Mechanics

The OOM killer works pretty much the same at the CGroup level, except a couple small but important differences.

First of all, the OOM killer is triggered when the sum of memory consumed by the cgroup processes exceeds the assigned cgroup memory limit. While running a shell in a container, the former can be read from /sys/fs/cgroup/memory/memory.usage_in_bytes and the latter from /sys/fs/cgroup/memory/memory.limit_in_bytes.

Secondly, only processes within the offending cgroup are targeted. But the cgroup process with the highest oom_score is still the first one to go.

Why OOM killer avoidance is important for PostgreSQL

Some of the reasons for this emphasis on OOM Killer avoidance are:

Lost committed transactions: if the postmaster (or in HA setups the controlling Patroni processes) are killed, and replication is asynchronous (which is usually the case), transactions that have been committed on the primary database may be lost entirely when the database cluster fails over to a replica.
Lost active connections: if a client backend process is killed, the postmaster assumes shared memory may have been corrupted, and as a result it kills all active database connections and goes into crash recovery (rolls forward through transaction logs since the last checkpoint).
Lost inflight transactions: when client backend processes are killed, transactions which have been started but not committed will be lost entirely. At that point the client application is the only source for the inflight data.
Down time: A PostgreSQL cluster has only a single writable primary node. If it goes down, at least some application down time is incurred.
Reset statistics: the crash recovery process causes collected statistics to be reset (i.e. zeroed out). This affects maintenance operations such as autovacuum and autoanalyze, which in turn will cause performance degradation, or in severe cases outages (e.g. due to out of disk space). It also affects the integrity of monitoring data collected on PostgreSQL, potentially causing lost alerts.

Undoubtedly there are others neglected here.

There are several problems related to the OOM killer when PostgreSQL is run under Kubernetes which are noteworthy:

Overcommit

Kubernetes actively sets vm.overcommit_memory=1. This leads to promiscuous overcommit behavior and is in direct contrast with PostgreSQL best practice. It greatly increases the probability that OOM Killer reaping will be necessary.

cgroup OOM behavior

Even worse, an OOM kill can happen even when the host node does not have any memory pressure. When the memory usage of a cgroup (pod) exceeds its memory limit, the OOM killer will reap one or more processes in the cgroup.

OOM Score adjust

oom_score_adj values are almost completely out of control of the PostgreSQL pods, preventing any attempt at following the long established best practices described above. I have created an issue on the Kubernetes github for this, but unfortunately it has not gotten much traction.

Swap

Kubernetes defaults to enforcing swap disabled. This is directly in opposition of the recommendation of Linux kernel developers. For example, see Chris Down's excellent blog on why swap should not be disabled. In particular I have observed dysfunctional behaviors in memory constrained cgroups when switching from I/O dominant workloads to anonymous memory intensive ones. Evidence of other folks who have run into this issue can be seen in this article discussing the need for swap:

"There is also a known issue with memory cgroups, buffer cache and the OOM killer. If you don’t use cgroups and you’re short on memory, the kernel is able to start flushing dirty and clean cache, reclaim some of that memory and give it to whoever needs it. In the case of cgroups, for some reason, there is no such reclaim logic for the clean cache, and the kernel prefers to trigger the OOM killer, who then gets rid of some useful process."

There is also an issue on the Kubernetes github for this problem, which is still being debated three + years later.

Kubernetes QoS and Side Effects

Kubernetes defines 3 Quality of Service (QoS) levels. They impact more than just OOM killer behavior, but for the purposes of this paper only the OOM killer behavior will be addressed. The levels are:

Guaranteed: the memory limit and request are both set and equal for all containers in the pod.
Burstable: no memory limit, but with a memory request for all containers in the pod.
Best Effort: everything else.

With a Guaranteed QoS pod the values for oom_score_adj are almost as desired; PostgreSQL might not be targeted in a host memory pressure scenario. But the cgroup "kill if memory limit exceeded" behavior is undesirable. Relevant characteristics are as follows:

oom_score_adj=-998: this is good, but not the recommended -1000 (OOM killer disabled).
The documented environment variables are able to successfully reset oom_score_adj=0 for the postmaster children which is also good.

With a Burstable QoS pod, oom_score_adj values are set very high, and with surprising semantics (smaller requested memory leads to higher oom_score_adj). This makes PostgreSQL a prime target if/when the host node is under memory pressure. If the host node had vm.overcommit_memory=2, this situation would be tolerable because OOM kills would be unlikely if not impossible. However, as noted above, Kubernetes recommends/sets vm.overcommit_memory=1. Relevant characteristics are as follows:

The cgroup memory constraint OOM killer behavior does not apply -- this is good
oom_score_adj=(1000 - 10 * (percent avail mem requested)) (this is a slight simplification -- there is also an enforced minimum value of 2, and maximum value of 999): this leads to very small pod getting higher score adjust value than very large one. E.g. a pod requesting 1% available memory will get oom_score_adj=990 while one requesting 50% available memory will get oom_score_adj=500. This in turn means that if the smaller pod is idle, using essentially no resources it might, for example have oom_score=(0.1*10)+990=991 while the larger pod might be using 40% of system memory and get oom_score=(40*10)+500=900.

Desired behavior

The ideal solution would be if the kernel would provide a mechanism to allow equivalent behavior to vm.overcommit_memory=2, except acting at the cgroup level. In other words, allow a process making excess memory request within a cgroup to receive an "out of memory" error instead of using the OOM Killer to enforce the constraint. This would be the ideal solution because most users seem to want Guaranteed QoS pods, but currently the memory limit enforcement via OOM killer is a problem.
Another desired change is for Kubernetes to provide a mechanism to allow certain pods (with suitable RBAC controls on which ones) to override the oom_score_adj values which are currently set based on QoS heuristics. This would allow PostgreSQL pods to actively set oom_score_adj to recommended values. Hence the PostgreSQL postmaster process could have the recommended oom_score_adj=-1000, the PostgreSQL child processes could be set to oom_score_adj=0, and Burstable QoS pods would be a more reasonable alternative.
Finally, running Kubernetes with swap enabled should not be such a no-no. It took some digging, and I have not personally tested it, but a workaround is mentioned in the very long GitHub issue discussed earlier.

Impact and mitigation

In typical production scenarios the OOM killer semantics described above may never be an issue. Essentially, if your pods are sized well, hopefully based on testing and experience, and you do not allow execution of arbitrary SQL, the OOM killer will probably never strike.

On development systems, OOM killer action might be more likely to occur, but probably not so often as to be a real problem.

However, if the OOM killer has caused distress or consternation in your environment, here are some suggested workarounds.

Option 1

Ensure your pod is Guaranteed QoS (memory limit and memory request sizes set the same).
Monitor cgroup memory usage and alert on a fairly conservative threshold, e.g.
50% of the memory limit setting.
Monitor and alert on OOM Killer events.
Adjust memory limit/request for the actual maximum memory use based on
production experience.

Option 2

Ensure your pod is Burstable QoS (with a memory request, but without a memory limit).
Monitor Kubernetes host memory usage and alert on a fairly conservative
threshold, e.g. 50% of physical memory.
Monitor and alert on OOM Killer events.
Adjust Kubernetes host settings to ensure OOM killer is never invoked.

Option 3

Accept the fact that some OOM Killer events will occur. Monitoring history
will inform the statistical likelihood and expected frequency of occurrence.
Ensure your application is prepared to retry transactions for lost connections.
Run a High Availability cluster.
Depending on actual workload and usage patterns, the OOM killer event.
frequency may be equal or nearly equal to zero.

Future work

Crunchy Data is actively working with the PostgreSQL, Kubernetes, and Linux Kernel communities to improve the OOM killer behavior. Some possible longer term solutions include:

Linux kernel: cgroup level overcommit_memory control
Kubernetes: oom_score_adj override control, swap enablement normalized
Crunchy: Explore possible benefits from using cgroup v2 under kube 1.19+

Summary

The dreaded Linux Assassin has been around for many years and shows no signs of retiring soon. But you can avoid being targeted through careful planning, configuration, monitoring, and alerting. The world of containers and Kubernetes brings new challenges, but the requirements for diligent system administration remain very much the same.