PostgreSQL Disaster Recovery with Kubernetes’ Volume Snapshots

October 18, 2023
A large elephant with a smaller one

A new era for Postgres in Kubernetes has just begun. Version 1.21 of CloudNativePG introduces declarative support for Kubernetes’ standard API for Volume Snapshots, enabling, among others, incremental and differential copy for both backup and recovery operations to improve RPO and RTO.

The benchmarks on EKS that I present in this article highlight that backup and - most importantly - recovery times of Very Large Databases (VLDB) are now reduced by a few orders of magnitude when compared with the existing object store counterpart. As an example, I was able to fully recover a 4.5 TB Postgres database from a volume snapshot in 2 minutes. This is just the beginning, as we are planning to natively support more features provided by Kubernetes on the storage level.

About Kubernetes volume snapshots

First of all, volume snapshots have been around for many years. When I was a maintainer and developer of Barman, a popular backup and recovery open source tool for Postgres, we regularly received a request from a customer to integrate it with their storage solution supporting snapshots. The major blocker was the lack of a standard interface for us to control the snapshotting capabilities of the storage.

Kubernetes fixed this. In December 2020, Kubernetes 1.20 introduced volume snapshotting by enriching the API with the VolumeSnapshot, VolumeSnapshotContent and VolumeSnapshotClass custom resource definitions. Volume snapshotting is now in every supported Kubernetes version, providing a generic and standard interface for:

  • Creating a new volume snapshot from a PVC
  • Creating a new volume from a volume snapshot
  • Deleting an existing snapshot

The implementation is delegated to the underlying CSI drivers, and storage classes can offer a variety of capabilities based on storage: these might include incremental block level copy, differential block level copy, replication on a secondary or n-ary location in another region, and so on.

The main advantage is that the interface abstracts the complexity and the management of storage from the application, in our case a Postgres workload. From a database perspective, incremental and differential backup and recovery are the most desired features that volume snapshotting brings along.

All major cloud providers have CSI drivers and storage classes supporting volume snapshots (for details, see GKE, EKS, or AKS). On premises, you can use Openshift Data Foundation (ODF) and LVM with Red Hat, Longhorn with Rancher, and Portworx by Pure Storage - just to cite a few. You can find a detailed list of available drivers in the official document of Kubernetes Containers Storage Interface (CSI) project.

Before CloudNativePG 1.21

Prior to version 1.21, CloudNativePG supported backup and recovery only on object stores.

Object stores are very practical in many contexts, especially in Cloud environments and with small/medium size databases - I’d say below 500GB, but it really depends on several factors and there’s no clear demarcation. One of them is the time it takes to backup a database and store it in an object store, although the most important one - at least for the scope of this article - is the time to restore from a backup that’s safely secured in an object store: this metric represents the Recovery Time Objective in your Business Continuity Plan for that specific database. My advice is to measure both times, and then make a decision whether they are acceptable. Based on my tests and past experience, for a 45GB database, backup time might be in the order of 60-100 minutes, while recovery time in the 30-60 minutes range (these might change for the better or worse depending on the actual object store technology underneath). Time linearly increases with the database size without incremental and/or differential copy, proving to be inadequate for VLDB use cases.

For this reason, following some conversations with some members of CNCF TAG storage at KubeCon Europe 2023 in Amsterdam, in April 2023 we decided to introduce imperative support for backup and recovery with Kubernetes volume snapshots, through the cnpg plugin for kubectl (CloudNativePG 1.20). This allowed us to have a fast prototype of this feature, and then enrich it with a declarative API.

Disclaimer: for the sake of honesty, other Postgres operators for Kubernetes provided information on how to use volume snapshots for both backup and recovery. Given that these instructions are imperative, and that our operator is built with a fully declarative model, I don’t cover them in this article. Other operators rely on Postgres level backup tools for incremental backup/recovery; having conceived Barman many years ago, we could have gone down that path, but our vision is to rely on the Kubernetes way of doing incremental backup/recovery in order to facilitate the integration of Postgres in that ecosystem. Nonetheless, my advice is that you evaluate those alternative solutions and compare CloudNativePG with all available operators for Postgres before you make your own decision.

With CloudNativePG 1.21

CloudNativePG allows you to use both object store and volume snapshot strategies with your Postgres clusters. While the WAL archive - containing the transactional logs - still needs to reside in an object store, physical base backups (copies of the PostgreSQL data files) can now be stored as a tarball in an object store, or as volume snapshots.

The WAL archive is a requirement for both online backup and, most importantly, Point In Time Recovery (PITR).

The first implementation of volume snapshots in CloudNativePG has one limitation that’s worth mentioning: it supports Cold Backup only. In database technology, a cold (physical) backup is a copy of the data files taken when the DBMS is shut down - as bad as this may sound, you shouldn’t worry: a production cluster normally has at least a replica, and the current Cold Backup implementation takes a full backup from a standby, without impacting your write operations on the primary. As explained further down, this limitation will be removed in version 1.22 with support of PostgreSQL’s low level API for Hot Physical Base Backups.

In any case, a very important outcome of Cold Backups is that they are a statically consistent physical representation of the entire database cluster at a single point in time (a database snapshot, not to be confused with volume snapshots) and, as a result, they are sufficient to restore a Postgres cluster. So, for example, if your Recovery Point Objective is 1 hour for the last week of data, you can fulfill it with a volume snapshot backup every hour, retaining the last 7 days.

Volume snapshot backup

Before you proceed, make sure you have the name of the storage class and the related volume snapshot class. Given that they vary from environment to environment, I will be using a generic pattern in this article: <MY_STORAGE_CLASS> and <MY_VOLUMESNAPSHOT_CLASS>.

IMPORTANT: In this article I won’t be covering any specific storage class or environment. However, you can apply the examples in this article in every environment, just by making sure you use the correct storage class and volume snapshot class.

You can now enable volume snapshotting for physical base backups just by adding the volumeSnapshot stanza in the backup section of a PostgreSQL Cluster resource. An example might help you, especially if you are familiar with how CloudNativePG works.

Suppose you want to create a Postgres cluster called hendrix with two replicas, reserving a 10GB volume for PGDATA and a 10GB volume for WAL files. Suppose that you have already set up the backups on the object store so that you can archive the WAL files there (that’s the barmanObjectStore section which we leave empty in this article as it is not relevant).

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: hendrix
spec:
  instances: 3

  storage:
    storageClass: <MY_STORAGE_CLASS>
    size: 10Gi
  walStorage:
    storageClass: <MY_STORAGE_CLASS>
    size: 10Gi

  backup:
    # Volume snapshot backups
    volumeSnapshot:
       className: <MY_VOLUMESNAPSHOT_CLASS>
    # For the WAL archive and object store backups
    barmanObjectStore:
       …

Although you can directly create a Backup resource, my advice is to either:

  • Use the ScheduledBackup object to organize your volume snapshot backups for the Postgres cluster on a daily or even hourly basis
  • Use the backup -m volumeSnapshot command of the cnpg plugin for kubectl to get the on-demand Backup resource created for you

In this case I will use the plugin:

kubectl cnpg backup -m volumeSnapshot hendrix

Under the hood, the plugin will create a Backup resource following the hendrix-<YYYYMMDDHHMMSS> naming pattern, where YYYYMMDDHHMMSS is the time the backup was requested.

The operator then initiates the Cold Backup procedure, by:

  • Shutting down the Postgres server for the selected replica (fencing)
  • Creating a VolumeSnapshot resource for each volume defined for the cluster, in our case:
    • hendrix-YYYYMMDDHHMMSS for the PGDATA
    • hendrix-YYYYMMDDHHMMSS-wal for the WAL files
  • Waiting for the CSI external snapshotter to create, for each VolumeSnapshot, the related VolumeSnapshotContent resource
  • Removing, when completed, the fence on the selected replica for the Cold Backup operation

You can list the available backups and restrict them to the hendrix cluster by running:

kubectl get backup --selector=cnpg.io/cluster=hendrix

As you can see the METHOD column will report whether a backup has been taken via volume snapshots or object stores:

NAME                        AGE    CLUSTER   METHOD              PHASE       ERROR
hendrix-20231017150434      81s    hendrix   volumeSnapshot      completed
hendrix-20231017125847      7m8s   hendrix   barmanObjectStore   completed

Similarly you can list the current volume snapshots for the hendrix cluster with:

kubectl get volumesnapshot --selector=cnpg.io/cluster=hendrix

Both backups and volume snapshots contain important annotations and labels that allow you to browse these objects and remove them according to their month or date of execution, for example. Make sure you spend some time exploring them through the describe command for kubectl.

You can then schedule daily backups at 5AM every morning by creating a ScheduledBackup as follows:

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: hendrix-vs
spec:
  schedule: '0 0 5 * * *'
  cluster:
    name: hendrix
  backupOwnerReference: self
  method: volumeSnapshot

For more detailed information, please refer to the documentation on volume snapshot backup for CloudNativePG.

Volume snapshot recovery

Recovery is what makes a backup useful: make sure that you test the procedure before you adopt it in production - and also on a regular basis, preferably automated (the declarative approach opens up so many interesting scenarios in the areas of data warehousing and sandboxing for reporting and analysis).

Recovery from volume snapshots is achieved in the same way as CloudNativePG recovers from object stores: by bootstrapping a new cluster. The only difference here is that, instead of just pointing to an object store, you can now request to create the new PVCs starting from a set of consistent and related volume snapshots (PGDATA, WALs, and soon also tablespaces).

All you need to do is create a new cluster resource (for example hendrix-recovery), with identical settings to the hendrix one, except the bootstrap section. Here is an excerpt:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: hendrix-recovery
spec:

  # <snip>

  bootstrap:
   recovery:
    volumeSnapshots:
      storage:
        name: hendrix-YYYYMMDDHHMMSS
        kind: VolumeSnapshot
        apiGroup: snapshot.storage.k8s.io

      walStorage:
        name: hendrix-YYYYMMDDHHMMSS-wal
        kind: VolumeSnapshot
        apiGroup: snapshot.storage.k8s.io

When you create this resource, the recovery job will provision the underlying PVCs starting from the snapshots specified in the .spec.bootstrap.recovery.volumeSnapshots. Once completed, PostgreSQL is started.

You might have noticed that in the above example I didn’t define any WAL archive. This is due to the fact that the above volume snapshots have been taken using a Cold Backup strategy (the only one available for now in CloudNativePG). As mentioned earlier, these are consistent database snapshots and are sufficient to restore at a specific point in time: the time of the backup.

However, if you want to take advantage of volume snapshots for lower RPO with PITR or for better global RTO through a replica cluster in a different region (as the underlying storage class supports relaying volume snapshots across multiple Kubernetes clusters), you need to specify the location of the WAL archive by defining a source through an external cluster. For example, you can add the following to your hendrix-recovery cluster:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: hendrix-recovery
spec:

  # <snip>

  bootstrap:
   recovery:
    source: hendrix
    volumeSnapshots:
      # <snip>

  replica:
    enabled: true
    source: hendrix

  externalClusters:
    - name: hendrix
      barmanObjectStore:
        # <snip>

The above manifest requests a new Postgres Cluster called hendrix-recovery which is bootstrapped using the given volume snapshots, then placed in continuous replication by fetching WAL files from the hendrix object store and started in read-only mode.

These are just a few examples. Don’t let yourself be overwhelmed by the flexibility, freedom, and creativity you can unleash with both Postgres and the operator in terms of architectures: read this article on recommended architectures for PostgreSQL in Kubernetes from the CNCF blog for more ideas.

Benchmark results

Let’s now talk about some initial benchmarks I have performed on volume snapshots using 3 r5.4xlarge nodes on AWS EKS with the gp3 storage class. I have defined 4 different database size categories (tiny, small, medium, and large), as follows:

Cluster name Database size pgbench init scale PGDATA volume size WAL volume size pgbench init duration
tiny

4.5 GB

300

8 GB

1 GB

67s

small

44 GB

3,000

80 GB

10 GB

10m 50s

medium

438 GB

3,0000

800 GB

100 GB

3h 15m 34s

large

4,381 GB

300,000

8,000 GB

200 GB

32h 47m 47s

These databases have been created by running the pgbench initialization process with different scaling factors, ranging between 300 for the smallest (taking a little over a minute) to 300000 for the largest (taking approximately 33 hours to complete and produce a 4.4 TB database). The table above also shows the size of the PGDATA and WAL volumes used in the tests.

The experiment consisted in taking a first backup on volume snapshots, and then a second one after running pgbench for 1 hour. It’s important to note that the first backup needs to store the entire content of the volume, while subsequent ones only store the delta from the previous snapshot. Each cluster was destroyed and then recreated starting from the last snapshot. In order not to taint the results of the test and introduce variability, I decided not to replay any WAL file so as to measure the bare duration of the restore operation from the recovery of the snapshot until Postgres starts accepting connections (in Kubernetes terms, until the readiness probe succeeds and the pod is ready).

The table below shows the results of both backup and recovery for each of them.

Cluster name 1st backup duration 2nd backup duration after 1hr of pgbench Full recovery time
tiny

2m 43s

4m 16s

31s

small

20m 38s

16m 45s

27s

medium

2h 42m

2h 34m

48s

large

3h 54m 6s

2h 3s

2m 2s

All the databases were able to restart within 2 minutes (yes, minutes), including the largest instance of roughly 4.4 TB: this is definitely an optimistic estimation, as the actual time depends on several factors involving the way the CSI external snapshotter stores deltas and, in most cases, should be less than the time taken for the first full backup. Our advice, as usual, is to test it yourself, as every organizational environment (which includes not just technology, but people too!) is unique.

After CloudNativePG 1.21

As I said at the start of this article, this first implementation is just the beginning of volume snapshots support in CloudNativePG.

We are already working on adding Hot Backup support in version 1.22, by honoring the pg_start_backup() and pg_stop_backup() interfaces so as to avoid shutting down an instance. This will require the presence of a WAL archive, currently available only in object stores.

The implementation of this feature will open up even more interesting scenarios. Snapshotting is the foundation of PVC Cloning, which is the possibility of creating a new PVC by using an existing one as a source. As a result, we will be able to overcome an existing limitation: scale up and replica cloning are currently implemented with pg_basebackup only. If you have a very large database, that would imply hours or days to complete the process (which is not always critical, but often undesired). PVC Cloning will make this process faster.

Another area where PVC Cloning will help is with in-place upgrades of PostgreSQL (let’s say from version 13 to 16). We have not yet introduced pg_upgrade support in CloudNativePG for a simple reason: there’s no way to automatically rollback in case of an issue with any of the 15+ steps required by this critical operation. And rolling back using a backup in an object store might not always be a good strategy (definitely not in the case of a VLDB). Shortly, our idea is to use PVC Cloning to create new PVCs on which to run pg_upgrade and, if everything goes as planned, swap the old PVCs with the new ones. In case of failures, abort the upgrade and resume the existing cluster (with the untouched PVCs). Major PostgreSQL upgrades are currently possible with CloudNativePG: please read “The Current State of Major PostgreSQL Upgrades with CloudNativePG” for more information.

Snapshotting will be even more interesting when we introduce support for another global object in PostgreSQL: tablespaces, expected for 1.22 as well. Tablespaces enable you to place indexes or data partitions in separate I/O volumes, for better performance and vertical scalability. Among the benefits of tablespaces, spreading your data in multiple volumes might decrease the time for backup and recovery, as snapshots can be taken in parallel.

We are also already following the progress of the Kubernetes VolumeGroupSnapshot feature, currently in alpha state, to achieve atomic snapshots among the different volumes (PGDATA, WALs, and tablespaces) of a Postgres database.

Conclusions

Declarative support for Kubernetes’ Volume Snapshot API is another milestone in the evolution of CloudNativePG as an open and standard way to run PostgreSQL in a cloud native environment.

Although 1.22 and subsequent versions will make it even more evident, this version already takes the PostgreSQL VLDB experience in Kubernetes to another level, whether you are in the public cloud or on-premise, VM, or bare metal.

Taken into an AI driven world, volume snapshots change the way you approach data warehousing with Postgres and how you create sandbox environments for analysis and data exploration.

In terms of business continuity, support for volume snapshots will give you, among the others:

  • Better Recovery Time Objectives through faster restores from volume snapshots following a disaster
  • More flexibility on the Recovery Point Objective side, by adopting Cold Backup only solutions, or implementing hybrid backup strategies based on object stores too (with different scheduling)
  • Finer control on where to relay your data once the volume snapshot is completed, by relying on the storage class to clone your data in different Kubernetes clusters or regions

Join the CloudNativePG Community now if you want to help improve the project, at any level.

If you are an organization that is interested in moving your PostgreSQL databases to Kubernetes, don’t hesitate to contact us for professional support, especially if you are at an early stage of the process. EDB provides 24/7 support on CloudNativePG and PostgreSQL under the Community 360 plan (soon available for OpenShift too!). If you are looking for longer support periods, integration with Kasten K10, as well as the possibility to run Postgres Extended with TDE or Postgres Advanced to facilitate migrations from Oracle, you should look into EDB Postgres for Kubernetes, which is a product based on CloudNativePG and available under the EDB Standard and Enterprise plans.

Share this

Relevant Blogs

More Blogs

A Complete Guide to PostgreSQL Backup & Recovery

.layout__region.layout__region--first { width:100%; max-width:785px; } code { word-wrap: normal; } Backups are often taken based on the suggestions and recommendations of the person we believe knows best. We don’t try...
October 06, 2021