A new era for Postgres in Kubernetes has just begun. Version 1.21 of CloudNativePG introduces declarative support for Kubernetes’ standard API for Volume Snapshots, enabling, among others, incremental and differential copy for both backup and recovery operations to improve RPO and RTO.
The benchmarks on EKS that I present in this article highlight that backup and - most importantly - recovery times of Very Large Databases (VLDB) are now reduced by a few orders of magnitude when compared with the existing object store counterpart. As an example, I was able to fully recover a 4.5 TB Postgres database from a volume snapshot in 2 minutes. This is just the beginning, as we are planning to natively support more features provided by Kubernetes on the storage level.
About Kubernetes volume snapshots
First of all, volume snapshots have been around for many years. When I was a maintainer and developer of Barman, a popular backup and recovery open source tool for Postgres, we regularly received a request from a customer to integrate it with their storage solution supporting snapshots. The major blocker was the lack of a standard interface for us to control the snapshotting capabilities of the storage.
Kubernetes fixed this. In December 2020, Kubernetes 1.20 introduced volume snapshotting by enriching the API with the
VolumeSnapshotClass custom resource definitions. Volume snapshotting is now in every supported Kubernetes version, providing a generic and standard interface for:
- Creating a new volume snapshot from a PVC
- Creating a new volume from a volume snapshot
- Deleting an existing snapshot
The implementation is delegated to the underlying CSI drivers, and storage classes can offer a variety of capabilities based on storage: these might include incremental block level copy, differential block level copy, replication on a secondary or n-ary location in another region, and so on.
The main advantage is that the interface abstracts the complexity and the management of storage from the application, in our case a Postgres workload. From a database perspective, incremental and differential backup and recovery are the most desired features that volume snapshotting brings along.
All major cloud providers have CSI drivers and storage classes supporting volume snapshots (for details, see GKE, EKS, or AKS). On premises, you can use Openshift Data Foundation (ODF) and LVM with Red Hat, Longhorn with Rancher, and Portworx by Pure Storage - just to cite a few. You can find a detailed list of available drivers in the official document of Kubernetes Containers Storage Interface (CSI) project.
Before CloudNativePG 1.21
Prior to version 1.21, CloudNativePG supported backup and recovery only on object stores.
Object stores are very practical in many contexts, especially in Cloud environments and with small/medium size databases - I’d say below 500GB, but it really depends on several factors and there’s no clear demarcation. One of them is the time it takes to backup a database and store it in an object store, although the most important one - at least for the scope of this article - is the time to restore from a backup that’s safely secured in an object store: this metric represents the Recovery Time Objective in your Business Continuity Plan for that specific database. My advice is to measure both times, and then make a decision whether they are acceptable. Based on my tests and past experience, for a 45GB database, backup time might be in the order of 60-100 minutes, while recovery time in the 30-60 minutes range (these might change for the better or worse depending on the actual object store technology underneath). Time linearly increases with the database size without incremental and/or differential copy, proving to be inadequate for VLDB use cases.
For this reason, following some conversations with some members of CNCF TAG storage at KubeCon Europe 2023 in Amsterdam, in April 2023 we decided to introduce imperative support for backup and recovery with Kubernetes volume snapshots, through the
cnpg plugin for
kubectl (CloudNativePG 1.20). This allowed us to have a fast prototype of this feature, and then enrich it with a declarative API.
Disclaimer: for the sake of honesty, other Postgres operators for Kubernetes provided information on how to use volume snapshots for both backup and recovery. Given that these instructions are imperative, and that our operator is built with a fully declarative model, I don’t cover them in this article. Other operators rely on Postgres level backup tools for incremental backup/recovery; having conceived Barman many years ago, we could have gone down that path, but our vision is to rely on the Kubernetes way of doing incremental backup/recovery in order to facilitate the integration of Postgres in that ecosystem. Nonetheless, my advice is that you evaluate those alternative solutions and compare CloudNativePG with all available operators for Postgres before you make your own decision.
With CloudNativePG 1.21
CloudNativePG allows you to use both object store and volume snapshot strategies with your Postgres clusters. While the WAL archive - containing the transactional logs - still needs to reside in an object store, physical base backups (copies of the PostgreSQL data files) can now be stored as a tarball in an object store, or as volume snapshots.
The WAL archive is a requirement for both online backup and, most importantly, Point In Time Recovery (PITR).
The first implementation of volume snapshots in CloudNativePG has one limitation that’s worth mentioning: it supports Cold Backup only. In database technology, a cold (physical) backup is a copy of the data files taken when the DBMS is shut down - as bad as this may sound, you shouldn’t worry: a production cluster normally has at least a replica, and the current Cold Backup implementation takes a full backup from a standby, without impacting your write operations on the primary. As explained further down, this limitation will be removed in version 1.22 with support of PostgreSQL’s low level API for Hot Physical Base Backups.
In any case, a very important outcome of Cold Backups is that they are a statically consistent physical representation of the entire database cluster at a single point in time (a database snapshot, not to be confused with volume snapshots) and, as a result, they are sufficient to restore a Postgres cluster. So, for example, if your Recovery Point Objective is 1 hour for the last week of data, you can fulfill it with a volume snapshot backup every hour, retaining the last 7 days.
Volume snapshot backup
Before you proceed, make sure you have the name of the storage class and the related volume snapshot class. Given that they vary from environment to environment, I will be using a generic pattern in this article: <
IMPORTANT: In this article I won’t be covering any specific storage class or environment. However, you can apply the examples in this article in every environment, just by making sure you use the correct storage class and volume snapshot class.
You can now enable volume snapshotting for physical base backups just by adding the
volumeSnapshot stanza in the
backup section of a PostgreSQL
Cluster resource. An example might help you, especially if you are familiar with how CloudNativePG works.
Suppose you want to create a Postgres cluster called
hendrix with two replicas, reserving a 10GB volume for PGDATA and a 10GB volume for WAL files. Suppose that you have already set up the backups on the object store so that you can archive the WAL files there (that’s the
barmanObjectStore section which we leave empty in this article as it is not relevant).
apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: hendrix spec: instances: 3 storage: storageClass: <MY_STORAGE_CLASS> size: 10Gi walStorage: storageClass: <MY_STORAGE_CLASS> size: 10Gi backup: # Volume snapshot backups volumeSnapshot: className: <MY_VOLUMESNAPSHOT_CLASS> # For the WAL archive and object store backups barmanObjectStore: …
Although you can directly create a
Backup resource, my advice is to either:
- Use the
ScheduledBackupobject to organize your volume snapshot backups for the Postgres cluster on a daily or even hourly basis
- Use the
backup -m volumeSnapshotcommand of the
kubectlto get the on-demand
Backupresource created for you
In this case I will use the plugin:
kubectl cnpg backup -m volumeSnapshot hendrix
Under the hood, the plugin will create a
Backup resource following the
hendrix-<YYYYMMDDHHMMSS> naming pattern, where
YYYYMMDDHHMMSS is the time the backup was requested.
The operator then initiates the Cold Backup procedure, by:
- Shutting down the Postgres server for the selected replica (fencing)
- Creating a
VolumeSnapshotresource for each volume defined for the cluster, in our case:
hendrix-YYYYMMDDHHMMSSfor the PGDATA
hendrix-YYYYMMDDHHMMSS-walfor the WAL files
- Waiting for the CSI external snapshotter to create, for each
VolumeSnapshot, the related
- Removing, when completed, the fence on the selected replica for the Cold Backup operation
You can list the available backups and restrict them to the
hendrix cluster by running:
kubectl get backup --selector=cnpg.io/cluster=hendrix
As you can see the
METHOD column will report whether a backup has been taken via volume snapshots or object stores:
NAME AGE CLUSTER METHOD PHASE ERROR hendrix-20231017150434 81s hendrix volumeSnapshot completed hendrix-20231017125847 7m8s hendrix barmanObjectStore completed
Similarly you can list the current volume snapshots for the
hendrix cluster with:
kubectl get volumesnapshot --selector=cnpg.io/cluster=hendrix
Both backups and volume snapshots contain important annotations and labels that allow you to browse these objects and remove them according to their month or date of execution, for example. Make sure you spend some time exploring them through the
describe command for
You can then schedule daily backups at 5AM every morning by creating a
ScheduledBackup as follows:
apiVersion: postgresql.cnpg.io/v1 kind: ScheduledBackup metadata: name: hendrix-vs spec: schedule: '0 0 5 * * *' cluster: name: hendrix backupOwnerReference: self method: volumeSnapshot
For more detailed information, please refer to the documentation on volume snapshot backup for CloudNativePG.
Volume snapshot recovery
Recovery is what makes a backup useful: make sure that you test the procedure before you adopt it in production - and also on a regular basis, preferably automated (the declarative approach opens up so many interesting scenarios in the areas of data warehousing and sandboxing for reporting and analysis).
Recovery from volume snapshots is achieved in the same way as CloudNativePG recovers from object stores: by bootstrapping a new cluster. The only difference here is that, instead of just pointing to an object store, you can now request to create the new PVCs starting from a set of consistent and related volume snapshots (PGDATA, WALs, and soon also tablespaces).
All you need to do is create a new cluster resource (for example
hendrix-recovery), with identical settings to the
hendrix one, except the
bootstrap section. Here is an excerpt:
apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: hendrix-recovery spec: # <snip> bootstrap: recovery: volumeSnapshots: storage: name: hendrix-YYYYMMDDHHMMSS kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io walStorage: name: hendrix-YYYYMMDDHHMMSS-wal kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io
When you create this resource, the recovery job will provision the underlying PVCs starting from the snapshots specified in the
.spec.bootstrap.recovery.volumeSnapshots. Once completed, PostgreSQL is started.
You might have noticed that in the above example I didn’t define any WAL archive. This is due to the fact that the above volume snapshots have been taken using a Cold Backup strategy (the only one available for now in CloudNativePG). As mentioned earlier, these are consistent database snapshots and are sufficient to restore at a specific point in time: the time of the backup.
However, if you want to take advantage of volume snapshots for lower RPO with PITR or for better global RTO through a replica cluster in a different region (as the underlying storage class supports relaying volume snapshots across multiple Kubernetes clusters), you need to specify the location of the WAL archive by defining a source through an external cluster. For example, you can add the following to your
apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: hendrix-recovery spec: # <snip> bootstrap: recovery: source: hendrix volumeSnapshots: # <snip> replica: enabled: true source: hendrix externalClusters: - name: hendrix barmanObjectStore: # <snip>
The above manifest requests a new Postgres
hendrix-recovery which is bootstrapped using the given volume snapshots, then placed in continuous replication by fetching WAL files from the
hendrix object store and started in read-only mode.
These are just a few examples. Don’t let yourself be overwhelmed by the flexibility, freedom, and creativity you can unleash with both Postgres and the operator in terms of architectures: read this article on recommended architectures for PostgreSQL in Kubernetes from the CNCF blog for more ideas.
Let’s now talk about some initial benchmarks I have performed on volume snapshots using 3
r5.4xlarge nodes on AWS EKS with the
gp3 storage class. I have defined 4 different database size categories (tiny, small, medium, and large), as follows:
|Cluster name||Database size||pgbench init scale||PGDATA volume size||WAL volume size||pgbench init duration|
3h 15m 34s
32h 47m 47s
These databases have been created by running the
pgbench initialization process with different scaling factors, ranging between 300 for the smallest (taking a little over a minute) to 300000 for the largest (taking approximately 33 hours to complete and produce a 4.4 TB database). The table above also shows the size of the PGDATA and WAL volumes used in the tests.
The experiment consisted in taking a first backup on volume snapshots, and then a second one after running
pgbench for 1 hour. It’s important to note that the first backup needs to store the entire content of the volume, while subsequent ones only store the delta from the previous snapshot. Each cluster was destroyed and then recreated starting from the last snapshot. In order not to taint the results of the test and introduce variability, I decided not to replay any WAL file so as to measure the bare duration of the restore operation from the recovery of the snapshot until Postgres starts accepting connections (in Kubernetes terms, until the readiness probe succeeds and the pod is ready).
The table below shows the results of both backup and recovery for each of them.
|Cluster name||1st backup duration||2nd backup duration after 1hr of pgbench||Full recovery time|
3h 54m 6s
All the databases were able to restart within 2 minutes (yes, minutes), including the largest instance of roughly 4.4 TB: this is definitely an optimistic estimation, as the actual time depends on several factors involving the way the CSI external snapshotter stores deltas and, in most cases, should be less than the time taken for the first full backup. Our advice, as usual, is to test it yourself, as every organizational environment (which includes not just technology, but people too!) is unique.
After CloudNativePG 1.21
As I said at the start of this article, this first implementation is just the beginning of volume snapshots support in CloudNativePG.
We are already working on adding Hot Backup support in version 1.22, by honoring the
pg_stop_backup() interfaces so as to avoid shutting down an instance. This will require the presence of a WAL archive, currently available only in object stores.
The implementation of this feature will open up even more interesting scenarios. Snapshotting is the foundation of PVC Cloning, which is the possibility of creating a new PVC by using an existing one as a source. As a result, we will be able to overcome an existing limitation: scale up and replica cloning are currently implemented with
pg_basebackup only. If you have a very large database, that would imply hours or days to complete the process (which is not always critical, but often undesired). PVC Cloning will make this process faster.
Another area where PVC Cloning will help is with in-place upgrades of PostgreSQL (let’s say from version 13 to 16). We have not yet introduced
pg_upgrade support in CloudNativePG for a simple reason: there’s no way to automatically rollback in case of an issue with any of the 15+ steps required by this critical operation. And rolling back using a backup in an object store might not always be a good strategy (definitely not in the case of a VLDB). Shortly, our idea is to use PVC Cloning to create new PVCs on which to run
pg_upgrade and, if everything goes as planned, swap the old PVCs with the new ones. In case of failures, abort the upgrade and resume the existing cluster (with the untouched PVCs). Major PostgreSQL upgrades are currently possible with CloudNativePG: please read “The Current State of Major PostgreSQL Upgrades with CloudNativePG” for more information.
Snapshotting will be even more interesting when we introduce support for another global object in PostgreSQL: tablespaces, expected for 1.22 as well. Tablespaces enable you to place indexes or data partitions in separate I/O volumes, for better performance and vertical scalability. Among the benefits of tablespaces, spreading your data in multiple volumes might decrease the time for backup and recovery, as snapshots can be taken in parallel.
We are also already following the progress of the Kubernetes
VolumeGroupSnapshot feature, currently in alpha state, to achieve atomic snapshots among the different volumes (PGDATA, WALs, and tablespaces) of a Postgres database.
Declarative support for Kubernetes’ Volume Snapshot API is another milestone in the evolution of CloudNativePG as an open and standard way to run PostgreSQL in a cloud native environment.
Although 1.22 and subsequent versions will make it even more evident, this version already takes the PostgreSQL VLDB experience in Kubernetes to another level, whether you are in the public cloud or on-premise, VM, or bare metal.
Taken into an AI driven world, volume snapshots change the way you approach data warehousing with Postgres and how you create sandbox environments for analysis and data exploration.
In terms of business continuity, support for volume snapshots will give you, among the others:
- Better Recovery Time Objectives through faster restores from volume snapshots following a disaster
- More flexibility on the Recovery Point Objective side, by adopting Cold Backup only solutions, or implementing hybrid backup strategies based on object stores too (with different scheduling)
- Finer control on where to relay your data once the volume snapshot is completed, by relying on the storage class to clone your data in different Kubernetes clusters or regions
Join the CloudNativePG Community now if you want to help improve the project, at any level.
If you are an organization that is interested in moving your PostgreSQL databases to Kubernetes, don’t hesitate to contact us for professional support, especially if you are at an early stage of the process. EDB provides 24/7 support on CloudNativePG and PostgreSQL under the Community 360 plan (soon available for OpenShift too!). If you are looking for longer support periods, integration with Kasten K10, as well as the possibility to run Postgres Extended with TDE or Postgres Advanced to facilitate migrations from Oracle, you should look into EDB Postgres for Kubernetes, which is a product based on CloudNativePG and available under the EDB Standard and Enterprise plans.