May 5, 2020
This blog is part of a series that will discuss best practices for High Availability (HA) for Postgres databases. In a prior blog, I explained what High Availability means (What Does "Database High Availability" Really Mean?). Future blogs will look at configuration recommendations for streaming replication, optimal parameters for EDB Failover Manager (EFM), and how EFM can be used to quickly and reliably detect and manage database server failures.
This blog is focused on minimizing downtime associated with database server software maintenance, a.k.a. patching. The techniques outlined below leverage EDB Failover Manager (EFM) to help with minimizing downtime when patching a cluster of Postgres databases. All recommendations apply to PostgreSQL and EDB Postgres Advanced Server.
I use the term “DBA” to designate the person who does the patching, knowing that in some companies DBAs do this task, in others a SysAdmin is in charge of patching, the DevOps team may do it, or it is fully automated using Ansible, Chef, Puppet or Salt.
Software Maintenance: Patching & Schedules
We commonly see three reasons for software maintenance:
- A specific patch has to be installed to remedy known software issues
- A new minor version improves general stability and reliability of the software
- A security problem has been identified, and a patch needs to be installed to address the problem.
Software maintenance almost always implies downtime and could seriously impact the SLA for an application if it is not well planned and executed with maximum efficiency.
Most IT shops carefully plan maintenance activities, bundle the patches for the database, the OS, and other software, and execute the maintenance on a carefully planned schedule. This may not always be possible, because some security patches may require immediate attention. The PostgreSQL community and EDB release minor updates quarterly, except if there are urgent security fixes. I highly recommend applying quarterly updates and high-CVE security patches ASAP.
How to Plan Ideal Maintenance for the PostgreSQL Database
Two aspects need to be considered:
- Identifying an optimal maintenance window
- Make the window as short as possible
Ideal maintenance windows correspond with low system usage. This requires a detailed understanding of the usage patterns. For example in a global retail solution, when do we have the smallest number of customers online? Maybe it is at 1:00 AM EDT on Sunday morning. Only a detailed analysis will show that. DBAs need to have a very good understanding of the usage patterns, as they cannot always wait for the next planned downtime.
Many different strategies can be applied to minimize downtime. For standalone servers, that are not clustered in a HA configuration, patching EDB Postgres, alone, is as easy as stopping the database service and starting Postgres service with new binaries in a fraction of seconds. However, a common practice bundles multiple software patches together. This leverages a common maintenance window but also makes it longer.
An oft proposed solution swaps out the Postgres binaries on disk and then does a quick restart. This sounds appealing, but has two drawbacks:
- Security software may be scanning the disk and the memory, and raise alerts immediately when the on-disk and the in-memory images differ, as that is usually the sign of a malware attack.
- Postgres may crash if a new session loads an Extension Library dynamically from the disk, thereby causing incompatibility with the old Postgres code (in-memory). Learn more here.
This solution is not applicable during many planned maintenance events, as those often include OS or agent upgrades, which require restarts.
For mission-critical systems consisting of a master and several replicas, we recommend other approaches. The techniques described below will help minimize downtime of the database service, even if other maintenance operations at the OS level take individual servers down for many minutes.
What are the Strategies for the PostgreSQL Database Patching?
DBAs have three strategies:
- All Nodes Patching - In the All-Nodes-Patching strategy, the DBA applies the patches on all the nodes simultaneously and then brings back all the machines at the same time. One can use Ansible or other scripting solutions to parallelize this operation and try to finish the process in a shorter time window.
- Rolling Patching - In this strategy, DBAs apply the patches on one server at a time, starting with the standbys. After patching the standbys, the DBA applies the patches on the master. No switchover between master and replica is required.
- Minimum Downtime Patching - The DBA first patches the replicas. Instead of patching the master next, the DBAs first performs a switchover from the master to one of the updates Replicas and redirects the connections to the new master. After the switchover, update the old master and add it back into the cluster as a replica.
Note: If you have synchronous standby(s) in your streaming replication cluster, then for approaches (2) and (3), you would like to change the synchronous standby(s) to asynchronous by modifying the following parameter and reloading in EDB Postgres:
synchronous_standby_names = ‘’
After applying the patches, you can convert BACK your standby(s) from asynchronous to synchronous by changing the parameter mentioned above and reloading in EDB Postgres.
Patching a Postgres Server that is part of a HA Cluster
To patch a node that is part of a HA cluster, I recommend the following sequence of steps:
- Make sure EDB Postgres Failover Manager has the following parameters set to false in efm.properties file. master.shutdown.as.failure=false
- Graceful shutdown of the services in the following order:
- Shutdown efm agent - e.g systemctl stop edb-efm-3.9
- Shutdown EDB Postgres service - e.g. systemctl stop edb-as-12
- Apply the patches, either using yum or apt or Zypper command.
- Bring the services in the following order
- Start the EDB Postgres database service - e.g. systemctl start edb-as-12
- Start the EDB Failover manager service - e.g systemctl start edb-efm-3.9
In conclusion, using any one of the mentioned approaches, DBAs can bring their system up-to-date. Hopefully, my recommendations will help you maintain your SLAs!
Want to learn more about how to manage failover and replication for Postgres high availability? Click here.