EnterpriseDB Failover Manager (EFM) is a great tool to automate failover/switchover if you use Postgres' streaming replication feature. Not only do you get High Availability (HA), you can do so with just a few simple commands to make it all happen very quickly. We recently had an issue wherein a customer sought to improve the wallclock performance of a master/standby switchover in EFM. In the process, we started discussing what it takes for EFM to actually perform the switchover which, in some scenarios, takes as little as 5 seconds. We thought we'd share the knowledge with everyone (note that unless specified, "Agent" is the agent where the command was run):
- The 'efm' command process (the CLI) first checks that a master exists and that it and all standbys are in sync. You can promote a standby any time, even if there's no master or if things are out of sync. But a switchover will not happen unless there's a master and everything is in sync. CLI sends signal to the local agent is to start promotion/switchover.
- Agent retrieves recovery.conf text from a standby.
- Agent sends text to the original master. This is a signal to the master that it should become a standby after the normal manual promotion steps occur (steps 3-10 below).
- Master agent drops the VIP and writes out the recovery.conf file (note that host address is incorrect at this point).
- Master agent stops monitoring its local database, stops the database, and becomes IDLE.
- A standby is chosen (replay paused, standby chosen based on xlog location and priority) for promotion and enters PROMOTING state.
- Standby agent runs fencing script, if it exists.
- Standby agent writes trigger file and resumes replay.
- Thread started in #6 above monitors database for it to come out of recovery and become master. This is where the recovery.check.period property is used.
- Other standby agents reconfigure recovery.conf files to point to new master.
- Other standby agents stop monitoring the local database, restart the database, and resume monitoring.
- Original master agent reconfigures recovery.conf file to point to new master.
- Original master agent starts database and resumes monitoring.
That's it! Note that the new master is promoted before the old one is restarted as a standby, so you actually have a new master faster than the total time.