What Happens When I Press Enter? -- EDB failover manager Switchover

Richard Yen

EnterpriseDB Failover Manager (EFM) is a great tool to automate failover/switchover if you use Postgres' streaming replication feature.  Not only do you get High Availability (HA), you can do so with just a few simple commands to make it all happen very quickly.  We recently had an issue wherein a customer sought to improve the wallclock performance of a master/standby switchover in EFM.  In the process, we started discussing what it takes for EFM to actually perform the switchover which, in some scenarios, takes as little as 5 seconds.  We thought we'd share the knowledge with everyone (note that unless specified, "Agent" is the agent where the command was run):

  • The 'efm' command process (the CLI) first checks that a master exists and that it and all standbys are in sync. You can promote a standby any time, even if there's no master       or if things are out of sync. But a switchover will not happen unless there's a master and everything is in sync. CLI sends signal to the local agent is to start promotion/switchover.
  • Agent retrieves recovery.conf text from a standby.
  • Agent sends text to the original master. This is a signal to the master that it should become a standby after the normal manual promotion steps occur (steps 3-10 below).
  • Master agent drops the VIP and writes out the recovery.conf file (note that host address is incorrect at this point).
  • Master agent stops monitoring its local database, stops the database, and becomes IDLE.
  • A standby is chosen (replay paused, standby chosen based on xlog location and priority) for promotion and enters PROMOTING state.
  • Standby agent runs fencing script, if it exists.
  • Standby agent writes trigger file and resumes replay.
  • Thread started in #6 above monitors database for it to come out of recovery and become master. This is where the recovery.check.period property is used.
  • Other standby agents reconfigure recovery.conf files to point to new master.
  • Other standby agents stop monitoring the local database, restart the database, and resume monitoring.
  • Original master agent reconfigures recovery.conf file to point to new master.
  • Original master agent starts database and resumes monitoring.

That's it! Note that the new master is promoted before the old one is restarted as a standby, so you actually have a new master faster than the total time.