Using Failover Manager¶
Failover Manager offers support for monitoring and failover of clusters with one or more Standby servers. You can add or remove nodes from the cluster as your demand for resources grows or shrinks.
If a Master node reboots, Failover Manager may detect the database is
down on the Master node and promote a Standby node to the role of
Master. If this happens, the Failover Manager agent on the (rebooted)
Master node will not get a chance to write the
(for server version 11 or prior) or
standby.signal file (for server
version 12 or later); the rebooted Master node will return to the cluster as a
second Master node.
To prevent this, start the Failover Manager agent before starting the
database server. The agent will start in idle mode, and check to see if
there is already a master in the cluster. If there is a master node, the
agent will verify that a
exists, and the database will not start as a second master.
Managing a Failover Manager Cluster¶
Once configured, a Failover Manager cluster requires no regular maintenance. The following sections provide information about performing the management tasks that may occasionally be required by a Failover Manager Cluster.
By default, some of the efm commands must be invoked
efm or an OS superuser; an administrator can selectively permit users to
invoke these commands by adding the user to the
efm group. The commands
Starting the Failover Manager Cluster¶
You can start the nodes of a Failover Manager cluster in any order.
To start the Failover Manager cluster on RHEL 6.x or CentOS 6.x, assume superuser privileges, and invoke the command:
service edb-efm-3.10 start
To start the Failover Manager cluster on RHEL/CentOS 7.x or RHEL/CentOS 8.x, assume superuser privileges, and invoke the command:
systemctl start edb-efm-3.10
If the cluster properties file for the node specifies that
true, the node will start as a Witness node.
If the node is not a dedicated Witness node, Failover Manager will
connect to the local database and invoke the
function. If the server responds
false, the agent assumes the node is a
Master node, and assigns a virtual IP address to the node (if
applicable). If the server responds
true, the Failover Manager agent
assumes that the node is a Standby server. If the server does not
respond, the agent will start in an idle state.
After joining the cluster, the Failover Manager agent checks the supplied database credentials to ensure that it can connect to all of the databases within the cluster. If the agent cannot connect, the agent will shut down.
If a new master or standby node joins a cluster, all of the existing nodes will also confirm that they can connect to the database on the new node.
Adding Nodes to a Cluster¶
You can add a node to a Failover Manager cluster at any time. When you add a node to a cluster, you must modify the cluster to allow the new node, and then tell the new node how to find the cluster. The following steps detail adding a node to a cluster:
auto.allow.hostsis set to
true, use the
efm allow-nodecommand, to add the IP address of the new node to the Failover Manager allowed node host list. When invoking the command, specify the cluster name and the IP address of the new node:
efm allow-node <cluster_name ip_address>
For more information about using the
efm allow-nodecommand or controlling a Failover Manager service, see Using the EFM Utility.
Install a Failover Manager agent and configure the cluster properties file on the new node. For more information about modifying the properties file, see The Cluster Properties File.
Configure the cluster members file on the new node, adding an entry for the Membership Coordinator. For more information about modifying the cluster members file, see The Cluster Members File.
Assume superuser privileges on the new node, and start the Failover Manager agent.
When the new node joins the cluster, Failover Manager will send a
notification to the administrator email provided in the
property, and/or will invoke the specified notification script.
Please note: : To be a useful Standby for the current node, the node must be a standby in the PostgreSQL Streaming Replication scenario.
Changing the Priority of a Standby¶
If your Failover Manager cluster includes more than one Standby server,
you can use the
efm set-priority command to influence the promotion
priority of a Standby node. Invoke the command on any existing member of
the Failover Manager cluster, and specify a priority value after the IP
address of the member.
For example, the following command instructs Failover Manager that the
acctg cluster member that is monitoring
10.0.1.9 is the primary Standby
efm set-priority acctg 10.0.1.9 1
You can set the priority of a standby to
0 to make the standby
non-promotable. Setting the priority of a standby to a value greater
0 overrides a property value of
For example, if the properties file on node
10.0.1.10 includes a setting
promotable=false and you use
efm set-priority to set the promotion
10.0.1.10 to be the standby used in the event of a failover,
the value designated by the
efm set-priority command will override the
value in the property file:
efm set-priority acctg 10.0.1.10 1
In the event of a failover, Failover Manager will first retrieve
information from Postgres streaming replication to confirm which Standby
node has the most recent data, and promote the node with the least
chance of data loss. If two Standby nodes contain equally up-to-date
data, the node with a higher user-specified priority value will be
promoted to Master unless use.replay.tiebreaker
is set to
false . To check the priority value of your Standby nodes,
use the command:
efm cluster-status <cluster_name>
Please note: : The promotion priority may change if a node becomes isolated from the cluster, and later re-joins the cluster.
Promoting a Failover Manager Node¶
You can invoke
efm promote on any node of a Failover Manager cluster to
start a manual promotion of a Standby database to Master database.
Manual promotion should only be performed during a maintenance window for your database cluster. If you do not have an up-to-date Standby database available, you will be prompted before continuing. To start a manual promotion, assume the identity of efm or the OS superuser, and invoke the command:
efm promote <cluster_name> [-switchover] [-sourcenode <address>] [-quiet] [-noscripts]
<cluster_name>is the name of the Failover Manager cluster.
–switchoveroption to reconfigure the original Master as a Standby. If you include the
–switchoverkeyword, the cluster must include a master node and at least one standby, and the nodes must be in sync.
–sourcenodekeyword to specify the node from which the recovery settings will be copied to the master.
-quietkeyword to suppress notifications during switchover.
-noscriptskeyword to prevent instruct Failover Manager to not invoke fencing and post-promotion scripts.
For server versions 11 and prior, the
recovery.conffile is copied from an existing standby to the master node. For server version 12 and later, the
restore_commandparameters are copied and stored in memory.
The master database is stopped.
If you are using a VIP, the address is released from the master node.
A standby is promoted to replace the master node, and acquires the VIP.
The address of the new master node is added to the
recovery.conffile or the
primary_conninfodetails are stored in memory.
application.nameproperty is set for this node, the application_name property will be added to the
recovery.conffile or the
primary_conninfoinformation will be stored in memory.
If you are using server version 12 or later, the recovery settings that have been stored in memory are written to the
The old master is started; the agent will resume monitoring it as a standby.
During a manual promotion, the Master agent releases the virtual IP
address before creating a
recovery.conf file in the directory specified
db.data.dir property. The
recovery.conf file is created on
all server versions, and is used to prevent the old master database from starting
until the file is removed, preventing the node from starting as a second master
in the cluster.
The Master agent remains running, and assumes a status of
The Standby agent confirms that the virtual IP address is no longer in use before pinging a well-known address to ensure that the agent is not isolated from the network. The Standby agent runs the fencing script and promotes the Standby database to Master. The Standby agent then assigns the virtual IP address to the Standby node, and runs the post-promotion script (if applicable).
Please note that this command instructs the service to ignore the value
specified in the
auto.failover parameter in the cluster properties file.
To return a node to the role of master, place the node first in the promotion list:
efm set-priority <cluster_name> <ip_address> <priority>
Then, perform a manual promotion:
efm promote <cluster_name> ‑switchover
For more information about the efm utility, see Using the EFM Utility.
Stopping a Failover Manager Agent¶
When you stop an agent, Failover Manager will remove the node’s address from the cluster members list on all of the running nodes of the cluster, but will not remove the address from the Failover Manager Allowed node host list.
To stop the Failover Manager agent on RHEL 6.x or CentOS 6.x, assume superuser privileges, and invoke the command:
service edb-efm-3.10 stop
To stop the Failover Manager agent on RHEL/CentOS 7.x or RHEL/CentOS 8.x, assume superuser privileges, and invoke the command:
systemctl stop edb-efm-3.10
Until you invoke the efm disallow-node command (removing the node’s
address of the node from the Allowed node host list), you can use the
service edb-efm-3.10 start command to restart the node at a later date
without first running the
efm allow-node command again.
Please note that stopping an agent does not signal the cluster that the
agent has failed unless the master.shutdown.as.failure
property is set to
Stopping a Failover Manager Cluster¶
To stop a Failover Manager cluster, connect to any node of a Failover
Manager cluster, assume the identity of
efm or the OS superuser, and
invoke the command:
efm stop-cluster <cluster_name>
The command will cause all Failover Manager agents to exit. Terminating the Failover Manager agents completely disables all failover functionality.
Please note: : When you invoke the
efm stop-cluster command, all
authorized node information is lost from the Allowed node host list.
Removing a Node from a Cluster¶
efm disallow-node command removes the IP address of a node from the
Failover Manager Allowed Node host list. Assume the identity of
the OS superuser on any existing node (that is currently part of the
running cluster), and invoke the
efm disallow-node command, specifying
the cluster name and the IP address of the node:
efm disallow-node <cluster_name> <ip_address>
efm disallow-node command will not stop a running agent; the service
will continue to run on the node until you stop the agent.
If the agent or cluster is subsequently stopped, the node will not be allowed to
rejoin the cluster, and will be removed from the failover priority list (and
will be ineligible for promotion).
After invoking the
efm disallow-node command, you must use the efm
allow-node command to add the node to the cluster again .
Running Multiple Agents on a Single Node¶
You can monitor multiple database clusters that reside on the same host by running multiple Master or Standby agents on that Failover Manager node. You may also run multiple Witness agents on a single node. To configure Failover Manager to monitor more than one database cluster, while ensuring that Failover Manager agents from different clusters do not interfere with each other, you must:
Create a cluster properties file for each member of each cluster that defines a unique set of properties and the role of the node within the cluster.
Create a cluster members file for each member of each cluster that lists the members of the cluster.
Customize the service script (on a RHEL or CentOS 6.x system) or the unit file (on a RHEL/CentOS 7.x or RHEL/CentOS 8.x system) for each cluster to specify the names of the cluster properties and the cluster members files.
Start the services for each cluster.
The examples that follow uses two database clusters (acctg and sales) running on the same node:
/opt/pgdata1; its server is monitoring port
/opt/pgdata2; its server is monitoring port
To run a Failover Manager agent for both of these database clusters, use
efm.properties.in template to create two properties files. Each
cluster properties file must have a unique name. For this example, we
sales.properties to match the
sales database clusters.
The following parameters must be unique in each cluster properties file:
Within each cluster properties file, the
db.port parameter should
specify a unique value for each cluster, while the
db.database parameter may have the same value or a unique value. For
acctg.properties file may specify:
sales.properties file may specify:
Some parameters require special attention when setting up more than one Failover Manager cluster agent on the same node. If multiple agents reside on the same node, each port must be unique. Any two ports will work, but it may be easier to keep the information clear if using ports that are not too close to each other.
When creating the cluster properties file for each cluster, the
db.data.dir parameters must also specify values that are unique
for each respective database cluster.
The following parameters are used when assigning the virtual IP address to a node. If your Failover Manager cluster does not use a virtual IP address, leave these parameters blank.
This parameter value is determined by the virtual IP addresses being
used and may or may not be the same for both
After creating the
sales.properties files, create a
service script or unit file for each cluster that points to the
respective property files; this step is platform specific. If you are
using RHEL 6.x or CentOS 6.x, see RHEL 6.x or CentOS 6.x; if you are using RHEL/CentOS 7.x or RHEL/CentOS 8.x, see RHEL/CentOS 7.x or RHEL/CentOS 8.x.
Please note: : If you are using a custom service script or unit file, you must manually update the file to reflect the new service name when you upgrade Failover Manager.
RHEL 6.x or CentOS 6.x¶
If you are using RHEL 6.x or CentOS 6.x, you should copy the
service script to new file with a name that is unique for each cluster.
# cp /etc/init.d/edb-efm-3.10 /etc/init.d/efm-acctg # cp /etc/init.d/edb-efm-3.10 /etc/init.d/efm-sales
Then edit the
CLUSTER variable, modifying the cluster name from
After creating the service scripts, run:
# chkconfig efm-acctg on
# chkconfig efm-sales on
Then, use the new service scripts to start the agents. For example, you
can start the
acctg agent with the command:
# service efm-acctg start
RHEL/CentOS 7.x or RHEL/CentOS 8.x¶
If you are using RHEL/CentOS 7.x or RHEL/CentOS 8.x, you should copy the
unit file to new file with a name that is unique for each cluster. For
example, if you have two clusters (named acctg and sales), the unit file
names might be:
Then, edit the
CLUSTER variable within each unit file, changing the
specified cluster name from
efm to the new cluster name. For example,
for a cluster named
acctg, the value would specify:
You must also update the value of the
PIDfile parameter to specify the
new cluster name. For example:
After copying the service scripts, use the following commands to enable the services:
# systemctl enable efm-acctg.service
# systemctl enable efm-sales.service
Then, use the new service scripts to start the agents. For example, you
can start the
acctg agent with the command:
# systemctl start efm-acctg
For information about customizing a unit file, please visit: