Notifications

Failover Manager will send e-mail notifications and/or invoke a notification script when a notable event occurs that affects the cluster. If you have configured Failover Manager to send an email notification, you must have an SMTP server running on port 25 on each node of the cluster. Use the following parameters to configure notification behavior for Failover Manager:

user.email
script.notification
from.email

For more information about editing the configuration properties, see Specifying Cluster Properties .

The body of the notification contains details about the event that triggered the notification, and about the current state of the cluster. For example:

EFM node: 10.0.1.11
Cluster name: acctg
Database name: postgres
VIP: ip_address (Active|Inactive)
Database health is not being monitored.

The VIP field displays the IP address and state of the virtual IP if implemented for the node.

Failover Manager assigns a severity level to each notification. The following levels indicate increasing levels of attention required:

INFO indicates an informational message about the agent and does not require any manual intervention (for example, Failover Manager has started or stopped).

WARNING indicates that an event has happened that requires the administrator to check on the system (for example, failover has occurred).

SEVERE indicates that a serious event has happened and requires the immediate attention of the administrator (for example, failover was attempted, but was unable to complete).

The severity level designates the urgency of the notification. A notification with a severity level of SEVERE requires user attention immediately, while a notification with a severity level of INFO will call your attention to operational information about your cluster that does not require user action. Notification severity levels are not related to logging levels; all notifications are sent regardless of the log level detail specified in the configuration file.

You can use the notification.level property to specify the minimum severity level that will trigger a notification.

Please note: In addition to sending notices to the administrative email address, all notifications are recorded in the cluster log file (/var/log/efm-3.6/cluster_name.log).

The conditions listed in the table below will trigger an INFO level notification:

Subject Description
Executed fencing script Executed fencing script script_name Results: script_results
Executed post-promotion script Executed post-promotion script script_name Results: script_results
Executed remote pre-promotion script Executed remote pre-promotion script script_name Results: script_results
Executed remote post-promotion script Executed remote post-promotion script script_name Results: script_results
Executed post-database failure script Executed post-database failure script script_name Results: script_results
Executed master isolation script Executed master isolation script script_name Results: script_results
Witness agent running on node_address for cluster cluster_name Witness agent is running.
Master agent running on node_address for cluster cluster_name Master agent is running and database health is being monitored.
Standby agent running on node_address for cluster cluster_name Standby agent is running and database health is being monitored.
Idle agent running on node node_address for cluster cluster_name Idle agent is running. After starting the local database, the agent can be resumed.
Assigning VIP to node node_address Assigning VIP VIP_address to node node_address Results: script_results
Releasing VIP from node node_address Releasing VIP VIP_address from node node_address Results: script_results
Starting auto resume check for cluster cluster_name The agent on this node will check every auto.resume.period seconds to see if it can resume monitoring the failed database. The cluster should be checked during this time and the agent stopped if the database will not be started again. See the agent log for more details.
Executed agent resumed script Executed agent resumed script script_name Results: script_results

The conditions listed in the table below will trigger a WARNING level notification:

Subject Description
Witness agent exited on node_address for cluster cluster_name Witness agent has exited.
Master agent exited on node_address for cluster cluster_name Database health is not being monitored.
Cluster cluster_name notified that master node has left Failover is disabled for the cluster until the master agent is restarted.
Standby agent exited on node_address for cluster cluster_name Database health is not being monitored.
Agent exited during promotion on node_address for cluster cluster_name Database health is not being monitored.
Agent exited on node_address for cluster cluster_name The agent has exited. This is generated by an agent in the Idle state.
Agent exited for cluster cluster_name The agent has exited. This notification is usually generated during startup when an agent exits before startup has completed.
Virtual IP address assigned to non-master node The virtual IP address appears to be assigned to a non-master node. To avoid any conflicts, Failover Manager will release the VIP. You should confirm that the VIP is assigned to your master node and manually reassign the address if it is not.
Virtual IP address not assigned to master node. The virtual IP address appears to not be assigned to a master node. EDB Postgres Failover Manager will attempt to reacquire the VIP.
No standby agent in cluster for cluster cluster_name The standbys on cluster_name have left the cluster.
Standby agent failed for cluster cluster_name A standby agent on cluster_name has left the cluster, but the coordinator has detected that the standby database is still running.
Standby database failed for cluster cluster_name A standby agent has signaled that its database has failed. The other nodes also cannot reach the standby database.
Standby agent cannot reach database for cluster cluster_name A standby agent has signaled database failure, but the other nodes have detected that the standby database is still running.
Cluster cluster_name has dropped below three nodes At least three nodes are required for full failover protection. Please add witness or agent node to the cluster.
Subset of cluster cluster_name disconnected from master This node is no longer connected to the majority of the cluster cluster_name. Because this node is part of a subset of the cluster, failover will not be attempted. Current nodes that are visible are: node_address
Promotion has started on cluster cluster_name. The promotion of a standby has started on cluster cluster_name.
Witness failure for cluster cluster_name Witness running at node_address has left the cluster.
Idle agent failure for cluster cluster_name. Idle agent running at node_address has left the cluster.
One or more nodes isolated from network for cluster cluster_name This node appears to be isolated from the network. Other members seen in the cluster are: node_name
Node no longer isolated from network for cluster cluster_name. This node is no longer isolated from the network.
Standby agent tried to promote, but master DB is still running The standby EFM agent tried to promote itself, but detected that the master DB is still running on node_address. This usually indicates that the master EFM agent has exited. Failover has NOT occurred.
Standby agent started to promote, but master has rejoined. The standby EFM agent started to promote itself, but found that a master agent has rejoined the cluster. Failover has NOT occurred.
Standby agent tried to promote, but could not verify master DB The standby EFM agent tried to promote itself, but could not detect whether or not the master DB is still running on node_address. Failover has NOT occurred.
Standby agent tried to promote, but VIP appears to still be assigned The standby EFM agent tried to promote itself, but could not because the virtual IP address (VIP_address) appears to still be assigned to another node. Promoting under these circumstances could cause data corruption. Failover has NOT occurred.
Standby agent tried to promote, but appears to be orphaned The standby EFM agent tried to promote itself, but could not because the well-known server (server_address) could not be reached. This usually indicates a network issue that has separated the standby agent from the other agents. Failover has NOT occurred.
Failover has not occurred An agent has detected that the master database is no longer available in cluster cluster_name, but there are no standby nodes available for failover.
Potential manual failover required on cluster cluster_name. A potential failover situation was detected for cluster cluster_name. Automatic failover has been disabled for this cluster, so manual intervention is required.
Failover has completed on cluster cluster_name Failover has completed on cluster cluster_name.
Lock file for cluster cluster_name has been removed The lock file for cluster cluster_name has been removed from: path_name on node node_address. This lock prevents multiple agents from monitoring the same cluster on the same node. Please restore this file to prevent accidentally starting another agent for cluster.
recovery.conf file for cluster cluster_name has been found A recovery.conf file for cluster cluster_name has been found at: path_name on master node node_address. This may be problematic should you attempt to restart the DB on this node.
recovery_target_timeline is not set to latest in recovery.conf The recovery_target_timeline parameter is not set to latest in the recovery.conf file. The standby server will not be able to follow a timeline change that occurs when a new master is promoted.
trigger_file path given in recovery.conf is not writable The path provided for the trigger_file parameter in the recovery.conf file is not writable by the db_service_owner user. Failover Manager will not be able to promote the database if needed.
Promotion has not occurred for cluster cluster_name A promotion was attempted but there is already a node being promoted: ip_address.
Standby not reconfigured after failover in cluster cluster_name The auto.reconfigure property has been set to false for this node. The node has not been reconfigured to follow the new master node after a failover.
Could not resume replay for cluster cluster_name Could not resume replay for standby being promoted. Manual intervention may be required. Error: error_decription This error is returned if the server encounters an error when invoking replay during the promotion of a standby.
Could not resume replay for standby standby_id. Could not resume replay for standby. Manual intervention may be required. Error: error_message.
Possible problem with database timeout values Your remote.timeout value (value) is higher than your local.timeout value (value). If the local database takes too long to respond, the local agent could assume that the database has failed though other agents can connect. While this will not cause a failover, it could force the local agent to stop monitoring, leaving you without failover protection.
No standbys available for promotion in cluster cluster_name The current number of standby nodes in the cluster has dropped to the minimum number: number. There cannot be a failover unless another standby node(s) is added or made promotable.
Custom monitor timeout for cluster cluster_name The following custom monitoring script has timed out: script_name
Custom monitor ‘safe mode’ failure for cluster cluster_name The following custom monitor script has failed, but is being run in “safe mode”: script_name. Output: script_results

The conditions listed in the table below will trigger a SEVERE notification:

Subject Description
Standby database restarted but EFM cannot connect The start or restart command for the database ran successfully but the database is not accepting connections. EFM will keep trying to connect for up to restart.connection.timeout seconds.
Unable to connect to DB on node_address The maximum connections limit has been reached.
Unable to connect to DB on node_address Invalid password for db.user=user_name.
Unable to connect to DB on node_address Invalid authorization specification.
Master cannot ping local database for cluster cluster_name The master agent can no longer reach the local database running at node_address. Other nodes are able to access the database remotely, so the master will not release the VIP and/or create a recovery.conf file. The master agent will become idle until the resume command is run to resume monitoring the database.
Fencing script error

Fencing script script_name failed to execute successfully.

Exit Value: exit_code
Results: script_results Failover has NOT occurred.
Post-promotion script failed Post-promotion script script_name failed to execute successfully. Exit Value: exit_code Results: script_results
Remote-post-promotion script failed

Remote-post-promotion script script_name failed to execute successfully

Exit Value: exit_code

Results: script_results

Node: node_address

Remote-pre-promotion script failed

Remote-pre-promotion script script_name failed to execute successfully

Exit Value: exit_code

Results: script_results

Node: node_address

Post-database failure script error Post-database failure script script_name failed to execute successfully. Exit Value: exit_code Results: script_results
Agent resumed script error Agent resumed script script_name failed to execute successfully. Results: script_results
Master isolation script failed Master isolation script script_name failed to execute successfully. Exit Value: exit_code Results: script_results
Could not promote standby The trigger file file_name could not be created on node. Could not promote standby. Error details: message_details
Error creating recovery.conf file on node_address for cluster cluster_name There was an error creating the recovery.conf file on master node node_address during promotion. Promotion has continued, but requires manual intervention to ensure that the old master node can not be restarted. Error details: message_details
An unexpected error has occurred for cluster cluster_name An unexpected error has occurred on this node. Please check the agent log for more information. Error: error_details
Master database being fenced off for cluster cluster_name The master database has been isolated from the majority of the cluster. The cluster is telling the master agent at ip_address to fence off the master database to prevent two masters when the rest of the failover manager cluster promotes a standby.
Isolated master database shutdown. The isolated master database has been shutdown by failover manager.
Master database being fenced off for cluster cluster_name The master database has been isolated from the majority of the cluster. Before the master could finish detecting isolation, a standby was promoted and has rejoined this node in the cluster. This node is isolating itself to avoid more than one master database.
Could not assign VIP to node node_address Failover manager could not assign the VIP address for some reason.
master_or_standby database failure for cluster cluster_name The database has failed on the specified node.
Agent is timing out for cluster cluster_name This agent has timed out trying to reach the local database. After the timeout, the agent could successfully ping the database and has resumed monitoring. However, the node should be checked to make sure it is performing normally to prevent a possible database or agent failure.
Resume timed out for cluster cluster_name This agent could not resume monitoring after reconfiguring and restarting the local database. See agent log for details.
Internal state mismatch for cluster cluster_name The failover manager cluster’s internal state did not match the actual state of the cluster members. This is rare and can be caused by a timing issue of nodes joining the cluster and/or changing their state. The problem should be resolved, but you should check the cluster status as well to verify. Details of the mismatch can be found in the agent log file.
Failover has not occurred

An agent has detected that the master database

is no longer available in cluster cluster_name, but there are not enough standby nodes available for failover..

Database in wrong state on node_address The standby agent has detected that the local database is no longer in recovery. The agent will now become idle. Manual intervention is required.
Database in wrong state on node_address The master agent has detected that the local database is in recovery. The agent will now become idle. Manual intervention is required.
Database connection failure for cluster cluster_name

This node is unable to connect to the database running on: node_address

Until this is fixed, failover may not work properly because this node will not be able to check if the database is running or not.

Standby custom monitor failure for cluster cluster_name
The following custom monitor script has failed on a standby node.
The agent will stop monitoring the local database.

Script location: script_name

Script output: script_results

Master custom monitor failure for cluster cluster_name

The following custom monitor script has failed on a master node.

EFM will attempt to promote a standby.
Script location: script_name

Script output: script_results

property_name set to true for master node The property_name property has been set to true for this cluster. Stopping the master agent without stopping the entire cluster will be treated by the rest of the cluster as an immediate master agent failure. If maintenance is required on the master database, shut down the master agent and wait for a notification from the remaining nodes that failover will not happen.
Load balancer attach scrip error Load balancer attach script script_name failed to execute successfully. Exit Value: exit_code Results: script_results
Load balancer detach script error Load balancer detach script script_name failed to execute successfully. Exit Value: exit_code Results: script_results