Troubleshooting v4

The Failover Manager agent fails to start

If an agent fails to start, see the startup log /var/log/efm-<version>/startup-<cluster>.log for more information.

Authorization file not found. Is the local agent running?

If you invoke an Failover Manager cluster management command and Failover Manager isn't running on the node, the efm command displays an error:

Authorization file not found. Is the local agent running?

Not authorized to run this command. User '<os user>' is not a member of the `efm` group.

You must have special privileges to invoke some of the efm commands documented in Using the efm utility. If these commands are invoked by a user who isn't authorized to run them, the efm command displays an error:

Not authorized to run this command. User '<os user>' is not a member of the `efm` group.

Notification; Unexpected error message

If you receive a notification message about an unexpected error message, check the Failover Manager log file for an OutOfMemory message. Failover Manager runs with the default memory value set by this property:

# Extra information that will be passed to the JVM when starting the agent.

If you're running with less than 128 megabytes allocated, increase the value and restart the Failover Manager agent.

Confirming the OpenJDK version

Failover Manager is tested with OpenJDK. We strongly recommend using OpenJDK. You can use the following command to check the type of your Java installation:

# java -version
openjdk version "11.0.20" 2023-07-18 LTS
OpenJDK Runtime Environment (Red_Hat- (build 11.0.20+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat- (build 11.0.20+8-LTS, mixed mode, sharing)

There's a temporary issue with OpenJDK version 11 on RHEL and its derivatives. When starting Failover Manager, you might see an error like the following:

java.lang.Error: /usr/lib/jvm/java-11-openjdk- (No such file or directory)

If you see this message, the workaround is to manually install the missing package using the command sudo dnf install tzdata-java.

Unexpected connection attempts from outside the cluster

If an external process tries to connect to an agent on the bind.address port, Failover Manager logs a warning containing the source of the connection attempt. These warnings don't affect the Failover Manager cluster. However, you can use the source address to stop or configure the outside process to not try to connect to a Failover Manager agent. The following is an example of the message that appears when something outside of the cluster attempts to connect to the agent process from <source_address>:

org.jgroups.protocols.TCP warn WARN: JGRP000006: failed accepting connection from peer Socket[addr=/<source_address>,port=56046,localport=7800]: Read timed out

If you're running an agent with an address that used to be part of a different cluster, the original cluster might still be trying to connect to this address to re-form the cluster. In this example, the cluster oldcluster is still trying to connect to an address that's now part of newcluster:

org.jgroups.protocols.TCP warn WARN: JGRP000012: discarded message from different cluster oldcluster (our cluster is newcluster). Sender was 93cb99c7-bf3f-4243-b582-faf25aced49e(<source_address>)

The cluster name and <source_address> information can be used to find the original cluster. Using the efm reset-members command with that cluster should clear the address from its cache.