This article discusses the open source tool repmgr (Replication Manager) and how to set up and configure it for automatic failover in PostgreSQL.

1. Install PostgreSQL

2. Install repmgr

3. Configure PostgreSQL

4. Create users

5. Configure pg_hba.conf

6. Configure the repmgr file

7. Register the primary server

8. Build/clone the standby server

9. Register the standby server

10. Start repmgrd daemon process

 

repmgr (Replication Manager) is an open source tool used for managing the replication and failover of PostgreSQL clusters. In this post we will learn to set up and configure the cluster for automatic failover. 

Prerequisites

The following software must be installed on both master and standby servers:

  • PostgreSQL
  • repmgr (matching the installed PostgreSQL major version)
  • At the network level, connections with the PostgreSQL port (default: 5432) must be possible in both directions.

Step 1: Install PostgreSQL 

Create two clusters/servers with the PostgreSQL installation. You can follow the PostgreSQL instructions at the link below for installation using PostgreSQL’s PGDG repo package. For the sake of naming conventions, we will consider master and standby as two servers.

https://wiki.postgresql.org/wiki/YUM_Installation

yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm



yum -y install epel-release yum-utils

yum-config-manager --enable pgdg12

yum install postgresql12-server postgresql12



/usr/pgsql-12/bin/postgresql-12-setup initdb

 

Note: The above step of initialization of the cluster is not needed on the standby server.

systemctl enable --now postgresql-12

systemctl status postgresql-12

 

Step 2: Install repmgr 

You will need to install repmgr on the master as well as standby.

yum -y  install repmgr12*

 

Step 3: Configure PostgreSQL

On the primary server, a PostgreSQL instance must be initialized and running. The following replication settings may need to be adjusted:

max_wal_senders = 10

max_replication_slots = 10

wal_level = 'hot_standby' or 'replica' or 'logical'

hot_standby = on

archive_mode = on

archive_command = '/bin/true'

shared_preload_libraries = 'repmgr'

                    

Step 4: Create users 

Create a dedicated PostgreSQL superuser account and a database for the repmgr metadata:

create user repmgr;

create database repmgr with owner repmgr;

 

Step 5: Configure pg_hba.conf

Ensure the repmgr user has appropriate permissions in pg_hba.conf and can connect in replication mode; pg_hba.conf should contain entries similar to the following:

    local      replication      repmgr                                        trust

    host       replication      repmgr        127.0.0.1/32            trust

    host       replication      repmgr        192.168.1.0/24        trust



    local       repmgr           repmgr                                       trust

    host        repmgr           repmgr        127.0.0.1/32           trust

    host        repmgr           repmgr        192.168.1.0/24       trust

Note: Adjust above settings according to your network configurations.

 

Step 6: Configure the repmgr file 

Create a repmgr.conf on the master server with the following entries:

cluster='failovertest'

node_id=1

node_name=node1

conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'

data_directory='/var/lib/pgsql/12/data/'

failover=automatic

promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'

follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

Note: The location and file must be accessible for the user we are using for repmgr.

 

Step 7: Register the primary server

Register the primary server with repmgr:

-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register

WARNING: the following problems were found in the configuration file:

  parameter "cluster" is deprecated and will be ignored

INFO: connecting to primary database...

NOTICE: attempting to install extension "repmgr"

NOTICE: "repmgr" extension successfully installed

NOTICE: primary node record (ID: 1) registered

-bash-4.2$

 

Then check the status of the cluster:

-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf  cluster show

WARNING: the following problems were found in the configuration file:

  parameter "cluster" is deprecated and will be ignored

 ID | Name  | Role | Status | Upstream | Location | Priority | Timeline | Connection string                                          

----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------

 1  | node1 | primary | * running |      | default  | 100  | 1    | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2

 

 

Step 8: Build/clone the standby server

Create the repmgr.conf file on standby server:

-bash-4.2$ cat repmgr.conf

node_id=2

node_name=node2

conninfo='host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2'

data_directory='/var/lib/pgsql/12/data'

failover=automatic

promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'

follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

Note: In the above commands, for the host IP info we need to specify the IP of the standby server. In this example, 172.16.140.137 is my standby server. 

We can now perform the dry run and test if our configuration is correct:

 

-bash-4.2$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run

NOTICE: destination directory "/var/lib/pgsql/12/data" provided

INFO: connecting to source node

DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr

DETAIL: current installation size is 32 MB

INFO: "repmgr" extension is installed in database "repmgr"

INFO: parameter "max_wal_senders" set to 10

NOTICE: checking for available walsenders on the source node (2 required)

INFO: sufficient walsenders available on the source node

DETAIL: 2 required, 10 available

NOTICE: checking replication connections can be made to the source server (2 required)

INFO: required number of replication connections could be made to the source server

DETAIL: 2 replication connections required

NOTICE: standby will attach to upstream node 1

HINT: consider using the -c/--fast-checkpoint option

INFO: all prerequisites for "standby clone" are met

 

If there is no problem, start cloning:

-bash-4.2$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone

NOTICE: destination directory "/var/lib/pgsql/12/data" provided

INFO: connecting to source node

DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr

DETAIL: current installation size is 32 MB

NOTICE: checking for available walsenders on the source node (2 required)

NOTICE: checking replication connections can be made to the source server (2 required)

INFO: checking and correcting permissions on existing directory "/var/lib/pgsql/12/data"

NOTICE: starting backup (using pg_basebackup)...

HINT: this may take some time; consider using the -c/--fast-checkpoint option

INFO: executing:

  pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/12/data -h 172.16.140.135 -p 5432 -U repmgr -X stream

NOTICE: standby clone (using pg_basebackup) complete

NOTICE: you can now start your PostgreSQL server

HINT: for example: pg_ctl -D /var/lib/pgsql/12/data start

HINT: after starting the server, you need to register this standby with "repmgr standby register"

 

Step 9: Register the standby server

Register the standby server with repmgr:

-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register

INFO: connecting to local node "node2" (ID: 2)

INFO: connecting to primary database

WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)

INFO: standby registration complete

NOTICE: standby node "node2" (ID: 2) successfully registered



-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show

 ID | Name  | Role | Status | Upstream | Location | Priority | Timeline | Connection string                                          

----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------

 1  | node1 | primary | * running |      | default  | 100  | 1    | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2

 2  | node2 | standby |   running | node1 | default  | 100  | 1    | host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2

-bash-4.2$

 

 

Step 10: Start repmgrd daemon process

To enable the automatic failover, we now need to start the repmgrd daemon process on master slave and witness:

For example:

-bash-4.2$ /usr/pgsql-12/bin/repmgrd -f /var/lib/pgsql/repmgr.conf

[2020-02-23 20:44:43] [NOTICE] repmgrd (repmgrd 5.0.0) starting up

[2020-02-23 20:44:43] [INFO] connecting to database "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"

-bash-4.2$ INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid

[2020-02-23 20:44:43] [NOTICE] starting monitoring of node "node1" (ID: 1)

[2020-02-23 20:44:43] [INFO] "connection_check_type" set to "ping"

[2020-02-23 20:44:43] [NOTICE] monitoring cluster primary "node1" (ID: 1)

[2020-02-23 20:44:43] [INFO] child node "node2" (ID: 2) is attached

 

We can also check the events for the cluster:

-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster event

 Node ID | Name  | Event          | OK | Timestamp       | Details                                                                           

---------+-------+--------------------+----+---------------------+----------------------------------------------------------------------------------------

 2   | node2 | repmgrd_start  | t  | 2020-02-23 20:46:26 | monitoring connection to upstream node "node1" (ID: 1)                            

 1   | node1 | repmgrd_start  | t  | 2020-02-23 20:44:43 | monitoring cluster primary "node1" (ID: 1)                                        

 2   | node2 | standby_register   | t  | 2020-02-23 20:39:24 | standby registration succeeded; upstream node ID is 1 (-F/--force option was used)    

 1   | node1 | primary_register   | t  | 2020-02-23 20:39:11 | existing primary record updated                                                   

 2   | node2 | standby_register   | t  | 2020-02-23 20:38:25 | standby registration succeeded; upstream node ID is 1                             

 2   | node2 | standby_unregister | t  | 2020-02-23 20:37:56 |                                                                                   

 2   | node2 | standby_register   | t  | 2020-02-23 20:12:23 | standby registration succeeded; upstream node ID is 1                             

 2   | node2 | standby_clone  | t  | 2020-02-23 20:09:25 | cloned from host "172.16.140.135", port 5432; backup method: pg_basebackup; --force: N

 1   | node1 | primary_register   | t  | 2020-02-23 19:57:11 |                                                                                   

 1   | node1 | cluster_created | t  | 2020-02-23 19:57:11 |          

                                                                        

Now, if the master server fails, repmgrd will detect that the master is not reachable and then promote the next available server and perform the automatic failover.

The log messages will be as follows:

-bash-4.2$ [2020-02-23 20:51:28] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state

[2020-02-23 20:52:40] [WARNING] unable to ping "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"

[2020-02-23 20:52:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

.

.

.

[2020-02-23 20:53:30] [INFO] checking state of node 1, 6 of 6 attempts

[2020-02-23 20:53:30] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.140.135 fallback_application_name=repmgr"

[2020-02-23 20:53:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2020-02-23 20:53:30] [WARNING] unable to reconnect to node 1 after 6 attempts

[2020-02-23 20:53:30] [INFO] 0 active sibling nodes registered

[2020-02-23 20:53:30] [INFO] primary and this node have the same location ("default")

[2020-02-23 20:53:30] [INFO] no other sibling nodes - we win by default

[2020-02-23 20:53:30] [NOTICE] this node is the only available candidate and will now promote itself

[2020-02-23 20:53:30] [INFO] promote_command is:

  "/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file"

[2020-02-23 20:53:30] [NOTICE] promoting standby to primary

[2020-02-23 20:53:30] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()

[2020-02-23 20:53:30] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete

[2020-02-23 20:53:31] [NOTICE] STANDBY PROMOTE successful

[2020-02-23 20:53:31] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary

[2020-02-23 20:53:31] [INFO] 0 followers to notify

[2020-02-23 20:53:31] [INFO] switching to primary monitoring mode

[2020-02-23 20:53:31] [NOTICE] monitoring cluster primary "node2" (ID: 2)

 

 

Reference Links:

https://repmgr.org/docs/5.0/index.html

https://repmgr.org/docs/4.4/repmgrd-automatic-failover.html

https://repmgr.org/docs/4.4/repmgrd-basic-configuration.html#REPMGRD-AUTOMATIC-FAILOVER-CONFIGURATION