Continuous archiving can be used to create a high
availability (HA) cluster configuration with one or more
standby servers ready to take over operations if the
primary server fails. This capability is widely referred to as
warm standby or log shipping.
The primary and standby server work together to provide this capability,
though the servers are only loosely coupled. The primary server operates
in continuous archiving mode, while each standby server operates in
continuous recovery mode, reading the WAL files from the primary. No
changes to the database tables are required to enable this capability,
so it offers low administration overhead in comparison with some other
replication approaches. This configuration also has relatively low
performance impact on the primary server.
Directly moving WAL records from one database server to another
is typically described as log shipping. PostgreSQL
implements file-based log shipping, which means that WAL records are
transferred one file (WAL segment) at a time. WAL files (16MB) can be
shipped easily and cheaply over any distance, whether it be to an
adjacent system, another system on the same site or another system on
the far side of the globe. The bandwidth required for this technique
varies according to the transaction rate of the primary server.
Record-based log shipping is also possible with custom-developed
procedures, as discussed in Section 24.4.4.
It should be noted that the log shipping is asynchronous, i.e., the WAL
records are shipped after transaction commit. As a result there is a
window for data loss should the primary server suffer a catastrophic
failure: transactions not yet shipped will be lost. The length of the
window of data loss can be limited by use of the
archive_timeout parameter, which can be set as low
as a few seconds if required. However such low settings will
substantially increase the bandwidth requirements for file shipping.
If you need a window of less than a minute or so, it's probably better
to look into record-based log shipping.
The standby server is not available for access, since it is continually
performing recovery processing. Recovery performance is sufficiently
good that the standby will typically be only moments away from full
availability once it has been activated. As a result, we refer to this
capability as a warm standby configuration that offers high
availability. Restoring a server from an archived base backup and
rollforward will take considerably longer, so that technique only
offers a solution for disaster recovery, not high availability.
It is usually wise to create the primary and standby servers
so that they are as similar as possible, at least from the
perspective of the database server. In particular, the path names
associated with tablespaces will be passed across as-is, so both
primary and standby servers must have the same mount paths for
tablespaces if that feature is used. Keep in mind that if
is executed on the primary, any new mount point needed for it must
be created on both the primary and all standby servers before the command
is executed. Hardware need not be exactly the same, but experience shows
that maintaining two identical systems is easier than maintaining two
dissimilar ones over the lifetime of the application and system.
In any case the hardware architecture must be the same — shipping
from, say, a 32-bit to a 64-bit system will not work.
In general, log shipping between servers running different major
levels will not be possible. It is the policy of the PostgreSQL Global
Development Group not to make changes to disk formats during minor release
upgrades, so it is likely that running different minor release levels
on primary and standby servers will work successfully. However, no
formal support for that is offered and you are advised to keep primary
and standby servers at the same release level as much as possible.
When updating to a new minor release, the safest policy is to update
the standby servers first — a new minor release is more likely
to be able to read WAL files from a previous minor release than vice
There is no special mode required to enable a standby server. The
operations that occur on both primary and standby servers are entirely
normal continuous archiving and recovery tasks. The only point of
contact between the two database servers is the archive of WAL files
that both share: primary writing to the archive, standby reading from
the archive. Care must be taken to ensure that WAL archives for separate
primary servers do not become mixed together or confused. The archive
need not be large, if it is only required for the standby operation.
The magic that makes the two loosely coupled servers work together is
simply a restore_command used on the standby that,
when asked for the next WAL file, waits for it to become available from
the primary. The restore_command is specified in the
recovery.conf file on the standby server. Normal recovery
processing would request a file from the WAL archive, reporting failure
if the file was unavailable. For standby processing it is normal for
the next WAL file to be unavailable, so we must be patient and wait for
it to appear. For files ending in .backup or
.history there is no need to wait, and a non-zero return
code must be returned. A waiting restore_command can be
written as a custom script that loops after polling for the existence of
the next WAL file. There must also be some way to trigger failover, which
should interrupt the restore_command, break the loop and
return a file-not-found error to the standby server. This ends recovery
and the standby will then come up as a normal server.
Pseudocode for a suitable restore_command is:
triggered = false;
while (!NextWALFileReady() && !triggered)
sleep(100000L); /* wait for ~0.1 sec */
triggered = true;
A working example of a waiting restore_command is provided
as a contrib module named pg_standby. It
should be used as a reference on how to correctly implement the logic
described above. It can also be extended as needed to support specific
configurations or environments.
PostgreSQL does not provide the system
software required to identify a failure on the primary and notify
the standby system and then the standby database server. Many such
tools exist and are well integrated with other aspects required for
successful failover, such as IP address migration.
The means for triggering failover is an important part of planning and
design. The restore_command is executed in full once
for each WAL file. The process running the restore_command
is therefore created and dies for each file, so there is no daemon
or server process and so we cannot use signals and a signal
handler. A more permanent notification is required to trigger the
failover. It is possible to use a simple timeout facility,
especially if used in conjunction with a known
archive_timeout setting on the primary. This is
somewhat error prone since a network problem or busy primary server might
be sufficient to initiate failover. A notification mechanism such
as the explicit creation of a trigger file is less error prone, if
this can be arranged.
The size of the WAL archive can be minimized by using the %r
option of the restore_command. This option specifies the
last archive file name that needs to be kept to allow the recovery to
restart correctly. This can be used to truncate the archive once
files are no longer required, if the archive is writable from the
The short procedure for configuring a standby server is as follows. For
full details of each step, refer to previous sections as noted.
Set up primary and standby systems as near identically as
possible, including two identical copies of
PostgreSQL at the same release level.
Set up continuous archiving from the primary to a WAL archive located
in a directory on the standby server. Ensure that
are set appropriately on the primary
(see Section 24.3.1).
Make a base backup of the primary server (see Section 24.3.2), and load this data onto the standby.
Begin recovery on the standby server from the local WAL
archive, using a recovery.conf that specifies a
restore_command that waits as described
previously (see Section 24.3.3).
Recovery treats the WAL archive as read-only, so once a WAL file has
been copied to the standby system it can be copied to tape at the same
time as it is being read by the standby database server.
Thus, running a standby server for high availability can be performed at
the same time as files are stored for longer term disaster recovery
For testing purposes, it is possible to run both primary and standby
servers on the same system. This does not provide any worthwhile
improvement in server robustness, nor would it be described as HA.
If the primary server fails then the standby server should begin
If the standby server fails then no failover need take place. If the
standby server can be restarted, even some time later, then the recovery
process can also be immediately restarted, taking advantage of
restartable recovery. If the standby server cannot be restarted, then a
full new standby server instance should be created.
If the primary server fails and then immediately restarts, you must have
a mechanism for informing it that it is no longer the primary. This is
sometimes known as STONITH (Shoot the Other Node In The Head), which is
necessary to avoid situations where both systems think they are the
primary, which will lead to confusion and ultimately data loss.
Many failover systems use just two systems, the primary and the standby,
connected by some kind of heartbeat mechanism to continually verify the
connectivity between the two and the viability of the primary. It is
also possible to use a third system (called a witness server) to prevent
some cases of inappropriate failover, but the additional complexity
might not be worthwhile unless it is set up with sufficient care and
Once failover to the standby occurs, we have only a
single server in operation. This is known as a degenerate state.
The former standby is now the primary, but the former primary is down
and might stay down. To return to normal operation we must
fully recreate a standby server,
either on the former primary system when it comes up, or on a third,
possibly new, system. Once complete the primary and standby can be
considered to have switched roles. Some people choose to use a third
server to provide backup to the new primary until the new standby
server is recreated,
though clearly this complicates the system configuration and
So, switching from primary to standby server can be fast but requires
some time to re-prepare the failover cluster. Regular switching from
primary to standby is useful, since it allows regular downtime on
each system for maintenance. This also serves as a test of the
failover mechanism to ensure that it will really work when you need it.
Written administration procedures are advised.
PostgreSQL directly supports file-based
log shipping as described above. It is also possible to implement
record-based log shipping, though this requires custom development.
An external program can call the
function (see Section 9.24)
to find out the file name and the exact byte offset within it of
the current end of WAL. It can then access the WAL file directly
and copy the data from the last known end of WAL through the current end
over to the standby server(s). With this approach, the window for data
loss is the polling cycle time of the copying program, which can be very
small, but there is no wasted bandwidth from forcing partially-used
segment files to be archived. Note that the standby servers'
restore_command scripts still deal in whole WAL files,
so the incrementally copied data is not ordinarily made available to
the standby servers. It is of use only when the primary dies —
then the last partial WAL file is fed to the standby before allowing
it to come up. So correct implementation of this process requires
cooperation of the restore_command script with the data