Continuous archiving can be used to create a high
availability (HA) cluster configuration with one or more
standby servers ready to take
over operations if the primary server fails. This
capability is widely referred to as warm standby
or log shipping.
The primary and standby server work together to provide this capability,
though the servers are only loosely coupled. The primary server operates
in continuous archiving mode, while each standby server operates in
continuous recovery mode, reading the WAL files from the primary. No
changes to the database tables are required to enable this capability,
so it offers low administration overhead in comparison with some other
replication approaches. This configuration also has relatively low
performance impact on the primary server.
Directly moving WAL or "log" records from one database server to another
is typically described as log shipping. EnterpriseDB
implements file-based log shipping, which means that WAL records are
transferred one file (WAL segment) at a time. WAL
files can be shipped easily and cheaply over any distance, whether it be
to an adjacent system, another system on the same site or another system
on the far side of the globe. The bandwidth required for this technique
varies according to the transaction rate of the primary server.
Record-based log shipping is also possible with custom-developed
procedures, as discussed in Section 36.4.4.
It should be noted that the log shipping is asynchronous, i.e. the
WAL records are shipped after transaction commit. As a result there
is a window for data loss should the primary server
suffer a catastrophic failure: transactions not yet shipped will be lost.
The length of the window of data loss
can be limited by use of the archive_timeout parameter,
which can be set as low as a few seconds if required. However such low
settings will substantially increase the bandwidth requirements for file
shipping. If you need a window of less than a minute or so, it's probably
better to look into record-based log shipping.
The standby server is not available for access, since it is continually
performing recovery processing. Recovery performance is sufficiently
good that the standby will typically be only moments away from full
availability once it has been activated. As a result, we refer to this
capability as a warm standby configuration that offers high
availability. Restoring a server from an archived base backup and
rollforward will take considerably longer, so that technique only
really offers a solution for disaster recovery, not HA.
It is usually wise to create the primary and standby servers
so that they are as similar as possible, at least from the
perspective of the database server. In particular, the path names
associated with tablespaces will be passed across as-is, so both
primary and standby servers must have the same mount paths for
tablespaces if that feature is used. Keep in mind that if
CREATE TABLESPACE
is executed on the primary, any new mount point needed for it must
be created on both the primary and all standby servers before the command
is executed. Hardware need not be exactly the same, but experience shows
that maintaining two identical systems is easier than maintaining two
dissimilar ones over the lifetime of the application and system.
In any case the hardware architecture must be the same — shipping
from, say, a 32-bit to a 64-bit system will not work.
In general, log shipping between servers running different major release
levels will not be possible. It is the policy of the PostgreSQL Global
Development Group not to make changes to disk formats during minor release
upgrades, so it is likely that running different minor release levels
on primary and standby servers will work successfully. However, no
formal support for that is offered and you are advised to keep primary
and standby servers at the same release level as much as possible.
When updating to a new minor release, the safest policy is to update
the standby servers first — a new minor release is more likely
to be able to read WAL files from a previous minor release than vice
versa.
There is no special mode required to enable a standby server. The
operations that occur on both primary and standby servers are entirely
normal continuous archiving and recovery tasks. The only point of
contact between the two database servers is the archive of WAL files
that both share: primary writing to the archive, standby reading from
the archive. Care must be taken to ensure that WAL archives for separate
primary servers do not become mixed together or confused.
The magic that makes the two loosely coupled servers work together
is simply a restore_command used on the standby that waits for
the next WAL file to become available from the primary. The
restore_command is specified in the recovery.conf
file on the standby
server. Normal recovery processing would request a file from the
WAL archive, reporting failure if the file was unavailable. For
standby processing it is normal for the next file to be
unavailable, so we must be patient and wait for it to appear. A
waiting restore_command can be written as a custom
script that loops after polling for the existence of the next WAL
file. There must also be some way to trigger failover, which
should interrupt the restore_command, break the loop
and return a file-not-found error to the standby server. This
ends recovery and the standby will then come up as a normal
server.
Pseudocode for a suitable restore_command is:
triggered = false;
while (!NextWALFileReady() && !triggered)
{
sleep(100000L); /* wait for ~0.1 sec */
if (CheckForExternalTrigger())
triggered = true;
}
if (!triggered)
CopyWALFileForRecovery();
EnterpriseDB does not provide the system
software required to identify a failure on the primary and notify
the standby system and then the standby database server. Many such
tools exist and are well integrated with other aspects required for
successful failover, such as IP address migration.
The means for triggering failover is an important part of planning and
design. The restore_command is executed in full once
for each WAL file. The process running the restore_command
is therefore created and dies for each file, so there is no daemon
or server process and so we cannot use signals and a signal
handler. A more permanent notification is required to trigger the
failover. It is possible to use a simple timeout facility,
especially if used in conjunction with a known
archive_timeout setting on the primary. This is
somewhat error prone since a network problem or busy primary server might
be sufficient to initiate failover. A notification mechanism such
as the explicit creation of a trigger file is less error prone, if
this can be arranged.
The short procedure for configuring a standby server is as follows. For
full details of each step, refer to previous sections as noted.
Set up primary and standby systems as near identically as
possible, including two identical copies of
EnterpriseDB at the same release level.
Set up continuous archiving from the primary to a WAL archive located
in a directory on the standby server. Ensure that archive_command
is set appropriately on the primary
(see Section 36.3.1).
Make a base backup of the primary server (see Section 36.3.2), and load this data onto the standby.
Begin recovery on the standby server from the local WAL
archive, using a recovery.conf that specifies a
restore_command that waits as described
previously (see Section 36.3.3).
Recovery treats the WAL archive as read-only, so once a WAL file has
been copied to the standby system it can be copied to tape at the same
time as it is being read by the standby database server.
Thus, running a standby server for high availability can be performed at
the same time as files are stored for longer term disaster recovery
purposes.
For testing purposes, it is possible to run both primary and standby
servers on the same system. This does not provide any worthwhile
improvement in server robustness, nor would it be described as HA.
If the primary server fails then the standby server should begin
failover procedures.
If the standby server fails then no failover need take place. If the
standby server can be restarted, even some time later, then the recovery
process can also be immediately restarted, taking advantage of
restartable recovery. If the standby server cannot be restarted, then a
full new standby server should be created.
If the primary server fails and then immediately restarts, you must have
a mechanism for informing it that it is no longer the primary. This is
sometimes known as STONITH (Shoot the Other Node In The Head), which is
necessary to avoid situations where both systems think they are the
primary, which can lead to confusion and ultimately data loss.
Many failover systems use just two systems, the primary and the standby,
connected by some kind of heartbeat mechanism to continually verify the
connectivity between the two and the viability of the primary. It is
also possible to use a third system (called a witness server) to avoid
some problems of inappropriate failover, but the additional complexity
may not be worthwhile unless it is set-up with sufficient care and
rigorous testing.
Once failover to the standby occurs, we have only a
single server in operation. This is known as a degenerate state.
The former standby is now the primary, but the former primary is down
and may stay down. To return to normal operation we must
fully recreate a standby server,
either on the former primary system when it comes up, or on a third,
possibly new, system. Once complete the primary and standby can be
considered to have switched roles. Some people choose to use a third
server to provide backup to the new primary until the new standby
server is recreated,
though clearly this complicates the system configuration and
operational processes.
So, switching from primary to standby server can be fast but requires
some time to re-prepare the failover cluster. Regular switching from
primary to standby is encouraged, since it allows regular downtime on
each system for maintenance. This also acts as a test of the
failover mechanism to ensure that it will really work when you need it.
Written administration procedures are advised.
EnterpriseDB directly supports file-based
log shipping as described above. It is also possible to implement
record-based log shipping, though this requires custom development.
An external program can call the pg_xlogfile_name_offset()
function (see Section 8.17)
to find out the file name and the exact byte offset within it of
the current end of WAL. It can then access the WAL file directly
and copy the data from the last known end of WAL through the current end
over to the standby server(s). With this approach, the window for data
loss is the polling cycle time of the copying program, which can be very
small, but there is no wasted bandwidth from forcing partially-used
segment files to be archived. Note that the standby servers'
restore_command scripts still deal in whole WAL files,
so the incrementally copied data is not ordinarily made available to
the standby servers. It is of use only when the primary dies —
then the last partial WAL file is fed to the standby before allowing
it to come up. So correct implementation of this process requires
cooperation of the restore_command script with the data
copying program.
In a warm standby configuration, it is possible to offload the expense of
taking periodic base backups from the primary server; instead base backups
can be made by backing
up a standby server's files. This concept is generally known as
incrementally updated backups, log change accumulation or more simply,
change accumulation.
If we take a backup of the standby server's files while it is following
logs shipped from the primary, we will be able to reload that data and
restart the standby's recovery process from the last restart point.
We no longer need to keep WAL files from before the restart point.
If we need to recover, it will be faster to recover from the incrementally
updated backup than from the original base backup.
Since the standby server is not "live", it is not possible to
use pg_start_backup() and pg_stop_backup()
to manage the backup process; it will be up to you to determine how
far back you need to keep WAL segment files to have a recoverable
backup. You can do this by running pg_controldata
on the standby server to inspect the control file and determine the
current checkpoint WAL location.