Health Status

Suggest edits

The Health Status dashboard provides real-time insight into the topology and health of Postgres high-availability clusters. It supports both primary/standby and distributed high-availability clusters.

The Health Status dashboard displays:

A set of cluster-wide health indicators that helps to draw your attention immediately to the critical issues.
A schematic view of all the nodes organized in a cluster distributed across regions displaying the health and roles of each node.
A replication status view displaying the status of each replication slot and the associated replication lags.

Viewing Health Status dashboard

To view the Health Status dashboard from the BigAnimal portal:

In the left navigation of the BigAnimal portal, go to Clusters.
Select any ready high-availability or PGD cluster.
On the cluster detail page, select the Health Status tab.

The Health Status tab displays the dashboard with health status categorized in sections:

Global cluster health

The global cluster health section displays the cluster-wide view, including the following metrics:

Raft Status (PGD only) indicates whether the Raft consensus algorithm is running correctly in the cluster. It verifies that one node is elected as the global leader and the Raft roles such as RAFT_LEADER and RAFT_FOLLOWER are defined correctly across all the nodes.
Replication Slot Status (PGD only) indicates whether all the replication slots are in streaming state.
Clock Skew indicates whether the node's clock is in sync and doesn't exceed a threshold of 60 seconds.
Proxy Status (PGD only) provides the number of PGD proxies up and running compared to the available proxies.
Node Status provides the number of nodes that are up and running.
Transaction rate provides the total number of transactions, including committed and rolled back transactions per second in the cluster.

Regional nodes health and roles

The regional nodes health and roles section displays fine-grained health status at the regional and node level. It's structured as an accordion, with each element representing a group of nodes deployed in the same region. Each item displays basic information including:

Proxy Status (PGD only) indicates the number of active proxies compared to the available proxies in the specified regions.
Node Status indicates the number of nodes up and running compared to the available nodes in the specified region in a text chart. It provides the status of all nodes in the specified region using a Boolean indicator (green (OK)/red (KO)) in a text chart.

Expanding each item provides a list of nodes with information like:

Total number of active connections compared to the maximum number of configured connections for each node.
A Node Ko tag for each node if it's down.
Memory usage percentage on a progress bar.
Storage usage percentage on a progress bar.

For PGD, it provides the tags below the node name:

Raft Leader indicates that the node is a Raft leader locally in the region.
Raft Follower indicates that the node is a Raft follower locally in the region.
Global Raft Leader indicates that the nodes is a Raft leader globally in the cluster.
Global Raft Follower indicates that the node is a Raft follower globally in the cluster.
Witness indicates that the node is a witness in the PGD cluster. See Witness nodes for more information.

For high availability, it provides the tags below the node name:

Primary indicates if the node role is primary.

Replication status

The replication Status section has a matrix displaying the replication lag across all the nodes of the cluster. The matrix provides different types of replication lags for Write, Replay, Flush, and Sent. It provides the lag in both bytes and time for Write, Replay, and Flush. It provides the lag only in bytes for Sent.

Note

In high-availability clusters, replication occurs only from the primary (source) to the replicas (target). So the matrix displays only one row for the source and multiple columns for the targets.
In PGD clusters, the replications are bidirectional, so the matrix displays an equal number of rows and columns. Every node is both a source and a target.

Note

The data on the Health Status dashboard is dynamic and is updated continuously. However, the cluster architecture is based on a snapshot. If a new node is added or an existing node is removed, you must reload the Health Status dashboard by selecting the tab again or reloading the browser page.

← Prev

Viewing metrics and logs from AWS

↑ Up

Monitoring and logging

Monitoring using BigAnimal Observability

Health Status

Viewing Health Status dashboard

Global cluster health

Regional nodes health and roles

Replication status

Note

Note

← Prev

↑ Up

Next →