Using WarehousePG Observability

Suggest edits

WarehousePG (WHPG) Observability uses Grafana to visualize your WHPG cluster. EDB maintains a GitHub repository with seven pre-defined dashboards in JSON format, along with instructions for easy setup.

In Grafana, a dashboard is the main canvas for data. Dashboards are composed by rows, which are logical dividers that group related visualizations, called panels. Panels are the individual charts or widgets displaying metrics.

The following is a description of the different dashboards provided for WHPG Observability.

WHPG Cluster Dashboard

This dashboard provides a comprehensive operational overview of your WHPG cluster, covering cluster status, host resource utilization, query activity, and database maintenance metrics.

Summary

This top row provides immediate, high-level cluster status checks and key version and connection details for the cluster.

DB Connection Status: Connection status of the database.
Total Connections: The total number active and idle connections to the database.
WarehousePG Version: The version of your WHPG cluster.
Database Uptime: How long the database has been running since the last start time.
Observability Version: The version of WHPG Observability.

Cluster details

This row focuses on the health of the WHPG cluster's segments, coordinator, and the host load.

Coordinator Status: The status for the coordinator, whether there is a standby coordinator, and if the cluster is working in failed over state.
System Load (15 min avg): The average system load over the last 15 minutes for each host instance.
CPU Utilization per Host (15 min Avg): The percentage of CPU currently in use over the last 15 minutes for each host instance.
Primary Segment Status: Segment count for primary segments, including Total, Up, Down, Promoted Mirror, and Primary Not In Sync.
Mirror Segment Status: Segment counts for mirror segments, including Total, Up, Down, Promoted Mirror, and Mirror Not In Sync.
Queries Overview (last 24 hours): Historical trend of active queries and idle queries over the past 24 hours.
Log Severity Count: Number of critical log messages (FATAL, PANIC, ERROR) over time.

Query Status

This row tracks database workload, transaction rate, and query performance.

Query Status Summary: Query counts, including Total, Active, Idle, Long Running (>120s), Queries in Wait, and Blocked Queries.
Query Per Second (5 min rate): Rate of total queries executed (commits + rollbacks) per second over the last 5 minutes.
Queries per user: Number of queries executed over time, segmented and labeled by user.

Disk Usage

This row provides a deep dive into filesystem and disk space utilization across the cluster hosts.

Free Disk Space (Host View) For Mount Point: Free disk space for specific mount points, including the size, host, and filesystem type.
Free Disk Space (Device View) for Host: Free disk space for all devices on the selected host.

WHPG Cluster Details Dashboard

This dashboard focuses on the structural and configuration details of your WHPG cluster, including object sizing, segment configurations, and key database settings.

Summary

This top row provides immediate, high-level cluster status checks and key version and connection details for the cluster.

DB Connection Status: Connection status of the database.
Total Connections: The total number active and idle connections to the database.
WarehousePG Version: The version of your WHPG cluster.
Database Uptime: How long the database has been running since the last start time.
Observability Version: The version of WHPG Observability.

Size Details

This row provides a visual breakdown of the data contained within the database instances, often used for data loading and capacity planning.

Database Size: Displays the size of each database within the cluster.
Object Details: Distribution of objects across schemas, providing insight into the structural complexity of the databases.

Configurations

This row provides critical live configuration and health details, allowing administrators to inspect the structure and security rules of the running cluster.

WHPG Segment Configuration (Live): Hostname, segment ID, role, preferred role, synchronization mode, and status for all segments.
pg_hba: Contents of the access rules defined in the pg_hba.conf file, showing the connection type, databases, users, and authentication method.
pg_settings: Current running values for PostgreSQL and WHPG database settings.

WHPG Log Details Dashboard

This dashboard is dedicated to analyzing the log files collected from your WHPG cluster. It allows you to filter, quantify, and visualize log activity by host, segment, and severity level.

Summary

This top row provides immediate, high-level cluster status checks and key version and connection details for the cluster.

DB Connection Status: Connection status of the database.
Total Connections: The total number active and idle connections to the database.
WarehousePG Version: The version of your WHPG cluster.
Database Uptime: How long the database has been running since the last start time.
Observability Version: The version of WHPG Observability.

Overview

This row provides high-level summaries of log volume across the selected time range, broken down by filters.

Cumulative Log Severity Count: Historical count of critical log messages (FATAL, PANIC, and ERROR) over the selected time range, showing overall cluster stability trends.
PANIC Logs per Host: Trend of PANIC level logs generated per host over a 5 minute range.
FATAL Logs per Host: Trend of FATAL level logs generated per host over a 5 minute range.
ERROR Logs per Host: Trend of ERROR level logs generated per host over a 5 minute range.

Host View

This row provides a dedicated log console and activity trend analysis, organized and filtered by the physical hostnames.

PANIC, FATAL, ERROR Logs for Host (): Shows all individual log lines retrieved from the selected hosts that match the critical severity levels (PANIC, FATAL, ERROR), presented as a log browser.
Log Volume (Rolling 5-Minute Window) per Host (): Trend of log entries generated per host within a 5 minute window, broken down by the severity levels (FATAL, ERROR, PANIC). This helps detect sudden spikes in logging activity.

WHPG Query Details Dashboard

This dashboard is designed for in-depth analysis and troubleshooting of the active workload, focusing on identifying slow, blocked, or resource-intensive queries.

Summary

This top row provides immediate, high-level cluster status checks and key version and connection details for the cluster.

DB Connection Status: Connection status of the database.
Total Connections: The total number active and idle connections to the database.
WarehousePG Version: The version of your WHPG cluster.
Database Uptime: How long the database has been running since the last start time.
Observability Version: The version of WHPG Observability.

Query Status

This row provides a summary of the cluster's current query state and immediate workload statistics.

Query Status Summary: Real-time status check of total queries, active queries, idle queries, long running queries (>120s), queries in wait, and blocked queries.
Connections Per User: Breakdown of active and idle connections, grouped by user.
Queries per user: Historical number of queries executed over time, segmented and labeled by user.
Queries Overview: Historical trend of active queries and idle queries over the selected time range, showing overall workload behavior.
Query Per Second (5 min rate): The rate of total transactions executed per second over the last 5 minutes.

Query Details

This section provides critical table views of current database activity.

Current Running Queries: Active and idle queries currently running on the cluster.
Long Running Queries: Queries that have been running for over 120 seconds.
Blocked Queries: Queries that are currently waiting for a lock to be released.

WHPG Recommendations Dashboard

This dashboard is designed to provide actionable insights for database maintenance and performance tuning.

Summary

This top row provides immediate, high-level cluster status checks and key version and connection details for the cluster.

DB Connection Status: Connection status of the database.
Total Connections: The total number active and idle connections to the database.
WarehousePG Version: The version of your WHPG cluster.
Database Uptime: How long the database has been running since the last start time.
Observability Version: The version of WHPG Observability.

Recommendations Details

This row lists the top tables or queries that require immediate attention related to bloat, missing statistics, or inefficient indexing.

Top 20 Bloated Tables: Tables with the highest ratio of dead rows to total rows. This indicates severe table bloat that requires vacuuming to reclaim space and improve performance.
Top 20 Tables With the Most Dead Rows: Tables containing the highest raw count of dead rows. These dead rows accumulate from updates and deletes and contribute to table bloat.
Top 20 Tables Not Vacuumed: Tables that have never been vacuumed or auto-vacuumed and still contain dead rows. These tables are at high risk of severe bloat and should be prioritized for manual VACUUM operations.
Top 20 Tables Missing Statistics: Tables that are missing ANALYZE statistics, based on the number of columns without collected statistics. Tables lacking statistics can cause the query optimizer to choose suboptimal execution plans, leading to slower query performance.
Top 20 Largest Tables by Disk Usage: Tables consuming the most disk space in the database. This helps identify the biggest contributors to storage usage for capacity planning.
Tables Likely Needing Indexes: Tables where sequential scans are performed more significantly than index scans. This suggests missing or inefficient indexes and points to opportunities for query performance improvement.

WHPG Resource Group Details Dashboard

This dashboard is focused on monitoring and analyzing the performance of WHPG's Resource Groups.

Memory Utilization: Historical trend of memory utilization over time, broken down by hostname, for the selected resource group.
CPU Utilization Historical trend of CPU utilization for the hosts belonging to the resource group, alongside the configured Max CPU Limit for that group.
Concurrency Status: Historical trends of the resource group's query limits, including the configured Concurrency Limit, the number of Running queries, and the number of Queued queries.
Resource Group Configurations: Resource group's static configuration parameters, such as groupname, concurrency, and CPU/Memory limits.

WHPG Segment Details Dashboard

This dashboard provides an in-depth view of the health, status, resource utilization, and spill file activity at the individual segment level.

Summary

This top row provides immediate, high-level cluster status checks and key version and connection details for the cluster.

DB Connection Status: Connection status of the database.
Total Connections: The total number active and idle connections to the database.
WarehousePG Version: The version of your WHPG cluster.
Database Uptime: How long the database has been running since the last start time.
Observability Version: The version of WHPG Observability.

Segment Details

This row focuses on the core status and configuration health of the segments, showing a current count and a historical view of segment state.

Segment Status: Displays the values of Total Primary Segments, Primary Segments Up/Down, Preferred Primary UP, Primary Not In Sync, Total Mirror Segments, Mirror Segments Up/Down, Promoted Mirror, and Mirror Not In Sync.
Segment Status Timeline: Historical visualization of the status of each segment over time, allowing for quick identification of recent state changes.
Database Size: Size of each database within the cluster.
Segment Status: Current status of each individual segment in a detailed view, showing the segment description and its role and status.

System Load Details

This row provides granular time-series data on the load averages for all host instances in the cluster.

System Load 1: 1-minute oad average for each host instance over time.
System Load 5: 5-minute load average for each host instance over time.
System Load 15: 15-minute load average for each host instance over time.

CPU Details

This row provides detailed time-series analysis of CPU utilization across the hosts.

CPU Utilization for every Host (Current): Percentage of CPU actively used on each host, calculated as a 5-minute average.
CPU Utilization per Host (15-min Avg): Historical trend of the percentage of CPU currently in use for each host instance, averaged over 15 minutes.

Disk info

This row focuses on disk I/O, error rates, and available disk space across the cluster hosts.

Disk Write Utilization (%): Percentage of time the disk spent doing write operations, averaged over the last 5 minutes.
Disk Read Utilization (%): Percentage of time the disk spent doing read operations, averaged over the last 5 minutes.
File System Device Error: Historical trend of the total number of filesystem or device errors across all monitored disks for each host.
Current Disk I/O for Each Host: Instantaneous rate of disk I/O for each host.
Free Disk Space (Host View) For Mount Point: Free disk space for specific mount points.
Free Disk Space (Device View) for Host: Free disk space for all devices on the host.

Memory Usage

This row provides a dedicated view of how memory is being utilized by the host operating systems.

Memory Utilization: Percentage of memory (RAM) currently in use across all hosts.
Swap Memory Utilization: Percentage of swap memory currently in use across all hosts. This helps monitor if the system is running low on available physical RAM and relying on slower swap space.

Spill Details

This row provides critical metrics for diagnosing queries that are utilizing excessive temporary disk space (spill files).

Spill File Usage per Segment: Total number of workfile spill files and the total spill size (in bytes) for each segment in the cluster.
Top Queries by Spill Usage: Top 20 queries currently generating the most workfile spills, including spill size, number of files created, and the query text.

Could this page be better? Report a problem or suggest an addition!