Observing system events

Verify cluster health, maintain operational awareness and respond to system events in real time through the following core actions:

  • Verifying the cluster health: Maintain a real-time view of the cluster’s topology and health. Ensure that coordinator, standby, and segment nodes are online and correctly configured.

  • Visualizing hardware performance: Track the physical health of your infrastructure. Identify OS-level bottlenecks, such as CPU spikes, memory exhaustion, or network latency across specific hosts.

  • Validating database responsiveness: Ensure the database engine is actively processing requests. Review automated Canary checks—synthetic SQL probes that verify connectivity and execution speed.

  • Auditing system logs: Investigate the unified stream of system and database telemetry. Use the integrated Loki search to pinpoint the root cause of query failures or administrative changes.

  • Managing alerts: Integrate with Prometheus Alertmanager to govern the incident lifecycle through real-time notifications.

Verifying the cluster health

Monitor real-time WarehousePG cluster health, verify node availability, and track critical connectivity metrics to ensure high availability.

Visualizing hardware performance

Track physical host metrics, identifying resource bottlenecks, and correlating hardware spikes with database activity.

Validating database responsiveness

Track proactive health indicators and automated canary check results to ensure database availability.

Auditing system logs

Access, filter, and analyze system and database telemetry through integrated log viewers.

Managing alerts

Manage alerts to govern the incident lifecycle using the Prometheus Alertmanager integration to receive real-time notifications.


Could this page be better? Report a problem or suggest an addition!