Observing system events

Verify cluster health, maintain operational awareness and respond to system events in real time through the following core actions:

Verifying the cluster health: Maintain a real-time view of the cluster’s topology and health. Ensure that coordinator, standby, and segment nodes are online and correctly configured.
Visualizing hardware performance: Track the physical health of your infrastructure. Identify OS-level bottlenecks, such as CPU spikes, memory exhaustion, or network latency across specific hosts.
Validating database responsiveness: Ensure the database engine is actively processing requests. Review automated Canary checks—synthetic SQL probes that verify connectivity and execution speed.
Auditing system logs: Investigate the unified stream of system and database telemetry. Use the integrated Loki search to pinpoint the root cause of query failures or administrative changes.
Managing alerts: Integrate with Prometheus Alertmanager to govern the incident lifecycle through real-time notifications.

Monitor real-time WarehousePG cluster health, verify node availability, and track critical connectivity metrics to ensure high availability.

Track physical host metrics, identifying resource bottlenecks, and correlating hardware spikes with database activity.

Track proactive health indicators and automated canary check results to ensure database availability.

Access, filter, and analyze system and database telemetry through integrated log viewers.

Manage alerts to govern the incident lifecycle using the Prometheus Alertmanager integration to receive real-time notifications.