Observing system events
Verify cluster health, maintain operational awareness and respond to system events in real time through the following core actions:
Verifying the cluster health: Maintain a real-time view of the cluster’s topology and health. Ensure that coordinator, standby, and segment nodes are online and correctly configured.
Visualizing hardware performance: Track the physical health of your infrastructure. Identify OS-level bottlenecks, such as CPU spikes, memory exhaustion, or network latency across specific hosts.
Validating database responsiveness: Ensure the database engine is actively processing requests. Review automated Canary checks—synthetic SQL probes that verify connectivity and execution speed.
Auditing system logs: Investigate the unified stream of system and database telemetry. Use the integrated Loki search to pinpoint the root cause of query failures or administrative changes.
Managing alerts: Integrate with Prometheus Alertmanager to govern the incident lifecycle through real-time notifications.
Verifying the cluster health
Monitor real-time WarehousePG cluster health, verify node availability, and track critical connectivity metrics to ensure high availability.
Visualizing hardware performance
Track physical host metrics, identifying resource bottlenecks, and correlating hardware spikes with database activity.
Validating database responsiveness
Track proactive health indicators and automated canary check results to ensure database availability.
Auditing system logs
Access, filter, and analyze system and database telemetry through integrated log viewers.
Managing alerts
Manage alerts to govern the incident lifecycle using the Prometheus Alertmanager integration to receive real-time notifications.
Could this page be better? Report a problem or suggest an addition!