Managing alerts
Maintain high availability and resolve incidents quickly by using real-time infrastructure signals to drive your operations. By integrating with Prometheus Alertmanager, you can detect performance degradation or system failures before they result in downtime.
Alertmanager required
You must set the ALERTMANAGER_URL in your system environment to enable the Alertmanager integration. See Configuring WEM for details.
Responding to cluster incidents
Govern the incident lifecycle by using the Alerts panel on the left sidebar to identify, mute, and audit notifications. This proactive approach ensures that critical failures are addressed before they impact database availability.
The Total Active, Critical, Warning, and Alertmanager header cards give an instant cluster-wide snapshot.
- Identify and prioritize threats using the Active Alerts tab. The Active Alerts table shows each alert's Severity, Alert name, Summary, Status, and Duration. Address Critical failures first, such as service outages or segment failures, before investigating Warning or Info events. Use the Actions column to silence an alert directly from the table.
- If you are performing a scheduled recovery or hardware upgrade, use the Silences tab to mute specific alerts. This functionality prevents notification fatigue and ensures your team stays focused on unexpected issues.
- Verify notification delivery to key stakeholders by checking the Notifications tab. Review the exact timestamp and destination (such as Slack, Email, or PagerDuty) for every alert dispatched.
- If an alert triggers too often or fails to trigger when expected, browse the Alert Rules tab. The Configured Alert Rules table shows each rule's State, Severity, Alert Rule name, For duration, Query, Group, and Last Evaluated time.
- Identify recurring patterns after an incident is resolved with the Alert History tab. Filter by time range (Last 1 day through Last 14 days). The table shows each alert's Severity, Alert name, Status, Started, Ended, and Duration, helping you isolate intermittent failures or capacity constraints that require long-term planning.
Understanding alert sources and severity
To respond effectively, you must understand where alerts originate and how they're categorized.
Supported alert sources
- Canary check failures: Triggered when automated SQL probes fail or exceed latency thresholds.
- Segment down events: Triggered if a segment becomes unreachable or enters a recovery state.
- Resource threshold breaches: Fired when CPU, Memory, or Disk Usage cross predefined limits.
- System errors: Critical database engine events captured from the WarehousePG (WHPG) log stream.
- WEM outages: Triggered if Prometheus is unable to reach the WEM service.
Severity levels
- Critical: Indicates a severe failure or a total loss of service. These require immediate attention.
- Warning: Highlights performance degradation or resource pressure. Investigate these to prevent escalation.
- Info: Routine informational notices regarding system changes or successful task completions.
Could this page be better? Report a problem or suggest an addition!