Managing alerts

Maintain high availability and resolve incidents quickly by using real-time infrastructure signals to drive your operations. By integrating with Prometheus Alertmanager, you can detect performance degradation or system failures before they result in downtime.

Alertmanager required

You must set the ALERTMANAGER_URL in your system environment to use this functionality. See Configuring WEM for details.

Responding to cluster incidents

Govern the incident lifecycle by using the Alerts panel on the left sidebar to identify, mute, and audit notifications. This proactive approach ensures that critical failures are addressed before they impact database availability.

  • Identify and prioritize threats using the Active Alerts tab to see current issues. Address Critical failures first, such as service outages or segment failures, before investigating Warning or Info events.
  • If you are performing a scheduled recovery or hardware upgrade, use the Silences tab to mute specific alerts. This functionality prevents notification fatigue and ensures your team stays focused on unexpected issues.
  • Verify notification delivery to key stakeholders by checking the Notifications tab. Review the exact timestamp and destination (such as Slack, Email, or PagerDuty) for every alert dispatched.
  • If an alert triggers too often or fails to trigger when expected, browse the Alert Rules tab. Review the technical thresholds and durations defined in your configuration to ensure the detection logic matches your operational needs.
  • Identify recurring patterns after an incident is resolved with the Alert History tab. Auditing these logs helps you isolate intermittent hardware failures or capacity constraints that require long-term planning.

Understanding alert sources and severity

To respond effectively, you must understand where alerts originate and how they're categorized.

Supported alert sources

  • Canary check failures: Triggered when automated SQL probes fail or exceed latency thresholds.
  • Segment down events: Triggered if a segment becomes unreachable or enters a recovery state.
  • Resource threshold breaches: Fired when CPU, Memory, or Disk Usage cross predefined limits.
  • System errors: Critical database engine events captured from the WarehousePG (WHPG) log stream.
  • WEM outages: Triggered if Prometheus is unable to reach the WEM service.

Severity levels

  • Critical: Indicates a severe failure or a total loss of service. These require immediate attention.
  • Warning: Highlights performance degradation or resource pressure. Investigate these to prevent escalation.
  • Info: Routine informational notices regarding system changes or successful task completions.

Could this page be better? Report a problem or suggest an addition!