Alerting

Alerts should be actionable, low-noise, and tied to user impact.

Principles

Alert on symptoms, not every error
Use severity levels consistently
Include context in the alert payload
Separate SLO alerts from operational alerts

Dashboards

Pair alerts with dashboards that show step latency, throughput, and error rates.

Common Alerts

Sustained step error rate above threshold
Step latency above SLO
Dead-letter queue growth
Orchestrator runtime failure or restart loops

Practical Defaults

Start with:

Error rate > 2% for 5 minutes (warning)
p95 latency > 2x baseline for 10 minutes (warning)
DLQ > 0 with sustained growth (critical)