Alerting
Alerts should be actionable, low-noise, and tied to user impact.
Principles
- Alert on symptoms, not every error
- Use severity levels consistently
- Include context in the alert payload
- Separate SLO alerts from operational alerts
Dashboards
Pair alerts with dashboards that show step latency, throughput, and error rates.
Common Alerts
- Sustained step error rate above threshold
- Step latency above SLO
- Dead-letter queue growth
- Orchestrator runtime failure or restart loops
Practical Defaults
Start with:
- Error rate > 2% for 5 minutes (warning)
- p95 latency > 2x baseline for 10 minutes (warning)
- DLQ > 0 with sustained growth (critical)