Hey everyone,
I’m a Lead DevOps at a game studio, running our infrastructure on DigitalOcean Kubernetes. I got tired of manually digging through logs and dashboards every time an alert fires, so I built an AI-powered alert assistant that does the initial investigation for me.
What it does:
When Alertmanager fires an alert, this workflow automatically investigates the incident and posts a structured diagnostic report as a thread reply in Mattermost — right under the original alert message.
The report includes:
- What happened (summary of the incident)
- Event timeline (what preceded the alert in the last 10+ minutes)
- Root cause (up to two hypotheses)
- Troubleshooting tips (step-by-step actions for each hypothesis)
How it works:
Receive & deduplicate — Webhook receives alerts from Alertmanager. A deduplication node (48h window using $getWorkflowStaticData) prevents repeated analysis of the same alert.
Enrich with trigger condition — The workflow fetches the actual PromQL expression from Prometheus Rules API, so the agent knows exactly what condition triggered the alert.
AI Agent investigation — The agent has access to 5 MCP tool servers:
- Kubernetes MCP — pod status, logs, events, resource consumption
- Grafana MCP — PromQL/LogQL queries, dashboard lookups, Sift analysis
- DigitalOcean MCP — DOKS cluster info, App Platform, networking
- GitHub MCP — recent releases, PRs, commits (to check if a deploy caused the issue)
- Qdrant Vector Store — RAG over internal infrastructure documentation (service topology, naming conventions, traffic flow)
Thread matching — After analysis, the workflow searches the Mattermost channel for the original alert post (matching by alertname in attachment titles) and replies in the same thread.
Key design decisions:
- The agent is restricted to read-only operations — no changes to infrastructure through MCP
- Tool retries are limited (max 1 retry per failed call) to avoid loops
- Alert labels are pre-processed into a clean prompt format, stripping noise like job, instance, metrics_path
- The PromQL trigger query is injected into the prompt, so the agent can re-evaluate the condition itself
Stack:
Alertmanager, n8n (self-hosted), OpenAI API, Grafana + Prometheus + Loki, Kubernetes (DOKS), Qdrant, Mattermost, DigitalOcean
Workflow JSON attached below — you’ll need to configure your own MCP server endpoints, credentials, and SetVars values.
I’d love your input: What additional data sources or context would you feed into the agent to make the analysis more effective? I’m considering adding database query stats and deployment pipeline status, but curious what others have found valuable in incident investigation.