Sharing workflow for AI-based decision to suggest/automate fixing DevOps incidents

Tom_Cao · July 3, 2025, 9:10am

I built an AI-Agent based decision workflow that auto-fixes incidents using Prometheus + n8n + bash scripts (based on what you defined).

The visualization of the flow:

Overall how it works

CPU spike or Disk almost full detected
Analyzed the runaway process
Try to reload core service and clean un-necessary logs
Performance restored
Inform, report and postmortem

N8n components I’m using:

Prometheus - Grafana: For monitoring and metrics collection.
Alertmanager: To send alerts, and n8n to orchestrate a workflow
First bash script: To do a system check and analyze
First AI-agent node: To analyze the result from the system check scripts. The agent will evaluate how to interact with the issues (just notify or take immediate action).
If it’s medium and low cases, just send a notification to Discord. If it’s kind of urgent actions, go the critical flow
Second AI-agent node (in critical flow): Analyzes the actual system state and creates specific fix commands. The commands could be clean logs, try to restart some services to release the resources.
Second bash script: to run the commands under AI-agent suggestions.
Finally, send some post-mortem and reports through Discord.

𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 and considerations of this flow:

Non-deterministic AI behavior: We won’t know would it will constantly fix the issues
Data privacy concerns: server data is sent to OpenAI to analyze

I’ve shared all materials in Github repo:

I hope this helps you to apply in your server or VPS, to help you self-fix your services just in case of an emergency. I would love to hear your feedback and collaboration to improve the flow.