Hi everyone,
I’m building an AI-heavy workflow in n8n that processes large documents and makes multiple LLM/API calls. The problem is that some executions take a long time, and I’m starting to hit timeout/reliability issues.
Current flow looks like:
Webhook → Download File → Extract Text → AI Processing → Save Result
Issues I’m seeing:
• Long executions sometimes fail midway
• If one AI call fails, the whole workflow retries
• Memory usage grows with large documents
• Webhook clients timeout waiting for response
I’ve considered:
• Splitting the workflow into smaller workflows
• Using queues/background processing
• Saving intermediate state/checkpoints
• Returning early from webhook and processing async
For people running AI workflows in production with n8n:
• What architecture works best for long-running jobs?
• How do you avoid execution timeouts and retries reprocessing everything?
• Do you split workflows by stage or keep one large workflow?
Describe the problem/error/question
What is the error message (if any)?
Please share your workflow
(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)
For long AI workflows, the best approach is usually to break the workflow into smaller stages instead of one huge execution.
Webhook → Create Job → Return Response
↓
Queue/Worker
↓
Extract → AI Process → Save Result
Why this works best
Webhook responds immediately → no client timeout
Each stage runs separately → easier retries
Failed AI step doesn’t restart everything
Lower memory usage
And it help to Save progress/checkpoints after major steps, Process large files in chunks, Retry only failed stages. Use queues/background workers for AI calls
The part I’d be most careful with is retries, not just timeouts. For AI workflows, I’d make each expensive step idempotent by saving a status for every stage, such as extracted, chunked, summarized, completed, or failed. Before running an AI call, the workflow should check whether that stage was already completed, so a retry can continue from the last checkpoint instead of spending tokens on the same document again. This also makes failures easier to debug because you can see exactly which stage failed instead of only seeing one long failed execution.
The concrete n8n pattern for this: in your intake webhook, set the “Respond” option to “Immediately” so the HTTP client gets a 200 right away. Then use an “Execute Workflow” node (with “Wait for sub-workflow” unchecked) to fire off the heavy processing in the background - that sub-workflow runs independently with its own execution timeout clock.
For checkpointing, store stage status in a DB (or even a simple Google Sheet row) keyed by job_id. Each sub-workflow checks the status on entry and skips already-completed stages. This way a retry picks up exactly where it left off instead of reprocessing from the start.