Workflow stops silently mid-execution — nested sub-workflows + AI loop over 1000+ items — memory/timeout issue?

Hi everyone,

I’m building a large-scale AI classification pipeline in n8n that processes thousands of records per run and assigns each one a label using an AI agent. The workflow runs on a schedule and is functional end-to-end, but stops randomly during execution — no error message, no crash log, it simply halts mid-run. I believe this is a memory or timeout issue but I need expert input on the architecture before I refactor.


What the workflow does

The pipeline is organized in three logical layers:

  1. Fetches a list of top-level parent records (~5–16 items)

  2. Per parent → fetches associated child groups (~3–8 per parent)

  3. Per child group → fetches a context object (4 reference datasets that define what is and isn’t valid for that group) and a list of input records to classify (~100–300 per group)

  4. Each input record is sent individually to an AI agent together with the context object

  5. The AI returns a classification label + reasoning

  6. Results are written to Google Sheets (one tab per child group)

The context object must be attached per child group because the same input record can receive a different classification depending on which group it belongs to. The per-group context evaluation is intentional and architecturally necessary.


Current architecture — 3 nested workflow layers

Layer 1 — Master Workflow
  Schedule Trigger
  → Fetch parent records (~5–16 items)
  → Loop Over Items
    → Execute Sub-workflow: Group Processor

Layer 2 — Group Processor Sub-workflow
  → Fetch child groups for this parent
  → Fetch context dataset A (parent-level)
  → Fetch context dataset B (shared/global)
  → Merge context
  → Loop Over Items (all child groups)
    → Execute Sub-workflow: Item Classifier

Layer 3 — Item Classifier Sub-workflow
  → Fetch context dataset C (child group-level)
  → Fetch context dataset D (child group-level)
  → Fetch input records for this child group
  → Merge all 4 context datasets into each input record item
  → Loop Over Items (all input records)
    → Build AI prompt (includes full merged context)
    → AI Agent (OpenAI)
    → Structured Output Parser
    → Append result to Google Sheets

Scale for a typical run:

  • 5 parents × 5 child groups × 200 input records = 5,000 AI calls per run

  • Each input record item carries the full merged context (~50KB) duplicated across all items

  • Estimated total data held in memory during a single run: ~250MB


The symptoms

  • Workflow stops without any error message or failed node

  • Sometimes halts mid-loop after 10–20 iterations

  • Sometimes stops after 30–45 minutes of running

  • Consistently happens on larger runs with more records

  • Running via manual execution (Execute workflow button in the editor)

  • Hosted on n8n Cloud


What I think is causing it

Based on the n8n documentation I suspect a combination of the following:

  1. Context duplication across all items — Each of the 5,000 input record items carries the full merged context object attached to it. The same reference data is duplicated 5,000 times in the items array instead of being held once and referenced. This alone likely causes significant heap pressure.

  2. Sub-workflow return data stacking — The docs state that Execute Workflow returns the output of the last node back to the parent. My Layer 3 sub-workflow’s last active node has 200 classified items flowing through it, so I assume all 200 items are being passed back to Layer 2. Layer 2 then returns results from all child groups back to Layer 1. This creates a data accumulation pyramid across the nesting levels.

  3. Manual execution timeout — The docs note that manual executions copy data to the frontend UI, increasing memory consumption. With 5,000 AI calls at ~2–3 seconds each, a full run takes 2–4 hours. I suspect there is a Cloud execution timeout I’m silently hitting.

  4. No intermediate persistence — Results are written to Sheets only at the end of each Layer 3 loop. A crash mid-run means no output is saved and the run must restart from zero.


Specific questions

  1. Is 3-level nested Execute Workflow a known stability problem on n8n Cloud? Is there a recommended maximum nesting depth for sub-workflows?

  2. What is the actual execution timeout on n8n Cloud (Starter and Pro plans)? Does it differ between manual executions and production (scheduled) executions?

  3. Sub-workflow return data best practice — Should the last node in every sub-workflow always return a minimal summary object (e.g. { status: "done", processed: 45 }) rather than the full processed item set? If 200 items flow into the last node of a sub-workflow, do all 200 get passed back to the parent workflow through the Execute Workflow node?

  4. Context deduplication inside a loop — The context object is identical for all input records within a single child group. What is the recommended pattern for building that context once per group and making it available across 200 loop iterations without attaching the full object to every item?

  5. Loop Over Items batch size with Execute Workflow — I’m using the default batch size. Should this be set to 1 when each iteration triggers an Execute Workflow node, so that the memory from each sub-workflow execution is freed before the next iteration begins?

  6. Production vs manual execution for long-running workflows — Should long-running pipelines like this always be triggered via Schedule Trigger in production mode rather than the manual execute button? Is the execution timeout significantly higher (or absent) for scheduled production runs?

  7. Crash recovery and idempotency — If the workflow stops at item 150 of 200, is there a clean n8n-native pattern for resuming without reprocessing the first 149 items? I’m considering writing results inside the loop (not at the end) and checking for existing records at the start of each iteration, but I’d like to know if there’s a better approach.


Environment

  • n8n Cloud, latest version

  • External API calls via HTTP Request nodes

  • OpenAI via the AI Agent node with Structured Output Parser

  • Google Sheets for output storage

  • Approximately 5,000–15,000 AI calls per full run depending on input volume

This is a classic Automation Ops failure mode: long-running and high-volume workflows (especially with AI loops) will eventually fail silently unless you add specific guardrails and observability.

When you hit 1,000+ items, you’re likely fighting memory fragmentation or silent timeouts at the sub-workflow boundary. Here is a quick runbook to stabilize this:

  1. Add Heartbeat Markers: Use a simple ‘Log’ step or a lightweight DB write at the start of each loop iteration. Without a heartbeat, you’re flying blind when it stops.

  2. Batching & Queuing: Don’t process all 1,000 items in a single execution context. Cap your batch sizes and, if possible, move to a Queue/Worker pattern (Redis) to decouple the heavy lifting.

  3. Monitor Resource Pressure: Large AI payloads + nested loops are notorious for triggering OOM (Out of Memory). Track your memory growth—if it spikes at the same point every time, your sub-workflow is likely holding too much state.

If you can share your workflow topology (JSON or screenshot) and your current execution mode, I’d be happy to suggest a more production-safe diagram for this setup.

Thank you all. I solved the problem!

All posts here are AI generated and are breaking our forum rules. I’m locking this down - all forum authors: consider this a formal warning.

1 Like