My Work flow keeps hanging when doing large scale tests

Hi everyone,
I’ve been working on a workflow in n8n (Cloud) that processes a list of 99+ people. For each person, it geocodes their location, builds Ticketmaster API URLs, fetches events, flattens them, runs them through a custom “event picker” function, and then generates an AI-written email.
The problem:
The workflow consistently crashes after a certain number of runs (the runs are split by a “loop over items (split in batches)” node).
• At first, it would make it through 40–75 people before failing (running for multiple hours).
• Recently, after adding slimming/pagination changes, it sometimes fails after only 4–5 people.
• In the Executions page, the last “successful” node is usually the Flatten TM Events node or the Slim TM Events node. After that, I’ll see “Crashed in XXms” with no detailed error.
• When watching in real time, I sometimes see “Connection Lost” in the top-right corner, which makes me wonder if this is a payload/memory issue on the server.
What I’ve tried so far:
• Added probes before/after key nodes to track where the workflow stalls.
• Slimmed down the Ticketmaster HTTP response (dropping unnecessary nested fields) before flattening.
• Set the workflow to “don’t save data for successful nodes” to keep payloads smaller.
• Limited per-request size (125–150) and distributed queries across 4 weekly windows.
• Added deduplication and junk filtering.
What I’ve observed:
• Flatten TM Events sometimes collects 300+ events (good) but other times only ~70.
• When it fails early, it’s not always at the same spot (sometimes after 4 people, sometimes after 65).
• Failures are inconsistent, which makes me think payload size, pagination handling, or memory exhaustion is the root cause.
My questions:

  1. Does this sound like a payload size/memory issue? If so, is there a recommended way in n8n Cloud to handle large JSON outputs (hundreds of events per person) without crashing the execution?
  2. Could it also be related to pagination? I noticed sometimes only page 0 of Ticketmaster events is pulled, even though there are multiple pages (so maybe our HTTP node isn’t looping correctly).
  3. Is the Split In Batches loop setup correct for processing ~99 people sequentially, or could that be part of the problem?
  4. What are the best practices to avoid “Crashed in XXms” when workflows need to handle hundreds of items per person across nearly 100 people? Should I be offloading to a database mid-workflow instead of carrying large arrays between nodes?
    Any insight into why it works fine for dozens of runs, then suddenly fails (sometimes very early, sometimes late) would be greatly appreciated. I’m trying to confirm whether this is something I need to fix in my node design (e.g., pagination or slimming) or if it’s a hard limitation of n8n Cloud memory that requires restructuring.
    Thanks in advance for any help!