Solving Concurrency & Execution Isolation in a Multi-Tenant n8n Architecture

I’m facing a major scalability issue in my multi-tenant n8n WhatsApp automation system.
Multiple webhook events trigger workflows simultaneously, causing:
• Race conditions
• Duplicate executions
• Queue congestion
• Database collisions
• Credential isolation risks
The challenge is maintaining execution isolation and preventing duplicate message processing while scaling workers horizontally in queue mode.
const lockKey = msg:${messageId};
const lock = await redis.set(
lockKey,
“processing”,
“NX”,
“EX”,
60
);
if (!lock) {
return “Duplicate execution blocked”;
}

Describe the problem/error/question

What is the error message (if any)?

Please share your workflow

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

Information on your n8n setup

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

Hi @Keira_Becky To solve this issue properly, I would use: Redis distributed locking to prevent duplicate executions
Idempotency keys for webhook validation
Queue mode with multiple workers for horizontal scaling
Tenant-based execution context isolation to prevent credential crossover
Retry-safe workflow design with dead-letter handling
PostgreSQL transaction control for safe database operations

The main goal is ensuring that each webhook event is processed exactly once, even under high concurrent traffic.

good morning @Keira_Becky
I would remove the sorting from within the workflow and put it in an external queue partitioned by tenantId + conversationId. This way, messages from the same conversation are processed in order, while different tenants/conversations keep running in parallel. I would still maintain a unique constraint in the database with tenantId + messageId as a final layer of protection against duplicates.

The Redis distributed lock approach you’re already using is the right instinct. For the credential isolation piece specifically - use a single “router” workflow that reads tenant config and passes credentials as parameters to sub-workflows via the Execute Workflow node. Each sub-workflow doesn’t store credentials in its own nodes; it receives the credential name or ID dynamically and uses the Credential node’s expression-based selection. This keeps the shared workflow code in one place while tenant state stays in your DB. For the concurrency cap per tenant, set QUEUE_WORKER_CONCURRENCY at the worker level and route high-volume tenants to dedicated worker instances using queue name labels if you need true isolation at scale.

two bugs in the redis lock. EX 60 is too short — if processing takes longer (Gemini hangs, openai slow, whatever) the lock expires and Meta’s next duplicate webhook acquires it and runs parallel. bump ttl to ~900s and release explicitly on success.

bigger issue is the key. msg:{messageId} dedupes Meta’s retries fine, but doesnt serialize per-conversation — two messages from the same user 200ms apart still process in parallel and reply ordering breaks. u need both: idempotency_key msg:{tenantId}:{messageId} for retries + serialization_key conv:{tenantId}:{conversationId} for FIFO.

real “exactly once” comes from DB-side anyway — INSERT … ON CONFLICT (tenant_id, message_id) DO NOTHING on ur outbound record. lock is for performance, DB constraint is for correctness.

The DB-based credential pattern is the stronger approach - storing encrypted_credentials in a separate table means the workflow code has zero hardcoded tenant context, and a compromised workflow export can’t leak credentials. The INSERT … ON CONFLICT DO NOTHING point is also something I didn’t cover - that’s the right way to enforce exactly-once at the data layer rather than relying solely on the lock. Good additions.

Hi @Keira_Becky Your approach is already correct Redis locking is the right direction.

What you’re solving:Prevent duplicate webhook runs
Avoid race conditions in multi-worker setup

Do this to make production safe

  1. Add tenant isolation to the lock:const lockKey = tenant:${tenantId}:msg:${messageId};
  2. Make sure to protect the database:ON CONFLICT (message_id) DO NOTHING

Webhook → Redis Lock → Queue → Worker → DB

Redis lock stops duplicates early, but DB idempotency is your final safety net.

Thanks @Emmas @achamm @syed_noor @Niffzy @tamy.santos @nguyenthieutoan

For the reply I really appreciate and learn a lot of approach from the replies

Your welcome! Happy we were all able to help you out!

You’re welcome @Keira_Becky

The unique constraint on (tenant_id, message_id) is the part that actually makes this safe, and it’s worth pulling apart from the lock, because the two are doing different jobs. INSERT … ON CONFLICT DO NOTHING is your correctness boundary. The SET NX EX lock is a best-effort optimization so a worker doesn’t spend time on a message another worker already picked up. If you ever have to decide which one to trust, trust the constraint.

That separation matters because of the case achamm flagged: a slow Gemini or OpenAI call outliving the 60s TTL. Moving the TTL to 900s makes that rarer, it doesn’t make it safe. A SET NX EX lease stops being mutual exclusion the moment the critical section can outlive the lease, and a worker has no way to tell “I still hold the lock” apart from “my lease expired 200ms ago and someone else holds it now.” A GC pause, a stalled HTTP call, or a short Redis partition all land in the same place. The fix isn’t a longer lease, it’s a fencing token: bump a per-conversation counter on acquire, carry it through the run, and have the write side reject anything that arrives with a stale token. If the effect is already idempotent at the DB, which ON CONFLICT per message gives you, the lock can stay best-effort and you stop depending on its timing for correctness.

One n8n-specific landmine worth flagging: don’t keep any of this dedup state in $getWorkflowStaticData(). In queue mode it isn’t safe for cross-execution state. The engine reads the whole staticData blob into memory when the execution starts, mutates a copy, and writes the entire column back at the end with a plain UPDATE on the workflow row (saveStaticDataById in workflow-static-data.service.ts on 2.x). There’s no per-workflow serialization. The concurrency-control service is keyed globally on production/evaluation and is turned off entirely in queue mode, so two runs of the same workflow read the same baseline and the later write silently overwrites the earlier one. It’s the same last-write-wins the team acknowledged for concurrent workflow saves in #11638. Keep per-tenant cursors and counters in Redis or Postgres where you own the atomicity.

The credential-isolation point is a separate axis from the concurrency one and probably deserves its own thread, since the dynamic-credential pattern through Execute Workflow has failure modes that have nothing to do with dedup.

Happy to share what the staticData read and write path looks like in the 2.x source if it helps you rule it in or out for your setup.