Hi @Keira_Becky To solve this issue properly, I would use: Redis distributed locking to prevent duplicate executions
Idempotency keys for webhook validation
Queue mode with multiple workers for horizontal scaling
Tenant-based execution context isolation to prevent credential crossover
Retry-safe workflow design with dead-letter handling
PostgreSQL transaction control for safe database operations
The main goal is ensuring that each webhook event is processed exactly once, even under high concurrent traffic.
good morning @Keira_Becky
I would remove the sorting from within the workflow and put it in an external queue partitioned by tenantId + conversationId. This way, messages from the same conversation are processed in order, while different tenants/conversations keep running in parallel. I would still maintain a unique constraint in the database with tenantId + messageId as a final layer of protection against duplicates.
The Redis distributed lock approach you’re already using is the right instinct. For the credential isolation piece specifically - use a single “router” workflow that reads tenant config and passes credentials as parameters to sub-workflows via the Execute Workflow node. Each sub-workflow doesn’t store credentials in its own nodes; it receives the credential name or ID dynamically and uses the Credential node’s expression-based selection. This keeps the shared workflow code in one place while tenant state stays in your DB. For the concurrency cap per tenant, set QUEUE_WORKER_CONCURRENCY at the worker level and route high-volume tenants to dedicated worker instances using queue name labels if you need true isolation at scale.
two bugs in the redis lock. EX 60 is too short — if processing takes longer (Gemini hangs, openai slow, whatever) the lock expires and Meta’s next duplicate webhook acquires it and runs parallel. bump ttl to ~900s and release explicitly on success.
bigger issue is the key. msg:{messageId} dedupes Meta’s retries fine, but doesnt serialize per-conversation — two messages from the same user 200ms apart still process in parallel and reply ordering breaks. u need both: idempotency_key msg:{tenantId}:{messageId} for retries + serialization_key conv:{tenantId}:{conversationId} for FIFO.
real “exactly once” comes from DB-side anyway — INSERT … ON CONFLICT (tenant_id, message_id) DO NOTHING on ur outbound record. lock is for performance, DB constraint is for correctness.
The DB-based credential pattern is the stronger approach - storing encrypted_credentials in a separate table means the workflow code has zero hardcoded tenant context, and a compromised workflow export can’t leak credentials. The INSERT … ON CONFLICT DO NOTHING point is also something I didn’t cover - that’s the right way to enforce exactly-once at the data layer rather than relying solely on the lock. Good additions.
The unique constraint on (tenant_id, message_id) is the part that actually makes this safe, and it’s worth pulling apart from the lock, because the two are doing different jobs. INSERT … ON CONFLICT DO NOTHING is your correctness boundary. The SET NX EX lock is a best-effort optimization so a worker doesn’t spend time on a message another worker already picked up. If you ever have to decide which one to trust, trust the constraint.
That separation matters because of the case achamm flagged: a slow Gemini or OpenAI call outliving the 60s TTL. Moving the TTL to 900s makes that rarer, it doesn’t make it safe. A SET NX EX lease stops being mutual exclusion the moment the critical section can outlive the lease, and a worker has no way to tell “I still hold the lock” apart from “my lease expired 200ms ago and someone else holds it now.” A GC pause, a stalled HTTP call, or a short Redis partition all land in the same place. The fix isn’t a longer lease, it’s a fencing token: bump a per-conversation counter on acquire, carry it through the run, and have the write side reject anything that arrives with a stale token. If the effect is already idempotent at the DB, which ON CONFLICT per message gives you, the lock can stay best-effort and you stop depending on its timing for correctness.
One n8n-specific landmine worth flagging: don’t keep any of this dedup state in $getWorkflowStaticData(). In queue mode it isn’t safe for cross-execution state. The engine reads the whole staticData blob into memory when the execution starts, mutates a copy, and writes the entire column back at the end with a plain UPDATE on the workflow row (saveStaticDataById in workflow-static-data.service.ts on 2.x). There’s no per-workflow serialization. The concurrency-control service is keyed globally on production/evaluation and is turned off entirely in queue mode, so two runs of the same workflow read the same baseline and the later write silently overwrites the earlier one. It’s the same last-write-wins the team acknowledged for concurrent workflow saves in #11638. Keep per-tenant cursors and counters in Redis or Postgres where you own the atomicity.
The credential-isolation point is a separate axis from the concurrency one and probably deserves its own thread, since the dynamic-credential pattern through Execute Workflow has failure modes that have nothing to do with dedup.
Happy to share what the staticData read and write path looks like in the 2.x source if it helps you rule it in or out for your setup.