Hi everyone,
I’m running n8n in queue mode with multiple workers, and I’m seeing a strange issue after restarting workers or deployments.
My setup:Webhook → Queue → Worker → Process → Database
Sometimes after a worker restart:
Old jobs get processed again
Some jobs appear “stuck” and later retry unexpectedly
A few executions create duplicate DB writes/API calls
I already use retries and basic error handling, but I think the issue is related to how jobs are acknowledged or recovered after worker crashes.
Example processing logic:if ($json.status !== “processed”) {
// continue processing
}
I’m trying to understand:
• How n8n queue mode handles unfinished jobs after restart
• Whether jobs are re-queued automatically
• Best way to make workflows safe against duplicate execution after crashes
For people using queue mode in production:
• What’s the recommended pattern for crash recovery and idempotent processing?
Describe the problem/error/question
What is the error message (if any)?
Please share your workflow
(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)
Hi there @Decoure_Ryan What you’re seeing is usually normal behavior in queue mode. If a worker crashes or restarts before a job is fully completed/acknowledged, the queue can mark that job as unfinished and reprocess it later. That’s why you’re seeing old jobs run again.
Try to Assume jobs can run more than once and make processing idempotent.
For example, before processing:if ($json.status === “processed”) {
return ;
}
And use DB-level protection like:ON CONFLICT DO NOTHING
or unique keys to prevent duplicate inserts.
Common production ways you can apply
Queue handles retries/recovery
Database handles deduplication/idempotency
Workers stay stateless
Welcome @Decoure_Ryan to our community! I’m Jay and I am a n8n verified creator.
To add to what Niffzy said - the root cause is Bull’s “stalled job” recovery mechanism. When a worker restarts without gracefully completing a job, Bull marks that job as stalled after QUEUE_BULL_STALLED_INTERVAL milliseconds (default 30000ms) and re-queues it. You can tune this with QUEUE_BULL_MAX_STALLED_COUNT=1 to limit how many times a stalled job gets retried, and QUEUE_BULL_STALLED_INTERVAL to control the detection window. For idempotency at the n8n level, use $getWorkflowStaticData or a DB status check at the very start of the workflow to short-circuit if the execution_id was already processed. Setting a unique constraint on execution_id in your DB is the most reliable safeguard.
Redis acts as a broker and workers execute the jobs, but I wouldn’t assume exactly-once guarantees; after crash/restart, treat it as at-least-once, review N8N_GRACEFUL_SHUTDOWN_TIMEOUT and design your workflow to handle reprocessing.
(I’m not shouting, just emphasizing )
ALWAYS USE A UNIQUE KEY
99% of issues could be solved with that
Great breakdown from @syed_noor. One thing I’d add: BullMQ also has a lockDuration setting (default 30s) — if your workflow takes longer than that, the lock expires and the job gets marked stalled even while still running. You can raise it via QUEUE_BULL_STALLED_INTERVAL as mentioned, but also make sure lockDuration is set appropriately in your BullMQ config.
Also worth noting — the Postgres idempotency key approach is the most reliable pattern I’ve seen in production. Combine it with n8n’s “Stop and Error” node after the INSERT check to cleanly exit duplicate runs without polluting your error logs.
Good addition on the lockDuration distinction — I should have called that out separately. The
QUEUE_BULL_STALLED_INTERVAL controls how often the checker runs, but lockDuration controls how long a job can be active before it’s considered stalled. Both need to exceed your longest workflow execution time.
The Stop and Error node tip is solid too. I use that after the idempotency INSERT with the message set to the job_id —that way when you review executions in n8n, you can immediately see which ones were legitimate duplicates vs actual failures. Keeps the execution list clean instead of showing false-positive errors.
For anyone implementing this pattern, I wrote a more detailed breakdown of all six production-readiness dimensions (idempotency is just one of them) here:
The job_id in the Stop and Error message is a smart touch - makes triage much faster when you’re scanning executions. One more thing worth adding on top of this pattern: set the continueOnFail on the idempotency check node and route the “already processed” path to a No-op node with a clear name (e.g. “DUPLICATE - skipped”), rather than relying solely on the error path. Keeps the execution graph readable and separates expected skips from actual failures at a glance.