[v2.2.4] "Worker failed to find data for execution" Race Condition in Queue Mode - Not Present in v1.x

[v2.2.4] “Worker failed to find data for execution” Race Condition in Queue Mode - Not Present in v1.x

EXECUTIONS_MODE=queue
DB_POSTGRESDB_HOST=pgbouncer.n8n.svc.cluster.local # PgBouncer in transaction mode- I tried also session mode
DB_POSTGRESDB_POOL_SIZE=60
N8N_CONCURRENCY_PRODUCTION_LIMIT=50
EXECUTIONS_DATA_SAVE_WAIT=true
EXECUTIONS_TIMEOUT=900000
QUEUE_BULL_STALLED_INTERVAL=60000
QUEUE_BULL_MAX_STALLED_COUNT=10


---

Our Hypothesis:

We believe this is a race condition between Redis job queue and PostgreSQL writes that was introduced or exacerbated in v2.x:


**Timeline we're seeing:**

0ms: n8n creates execution record → Writes to Postgres via PgBouncer
5ms: Job queued to Redis → Available to workers
10ms: Worker picks up job from Redis
15ms: Worker queries database for execution data
20ms: ERROR: “Worker failed to find data for execution”
50ms: Postgres write completes (too late)


**Why we think v2 changed this:**
- Same infrastructure worked flawlessly in v1.x
- Error appears to be timing-related
- Only started after v2 upgrade
- Affects random executions without even the first node in the workflow execute (consistent with race condition)

---

**What We've Tried:**

1.  Enabled `EXECUTIONS_DATA_SAVE_WAIT=true` - Still fails
2.  Increased `QUEUE_BULL_STALLED_INTERVAL` to 60 seconds - Still fails
3.  Increased `QUEUE_BULL_MAX_STALLED_COUNT` - Still fails
4.  Verified PgBouncer connectivity and performance - Working fine
5.  Checked database connection pool - Not exhausted
6.  Reduced worker concurrency from 50 to 10 - Still fails (less frequently)
7.  Cleared Redis queue and restarted all pods - Temporarily helps, then returns

---

Questions:

1. Did n8n v2.x change how execution data is written to the database before jobs are queued?** We suspect v2 may queue jobs to Redis before/during database writes rather than after, unlike v1.

2. Is PgBouncer's `pool_mode=transaction` incompatible with n8n v2? Should we use `pool_mode=session` also used it didn't work well?

3. Are there any new v2-specific environment variables** we should set to handle this race condition?

4. Has anyone else experienced this issue after upgrading to v2.x?**

5. Is there a recommended configuration for queue mode in v2 that differs from v1.x best practices?

---

**Workarounds That Partially Help:**

- ⚠️ **Bypassing PgBouncer** (direct Postgres connection) reduces failures from 50% to ~20%
- Reducing concurrency to 3 reduces failures from 50% to ~10%
- Scaling to only 5 workers reduces failures from 50% to ~5%

However, these workarounds severely limit our capacity and aren't sustainable for production.

---

Relevant Logs:

Worker logs when error occurs:

Error: Worker failed to find data for execution 846687 (job 371555)
Problem with execution 846687: Error: Worker failed to find data for execution 846687 (job 371555). Aborting.

And still the same issue with version 2.1.0 while it was working normally in version 1.x

Hi @Ahmed_Abozeid,

Please can you share how you are hosting n8n? If using docker please share your setup or docker compose files. Please remember to REDACT any sensitive information

Hi @Wouter_Nigrini, thanks for responding!

I’m running n8n on Kubernetes (GKE). Here’s my setup and the issue:

Critical Context: This worked perfectly in v1.x

Everything was completely fine when we were running n8n v1.x for months in production. This race condition only appeared after upgrading to v2.x. We’re now on v2.1.0 (tried v2.2.4 first but it was even worse).

Setup:

  • Kubernetes on GKE: 1 main pod (nrx) + 20 worker pods

  • PostgreSQL: db-custom-4-8192 (4 vCPU, 8GB RAM, direct connection)

  • Redis: 2GB, standard deployment

  • n8n version: 2.1.0 (both main and workers)

Configuration:

EXECUTIONS_MODE=queue
EXECUTIONS_DATA_SAVE_WAIT=true
QUEUE_BULL_JOB_OPTS_DELAY=2000
N8N_CONCURRENCY_PRODUCTION_LIMIT=10
DB_POSTGRESDB_POOL_SIZE=20
N8N_WORKERS_COUNT=0 (on main pod)
N8N_WORKERS_ENABLED=true

The Issue:

Getting “Worker failed to find data for execution [ID] (job [JOB_ID])” errors on random requests, not just parallel ones:

  • Single requests: ~10% failure rate

  • Parallel requests: 30-60% failure rate

Workers are picking up jobs from Redis in 15-27ms instead of the configured 2000ms delay and even I’m using this field too but still not working as intended: EXECUTIONS_DATA_SAVE_WAIT=true.
The timing from debug logs shows:

Failed execution: Enqueued at 16:11:30.700 → Failed at 16:11:30.715 (15ms later) Successful execution: Enqueued at 16:11:32.005 → Finished at 16:11:32.248 (243ms later)

What We’ve Tried:

  1. Cleaned 1,247 zombie executions (finished=false) - reduced failures from 20% to 10% but didn’t fix it

  2. Verified infrastructure is healthy (DB: 21/1000 connections, Redis: 49MB/2GB, all fast)

  3. Confirmed all worker pods are from same ReplicaSet (no mixed versions)

  4. Tested v2.2.4: even worse (60% failure rate)

  5. Downgraded to v2.1.0: still failing at 30%

  6. Verified QUEUE_BULL_JOB_OPTS_DELAY=2000 is in environment on both main and worker pods

The Problem:

QUEUE_BULL_JOB_OPTS_DELAY=2000 is set in the environment but n8n is completely ignoring it. Workers pick up jobs immediately instead of waiting 2 seconds for database writes to complete and then gets the error however I used this delay for solving the issue but this field was not there before and it was working fine.
Same infrastructure, same config, same workflows - everything worked in v1.x, breaks in v2.x. Would really appreciate any guidance on what changed or how to fix this. Can provide full deployment YAMLs if needed.

Thanks!

Stupid question, but is your worker pods connecting to the same db as the main instance?

1 Like

No worries, I appreciate your help!!

Yes, absolutely confirmed - both worker pods and main pod connect to the exact same PostgreSQL database.

Database Configuration (identical on both):

  • Host: 34.56.105.118

  • Database: n8n_staging

  • User:****

  • Port: 5432

Verification:

I compared the environment variables between main and worker pods using diff, and all database settings (DB_POSTGRESDB_HOST, DB_POSTGRESDB_DATABASE, DB_POSTGRESDB_USER, DB_POSTGRESDB_PASSWORD) are 100% identical. The only differences were N8N_USER and N8N_PASSWORD which are just the admin UI credentials.

I also checked pg_stat_activity and confirmed 48 active connections from both the main pod and all 20 worker pods, all connecting to the same database (n8n_staging at 34.56.105.118).

The race condition occurs because workers query this shared database for execution data BEFORE the main pod’s asynchronous database writes complete - not because they’re using different databases. The workers successfully write execution results back to the same database, which I can see in the worker logs showing “Save execution data to database for execution ID 848xxx” entries.

So yes - same database, race condition is about timing of writes/reads within that shared database.