1.42.1 Executions Running on Webhook and Multiple Workers

I am running 7 n8n boxes, 1 primary box, 2 webhook boxes and 4 worker boxes in queue mode. This configuration has been running successfully for over a year. Recently I upgraded from 1.25.1 to 1.42.1 and now, any workflow that has a wait time in it (any wait time from 1 second to 2 minutes to 90 minutes) executes on every box that I have. For instance, I have a simple workflow that takes an SQS message as the trigger, has a wait for 1 second, and a call to a webhook running on my local machine. It calls the webhook 7 times, 1 for each of the machines. I have turned off one of the webhook servers and it only called it 6 times. This is a major issue if anyone has any help that would be appreciated.

It looks like your topic is missing some important information. Could you provide the following if applicable.

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

n8n version: 1.42.1
database: postgres
EXECUTIONS_MODE: queue
Running via npm

hello @Slosarek

Can you provide the configuration for each type (primary, webhook, and worker)?

I am running them with pm2 with the ecosystem like this:

const sharedEnvironment = {
  AWS_ACCESS_KEY_ID: '<redacted>',
  AWS_DEFAULT_REGION: 'us-west-2',
  AWS_SECRET_ACCESS_KEY: '<redacted>',
  DB_POSTGRESDB_DATABASE: 'postgres',
  DB_POSTGRESDB_HOST: '<redacted>',
  DB_POSTGRESDB_PASSWORD: '<redacted>',
  DB_POSTGRESDB_PORT: '5432',
  DB_POSTGRESDB_SCHEMA: 'public',
  DB_POSTGRESDB_USER: 'postgres',
  DB_TYPE: 'postgresdb',
  N8N_ENCRYPTION_KEY: '<redacted>',
  NODE_FUNCTION_ALLOW_EXTERNAL: 'aws-sdk,axios,underscore,lodash',
  NODE_FUNCTION_ALLOW_BUILTIN: 'url,crypto,fs',
  WEBHOOK_URL: '<redacted>',
  QUEUE_BULL_REDIS_HOST: '<redacted>',
  QUEUE_BULL_REDIS_PORT: '6379',
  QUEUE_BULL_REDIS_DB: '1',
  EXECUTIONS_MODE: 'queue',
  N8N_CUSTOM_EXTENSIONS: '<redacted>',
  EXECUTIONS_DATA_PRUNE: 'true',
  EXECUTIONS_DATA_MAX_AGE: '720',
};

module.exports = {
  apps: [
    {
      name: 'primary',
      script: 'node_modules/.bin/n8n',
      node_args: '--max-http-header-size=49152',
      watch: true,
      env: {
        ...sharedEnvironment,
        N8N_DISABLE_PRODUCTION_MAIN_PROCESS: true,
      },
    },
    {
      name: 'worker',
      script: 'node_modules/.bin/n8n worker',
      node_args: '--max-http-header-size=49152',
      watch: true,
      env: {
        ...sharedEnvironment,
      },
    },
    {
      name: 'webhook',
      script: 'node_modules/.bin/n8n webhook',
      node_args: '--max-http-header-size=49152',
      watch: true,
      env: {
        ...sharedEnvironment,
      },
    },
  ]
};

and launching each of them with:

pm2 restart ecosystem.config.js --only <primary|worker|webhook>

which basically n8n with the environment variables above

node_modules/.bin/n8n
node_modules/.bin/n8n worker
node_modules/.bin/n8n webhook

I’m working with @Slosarek on this, here’s an example of our test workflow that we were playing with to confirm it was sending multiple times.

Here’s another interesting thing, I’m seeing 7 copies of the error come through in my slack bot error workflow on the same execution ID which is the same number of n8n boxes we have running in total, but only one execution ID.

Hey @chase2963 is this a recurring problem or did this happen only once when starting the services?

I can imagine this happening once, possibly due to a bad recovery process that might have started on all processes (still investigating this) but I cannot imagine a scenario where this would happen all the time.

Can you confirm whether this is still happening for new executions?

Also it was brought to my attention that we have a fix that might be related to this issue being released in today’s minor release.

This is happening on every execution what has a wait node in it.

It’s been consistently happening since we upgraded to 1.42.1. Is that minor change going to be a stable ‘latest’ release?

We just upgraded to 1.45.0 and it appears to have fixed our issue so far. It appears every box in the cluster was triggering the WaitTracker and then the workers would pick those up and run them all, but with the same execution id

Hey @chase2963 we have just released 1.42.2 with this fix, can be seen here.

Since you have already upgraded to 1.45.0 it’s no longer possible to revert as 1.44.0 contains an irreversible migration.

@Slosarek I’d recommend you upgrading to 1.45.0 or 1.42.2 to have this fix