Multi worker multi runner production setup with docker compose, are we doing it correctly?

We are using n8n enterprise for our company and we have a large multi worker/runner setup. Most of the time everything works fine but from time to time we get un-reproducable, no-useful-log hangs. Reloading the page and restarting the flow works. Nothing useful at all in container logs. This all started after switching to 2.0. Since I can’t give a useful log I just want to share our docker compose setup so that hopeufully somebody can pinpoint if we’re doing something obviously wrong in our configuration.

The biggest change we did on docker-compose file compared to v1.x is addition of runners.

docker-compose.yml

volumes:
  db_storage:
  n8n_storage:
  redis_storage:

# Main specific
x-n8n-main-env: &n8n_main_env
  N8N_DISABLE_PRODUCTION_MAIN_PROCESS: false

# Worker specific
x-n8n-worker-env: &n8n_worker_env
  N8N_DISABLE_UI: true
  N8N_DISABLE_ACTIVE_WORKFLOWS: true

# Main & Worker common vars
x-n8n-common-env: &n8n_common_env
  # Execution & queue
  EXECUTIONS_MODE: queue
  EXECUTIONS_TIMEOUT: 600
  EXECUTIONS_TIMEOUT_MAX: 1200
  EXECUTIONS_DATA_MAX_AGE: 72
  EXECUTIONS_DATA_PRUNE_MAX_COUNT: 5000
  OFFLOAD_MANUAL_EXECUTIONS_TO_WORKERS: true
  QUEUE_BULL_REDIS_HOST: redis
  QUEUE_BULL_REDIS_PORT: 6379
  QUEUE_HEALTH_CHECK_ACTIVE: true
  QUEUE_WORKER_STALLED_INTERVAL: 1200000   # 20 minutes in ms
  QUEUE_WORKER_LOCK_DURATION: 1200000      # 20 minutes lock in ms
  QUEUE_WORKER_LOCK_RENEW_TIME: 300000     # Renew lock every 5 minutes

  # Runners (safe code execution)
  N8N_RUNNERS_ENABLED: true
  N8N_RUNNERS_MODE: external
  N8N_RUNNERS_AUTH_TOKEN: ${N8N_RUNNERS_AUTH_TOKEN}
  N8N_RUNNERS_BROKER_LISTEN_ADDRESS: 0.0.0.0
  N8N_NATIVE_PYTHON_RUNNER: true

  # Misc / runtime / security
  N8N_ENFORCE_SETTINGS_FILE_PERMISSIONS: true
  N8N_BLOCK_ENV_ACCESS_IN_NODE: true
  N8N_SKIP_AUTH_ON_OAUTH_CALLBACK: true
  N8N_GIT_NODE_DISABLE_BARE_REPOS: true
  N8N_DEFAULT_BINARY_DATA_MODE: database
  N8N_CONCURRENCY_PRODUCTION_LIMIT: 6 # Number of concurrent 'active workflow' executions
  N8N_HOST: ${N8N_HOST}
  N8N_PROXY_HOPS: 1
  N8N_VERSION_NOTIFICATIONS_ENABLED: false
  N8N_DATA_TABLES_MAX_SIZE_BYTES: 256000000
  WEBHOOK_URL: ${WEBHOOK_URL}
  GENERIC_TIMEZONE: Europe/Amsterdam

  # Logs
  N8N_LOG_OUTPUT: file,console
  N8N_LOG_FORMAT: json
  # N8N_LOG_LEVEL: debug

  # DB
  DB_TYPE: postgresdb
  DB_POSTGRESDB_HOST: postgres
  DB_POSTGRESDB_PORT: 5432
  DB_POSTGRESDB_DATABASE: ${POSTGRES_DB}
  DB_POSTGRESDB_USER: ${POSTGRES_NON_ROOT_USER}
  DB_POSTGRESDB_PASSWORD: ${POSTGRES_NON_ROOT_PASSWORD}
  DB_POSTGRESDB_POOL_SIZE: 10

  # SMTP
  N8N_SMTP_SENDER: ${N8N_SMTP_SENDER}
  N8N_SMTP_HOST: ${N8N_SMTP_HOST}
  N8N_SMTP_PORT: ${N8N_SMTP_PORT}
  N8N_SMTP_USER: ${N8N_SMTP_USER}
  N8N_SMTP_PASS: ${N8N_SMTP_PASS}
  N8N_SMTP_STARTTLS: false
  N8N_SMTP_SSL: true

  # License
  N8N_LICENSE_ACTIVATION_KEY: ${N8N_LICENSE_ACTIVATION_KEY}

  # Credentials overwrite (quoted to keep YAML happy)
  CREDENTIALS_OVERWRITE_DATA: ${CREDENTIALS_OVERWRITE_DATA}

x-n8n-worker: &n8n_worker
  image: n8nio/n8n
  command: worker
  restart: unless-stopped
  volumes:
    - n8n_storage:/home/node/.n8n
  depends_on:
    postgres:
      condition: service_healthy
    redis:
      condition: service_healthy

x-n8n-runner: &n8n_runner
  build:
    context: .
    dockerfile: runners.Dockerfile
  image: n8nio/runners
  restart: unless-stopped

x-n8n-runner-env: &n8n_runner_env
    N8N_RUNNERS_AUTH_TOKEN: ${N8N_RUNNERS_AUTH_TOKEN}
    NODE_FUNCTION_ALLOW_BUILTIN: "*"
    NODE_FUNCTION_ALLOW_EXTERNAL: "*"
    N8N_RUNNERS_STDLIB_ALLOW: "*"
    N8N_RUNNERS_EXTERNAL_ALLOW: "*"

services:
  n8n:
    image: n8nio/n8n
    restart: unless-stopped
    ports:
      - "80:5678"
    volumes:
      - n8n_storage:/home/node/.n8n
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    environment:
      <<: [*n8n_main_env, *n8n_common_env]

  postgres:
    image: postgres:16
    restart: unless-stopped
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
      - POSTGRES_NON_ROOT_USER=${POSTGRES_NON_ROOT_USER}
      - POSTGRES_NON_ROOT_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
    volumes:
      - db_storage:/var/lib/postgresql/data
      - ./init-data.sh:/docker-entrypoint-initdb.d/init-data.sh
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -h localhost -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
      interval: 5s
      timeout: 5s
      retries: 10

  redis:
    image: redis:7
    restart: unless-stopped
    command: ["redis-server", "--appendonly", "yes"]
    volumes:
      - redis_storage:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 5s
      retries: 10

  n8n-worker-1:
    <<: *n8n_worker
    environment:
      <<: [*n8n_worker_env, *n8n_common_env]

  n8n-worker-2:
    <<: *n8n_worker
    environment:
      <<: [*n8n_worker_env, *n8n_common_env]

  n8n-worker-3:
    <<: *n8n_worker
    environment:
      <<: [*n8n_worker_env, *n8n_common_env]

  n8n-worker-4:
    <<: *n8n_worker
    environment:
      <<: [*n8n_worker_env, *n8n_common_env]

  n8n-runner-1:
    <<: *n8n_runner
    environment:
        <<: *n8n_runner_env
        N8N_RUNNERS_TASK_BROKER_URI: "http://n8n-worker-1:5679"
        N8N_RUNNERS_LAUNCHER_HEALTH_CHECK_PORT: 5691
    depends_on:
        - n8n-worker-1

  n8n-runner-2:
    <<: *n8n_runner
    environment:
        <<: *n8n_runner_env
        N8N_RUNNERS_TASK_BROKER_URI: "http://n8n-worker-2:5679"
        N8N_RUNNERS_LAUNCHER_HEALTH_CHECK_PORT: 5692
    depends_on:
        - n8n-worker-2

  n8n-runner-3:
    <<: *n8n_runner
    environment:
        <<: *n8n_runner_env
        N8N_RUNNERS_TASK_BROKER_URI: "http://n8n-worker-3:5679"
        N8N_RUNNERS_LAUNCHER_HEALTH_CHECK_PORT: 5693
    depends_on:
        - n8n-worker-3

  n8n-runner-4:
    <<: *n8n_runner
    environment:
        <<: *n8n_runner_env
        N8N_RUNNERS_TASK_BROKER_URI: "http://n8n-worker-4:5679"
        N8N_RUNNERS_LAUNCHER_HEALTH_CHECK_PORT: 5694
    depends_on:
        - n8n-worker-4

runners.Dockerfile (For some reason below packages are not accessible from n8n code noes, but that’s another issue)

FROM n8nio/runners
USER root
RUN cd /opt/runners/task-runner-javascript && pnpm add nodejs-polars danfojs-node
RUN cd /opt/runners/task-runner-python && uv pip install numpy pandas
USER runner

from time to time we get un-reproducable, no-useful-log hangs.

I saw this in multiple client setups, both self-hosted and cloud, since 2.0, and including the latest stable version.

So this is very likely a bug in n8n, and hopefully they’ll resolve it in a later version. I don’t think there’s anything wrong with your config.

Hi, @Cosku !

Given that your setup looks correct and closely follows the recommended queue + worker + runner architecture, and considering that similar intermittent hangs have been reported by others since the 2.0 release, this may point to an issue in the runners or queue execution layer rather than an obvious misconfiguration.

One simple and documented way to help isolate or mitigate the issue is to temporarily disable runners and run the same workload without them:

N8N_RUNNERS_ENABLED=false

Restart the stack and observe whether the hangs still occur.

We’ve just started to have it consistently. Again no errors in the logs. I tried to turn off RUNNERS but the problem persisted. It looks like it’s a UI / Backend communication issue because sometimes we get errors like “Failed to get workflow list” (paraphrasing) and sometimes it fails to show the executions list.

Everythign runs on a quite beefy hardware and metrics show less than 5% CPU and MEM usage so I’m pretty sure it’s not about lacking resources.

No errors on redis or postgre containers either

Here is how it looks when it hangs on the browser console. all calls are just pending. we have tried different networks to eliminate it and it’s the same.

We did “truncate execution_data” on postgre and it seems like it solved the problem. It’s baffling to me that couple GBs of execution data can just crash the whole UI. It’s no wonder though since the whole table is line after line “json” data hold as “text”.

@Cosku

Thanks for the detailed follow-up , the fact that truncating execution_data resolved the issue actually explains the behavior very well.

What you were hitting is a known scaling limitation around execution data storage and UI queries. When execution_data grows large (especially with JSON-heavy AI workflows), PostgreSQL queries used by the UI can block, causing API endpoints to hang without producing errors. This affects the UI and backend communication, while executions may continue running normally.

This isn’t a misconfiguration of workers or runners. Your setup is broadly correct. The fix is to aggressively prune or limit execution data in production and avoid storing large execution payloads long-term.

Treat execution data as short-term debug data, not durable storage, especially in high-volume or AI-heavy setups.

No Tamy (I do assume you’re AI but it’s important for future reference that people understand this)

The fix is not to aggressively prune or limit execution. There are multiple fixes, all of them needs to be done by n8n engineers.

1- If there is a chance that executions will generate data so big that a select query will take 10 minutes then database is not the place to hold that data. I’m told n8n team is working on an s3 solution so it’s already good

2- You need to decouple and isolate the UI from that task. Trying to get the execution list shouldn’t select all data, it should be deferred until the acutal execution data is needed and even then the long running database query shouldn’t hang the whole backend.

Hi @Cosku !

Hey, just to clarify, I’m not a bot :blush: I’m a Brazilian user and English isn’t my first language, so sometimes my writing may sound a bit “technical”, but I’m a real person. I work a lot with n8n, study the documentation constantly, and try to share what I learn from real production usage.

I fully agree with your points. Long term, this absolutely needs product-level improvements like moving large execution payloads out of the main database and decoupling the UI queries from heavy data reads.

My intention with the previous message was only to point out the immediate, practical workaround that helps people get their instance stable right now, until those deeper fixes from the n8n team arrive.

So we’re aligned on the root cause and the ideal solution; I was just focusing on what users can do today to unblock themselves.