Important delay before webhooks triggering

Describe the problem/error/question

The delay between receiving the webhook and the workflow starting is too high: sometimes over 90 seconds. The webhook executions stay in “new” status for this long:

… while our N8N_CONCURRENCY_PRODUCTION_LIMIT = 75, there are only 27 executions in the screenshot above.

This issue is blocking us from working with applications that require quick responses.

Information on your n8n setup

    • n8n version: 1.115.3

    • Database (default: SQLite): using RDS Postgres and EFS for storing n8n data

    • Running n8n via (Docker, npm, n8n cloud, desktop app): self-hosted mode on aws with Fargate (docker)

hello @theo

Check the resource usage. n8n may be overwhelmed and thus respond slowly.

hello @barn4k

Thanks for your answer.

We’re always under 40% of CPU and Memory utilizations, so I guess it’s fine.

Any other idea?

Thank you

We’re now on the v1.121.2 and the problem remains.

Perhaps @Jon, @ivov or @Gallo_AIA you would have a clue?

Thanks

This is just a limit.

How many workers you have set and started?(each of them needs concurrency as well).

Your env file will be helpful to DEBUG first…

Thanks @Parintele_Damaskin here without the sensitive information:

{
  "name": "N8N_LOG_FILE_MAXSIZE",
  "value": "32"
},
{
  "name": "NODE_OPTIONS",
  "value": "--max-old-space-size=8192"
},
{
  "name": "N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN",
  "value": "true"
},
{
  "name": "EXECUTIONS_DATA_SAVE_MANUAL_EXECUTIONS",
  "value": "false"
},
{
  "name": "QUEUE_RECOVERY_INTERVAL",
  "value": "0"
},
{
  "name": "N8N_DISABLE_PRODUCTION_MAIN_PROCESS",
  "value": "false"
},
{
  "name": "EXECUTIONS_DATA_MAX_AGE",
  "value": "960"
},
{
  "name": "N8N_BASIC_AUTH_ACTIVE",
  "value": "true"
},
{
  "name": "GENERIC_TIMEZONE",
  "value": "Europe/Paris"
},
{
  "name": "EXECUTIONS_MODE",
  "value": "queue"
},
{
  "name": "QUEUE_BULL_REDIS_HOST",
  "value": "redacted"
},
{
  "name": "EXECUTIONS_DATA_PRUNE_MAX_COUNT",
  "value": "300000"
},
{
  "name": "N8N_CONCURRENCY_PRODUCTION_LIMIT",
  "value": "75"
}

Beside the

name": “EXECUTIONS_DATA_PRUNE_MAX_COUNT”,

“value”: “300000” and “name”: “EXECUTIONS_DATA_MAX_AGE”,

“value”: “960”

that maybe accumulate a large number of execution records, which can slow down performance if not properly managed variables looks OK.

Set as well

QUEUE_HEALTH_CHECK_ACTIVE=true then acacesas the endpoinmts:

http://your_host:port/healthz

http://your_host:posr/healthz/readiness

A best practice as well would be a load balancer for webhooks :

P.S I am curios how you setup to start your workers eg :

n8n worker --concurrency=5

Thank you @Parintele_Damaskin!

We did set QUEUE_HEALTH_CHECK_ACTIVE to true and increased the concurrency up to 75!

We saw a positive impact on the speed of executions, but we still had queues that were accumulating up to around 70 executions, always up to 2:30 minutes before executing everything within a second! Very strange behavior IMO.

I can understand EXECUTIONS_DATA_PRUNE_MAX_COUNT is quite high, but it corresponds to around 15 days of history for our volume :confused:

This issue is really annoying, we’re thinking of creating another specific instance dedicated to webhooks that need faster execution.

If you have any other thoughts and suggestions, it could be really helpful!

Before doing that, check this resource that actually is recommended when scaling :

I will choose this path since reworking the workflows to process less data, and conduct with faster responses like 200 etc…

Btw my appraoach is the following : Webhook processors + queue mode + worklers

Main keptt out of the webhook pool (optionmally with profduction webhoooks disabled on main)

This effectively gives webhook traffic “prioriity” in terms of deddicated capacity for intake.

I don’t know if is the best, but for my case it does th job ell.

Cheers!

We had multiple workers for quite a while, but we had duplicated executions that were caused by that. (see this other discussion)

Regarding webhook processors, this is something we don’t want since we need not only a response but a quick execution of the called process..

It really feels like there’s a blind spot here, an important bug that many should face, but I don’t find lots of other dicussions on the community. This is weird.

Ok, I got your point…

Can you for example share a workflow that you think is taking too long for the response?

Maybe is needed restructuring so you don’t deal with longer waiting time, and maybe use batch if is possible… Etc…

Cheers!