Bug - Dozens of errors per sec per workflow containing a webhook

theo · December 9, 2024, 8:40pm

Hey,

For few hours now, all workflows containing a webhook display dozens and dozens of errors per second:

The error in the interface is: “timeout exceeded when trying to connect” and our main application (server) is logging “Failed to retrieve live execution rows in Postgres” on Cloudwatch.

It seems these error exectuions “are not real executions”, meaning the webhook isn’t really called, because when there’s a real call, the executions are successful.

Information on your n8n setup

n8n version: → 1.70.3 (we also had this error on 1.68.1)
Database (default: SQLite): → Postgres 14.2
n8n EXECUTIONS_PROCESS setting (default: own, main): → queue
Running n8n via (Docker, npm, n8n cloud, desktop app): → Docker on ECS Fargate
Operating system: → Linux

codeculture · December 12, 2024, 7:41am

Any Feedback here? I had the same issue lately … still searching for the reason.

theo · December 12, 2024, 10:52am

Yes and it’s still here from our end.

New info I understood since my last post:

the problem isn’t there for all workflows
the executions history for these errors doesn’t take new workflow edits into account: it displays the workflow state as it was when the error started
it doesn’t seem to impact new workflows containing webhook created since

=> So it seems we can fix it by recreating workflows that are impacted. But they are quite a few in my case and it doesn’t mean it won’t happen again, understanding the reason will definitely help

ria · December 19, 2024, 8:51am

Hi guys,

can you share your queue mode configurations (e.g. concurrency limit, etc) and how many workers you have?
@theo - did you make any changes there?

Refer to our docs for more information:

jc38 · December 19, 2024, 10:31am

Hi @ria, here are our ECS Task parameters related to how our n8n instance runs.
We have 2 workers.
We did not make any change we this issue occured, though it happened just after our app restarted (main and workers).

{
        "name": "EXECUTIONS_MODE",
        "value": "queue"
      },
      {
        "name": "N8N_DISABLE_PRODUCTION_MAIN_PROCESS",
        "value": "false"
      },
      {
        "name": "NODE_OPTIONS",
        "value": "--max-old-space-size=8192"
      },
      {
        "name": "N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN",
        "value": "true"
      },
      {
        "name": "EXECUTIONS_DATA_PRUNE_MAX_COUNT",
        "value": "300000"
      },
      {
        "name": "EXECUTIONS_DATA_MAX_AGE",
        "value": "960"
      },
      {
        "name": "N8N_LOG_FILE_MAXSIZE",
        "value": "32"
      },
      {
          "name": "EXECUTIONS_DATA_SAVE_MANUAL_EXECUTIONS",
          "value": "false"
      },
      {
          "name": "QUEUE_RECOVERY_INTERVAL",
          "value": "0"
      },

ria · December 19, 2024, 4:57pm

Thanks for sharing @jc38 !

Which version of n8n are you running on your instances?
You don’t seem to have any concurrency set. Can you try that?
N8N_CONCURRENCY_PRODUCTION_LIMIT = maybe start with something like 20

Also, a few other notes on your variables:

Are you using any webhook processors? What’s the reason you’re having N8N_DISABLE_PRODUCTION_MAIN_PROCESS set?
N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN and QUEUE_RECOVERY_INTERVAL have both been deprecated, and you don’t need those anymore

jc38 · December 19, 2024, 5:14pm

@ria thank you for your answers!

IIRC, we were using this parameter because of this issue shared here.

We had issues working with hosting webhook processors, so we stopped doing so. I’m removing those params and trying the concurrency limit one.

We’re on 1.71.2.

Mctne · December 23, 2024, 2:19pm

Hey,

Same problem here on 1.71.3, still investigating on the logs…

christopher · December 27, 2024, 9:00am

Hey,

we are also experiencing this problem, we are working in queued mode with 4 workers and concurrency set to 20.

We just updated to 1.72.1 and previously run on 1.71.3, where we had the problem. We will see if the problem continues to occur in this version.

Kent1 · December 29, 2024, 4:29pm

I had the same error and solved it by removing some executions from the database.

More specifically, executions with status “error” and no “startedAt” value.

You can try this SQL query in your n8n database:

SELECT * FROM execution_entity WHERE "startedAt" IS NULL AND status = 'error'

These executions seems to be replayed by workers and cause these errors with activated workflow.

theo · December 30, 2024, 9:20am

Thank you @Kent1 it fixed it for me!

I didn’t notice the execution IDs for these failing executions were again and again the wihtin the same limited list, meaning there were retries, for some reason.

Deleting them fixed the problem

system · January 6, 2025, 9:20am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.