Bug - Dozens of errors per sec per workflow containing a webhook

Hey,

For few hours now, all workflows containing a webhook display dozens and dozens of errors per second:

The error in the interface is: “timeout exceeded when trying to connect” and our main application (server) is logging “Failed to retrieve live execution rows in Postgres” on Cloudwatch.

It seems these error exectuions “are not real executions”, meaning the webhook isn’t really called, because when there’s a real call, the executions are successful.

Information on your n8n setup

  • n8n version: → 1.70.3 (we also had this error on 1.68.1)
  • Database (default: SQLite): → Postgres 14.2
  • n8n EXECUTIONS_PROCESS setting (default: own, main): → queue
  • Running n8n via (Docker, npm, n8n cloud, desktop app): → Docker on ECS Fargate
  • Operating system: → Linux

Any Feedback here? I had the same issue lately … still searching for the reason.

1 Like

Yes and it’s still here from our end.

New info I understood since my last post:

  • the problem isn’t there for all workflows
  • the executions history for these errors doesn’t take new workflow edits into account: it displays the workflow state as it was when the error started
  • it doesn’t seem to impact new workflows containing webhook created since

=> So it seems we can fix it by recreating workflows that are impacted. But they are quite a few in my case and it doesn’t mean it won’t happen again, understanding the reason will definitely help :slight_smile:

Hi guys,

can you share your queue mode configurations (e.g. concurrency limit, etc) and how many workers you have?
@theo - did you make any changes there?

Refer to our docs for more information:

1 Like

Hi @ria, here are our ECS Task parameters related to how our n8n instance runs.
We have 2 workers.
We did not make any change we this issue occured, though it happened just after our app restarted (main and workers).

{
        "name": "EXECUTIONS_MODE",
        "value": "queue"
      },
      {
        "name": "N8N_DISABLE_PRODUCTION_MAIN_PROCESS",
        "value": "false"
      },
      {
        "name": "NODE_OPTIONS",
        "value": "--max-old-space-size=8192"
      },
      {
        "name": "N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN",
        "value": "true"
      },
      {
        "name": "EXECUTIONS_DATA_PRUNE_MAX_COUNT",
        "value": "300000"
      },
      {
        "name": "EXECUTIONS_DATA_MAX_AGE",
        "value": "960"
      },
      {
        "name": "N8N_LOG_FILE_MAXSIZE",
        "value": "32"
      },
      {
          "name": "EXECUTIONS_DATA_SAVE_MANUAL_EXECUTIONS",
          "value": "false"
      },
      {
          "name": "QUEUE_RECOVERY_INTERVAL",
          "value": "0"
      },
1 Like

Thanks for sharing @jc38 !

Which version of n8n are you running on your instances?
You don’t seem to have any concurrency set. Can you try that?
N8N_CONCURRENCY_PRODUCTION_LIMIT = maybe start with something like 20

Also, a few other notes on your variables:

@ria thank you for your answers!

IIRC, we were using this parameter because of this issue shared here.

We had issues working with hosting webhook processors, so we stopped doing so. I’m removing those params and trying the concurrency limit one.

We’re on 1.71.2.

Hey,

Same problem here on 1.71.3, still investigating on the logs…

Hey,

we are also experiencing this problem, we are working in queued mode with 4 workers and concurrency set to 20.

We just updated to 1.72.1 and previously run on 1.71.3, where we had the problem. We will see if the problem continues to occur in this version.

I had the same error and solved it by removing some executions from the database.

More specifically, executions with status “error” and no “startedAt” value.

You can try this SQL query in your n8n database:

SELECT * FROM execution_entity WHERE "startedAt" IS NULL AND status = 'error'

These executions seems to be replayed by workers and cause these errors with activated workflow.

2 Likes

Thank you @Kent1 it fixed it for me! :raised_hands:

I didn’t notice the execution IDs for these failing executions were again and again the wihtin the same limited list, meaning there were retries, for some reason.

Deleting them fixed the problem :+1:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.