Duplicate WebEx messages sent on only one trigger

Describe the issue/error/question

When triggered from a singular webhook event, there are two of the same WebEx messages fired off. There are no loops or data splits in my flow. There is only one Webhook event and data body being passed through. There is only one authentication object associated with WebEx and there is only one of each node.

What is the error message (if any)?

Please share the workflow

Share the output returned by the last node

[
{

"id": "redacted",

"roomId": "redacted",

"roomType": "group",

"text": "redacted",

"personId": "redacted",

"personEmail": "redacted",

"created": "2022-03-17T14:32:17.981Z"

}

]

Information on your n8n setup

  • n8n version: 0.165.1
  • Database you’re using (default: SQLite): redis
  • Running n8n with the execution process [own(default), main]:
  • Running n8n via [Docker, npm, n8n.cloud, desktop app]: docker

EDIT:

Just updated to 0.168.2, this whole flow is broken now. Does not execute but says it is, and cancelling the execution says “job not found”


image

Just updated to 0.168.2, this whole flow is broken now. Does not execute but says it is, and cancelling the execution says “job not found”

Hey @dylan.moore, I am sorry to hear you’re running into this behaviour. How are you deploying n8n? Could it be that a previous execution simply is no longer available in your database after your upgrade?

Did you reload the n8n website in your browser after upgrading?

As for the duplicate, could it be that you send to webhooks, one to your test URL and one to your production URL? Only the test one would show up live in the canvas, the production one would only be visible in your list of previous executions if you are storing success execution data.

  1. Docker Compose in portainer.
  2. No, these are fresh executions in test webhook. Production doesn’t work after about an hour of being on anyway, prior to this update.
  3. Yes. Private browsing, no cache.
  4. No, I have turned off the ‘active’ switch indicating Production use-case and it still happens unfortunately.

So the error message sounds like the execution might already have finished but this information didn’t make it through to the browser. In your workflow settings, could you configure manual executions to be saved like so:

Then after your next test execution, could you check the status of said execution in the execution list?

Also, are you using a reverse proxy/load balancer/anything else sitting between your docker container and your browser by any chance?

Yes, I am using Traefik.

All of this worked before updating, before newest n8n only issue was double webhook execution

Oh uh the list magically appeared when I went back to the tab?

This looks like your docker container is unhealthy. Any chance you can take a look at its logs to see what’s wrong here?

Did this container perhaps restart after previously crashing? What’s the uptime like?

image
You may find this useful.

Hm, that’s odd indeed. Could you delete the existing Webhook node and create a fresh one from scratch, just to see if it might have an invalid timeout setting applied to it for any reason that could cause this?

image
Upon deleting and recreating a webhook, setting to POST again, and changing the Nexus webhook URL to the new one, same error.

I tried executing your original webhook node on my end, but did not encounter the problem you have reported. Does this error come up for you as soon as you execute the workflow manually? Or do you need to send a webhook first?

I have to provoke the event, just starting the test flow doesn’t break anything until in the UI it goes from “waiting” to “executing” and then the error from line 162 in packages/cli/src/TestWebhooks.ts fires off.

I’m checking the health of postgres* (not redis, sorry) but it’s hard to diagnose a problem when everything says it’s running fine as far as the daemon/persistent volume are concerned.

Could you share your docker compose file? Of course without any passwords/secrets, just to get an idea how your configuration looks like and hopefully be able to reproduce the problem. TestWebhooks.ts hasn’t been touched since last year, so I am a bit puzzled where this error might come from at the moment.

version: '3.1'
services:
  postgres:
    image: postgres:11
    restart: always
    volumes:
      - /mnt/zfs-nce-qa/infswmnceunxd/n8n/data/init-data.sh:/docker-entrypoint-initdb.d/init-data.sh
    command: chown -R postgres:postgres /docker-entrypoint-initdb.d
  n8n:
    image: n8nio/n8n
    restart: always
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.n8n.entrypoints=web"
        - "traefik.http.routers.n8n.rule=Host(`redacted`)"
        - "traefik.http.services.n8n.loadbalancer.server.port=5678"
        - "traefik.http.services.n8n.loadbalancer.sticky.cookie=true"
    links:
      - postgres
    volumes:
      - /mnt/zfs-nce-qa/infswmnceunxd/n8n/data/.n8n:/root/.n8n
    command: /bin/sh -c "export WEBHOOK_URL=redacted && sleep 5; n8n start" 
networks:
  default:
    external:
      name: main_overlay


Redacting this would be exhausting but trust that I didn’t put emojis into any fields or anything like that.

Mayn thanks for confirming! Tbh, I am not sure what could be causing this behaviour, so have reached out on this internally, hoping a person smarter than me might have an idea here.

I’ll keep banging my head against this too and inform if I find anything useful. Good luck and thanks for your time, sorry about this edge case.

1 Like

Hey @dylan.moore, one of our engineers had a look at this but couldn’t see anything obvious here either unfortunately.

The one thinkable situation where errors related to the timeout could occur would be when users clicks to manually execute the workflow (this starts a 2 minutes timer) and then just when the timeout kicks in after these 2 minutes the webhook request comes in (in which case the timeout fires and deletes the webhook).

Now when the webhook node runs it would try and delete the waiting webhook that was just previously deleted by the aforementioned timeout.

So the assumption here is that either your instance is under super heavy load (and this is just a symptom of the problem) or that your database is a major bottleneck.

Perhaps you’d be able to test the behavior on a fresh instance using another database (for example by simply running something like docker run -it --rm --name n8n -p 5678:5678 n8nio/n8n:0.168.2 which would spin up a docker container with the SQLite default database)?

As for the duplicate this does sound like a misconfigured proxy or load balancer.

Fixed it by destroying the database and recreating its volumes. Something in the update must have gotten snagged.

Unfortunately I’m still getting 2 WebEx messages and since Traefik is acting as pretty much a super convenient forwarding name service, I’m not confident how to proceed diagnosing it.

And as we can see, it’s not duplicitous.