[HELP NEEDED] Webhooks Randomly Stop - Require Workflow Toggle to Resume (Not Sustainable)

[HELP NEEDED] Webhooks Randomly Stop - Require Workflow Toggle to Resume (Not Sustainable)

Problem Description

I have n8n webhook reliability issues where webhooks randomly stop being processed and require manual workflow toggling to resume. This happens repeatedly and is not manageable for a production environment.

Important Context: I implemented a webhook forwarding architecture (Cloudflare + Upstash) specifically to try to solve n8n’s webhook reliability issues, but the core problem persists.

My Setup

  • Current Architecture: Zoom → Cloudflare Worker → Upstash Redis → n8n
  • Why This Architecture: Originally had direct webhooks (Zoom → n8n) but those were unreliable
  • Hosting: Self-hosted n8n via Coolify
  • Temporary Fix: Disabling and re-enabling workflows restores webhook flow
  • Problem: This workaround is not sustainable and doesn’t address the core n8n issue

The Pattern

What Happens:

  1. :white_check_mark: Webhooks work normally for hours/days
  2. :cross_mark: Suddenly stop coming through entirely (no new executions)
  3. :cross_mark: Complete silence - not even failed attempts in logs
  4. :white_check_mark: Toggling workflows off→on immediately fixes it
  5. :counterclockwise_arrows_button: Cycle repeats unpredictably

Evidence:

  • Upstash logs show: Successful executions, then complete silence
  • Cloudflare + Upstash are healthy: Forwarding infrastructure working correctly
  • n8n is responsive: UI works, other workflows function
  • Root cause confirmed: n8n stops processing webhooks (this happened before the forwarder too)

Current Workaround (Unsustainable)

When webhooks stop:
1. Disable affected workflows
2. Re-enable workflows  
3. Webhooks immediately resume

This works but requires constant monitoring and manual intervention!

Architecture Details

Current Flow:

Zoom Webhook → Cloudflare Worker → Upstash Redis Queue → n8n Polling

Why This Setup: Originally tried direct webhooks (Zoom → n8n) but n8n kept losing webhooks. Implemented the forwarder as a reliability buffer, but n8n is still the weak link.

Key Issue: The problem is specifically with n8n’s webhook processing - it stops consuming from the queue and requires workflow toggling to resume.

Theories on Root Cause (n8n-Specific)

1. n8n Webhook/Polling Engine Issues

  • n8n stops polling Upstash Redis queue after time/load?
  • Webhook processing engine getting stuck internally?
  • Workflow execution threads dying but not restarting?

2. n8n Database Issues (SQLite)

  • SQLite locking causing webhook processing to halt?
  • Database connection issues preventing queue consumption?
  • Should switch to PostgreSQL for better reliability?

3. n8n Memory/Resource Issues

  • Memory leaks causing webhook engine to fail?
  • Resource exhaustion stopping polling threads?
  • Multiple workflows causing internal conflicts?

4. n8n Internal State Problems

  • Workflow registration state getting corrupted?
  • Internal queues/buffers filling up and not clearing?
  • Thread pool exhaustion in webhook processor?

Note: Cloudflare and Upstash are confirmed working - this is specifically an n8n reliability issue.

What I Need Help With

  1. n8n root cause identification: Why does n8n’s webhook processing randomly stop?

  2. n8n monitoring strategies: How to detect when n8n stops processing webhooks?

  3. n8n configuration fixes: Settings/tweaks to prevent this issue?

  4. Database recommendations: Will PostgreSQL solve this vs SQLite?

  5. Automated n8n recovery: Can I script the workflow toggle workaround?

Environment Details

  • n8n: Self-hosted via Coolify, default SQLite database
  • Cloudflare: Worker with webhook forwarding logic
  • Upstash: Redis as webhook queue/buffer
  • Multiple workflows: Some sharing webhook endpoints
  • Load: ~200 webhooks per day

Questions for the Community

  • Has anyone seen n8n webhook processing randomly stop requiring workflow restarts?
  • Are there known n8n reliability issues with webhook/polling architectures?
  • Is SQLite the culprit? Should I switch to PostgreSQL for webhook reliability?
  • Any n8n configuration tweaks to prevent webhook processing from dying?
  • Can I monitor n8n’s internal state to detect when webhook processing stops?
  • Any way to auto-restart workflows when n8n stops processing webhooks?

Context: I already tried working around this with external infrastructure (Cloudflare + Upstash), but the issue is clearly within n8n itself.

This is becoming a critical reliability issue for production use - any insights would be hugely appreciated!

Tags: #webhooks #reliability zoom #cloudflare #upstash #production #debugging

1 Like

Okay, so nice implementation cloudflare and upstash side, but am wondering whats causing this too, I’ve not faced it, you could setup some alerts, in grafana N8n + Grafana Full Node.js Metrics Dashboard (JSON Example Included!)

say if you notice idleness, am sure you can implement some other methods too checking, but hope the dashboard might help.

Am wondering about ure setup, I see you mentioned sqlite as the db, it could be a bottleneck in the system, and switching to postgres I would recommend yes, my next few questions,

Do you have webhook nodes? and worker nodes? or just single setup atm?

If you system is being overloaded it could bug out webhook processing, and separating the main node, from webhook overloading is possible

Webhook link above

If you still see issues, after sending the traffic to the webhook nodes then it could suggest it’s not just bottleneck situation from single instance, but tbh 200 webhooks calls aday may suggest it’s an error elsewhere, do you see any errors in the logs, around the same time it stops? This would help dig deeper as we may see stacktraces or some error which would help.

You could try enabling debug logs further,

I don’t see this as a common issue in the forum, it could be network side issues too with host. But hopefully the above helps to dig deeper into the issue.

Hope this helps,

Hey @Declup, this :point_up: line made me pause, can you expand on the use case here? Do multiple workflows share the exact same webhook path, ie. do you expect multiple workflows to trigger from just one request?

If so, that might be the source of the issue as webhook paths must be unique per workflow, or else, just the last activated workflow will trigger. This was enforced with a fix in 1.91.0 here.

hello my friend, same is happening with me, when it sudenly stops after working without erorr loginig , since Webhook triggers not reliable after n8n restart
try to dectivate and activate the workflow, is it going to work?

@AI_Blueprint can you please share more about your n8n setup? Are you self-hosting or using n8n cloud? What n8n version? Are you talking about a webhook-based trigger for an app or the n8n webhook trigger?

have you try to set a workflow specific for error? and you are sure the sever have enough memory and cpu for handle all your automation? Are you sure the entire workflow dont go in strange loop behavior?

yes i’m sure its not related to my set up

Here’s my complete setup:

  • VPS: Hostinger cloud VPS
  • Processor: 2 vCPUs
  • OS: Ubuntu 22.04
  • Docker Compose: Running all services
  • n8n: Queue mode
  • 1 main (n8n-main)
  • 5 workers (n8n-worker)
  • 2 webhook workers (n8n-webhook-worker)
  • Redis + PostgreSQL
  • Reverse proxy: Traefik (Let’s Encrypt SSL)
  • Secrets: Managed via .env files
  • No external gateway/queue yet (considering for future scale)

i say if you set the workflow specif for error for capture error in the workflow.

how many ram do you have?

have you activate task runner ? Task runners | n8n Docs