Hi team,
We are running n8n in queue mode on AWS EKS (Kubernetes).
Our setup includes three containers: main, worker, and webhook.
The Main container is using a Persistent Volume Claim (PVC) for data storage.
As per the design, the Main container cannot run multiple replicas, since it handles the frontend UI, REST API, and internal orchestration.
Our chatbot widget is configured to use the frontend (Main) URL. The problem is:
Whenever the Main pod restarts or gets replaced (for example during deployment or node rotation), the chatbot widget starts returning errors because the Main service becomes temporarily unavailable.
Since we are using a PVC and EKS, the pod replacement time is slightly longer, which increases downtime.
This effectively creates a single point of failure for chatbot traffic.
Questions / Help Needed
-
Is there any recommended way to achieve zero-downtime or high availability for the Main container in queue mode when running on Kubernetes?
-
Can the chatbot widget be configured to use the Webhook service directly instead of the Main URL?
-
Is there any best-practice architecture for HA setups on EKS where frontend-based integrations (like chatbot widgets) keep working even when the Main pod restarts?
-
Any workaround to reduce pod replacement downtime when PVCs are involved?
Any guidance or recommendations would be greatly appreciated.