Hi n8n team,
I’m experiencing executions getting stuck in a permanent Waiting state when using the Chat Trigger with Response Mode = “responseNodes” and then sending a message via a Chat node after a timeout.
Environment
n8n version: 2.9.4
Deployment: self-hosted (Docker)
Database: Postgres
Redis: enabled (used for queue/buffer)
Reverse proxy: Caddy
Timezone: Europe/Madrid
Workflow setting Timeout Workflow enabled (e.g. ~1h45m), but it does not terminate these stuck executions.
Expected behavior
If the user closes the browser tab (disconnects) before replying to a Chat node (sendAndWait):
the limitWaitTime should resume the workflow,
the “timeout path” should run,
sending a “session expired” message should either fail fast or be skipped,
and the execution should finish (or error) instead of staying in Waiting forever.
Also, the workflow-level Timeout After should eventually stop the execution.
Actual behavior
When the user closes the tab before responding:
Chat Trigger receives the initial message.
A Chat node runs with operation = sendAndWait and limitWaitTime set (e.g. 30s).
After the limitWaitTime resumes, the workflow routes into a “Timeout” branch.
In that branch, a Chat node (intended as sendMessage) attempts to send “Session expired” text.
The execution becomes stuck and remains Waiting indefinitely. In the UI it looks like the node is waiting for the session to close / connection to end (e.g. “waiting for session to close”).
The workflow-level Timeout Workflow / Timeout After does not stop it.
Notes / reproduction hint
This happens specifically when the client closes the browser tab before responding to the approval buttons / response.
I can reproduce reliably by starting the chat, triggering the sendAndWait question, then closing the tab before answering.
If helpful, I can share a minimal exported workflow JSON and an execution ID showing it stuck.
Questions
Is this a known issue with chat response nodes / session lifecycle handling?
Should “sendMessage” fail quickly when the client is disconnected rather than keeping the execution in Waiting?
Is there a recommended pattern to avoid stuck executions in this scenario?
Yeah this looks like a legit bug, the Chat node doesn’t detect that the client disconnected so it just sits there trying to deliver the message to a session that no longer exists. Also worth knowing that the workflow-level Timeout After setting only counts active execution time, it doesn’t count time spent in a “waiting” state so that’s why it never kills these stuck ones. I’d file this on GitHub at GitHub · Where software is built if there isn’t one already, because n8n should really be failing fast on sendMessage when the SSE/WebSocket connection is gone instead of hanging forever. In the meantime the only way to clear those stuck executions is deleting them manually through the UI or the API, restarting n8n will also clear them but obviously that’s not ideal.
Hi @ManyIntegrations Welcome to the community!
what i would say is that on caddy set correct and working webhook url and make sure the websocket headers are working perfectly, also your workflow seems a bit off, what exactly is your agenda for this workflow?
Hi @ManyIntegrations, welcome to the n8n community !
My recommendation is to redesign the workflow so that it does not depend on the active chat session after a timeout, explicitly ends the execution in the timeout branch, stores the session state externally, and separates the interactive chat step from the final processing logic.
Hi @tamy.santos and @Anshul_Namdev thank you for your reply. This is just a short example to illustrate the behavior I’m asking about. The real workflow works like a chatbot, where the user answers each question only by pressing buttons using the “send and wait for response” action.
Everything works fine as long as the user completes the whole flow (pressing all the buttons until the end). The timeout also works correctly as long as the user keeps the window open.
The problem appears when the user closes the window (before finishing all the process). Even after the timeout has passed, in the execution panel the workflow remains in a “waiting” state until I manually stop it.
I also tried to do a “watchdog with a Wait”, but Timeout message node gets there stuck and the wait node never executes.
I want the timeout message because I need to end the execution and inform the user that the session has expired and that they need to open it again or refresh the page (F5). This only makes sense if the user is still there. But of course, if the user closes the window, there’s no point in sending a message — and there’s no way for me to know that.
This also happens with the chat window (where I create the workflow) in the same execution. All fine if I do not close the window.:
Very simple way for solving this problem is to use webhooks and custom HTML pages or n8n forms , so that when one form or input gets taken it calls another webhook which then calls another as long as the whole flow is not ended, this might sound lengthy but this is really a production level solution for HITL node as there are timeout ways but why to even wait for some user, just execute only when user does interact else nothing happens.
Interesting approach @Anshul_Namdev
I found a very simple solution: use the “Send and wait for response” option in the Timeout Chat Node as well (the one that displays “Session expired for inactivity”).
This way, if the user closes the window before replying to the previous Send and wait for response node, the Timeout node will be triggered. And since that Timeout node also won’t receive a response, the flow will end after this final timeout.
@ManyIntegrations That is a nice one too, but i would still recommend that kind of long approach i have suggested as it ensures that the workflow and the n8n would not run in wait i mean HITL and keeping the flow running for nothing, your approach is good too!
Thank you @Anshul_Namdev
Do you have an example of this solution?
I imagine I’ll need to build a front end with buttons, textboxes, and several workflows that call each other.
Right now, my entire workflow relies on execution.id instead of session.id. Using the send and wait for response node, I can complete a full interaction (such as booking or appointment flows) within a single execution. I find this approach safer, and it also makes debugging and troubleshooting much easier.
Regarding performance: the maximum execution time for a workflow is around 2 minutes, which is the timeout I use when no answer is received. Even if thousands of workflows run simultaneously for those 2 minutes, a regular VPS can handle it without issues, since waiting does not consume CPU.
@ManyIntegrations I currently not have that the one i made, although you should consider that as webhooks are amazing with the HTML as you can create your own system without actually keeping something in loop for nothing, HITL is good when response is certain but most of the times in production the workflows times out and gives an error with HITL, your approach is really fine again it is good for most of the use cases but using multiple webhooks and creating a single workflow like that sounds like a pain but it really works all the time without causing much error.
The approach you settled on is actually the cleanest native solution for this — using a “Send and wait” node in the timeout branch effectively self-terminates the orphaned execution path.
One thing worth adding for anyone hitting this: the core reason these get permanently stuck is that n8n’s execution engine treats “Waiting” state differently from “Running” — the workflow-level timeout setting only counts against active CPU time, not idle wait time. So a stuck HITL execution can sit there essentially forever.
The workaround that works at the infrastructure level (for self-hosted): add a cleanup script that queries your n8n database for executions in “waiting” status older than a threshold and marks them as stopped. Something like:
-- Postgres: find executions stuck in waiting > 2 hours
SELECT id, workflow_id, started_at
FROM execution_entity
WHERE status = 'waiting'
AND started_at < NOW() - INTERVAL '2 hours';
-- Mark them finished (use carefully, test first)
UPDATE execution_entity
SET status = 'canceled', stopped_at = NOW()
WHERE status = 'waiting'
AND started_at < NOW() - INTERVAL '2 hours';
You can wrap this in an n8n workflow itself — Schedule Trigger → Postgres node → done. Run it every hour as a maintenance task. For production chatbot flows where session drops are frequent, this prevents gradual accumulation of zombie executions that can eventually slow down the queue.
GitHub issue tracking this properly: GitHub · Where software is built is the place to +1 the disconnect detection request so the team prioritizes it.