We have fallen-foul of the bug that stops the internal healthcheck of the contained version of n8n working well on “serverless” hosts such as “google cloud run” or “amazon fargate”
though the workaround (attached) helps it doesn’t 100% fix it and can timeout after 5 seconds causing the “offline” message to appear, does anyone else have the issue and any tips (we have raised a ticket also)
Hi @Neil_Carmichael
I see this more as a mismatch between the serverless model and n8n. In production, I’d avoid scale-to-zero on the main instance and use queue mode with Redis and separate workers. This tends to be much more stable.
The 5 second timeout is almost certainly your ECS task definition healthcheck being too aggressive. I would bump teh healthcheck timeout to at least 10-15 seconds and increase the interval to 30s and also make sure youre hitting /healthz not /healthz/readiness since the readiness one checks the DB connection too and can be slower. Also set healthCheckGracePeriodSeconds on your ECS service to something like 120 so Fargate gives n8n time to actually boot before it starts killing tasks. That grace period one specifically catches a lot of people on Fargate.
Interesting, the issue isn’t with it start but when its doing a slight heavier step in a flow so the five seconds is exceeded and it shows users the dreaded “offline” message despite being up.
I am going to try EXECUTIONS_PROCESS to see if having the frontend GUI and backend tasks in separate processes help the situation.
That’s because n8n is sending health checks from the client (browser). So when the instance is under any significant load (the execution is running or the flow has dozens of nodes), the 5-sec intervals won’t pass
the healthcheck grace period is def key — but also double-check if you’re hitting /healthz not /readiness. readiness checks the db connection and under load thats what usually times out on fargate. we switched to /healthz + bumped the timeout to 20 seconds and the offline message stopped showing.
ah ok so the issue is during execution not startup, that changes things. Try setting N8NSKIPHEALTH_CHECKS=true and handling the healthcheck externally through your ALB target group instead, that way n8n’s internal check isnt competing with your workflow executions for resources.