Run n8n on stateless containers

I have been trying to deploy n8n on stateless containers (Google Cloud Run in particular) over the last days, but without much success so far. (There is also an issue on GitHub for it: Section about state in containers · Issue #609 · n8n-io/n8n-docs · GitHub, but I was forwarded to the forum)

The setup looks like this:

  • Postgres database on Cloud SQL
  • Set the N8N_ENCRYPTION_KEY variable to make the en-/ decryption stateless
  • Basic Auth

If I run this setup with exactly one container, it works flawlessly, but as soon as I allow Cloud Run to scale the number of instances, I get following problems:

  • Test mode doesn’t work reliably, because requests might be routed to another container than the one serving the website which is “waiting to receive request”. This problem, however, would be tolerable.
  • Of course cron jobs don’t work, since all containers are terminated when no more requests are incoming.
  • This error often occurs:
    Error: {"code":404,"message":"The requested webhook \"POST a32…\" is not registered.
    Maybe some problem with the activation of newly spun up instances?

So there must be some kind of problem when several n8n instances share the same Postgres database? If so, I think this shouldn’t be a problem. All they should do with the database is load the workflows and write the logs of the events. This itself shouldn’t cause any interference between the n8n instances.

As @krynble says in the GitHub issue:

Now if your workflows are based only on webhooks (i.e. external http requests) then you should have no problem with multiple n8n instances sharing the same database. This is the only situation when you can say that n8n is “stateless”.

So why doesn’t it work then? :sweat_smile:

All right, let’s continue!

Ok, now I know what is happening.

If you have multiple instances, whenever any of them is shut down it will de-register the webhook endpoint.

In order to avoid this you must set N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true as an environment variable. You can see the description for this flag here: n8n/index.ts at 1f71e69ed881142c417b6e12533783ac24cc2e45 · n8n-io/n8n · GitHub

Still I would recommend you use only 1 persistent n8n instance (started via npm run start or n8n start) and multiple webhook and worker processes as our documentation suggests.

1 Like

In order to avoid this you must set N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true as an environment variable.

Awesome, that’s what I’m talking about :raised_hands::grin:. I’ll give it a try, thanks!

Still I would recommend you use only 1 persistent n8n instance

I’m especially interested in this setup to be able to scale down to 0. Since then I could run n8n basically for free if there were only a few webhook calls per day. (And yet still be able to scale up to infinity … or at least to the the limits of the Postgres database)

I didn’t think n8n was officially supported on Stateless Containers :thinking:

1 Like

I see, that is a very interesting approach.

Please bear in mind that you cannot have workflows triggered with anything other than http requests in this case, otherwise you could have duplication of work, as I mentioned before.

Other than this, you should be on the safe side =)

Good luck and let us know of your results!

Hey @Jon it is not officially supported, but with some hacks it is possible.

There are very strong constraints to this kind of usage, this is why we do not officially support as it can cause some problems (mainly duplication of work) so we prefer to use the “official scaling methods”.

For webhook calls, with the above mentioned flag you can have n8n as nearly stateless.

1 Like

That is interesting to know, I am not sure if I will ever have that requirement as I prefer the idea of always having a main instance running.

1 Like

Hey @krynble, I am using Cloud Run stateless containers with op setup. Even after using N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true I still have op error:

This error often occurs:
Error: {"code":404,"message":"The requested webhook \"POST a32…\" is not registered .

I also keep one instance running in one VM. The error still happening. I don’t know if @ad-si has solved this issue.

I read official docs and watched your youtube tutorial on scaling n8n. Unfortunately, Cloud Run only trigger based on HTTP request. So I can’t set up worker in queue mode.

Let me know if there is any other information that you need me to clarify.

Thank you!

Hey @darlanm welcome to our community forums!

Very strange indeed.

What happens under the hood on n8n is the following:

  1. Once you activate a workflow that contains any webhook, a new entry is created in a database table called webhook_entity (add prefix if you make use of this feature)
  2. Whenever any http request with /webhook/* is received, it gets parsed and checked against the above mentioned database to check what workflow it belongs to
  3. If no entries in the webhook_entity table is found then you get the The requestd webhook ... is not registered

On a side note: the webhook_entity record is removed when n8n is shut down for any reason, therefore we set N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true so that it remains there, and also when you disable the workflow manually.

So if your workflow worked for a while and then stopped working, one of these conditions happened:

  • Your instance responding to the http request is not connected to the same database (maybe environment variables are wrong and it’s connecting to another database, perhaps even using local sqlite?)
  • N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true is not set and n8n was disabled in one or some of the instances, and when they shut down, one of them removes the record from the database

For the second case, you can enable logging (instructions here: Logging in n8n | Docs) and set it’s value to debug or at least verbose and watch for a line saying Call to remove all active workflows received (removeAll). If this ever happens, it means that n8n is somehow not following the N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN flag and we should investigate.

Lastly, I would like to clarify one question: Do your workflows eventually start working again with no manual intervention or do they stop and only work again if you manually reactivate?

I’m also still having issues, but I can’t really boil it down. By now I’m running a single container with the N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true flag and I have a monitor which runs a dummy workflow every 5 minutes. The monitor works flawlessly, but I still have workflow executions displayed as Unknown and get not registered warnings every now and then.
(Cloud run may restart container every now and then, but there seems no correlation between container restarts and stated problems).
Looks a log like a bug in n8n to me :sweat_smile:

Definitely, I will try running a similar setup and see how it works and if I can find a reason. As far as we tested with multiple deployment types, it was all working fine.

I will ask you if possible to enable loging on n8n and set the logging level to verbose or debug.

This will help us identify if and maybe when the workflows are being deactivated.

Hey @krynble and @ad-si Thank you for the responses.

@ad-si I did use cloud scheduler to ping webhook every minute (idle instance, warm starts). But I experience not working when spiking many containers, and it just ping randomly.

@krynble Okay, I will turn on the debug level and will let you know.

My initial suspicion is gone when you mentioned 2)

Forgive my ignorance. Is it possible that n8n respond to the webhook once the container is ready before checking any webhook entity in the database? (something like race condition)

Regarding your last question, about 40-60% start working again with no intervention. 100% working again after I manually reactivate.

This shouldn’t happen @darlanm

n8n only starts accepting http requests after the database connection has been initialized. This can be seen here where on line 191 we make sure the DB has been initialized and then on line 307 we start accepting http requests.

Let’s see if the logs help us with something!

Hey @krynble, hope this message log is clear enough. Let me know if it’s not, I will pull a longer message

> ERROR: Workflow could not be activated:

Default
2021-09-22T15:39:34.056106Z
 duplicate key value violates unique constraint "PK_b21ace2e13596ccd87dc9bf4ea6"


2021-09-22T15:39:18.755396Z
 => ERROR: Workflow could not be activated:

Default
2021-09-22T15:39:18.755474Z
 duplicate key value violates unique constraint "PK_b21ace2e13596ccd87dc9bf4ea6"


Default
2021-09-22T15:45:04.516265Z
 => ERROR: Workflow could not be activated:

Default
2021-09-22T15:45:04.516294Z
 duplicate key value violates unique constraint "PK_b21ace2e13596ccd87dc9bf4ea6"

Default
2021-09-22T15:45:04.715736Z
2021-09-22T15:45:04.516Z | e[31merrore[39m | e[31mUnable to initialize workflow "Monitoring" (startup)e[39m {"workflowName":"Monitoring","workflowId":15,"file":"ActiveWorkflowRunner.js","function":"init"}

Default
2021-09-22T15:45:04.716197Z
2021-09-22T15:45:04.716Z | e[36mverbosee[39m | e[36mFinished initializing active workflows (startup)e[39m {"file":"ActiveWorkflowRunner.js","function":"init"}

Info
2021-09-22T15:53:49.033003Z
GET2001.7 KB8 msChrome 92 https://n8n1--ue.a.run.app/favicon.ico

Default
2021-09-22T15:53:50.038333Z
2021-09-22T15:53:50.038Z | e[34mdebuge[39m | e[34mReceived webhoook "GET" for path "prevent-cr-down"e[39m {"file":"ActiveWorkflowRunner.js","function":"executeWebhook"}

Warning
2021-09-22T15:53:50.048453Z
GET4041.13 KB12 msChrome 92 https://n8n1--ue.a.run.app/webhook/prevent-cr-down

Info
2021-09-22T15:53:50.521462Z
GET2001.7 KB9 msChrome 92 https://n8n1--ue.a.run.app/favicon.ico

Default
2021-09-22T15:53:51.902586Z
2021-09-22T15:53:51.902Z | e[34mdebuge[39m | e[34mReceived webhoook "GET" for path "prevent-cr-down"e[39m {"file":"ActiveWorkflowRunner.js","function":"executeWebhook"}

Warning
2021-09-22T15:53:51.911025Z
GET4041.13 KB10 msChrome 92 https://n8n1-ue.a.run.app/webhook/prevent-cr-down

Info
2021-09-22T15:53:52.392841Z
GET2001.7 KB6 msChrome 92 https://n8n1-ue.a.run.app/favicon.ico

Aha, thank you for the information provided @darlanm

I was able to find where and how the issue happens.

I will check how this can be fixed and will return to you with more information.

I created a possible fix for the issue you are having. You can find it here: Fixed n8n's startup behavior for scaled mode by krynble · Pull Request #2242 · n8n-io/n8n · GitHub

It should be released on the next version update, I believe.

Thanks @krynble, for your speedy fix! I see the PR. Looking forward to when it’s merged.

Just to let you know that I am also trying a queue mode. Your video tutorial also helped me to set up n8n in Kubernetes. One pod still needs to be persistent (hence some cost I want to avoid), but I am pretty happy with it. Thank you!

2 Likes

Got released with [email protected]

1 Like

Hi @ad-si @darlanm , can you please share a tutorial on how you deploy it on cloud run?

My Dockerfile seems not working.

FROM n8nio/n8n
ENV N8N_PORT=8080
WORKDIR /data
CMD [ “n8n”, “start” ]

I found other tutorials here but it’s using terraform.

Thanks!

Welcome to the community @Rob_AI!

Can sadly not help with your question but just to make it clear again. n8n does not officially support that and you will run into issues and those are expected. Meaning if you run into any issues you are pretty much on your own.
So only do that, if you know exactly what you are doing. If you want that n8n runs properly, run into on a virtual machine. Those are available for $5 per month from services like DigitalOcean, Hetzner, and others.

2 Likes