AWS ECS Multi-main deployment strategy (avoiding split brain leader situation)

Describe the problem/error/question

When architecting for AWS ECS Multimain Setup, what should be the deployment strategy for the leader? Specifically within AWS ECS, should the Service Deployment configuration be:

Option A: Guarantee only one instance of the leader contain will run at any given point
Min running tasks %: 0
Max running tasks %: 100

Option B: During deployments, there may momentarily be two leader containers.
Min running tasks %: 100
Max running tasks %: 200

What is the generally recommended strategy for updating in a multi-leader environment?

It looks like your topic is missing some important information. Could you provide the following if applicable.

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:
  • n8n version: 1.34.2
  • Database (default: SQLite): PG
  • n8n EXECUTIONS_PROCESS setting (default: own, main): tbd
  • Running n8n via (Docker, npm, n8n cloud, desktop app): docker
  • Operating system: AWS ECS / linux

Hey @Spartak_Buniatyan,

From what I understand when running in multimain the other main instance can also be used to handle webhook requests replacing the need for webhook workers so it won’t matter if you have multiple mains working.

The only time it would be an issue is if you are not using the actual multimain configuration option and just trying to run 2 main instances which then causes issues.

When it comes to updating at the moment it would be a case of scheduling downtime based on your internal SLA / Update policy then updating all of the instances, I tend to start with my main instance then move to my workers in normal queue mode so I would apply this same process to multimain as well.

1 Like

Thank you @Jon

Just a couple of followup questions for additional clarity.

  1. Without multi-main option, is there a way to update the main without either causing a scheduling downtime or double scheduling ?

  2. Building on earlier question, In case it’s in queue mode with dedicated webhook processes, what will happen to incoming webhooks while main is being updated and thus there is downtime?

  3. Is there a recommended approach to have a zero downtime updates/deployment without multi-main option ?

Thank you again for your help

Hey @Spartak_Buniatyan,

  1. Without multi main there is no way so I think it would be the best option, In theory you could direct traffic to one of the main instances then update the other one and swap it back over and do the update on the other instance but even for that I would still recommend scheduling in downtime so you can revert any changes if needed.

  2. If you are running in queue mode with webhook workers they should still get the job and add them to Redis to be picked up but I am not actually sure if those jobs would be processed, I would assume they are but maybe @krynble can provide some clarification there.

  3. I would recommend to always plan for downtime even if there isn’t any as you don’t know what would go wrong, One thing I always think about is what does zero downtime actually look like as achieveing 100% uptime is a tricky thing to do.

An update itself typically doesn’t take that long so you could only be looking at 5 minutes if you pull the images down first then do a stop / start so the new image is picked up you could do that once a month a month and you would have an uptime of ~99.98% or once a quarter for ~99.995%

If you are dealing with a lot of webhooks coming in one thing you can do to help is add an extra layer to your environment and use something like Convoy / SVIX or Hookdeck which could sit infront of your n8n instance to handle and cache the webhooks and then pass them back to the workers this means even if n8n is offline the data is still coming in and will be retried as needed.

If you are thinking about taking the enterprise route it could be worth getting in touch with the team and a meeting could be arranged to go over this in more detail.

Thanks @Jon , just to keep you updated, we tested a setup in dedicated workers + webhooks instances mode, and when the main instance is take offline, the incoming web hooks still get processed. So it does allow for a ZERO downtime migration of the entire system, while ensuring there is no double scheduling. The deployment ECS deployment strategy needs to be:

  • main instance: 0% 100% (to guarantee only one instance is always running
  • web-hook instances: 100% 200% (so new webhook instances are launched first, and then the existing ones are brought down)
  • worker instances: 100% 200% (so new worker instances are launched first, and then the existing ones are brought down)

One remain question is what will happen to worker and webhook instances, when there is a SIGKILL even sent by the docker container? Will it wait to complete any work in flight, or start the showdown of the nodejs server? In either case, I guess we can build in a shut down grace period for existing containers

2 Likes

Hey @Spartak_Buniatyan,

That is good to know although I would still recommend scheduling downtime to allow for anything unexpected. These are some very good notes though other community members will no doubt find useful thanks for the work on it.

If Docker is restarting the container in my experience I have had mixed results but most of the time it just stops.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.