Best pattern for a highly reliable AWS host?

I’ve been considering how to have highly available self-hosted version of n8n with AWS that can take advantage of its autoscaling while also still abstracting all of this from our non-devs.

I’ve considered a few things like having API Gateway handle a greedy route of /webhook/* and forward it to SQS to be handled, but this feels useless since the code to add a job to the redis queue isn’t exposed via some API, so I’d need to go review the internal code and write a lambda duplicating it.

Before I go through a large list of design patterns trying to fit into the suggestions at in the scaling docs, it’d be great to learn if someone else has already established a good pattern for the webhook workers pool, workers pool, and main host.

I suppose using ECS with load balancing for everything might make more sense and be the least effort, but would love to hear what people have done.

Hey @Wittiest!

Welcome to the community :sparkling_heart:

I don’t have a lot of experience with this, but maybe @krynble can help here :slight_smile:

Thanks @harshil1712 :slight_smile:

If there is any documentation or tutorials for any service where someone has implemented all of the best practices outlined in n8n’s advanced scaling section, that’d also be super helpful.

I keep searching and all of the examples I’m finding are for the most basic implementation.

If nothing pans out, I think API Gateway, ECS, Elasticache, and the correct loadbalancing config should be enough. It’s always nice to see what others have done though.

Hi @Wittiest

Your question is super interesting. I have on my to-do list a benchmark to really crunch the numbers of how n8n scales. Also as part of this task comes the documentation process.

You got the overall idea, but I’m not sure about elasticache. API gateway is also not mandatory as a simple application load balancer should be enough.

My recommendation is that you set up ECS with 3 different tasks and a few services:

  • Postgres 13+ as a shared database for all your workflows
  • A Redis cluster that will be also used by all your n8n instances
  • 1 task running the n8n default process in queue mode and webhooks disabled. This will cause this "main’ process to run forever and getting restarted if necessary. This one should not have replication i.e. should be a single instance of n8n
  • 1 task running worker processes that can have multiple instances on the same machine. Each worker uses 1 process therefore running multiple instances helps you better use your resources that scale based on CPU / network
  • 1 task running webhook processes that can also have multiple copies in the same host. Just like the workers, these are single processes so running multiple instances helps better use your resources

So what happens in the end is that your main n8n instance will be responsible for triggering workflows that are not webhook based, like crons, polls, etc. Everything else will be run on workers or webhook nodes. For this reason, this instance cannot and should not be scaled as this would cause duplication of work.

Your main instance is also your entry point to n8n, allowing you to edit workflows, view executions, etc.

If you have any more questions, feel free to add. I believe this community post can become a great place for a future documentation on deployment practices as your questions will help build this document.

3 Likes

Hey @krynble. I appreciate the thorough answer. My team is still evaluating a move from Zapier, primarily because it’s difficult to monitor and we’ve run into a lot of unexpected issues.

I’ll definitely report back if we go through with the implementation so that I can share any learnings we have.

2 Likes