Right now, n8n supports scaling workers, but does not support scaling certain “core services” components, like the UI. This means you cannot load balance multiple instances, but most importantly, this means the main process is a single point of failure.
n8n should support scaling web nodes to remove this single point of failure.
We want to ensure the service remains available, even in the case of a partial failure (for example, an availability zone or regional outage with our cloud provider). For example, we can deploy n8n in multiple AZs or regions and if one AZ or region goes offline, the service remains available in the other AZ and/or region.
We also would like to support zero downtime deployments (e.g., blue/green deployment controllers), either for configuration changes or upgrades. Unless n8n supports multiple instances, we have to first shutdown the API then deploy the new version, which causes downtime.
- it removes a single point of failure that currently exists in the architecture
- improves fault tolerance in the system
- enables zero downtime deployments
- allows high degrees of scaling to be achieved
I’m not familiar with all the technical details that make this unsupported right now, but have seen at least two reasons in the docs and other threads:
- Cron job synchronization
- Webhook/workflow [de]registration
For 2: I know deregistration can be skipped, but it’s unclear to me if that completely solves problem 2.
For 1: there is a lot of prior art for how this kind of problem can be solved. One simple example is storing schedules in database and use distributed locking mechanisms for execution. In the wild: django-celery-beat does this for celery (or node-celery). Consensus algorithms, such as raft, might be another option, but would be wildly more complex.
I probably won’t contribute actual implementations, but I have a lot of experiencing developing distributed systems and can probably help with architectural problems that may be required to solve to implement.