I am deploying the n8n embed version on my cloud instance. we have multi main enabled on our license. All our workflow triggers are webhook based.
what are general best practice configurations for worker and main node, queue/worker concurrency, and scaling configurations?
Hi, this is a very generic question and it depends on a lot of factors. I think from the scaling docs the summary would be something like (interleaved with some of my thoughts):
Disable webhooks on the main
Have dedicated instances for the webhooks (# determined by load) might even make it dynamic based on LB load.
Have dynamic # workers based queue size (redis tasks waiting could be a metrics)
workers x corcurrency < tasks (waiting redis metric increases) if metric > treshold (or > 0) start new worker. (limit the max number of workers spawned. in case of some issues you wouldnt want to have 100s)
Prefer minimal size workers with limited concurrency (but allow it to scale more) IMHO
Make sure you manage your data
Monitor (and monitor).
Make sure you have decent backups of all the things you need and a tested way to get it up and running
i was able to separate out the deployment in main, webhook, and worker with scaling. what are you guys using for monitoring? i followed Observability on n8n | ANDREFFS
are there any other grafana dashboards for a multi node deployment? any best practices to capture relevant metrics
I would start with monitoring the central components (Postgres, Redis) which enable the whole queue setup.
each endpoint has the /metrics which you already know.
Which might give you things like eventloop lag, total success / queue/ error wf etc.
I haven’t got to it, but personally i would like to read the n8n postgresql execution data table which contain the execution times for the workflows.
You could expose per workflow success/error
number of execution per timeframe
many possibilities, but i guess its a custom built solution
this would be great to detect problems and/or trends for a workflow (this becomes more important when you don’t have a clue which worker actually runs which task)
there might be existing solutions for that. you might want to research a bit.