N8n self-hosted - sequential execution with third party tool - minimal setup with task spooler

the problem:
you want n8n concurrency but not for all workflows. you run self-hosted on a low power machine to which you have CLI access. you want to minimize the amount of containers you use.
example:
I have a rag pipeline running on a low-power device that reads from a filesystem in sequence and fires sub-workflows. the main workflow (the pipeline) cannot run in parallel. I tried scheduling it, and get multiple running instances no matter what.
solution (already proposed by N8N personnel): third-party tools.
edge linux solution: (compliant with n8n recommendations, using a minimal approach - 10 mins setup)

  1. write a shell script that calls the webhook and waits, put the shell somewhere on your host.
#!/bin/bash

WEBHOOK_URL="http://localhost:5678/webhook/my-workflow-id"

RESPONSE=$(curl -s -X GET "$WEBHOOK_URL")

echo "Response: $RESPONSE"
  1. install TSP (task spooler) from apt/deb. this allows to create execution queues on linux.
sudo apt install task-spooler
  1. create another script that adds to the tsp queue the webhook shell script call, if the queue does not have any other instance of the same script already in queue (so max queued scripts: 1 running and up to 1 waiting). this keeps the queues tidy. add at the bottom of the script a section to clean the completed queue history. this reduces unnecessary logging.
#!/bin/sh

#wait some time to ensure we do not fire the next run too soon, adjust to your liking or remove
sleep 3

JOB_NAME="n8n_queue"
WEBHOOK_SCRIPT="/home/ubuntarr/scripts/webhook_triggers/n8n_rag.sh"

# Check if there are any queued jobs with the specified name
queued_jobs_named=$(tsp -l | grep "$JOB_NAME" | grep -c queued)

if [ "$queued_jobs_named" -eq 0 ]; then
  echo "No queued '$JOB_NAME' jobs found. Adding webhook call to task-spooler queue."

  # Enqueue the job with a timeout of 2 hours, silence all output
  tsp -L "$JOB_NAME" timeout 2h sh -c "$WEBHOOK_SCRIPT > /dev/null 2>&1" > /dev/null 2>&1

  echo "Job added to task-spooler queue."
else
  echo "There is already a queued job named '$JOB_NAME'. Not adding a new job."
fi

# Cleanup: Remove completed jobs
echo "Cleaning up completed jobs..."
tsp -C
echo "Cleanup done."

  1. schedule calls to the tsp trigger script from crontab with the desired frequency.
# run every 5 mins
*/5 * * * * /path/to/your/tsp_queuing_script.sh > /dev/null

this approach has the added value that you do not care anymore if health checks fail or the n8n ui is unresponsive or whatever - which happens to me very often. you will see what failed and what not from cron.

  1. you can expand this to write to a log and send yourself a report via crontab (or n8n). the management of webhook replies to trigger a failure of the .sh file, is not yet part of the script, nor the timestamps of start and finish.

example test:

me@here:~/scripts/webhook_triggers$ sh tsp_n8n_rag.sh
No queued 'n8n_queue' jobs found. Adding webhook call to task-spooler queue.
Job added to task-spooler queue.
Cleaning up completed jobs...
Cleanup done.
me@here:~/scripts/webhook_triggers$ tsp -l
ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/1]
0    running    /tmp/ts-out.YvEklO                           [n8n_queue]timeout 2h sh -c /something/n8n_rag.sh > /dev/null 2>&1
me@here:~/scripts/webhook_triggers$ sh tsp_n8n_rag.sh
No queued 'n8n_queue' jobs found. Adding webhook call to task-spooler queue.
Job added to task-spooler queue.
Cleaning up completed jobs...
Cleanup done.
me@here:~/scripts/webhook_triggers$ tsp -l
ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/1]
0    running    /tmp/ts-out.YvEklO                           [n8n_queue]timeout 2h sh -c /something/n8n_rag.sh > /dev/null 2>&1
1    queued     (file)                                       [n8n_queue]timeout 2h sh -c /something/n8n_rag.sh > /dev/null 2>&1
me@here:~/scripts/webhook_triggers$ sh tsp_n8n_rag.sh
There is already a queued job named 'n8n_queue'. Not adding a new job.
Cleaning up completed jobs...
Cleanup done.

you can see that the script will add up to 1 queue and then refuse to add others, exiting quietly.

check what you do before setting up:
task spooler man page

Have you setup worker nodes? or webhook nodes? it comes with redis for execution queue? and distributed workload

hi, yes !
but worker nodes are meant to handle concurrency. if you set up a scheduling trigger with two workers in my experience this results in two concurrent runs, irrespective of what is set up, even using lockfiles and random number of seconds waits.

1 Like

just to better explain if someone else reads: this works both with a single “main” node and with scalable worker nodes - it has nothing to do with not leveraging the queue mode: in my current setup I have 1 main and 5 workers. I run via tsp the workflow which needs to be executing one at a time, this in turn triggers a number of sub-workflows due to its own internal design, which trigger another set of sub-workflows, etc, which are all picked up by workers correctly. to check how sub-workflows are picked up the best in my view is docker stats from the host.
Instead, in my experience, if you take the same workflow above and schedule it to run with a schedule trigger, irrespective of how you set it up you will end up with a separate workflow running on the same data with the same parameter, triggered on each worker. I am not sure why this is the case and looks to me like a design limitation on N8N side (possibly due to the fact that n8n was not originally built for queue mode) - the trigger lies within the workers and is not standalone.

the most basic use case for this is if you have to sweep a filesystem to perform RAG: you need a parent workflow to trigger the sub-workflows of individual files, you want the sub-workflows to be picked up by workers, but the parent workflow must stay 1 at all times.

1 Like

CONTENT REMOVED

I had posted here previously a method to basically restart N8N when an execution hangs, programmatically, using this task spooler method as a monitor, as suggested elsewhere.

however this evidently has not been tested within N8N before being suggested.

the task spooler method works, but in queue mode, sometimes when restarting a couple times frequently the containers, previous long-lasting or complex executions seem to be taken up again, which is completely in contrast with the idea of “starting anew”.

this applies also if I prevent the system from saving successful executions.
the CPU or RAM similarly are never peaking and I/O is against an M.2 NVME but most saves and loads happen against ramdisk, and network is never saturated with traffic.

simply put n8n at this moment is not suited to handle large amounts of data over a single thread over time.

1 Like

UPDATE.
with massive refactoring I made my RAG pipeline work.

currently I use task spooler.
what I wrote above about restarting n8n is correct - don’t set up checks to validate if n8n hangs in order to kill it, this generates potentially a sequence of restarts in a selfhosted environments, that can be due to large files and OOM issues, exogenous factors, etc.

my rag pipeline starts with one workflow that tsp calls. that flow sweeps the files list and the vectorstore checking for elements missing, incomplete, or updated and therefore to re-embed.

the changes I did were the following, to render it more stable.

  1. have the main NOT process workflows.
  2. increased the max_old_space_size to the most I could (this was fundamental) in the main (not very relevant) and in the workers (very relevant).
  3. since we cannot use minio in community edition, because S3 is an enterprise license feature, I simply made sure that binary data was always staying within workflows, to be unloaded after execution completion. this is not super stable but is “as good as we can get”.
  4. I changed all workflows to webhooks. the thing I did notice using docker stats, and that is not very clear from the documentation, is that nested workflows within a webhook, will be performed by the same worker, nested webhooks will be performed by a worker assigned by redis (or however that works). the bottom line is by running docker stats now I can see my 5 workers get almost evenly loaded after a couple seconds, which greatly reduced unresponsiveness.
  5. reduce concurrency to the minimum (5) for each worker node. this because, the usable cpus are capped to 1 for each n8n container. therefore adding workers provides more threads and overall ram per thread and reduces the load on individual processes, which in my specific edge scenario makes sense.

this way with tsp I trigger the scheduled sequential jobs, which will be assigned to a single worker at a time (at random). within those jobs, I keep the webhooks so load is passed to the other workers.

I am now going to test this letting tsp schedule every 5 minutes a job that lasts approximately 5 minutes (with 4 parallel sub-workflows and an amount of sequential sub-sub-workflows that is typically between 10 and 60 for each of the 4 parallel ones) , for 24 hours straight.

I also have an internally scheduled workflow that removes old successful executions to keep the database from growing above a limit.

if this setup proves to reduce consistently the random hangs, I will test a modified, less aggressive version of the docker restart helper I removed earlier.

1 Like

Love it thanks, btw if you want to monitor further u can setup N8n + Grafana Full Node.js Metrics Dashboard (JSON Example Included!)

Might be nice to watch the change in node metrics :slight_smile:

Thanks for adding this reply, good read.

Samuel