This is my Day 1 on n8n. I want to check with the community if my usecase that I would like to solve using n8n is even feasible before I spend too much time on it.
so, I have a requirement perform a sanity check on some data files that get dropped into various storage locations like azure blob or local storage accessible by UNC path. I would like to perform sanity check on those files like looking for dupes, nulls , bad characters etc. before they get into actual data processing pipelines (which will cause failure). file structure info etc. stored in a sql table that need to be read to validate the file.
So, I’m looking for
ability to monitor storage for new files
read the file and identify bad data
move the file to some failedfiles folder
Could be very basic ?
What is the error message (if any)?
Please share your workflow
(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)
1+3. possible, but dependinh on which storage you are planning to monitor, can be easy or not
2. very much depends on the file sizes and definition of bad data and how computationally expensive this is going to be.
file sizes around 5 gb maximum. typically comma/tab/pipe separated. Depending on the landing zone of the file , they file may have different column structure and constraints like it can not be null etc.. so, I will need to read the definition from a SqlServer based on the landing zone of the file.
Dealing with files of this size almost never makes sense. If you are trying to process files of this size directly in n8n it will almost certainly going to be a problem, N8N is not a data engine, it is more of an orchestration pipeline, heavy computations or dealing with files large in size…you probably would want to explore other possibilities,
Thank you for valuable insight. we do have some python code that does the same thing but trying to get onto the Agentic AI bandwagon. May I can invoke that code from n8n ?
Sure, you can totally invoke code from n8n, for that you will need to either host the code somwhere, a service which allows executing python code as a function (lambda, gcp cloud function) OR you can create a simple api wrapper around your function and self-host it anywhere you want.