Keeping a Vector Database Updated with New/Modified Files from Google Drive & Metadata Extraction Without JavaScript

Hi everyone,

I’m currently working on a workflow in n8n that keeps my vector database updated with the latest documents from Google Drive. Here’s how my workflow is structured:

  1. List files in Google Drive – I count the number of files in a specific folder.
  2. Loop through the files – The loop runs for the number of files.
  3. Download the first file – I retrieve the document.
  4. Route based on format & extract content – Depending on the file type, I extract relevant data.
  5. Store in a vector database – The extracted content is chunked, embedded, and stored.

Problem 1: Updating Only New or Modified Files

Currently, when I repeat the workflow, it downloads all files again. I want to ensure that only newly added or modified documents are updated in the vector database, rather than reprocessing everything.

  • Is there a way in n8n to efficiently detect new or updated files before downloading?
  • Should I rely on timestamps from Google Drive, or is there another method that would be more reliable?
  • How do you manage file deduplication or version tracking in a similar workflow?

Problem 2: Extracting Date from File Title Without JavaScript

Additionally, I want to enhance my metadata by extracting a date from the file name and adding it to my metadata inside the Enhanced Default Data Loader. However, I’d like to avoid using JavaScript.

  • I was thinking of calling an LLM inside n8n to recognize and extract the date from the filename. Has anyone implemented something similar?
  • What would be the best way to call an external LLM to process the file name and return structured metadata in n8n?

Here is my workflow

Any guidance, best practices, or creative solutions would be greatly appreciated! If you’ve tackled a similar challenge or have insights on optimizing this workflow, I’d love to hear your thoughts.

Thanks in advance for your help :rocket::blush:

Information on my n8n setup

  • n8n version: 1.81.4
  • Database (default: SQLite): default
  • n8n EXECUTIONS_PROCESS setting (default: own, main): default
  • Running n8n via (Docker, npm, n8n cloud, desktop app): n8n cloud
  • Operating system: windows 10

Hey,

For your first problem, there is actually a google drive trigger to listen for new files and modifications (you listen to any change to a folder). I think this workflow (Build an AI-Powered Tech Radar Advisor with SQL DB, RAG, and Routing Agents | n8n workflow template) does it the way you want.

For the metadata, you could indeed use a LLM if the filenames are not constant, but if the dates are more or less constants (like 2-3 variants), you could use regex to extract it efficiently and with no-error/hallucination/misunderstanding from the LLM. You could actually use both of these methods, starting with regex, and if it doesn’t find anything fallback with a LLM call.
For the LLM call, prompting will be important but I think you could use the output parser to add a layer to make sure you get the right format.

Let me know if this helps ! :slight_smile:

Hey, thanks a lot for your response!

Regarding the first issue — when I use the “watch for update in folder” or “watch for new file in folder” nodes, it only fetches the oldest file, not the entire set of documents (even though I have intentionally deleted the previous files and added 10 new PDF documents to the folder. ).

So unfortunately, it doesn’t seem to process all the 10 files as I expected.
Here’s the workflow :

Any idea why this might be happening or how to make it pull everything?

Thanks again!

I didn’t try it with the google drive trigger specifically but I guess it’s the same than for the gmail “on mail received” trigger. The trigger will be “activated” from the moment you activate your workflow. So if you want to test with multiple files created/updated, you need to first activate your workflow, add/update the files (in a one minute timespan, you can change to 5 minutes for the test), and if everything goes well, you will see in the execution your multiple items !
Also, when testing the trigger, it will always get the last file.

Tell me if this works !

3 Likes

Hi,

Thanks a lot for your answer! I managed to solve the first issue, so I really appreciate your help.

I was quite surprised to see that simply activating the workflow makes the “update file” trigger behave completely differently. Speaking of that, is there a way to test this production-like behavior without having to wait the minimum 1-minute delay every time? It slows down testing quite a bit.

Also, I had to add a node to delete old rows. That part was quite tricky: when duplicates occurred, all matching rows were being sent to the next node, which caused processing issues. I ended up adding a limit node to reduce it to just one entry. This finally solved the problem!

Now, about the second issue — I’m trying to merge two JSON objects within my workflow:

  • one comes from the “extract PDF” node, which parses the full PDF content,
  • the other is generated by an AI agent (LLM) which reads that content and extracts one or more metadata fields (in that case the date of the document, but it will be a lot more in the future).

Right now, the Superbase Vectorstoe doesnt handle the data. I think that the merge doesn’t work properly — I suspect the structure or the way n8n handles input items is causing the issue.
Do you have any advice on how to properly merge the two JSONs? Should I use a merge node in a particular mode, or write a function node to combine them manually?

Here is the workflow :slight_smile:

Thanks again for your time and support!

1 Like

You are using the append mode in your merge, which will cause to have as output 2 items which will be your 2 json, if you want everything in 1 item, you can use the mode “combine”.

Here’s an example with both modes :

For the testing trigger, you can use a custom cron to trigger every x seconds, but i don’t know if the API accepts it !

Hope this helps ! :slight_smile:

1 Like

Hey, thanks a lot for your message – it works extremely well now!

Thanks to the “combine” mode, I’m now able to tag my metadata very precisely, identifying both the name of my documents and their date, which is super helpful.

However, I’ve noticed something a bit odd: when I query my vector database, even though everything seems properly tagged, the LLM doesn’t always make the connection.

For example, if I ask for a list of a specific type of documents that have already been tagged with that info, it doesn’t always return the full list, even though the metadata is correctly attached.

Here’s my workflow on the chatting part :slight_smile:

Do you think this means I should use an SQL database alongside the vector store for certain types of questions? Any optimization tips you’d suggest?

Thanks again for your help – really appreciate it! :pray:

Hey ! Happy to see that it works as intended !

The problem might be hiding in the model/prompt you’re using, models like 4o-mini and “below” requires more work in the prompting to give you the output you want.
Also, I see that you limit to 4 retrieve in your vector store tool, did you try increasing this limit ?

Tell me if this works ! :slight_smile:

1 Like

Hey! Thanks again for your message.

So I did some deeper testing, and while increasing the context window did help slightly, the root cause of the issue seemed to be how the data was being “formalized” right after using the “extract PDF” tool.

What I noticed is that this step was introducing a lot of noise—things like structural elements that weren’t really useful for retrieval or semantic search. These elements were cluttering the actual content and likely affecting how well the model could understand and respond with accurate references.

To fix that, I ended up writing a small JavaScript script that processes the raw output and strips out all the unnecessary information. That allowed me to keep only the clean text and the specific metadata I actually care about. Once that was done, the quality of the data going into the vector database improved significantly.

Now everything is working really well. The system is able to tell me precisely where each piece of information is coming from, and even how many documents I have for a given date. That kind of granularity is exactly what I was aiming for, so I’m really happy with the result.

The next step for me is to set up a relational database alongside the vector one. I want to explore the best way to make both systems live and interact together. The idea is to imagine a real-world business use case where structured and unstructured data can be queried in a complementary way, using each system for what it does best.

Ultimately, I’m trying to design an architecture that combines both approaches efficiently, and hopefully define an optimal workflow that could scale in a corporate environment.

Thanks again for your help and your feedback—it really pushed me in the right direction. I’ll keep you updated with how the project evolves!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.