I’m facing a duplicate-data issue in one of my n8n workflows and would really appreciate your guidance.
Workflow overview:
The workflow scrapes job postings from the last 24 hours
It evaluates my CV against each job
If the CV is not a good fit, it generates an updated CV
All results are stored in a Google Sheets–based CRM
Each job has a unique Tracking ID
Problem:
Because the scraper always looks at the last 24 hours, sometimes the same job gets scraped again. When this happens:
The job is re-evaluated
The same Tracking ID is inserted again into the CRM
This creates duplicate records
I can clearly see duplicates by checking the Tracking ID column.
I already tried handling this inside Google Sheets using formulas to prevent duplicates, but the workflow still uploads the duplicate rows before the sheet logic can stop them. As a result, my CRM already contains duplicates.
Main question:
What is the best way inside n8n to compare new scraped jobs against existing Tracking IDs in the CRM before inserting them?
Considering that:
The CRM already contains many Tracking IDs
Each scrape can return many new job records
Doing a simple one-by-one comparison feels inefficient or impractical at scale.
What I’m looking for:
Best practice or recommended pattern in n8n
Node suggestions (e.g. Google Sheets, Code node, Merge, Data Store, etc.)
Any scalable approach to avoid inserting duplicates in this kind of workflow
Thanks a lot in advance
Any advice or real-world examples would be very helpful.
One option is to use the google sheet “Append or Update” option and setting the id to match on to the Tracking Id. This will avoid duplicate records being inserted
Thank you so much for your reply, I really appreciate it.
Using Google Sheets → Append or Update with the Tracking ID definitely makes sense within the same run. However, my main concern is across multiple days.
For example:
Today, I scrape jobs, evaluate them, and store them in Google Sheets with a Tracking ID.
Tomorrow, I run the same workflow again.
Some jobs from yesterday are scraped again (because they are still within the last 24 hours).
When those jobs re-enter the workflow, I’m not sure how n8n can efficiently check against all existing Tracking IDs already stored in the sheet before inserting or updating.
So my core question is:
How can I reliably match new scraped jobs against existing Tracking IDs from previous days inside n8n, and skip them before reaching Google Sheets?
That cross-day deduplication is the main issue I’m trying to solve.
Im not sure if you can do this efficiently for as long as youre using google sheets as your database. Generally you’d need to check if you already have tracking ids present which is a get operation and pulling information from google sheets so you can compare. The issue Im currently seeing is that google sheets has a rate limit of 250 calls per minute (i think) which will cause you to reach a limit before you can even decide to insert or ignore. Depending on your current load you’re probably still better off just pushing the records to the google sheet with add or update.
If you do want to check then you will need to Loop over all the jobs you received for the day using a Loop node, then try and find exsiting records by using a get google sheet call using a filter (tracking id), then use an If node to check if anything returned, then add if not found and continue looping if found.
@Wouter_Nigrini Wow that is a mind blowing solutin , thank u so much , I cannot find this answer from chatgpt not gemini etc. thank you so much fro your help .