How to check for duplicate data compared to previous scheduled flow executions

jay377 · November 19, 2021, 7:32am

Hi all,

Still new to n8n so this solution might be simple but any guidance would be appreciated.

Lets say I have this requirement:

I need to collect the latest news from a number of external sources via API and send that to a Google sheet (for example) so I have the latest news from different sources consolidated in 1 place and it updates every day or multiple times a day.

So the steps would be like this I imagine, for a single source:

n8n requests a list of the latest 20 items from an external API. Lets say a “latest news” API.
This will occur on a schedule multiple times a day or whatever feels appropriate for that source to not miss stuff or not get too many dupes.
Process this data into a cleaned/formatted list format or whatever might be required before posting to google sheet.
Post that data to a Google sheet as new entry (insert at the top of sheet) sorted date descending.

Seems simple enough but the problem is that duplicate items often definitely going to appear in each new fetch that were in the previous fetches.

If we fetch items from API every day, 20 per day and only 5 items are new from last fetch, is it possible in n8n to check the incoming data, compare it to the previous fetch(es) and delete any duplicates before posting the new entries to Google sheet?

What is the easiest way to achieve this functionality?

I assume some scripting would be needed for this and perhaps even a database that includes the raw incoming data from all previous fetches to check the new fetch against. Or would you fetch the new news, fetch the existing google sheet entries, compare them somehow and then post only the new unique ones to the google sheet.

Can you point me towards the easiest, simplest, fastest least resource intensive way to achieve this result.

Very much appreciated!

MutedJam · November 19, 2021, 8:11am

Hi @jay377, have you had a look at Creating triggers for n8n workflows using polling ⏲? Sounds like it explains what you have in mind

pemontto · November 20, 2021, 10:31am

There’s no easy way to do this currently, but what you can do is use staticData in a function node to store previously retrieved articles, and ignore new ones that match those.

There’s actually a pull request for a node that would do just this - ⚡ Add FilterNew node that reports new or unseen items by fahhem · Pull Request #2310 · n8n-io/n8n · GitHub

jay377 · November 20, 2021, 1:23pm

Thats great thanks guys. Polling looks like it might be exactly what I need, will investigate further.

Will check out staticdata as well. It says the data should be very small, is there a reason for that?

Could I use it with a list of of say 25 records with 8 fields each? Is that too much? What about two pages worth of text, like a whole long article? What is too much?

Thanks.

pemontto · November 21, 2021, 5:57am

Re. static data, that seems reasonable if it’s just text. I assume that’s recommended because it’s loaded, or maybe permanently stored in RAM on startup.

jay377 · November 22, 2021, 4:26pm

Ok that sounds great, that could solve all my troubles. I guess the limitations are based on the server you have it installed on, limitation likely being ram not CPU?

In my case i am testing self hosted on 4GB Vultr server so I could probably process a few pages (or maybe even alot of pages) of text content using staticdata without issue if its just for personal workflows or not many simultaneous executions. I guess the answer is just to test and see!