How do you handle getting only new items/deduplicating your APIs/services?

Hi everyone,

We’re currently working on a new feature in n8n that will allow you to only get new items from your APIs/services. This means that you won’t have to worry about duplicates or missing items when running your workflows on a schedule.

We’d love to get your feedback on this! Here are a few questions to get the conversation started:

  • How do you currently handle getting only new items from your APIs/services?
  • What are some of the challenges you’ve faced with this?
  • How would this new feature help you and your workflows?

I’m excited to hear your thoughts!

Thanks, Nik

9 Likes

Hi Nik,

Firstly, thank you for your work on this. Here are my 2 cents:

  • I’ve explored few options with static data, grabbing the last ran execution’s webhook input etc but found the solution with state-machine community node.
  • The need to setup an external service (Redis) or any other method could incur additional costs in general.
  • Having a native and maintained resolution to this problem without any extra dependencies would be amazing to say the least.

Currently the blunt force method of storing in staticData. However this has some drawbacks.

  • it doesn’t work if your workflow executes in parallel (staticData is only written/updated after the workflow ends)
  • you have to manage the duplicate list so it doesn’t go unbounded and make all workflow related DB request laggy

The other method if running in queue mode is to use Redis, as it’s already a requirement. However this introduces extra nodes and special conditions for workflow failures. Also some missing features in the Redis node like supporting DECR and the NX option.

Ideally I’d like to see a node settings level feature (available in all nodes) that can either memoize (caching for speed/efficiency) or deduplicate based on one or more fields. The important part is to make sure it doesn’t go unbounded, so the ability to cleanup the duplicate list by either number of items or by last seen. E.g. last 5000 items, or where item was last seen within 24hrs.

Hello, I am using APIs/services in every workflow, and this will be a great features.

  • How do you currently handle getting only new items from your APIs/services?
    Checking edit date or save in external database and check every run
  • What are some of the challenges you’ve faced with this?
    not always have it
  • How would this new feature help you and your workflows?
    Really useful for my side, cause I have always to use Javascript or other nodes to check it

I’ve just added an integration that had to handle (prevent) duplications and dig a pretty ugly workaround of setting a flag on the remote data, e.g. n8n_status = 2, and then checking that status upon next run of the workflow.

Ugly, but effective.

At the time I thought "it’d be a great feature if you can “tag” an object (received from a API/service), that tag stored in your Workflow history, and then be able to filter based on that. i.e. to save me having to create a custom flag on the data source side :slight_smile:

Hey,

I am using N8n as a SIEM Connector & SOAR tool and this feature is on my top-list as handling massive amounts of logs/events is not easy with any kind of static data.

  • How do you currently handle getting only new items from your APIs/services?
    Currently I am planing the executions by time so no duplicates arrive.

  • What are some of the challenges you’ve faced with this?
    The current setup allows some duplicates to arrive and there is no best way to handle this currently.

  • How would this new feature help you and your workflows?
    This allows to less look at execution times and removes duplicates (very important in SIEM case)

I do have some experience in the past working with “tines” and they had some nice way to handle those:
Deduplicate mode | Docs | Tines as we already have history of executions why not to read them and use as history?

Thanks

1 Like
  • 1 Workflow to receive Webhook/API data and store in Baserow
  • Processing webhook with a IF condition to only continue if it’s the earliest workflow in execution, checks the main database to see if that data was already processed or not, then proceeds