How to Avoid Scraping Duplicate Reddit Posts in n8n Automation

Keerthi_Vasan · March 23, 2025, 12:49am

It might be a simple automation, but I’d appreciate some help.

I’m trying to scrape trending posts on Reddit using n8n and store them in a Google Sheet. Each time the Reddit scraper node runs, I need to provide two inputs: Subreddit and Keyword.

The challenge is ensuring that each execution fetches only unique posts—avoiding duplicates from previous runs.

For example:

Subreddit: autotaskPSA
Keyword: Quotes

If the node runs today and retrieves Post X with URL Y, I want to ensure that when it runs again (e.g., a day later), it doesn’t scrape Post X again.

How can I achieve this? Any guidance would be greatly appreciated!

Information on your n8n setup

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

mohamed3nan · March 23, 2025, 1:05am

Hi @Keerthi_Vasan welcome to the community

You’re working with an API that provides exactly what you need—in this case, the required posts. However, this API does not return “unique posts” because it lacks parameters like time; it only offers “Additional Fields” with a “Sort” option.

Therefore, we need to store the fetched items in a separate database (Google Sheet) and use this database to check for new posts before adding them to the final database.

I hope that makes sense.

Please don’t hesitate to ask for help if you need further explanations…

If this answers your question, please mark the reply as the solution

Keerthi_Vasan · March 23, 2025, 1:33am

Hey @mohamed3nan
Thanks for replying.

Let’s say I’ve figured out how to store the retrieved values from the Reddit node into a database (Google Sheets in my case). However, I’m unsure how to compare the unique identifier (Reddit Post URL) with my initial database before adding it to the final dataset.

Example Workflow:

The scraper fetches X (Reddit Post URL) and Y (Post Title) and stores them in Google Sheet A (used to check for duplicates).
Each time a post is scraped, the workflow needs to search the column containing X (Reddit Post URL) to check if it already exists.

Challenges:

If a duplicate is detected, how can I prevent it from being added again?
Since the subreddit name and keyword remain constant, how do I ensure the Reddit scraper fetches only new posts instead of retrieving the same ones repeatedly?

Would love any insights or solutions!

mohamed3nan · March 23, 2025, 1:58am

Wait, actually, I think we can simplify things. I was over-engineering the solution and forgot that you can just use the same Google Sheets node to map and only append the new results.

The column we use as the unique Reddit identifier is “name”, which looks like this: “t3_1jbvt0a”

by the way, here is an example:

Keerthi_Vasan · March 23, 2025, 6:56pm

Hey, thanks for taking the time to reply! @mohamed3nan

I’m running a Reddit scraper node that fetches a set of posts based on a given keyword and subreddit. The first time it runs, it retrieves five unique posts and adds them to Google Sheets. However, when I re-run the node with the same keyword and subreddit, it doesn’t add new data because the URL field detects duplicates.

What I need is for each subsequent run to fetch new posts that are not already in the sheet.

For example:
Subreddit: r/ads
Keyword: conversion tracking
First Run (Limit: 3) → Returns posts A, B, C (added to the sheet)
Second Run (Limit: 3) → Expected output: D, E, F
But instead, it detects duplicates (A, B, C) and doesn’t add any new data.

Desired behavior: Each time the node runs, it should check the sheet’s URL field and fetch only new posts that aren’t already recorded.

// Attached is my workflow //

mohamed3nan · March 23, 2025, 10:37pm

Hi @Keerthi_Vasan

Based on your description, no new data is being added because the API returns the same posts that have already been recorded. The Google Sheets node then identifies these posts as duplicates based on their URLs.

Your workflow is functioning as intended. For example:

First Run (Limit: 3): The API returns posts A, B, and C, which are added to the sheet.
Second Run (Limit: 3): You expected the API to return posts D, E, and F.

However, since the API still returns posts A, B, and C (as they are the most recent posts), the Google Sheets node does not add any new data because those URLs already exist in the sheet.

Maybe you’ll have to wait some time until there is actually new data available from the API.

I hope this makes sense.

mohamed3nan · March 23, 2025, 11:07pm

Also, something to keep in mind: Do not compare Reddit API responses with the live website, as they are not the same based on my experiments.

If you want to compare, use this interface instead: https://oauth.reddit.com
this matches the API responses exactly.

I hope this helps!

system · June 21, 2025, 11:07pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.