It might be a simple automation, but I’d appreciate some help.
I’m trying to scrape trending posts on Reddit using n8n and store them in a Google Sheet. Each time the Reddit scraper node runs, I need to provide two inputs: Subreddit and Keyword.
The challenge is ensuring that each execution fetches only unique posts—avoiding duplicates from previous runs.
For example:
Subreddit: autotaskPSA
Keyword: Quotes
If the node runs today and retrieves Post X with URL Y, I want to ensure that when it runs again (e.g., a day later), it doesn’t scrape Post X again.
How can I achieve this? Any guidance would be greatly appreciated!
You’re working with an API that provides exactly what you need—in this case, the required posts. However, this API does not return “unique posts” because it lacks parameters like time; it only offers “Additional Fields” with a “Sort” option.
Therefore, we need to store the fetched items in a separate database (Google Sheet) and use this database to check for new posts before adding them to the final database.
I hope that makes sense.
Please don’t hesitate to ask for help if you need further explanations…
If this answers your question, please mark the reply as the solution
Let’s say I’ve figured out how to store the retrieved values from the Reddit node into a database (Google Sheets in my case). However, I’m unsure how to compare the unique identifier (Reddit Post URL) with my initial database before adding it to the final dataset.
Example Workflow:
The scraper fetches X (Reddit Post URL) and Y (Post Title) and stores them in Google Sheet A (used to check for duplicates).
Each time a post is scraped, the workflow needs to search the column containing X (Reddit Post URL) to check if it already exists.
Challenges:
If a duplicate is detected, how can I prevent it from being added again?
Since the subreddit name and keyword remain constant, how do I ensure the Reddit scraper fetches only new posts instead of retrieving the same ones repeatedly?
Wait, actually, I think we can simplify things. I was over-engineering the solution and forgot that you can just use the same Google Sheets node to map and only append the new results.
Hey, thanks for taking the time to reply! @mohamed3nan
I’m running a Reddit scraper node that fetches a set of posts based on a given keyword and subreddit. The first time it runs, it retrieves five unique posts and adds them to Google Sheets. However, when I re-run the node with the same keyword and subreddit, it doesn’t add new data because the URL field detects duplicates.
What I need is for each subsequent run to fetch new posts that are not already in the sheet.
For example: Subreddit: r/ads Keyword: conversion tracking First Run (Limit: 3) → Returns posts A, B, C (added to the sheet) Second Run (Limit: 3) → Expected output: D, E, F But instead, it detects duplicates (A, B, C) and doesn’t add any new data.
Desired behavior: Each time the node runs, it should check the sheet’s URL field and fetch only new posts that aren’t already recorded.
Based on your description, no new data is being added because the API returns the same posts that have already been recorded. The Google Sheets node then identifies these posts as duplicates based on their URLs.
Your workflow is functioning as intended. For example:
First Run (Limit: 3): The API returns posts A, B, and C, which are added to the sheet.
Second Run (Limit: 3): You expected the API to return posts D, E, and F.
However, since the API still returns posts A, B, and C (as they are the most recent posts), the Google Sheets node does not add any new data because those URLs already exist in the sheet.
Maybe you’ll have to wait some time until there is actually new data available from the API.