LinkedIn URL Normalization and Deduplication Failures, Apify Scraper Duplicates & Limits, and Intermittent Email Validation Errors (n8n v1.97.1)

trwgdz · June 17, 2025, 12:53pm

Bug Description

Bug Description
I’m encountering a series of interconnected and persistent issues with my n8n workflow designed for LinkedIn lead processing, specifically concerning LinkedIn URL normalization, deduplication, Apify scraping, and email validation. These problems have been present since I started building this complex workflow.

My overall workflow aims to:

Get leads from Google Sheets.
Normalize LinkedIn profile URLs for consistent identification.
Deduplicate leads based on normalized LinkedIn URLs and email addresses.
Scrape LinkedIn profile data using Apify.
Validate email addresses using EmailGuard.io.
Generate personalized outreach materials and save results to Outlook.

Here’s a detailed breakdown of the problems I’ve faced, chronologically where possible:

Phase 1: Initial Setup and Deduplication Challenges

Problem 1: Difficulty with Accurate Deduplication (Initial State)

Initial Goal: My primary goal from the beginning was to prevent processing duplicate leads. Leads often come from Google Sheets, where a single person might appear multiple times with slightly different data, or even multiple times with the same core LinkedIn URL but varied parameters (e.g., ?trk=).
Challenge: I needed a reliable way to identify unique individuals, and the LinkedIn Profile URL seemed the most robust identifier. However, direct comparison of raw LinkedIn URLs was failing due to variations.

Problem 2: “normalizedLinkedinUrl” Field Missing in “Remove Duplicates” Node (First Major Hurdle)

Introduction of Normalization: To address the deduplication challenge, I implemented a Code node (named Normalize LinkedIn URL) early in my workflow, directly after fetching leads from Google Sheets. Its purpose was to clean up LinkedIn URLs by removing parameters (like ? and # components) to create a consistent, normalized URL (e.g., https://www.linkedin.com/in/pamelajgoodwin/). This normalized URL was intended to be stored in a new field called normalizedLinkedinUrl.
First Error: When I then tried to use a Remove Duplicates node, configured to compare based on this new normalizedLinkedinUrl field, it would consistently fail with the error: "normalizedLinkedinUrl" field is missing from some input items.
Debugging Attempts:
- I inspected the output of my Normalize LinkedIn URL node, and for most items, the normalizedLinkedinUrl field seemed to be correctly generated and present.
- I tried adding an IF node (IF normalized linkedin Exists) before Remove Duplicates to filter out items without this field. However, the error persisted, suggesting that either the IF condition wasn’t catching all cases, or the data flow was more complex.
- I was confused about why the Remove Duplicates node was still complaining if the IF node was supposed to ensure the field existed.

Problem 3: Data Stream Mismatch and Field Nesting Issues (Root Cause of Deduplication Failure)

Discovery: Through detailed debugging, I realized the core issue was that my workflow had branched: one path was normalizing LinkedIn URLs, and another path (my email validation branch, involving EmailGuard and mails.so Outlook) was processing emails.
Key Insight: The normalizedLinkedinUrl was being added at the top level of item.json in one branch, while the email validation data (especially email after processing by EmailGuard) was often nested under item.json.data.
Consequence: When the workflow paths reconverged, the Remove Duplicates node (and the IF normalized linkedin Exists node before it) was receiving items where either normalizedLinkedinUrl was missing or email was nested incorrectly, leading to the “field missing” error.
Solution Implemented:
1. I added a Set node (prepare LinkedIn Data) after Normalize LinkedIn URL to explicitly ensure normalizedLinkedinUrl and email were at the root level of item.json.
2. I added another Set node (prepare Email Data) after the True branch of my IF email = deliverable1 node (from the email validation path). This Set node ensures the email field is brought to the top level ($json.email) and also explicitly sets normalizedLinkedinUrl to an empty string ("") for items coming only through the email path, guaranteeing the field exists for all items.
3. Crucially, I then inserted a Merge node (in Append mode) to combine the outputs of prepare LinkedIn Data and prepare Email Data. This ensures all items passing to downstream nodes (like IF normalized linkedin Exists and Remove Duplicates) consistently have both normalizedLinkedinUrl and email at the top level.
4. I updated the IF normalized linkedin Exists condition to {{ $json.normalizedLinkedinUrl }} is Exists, and the Remove Duplicates comparison fields to normalizedLinkedinUrl,email (removing any data. prefixes).
Current Status: This structural fix significantly improved the deduplication process, addressing the “missing field” errors.

Phase 2: Apify Scraper Problems

Problem 4: Apify Scraper Returns Duplicate Leads

Observation: Despite implementing the LinkedIn URL normalization and subsequent deduplication, I noticed that the apify - person LinkedIn Scrape node (my HTTP Request node calling the Apify LinkedIn Profile Scraper actor) was still consistently returning duplicate scraped data. For instance, if I had 10 unique LinkedIn URLs as input, the output would include items where 6 of them were identical scraped profiles, even though they originated from distinct input URLs. My Apify console shows multiple successful runs that each returned “1 result” in the dataset, but these results often lead to duplicates in my workflow.
Details:
- I confirmed that the Loop Over Items node was correctly passing unique personLinkedIn URLs to the Apify scraper for each iteration.
- The Apify node’s JSON body is configured to send {"profileUrls": ["{{ $json.personLinkedin }}"]}, indicating a single URL per request.
- I explicitly verified that the “Batching” setting on the Apify HTTP Request node was OFF.
- I tried adding Wait nodes (e.g., a 22 sec wait! node after Apify) to mitigate potential rate limits, but the duplicates persisted.
- When checking the Apify Console, I could see unique run IDs for each n8n execution, but the datasets retrieved from those runs would contain the same duplicate data.

Problem 5: Apify Scraper Exhibiting “Limits” Errors with Webhook Triggers

Context: In earlier iterations of my workflow (specifically “Workflow A” and “Workflow B”), which were triggered by webhooks, the Apify scraper would frequently return “limits” related errors.
Observation: I have encountered “Payment required - perhaps check your payment details?” and “Problem in node ‘apify - person LinkedIn Scrape’ Payment required - perhaps check your payment details?” errors. This suggests hitting Apify’s API rate limits or usage limits more aggressively when the workflow was triggered externally via webhooks, compared to manual execution.
Interplay with other problems: I observed a strange behavior where, when I simplified the workflow (by temporarily removing the entire “normalized LinkedIn path” branch), the Apify “limits” errors seemed to alleviate. However, this simplification then caused the EmailGuard problem (Problem 6) to surface or become more prominent. This implies a complex and perhaps resource-intensive interaction between the different parts of my workflow.

Phase 3: EmailGuard Validation Problems

Problem 6: Intermittent “Email field is required” error on EmailGuard node (Prominent in Simplified Workflow)

Context: This error became particularly noticeable and problematic when I simplified the workflow (removing the LinkedIn normalization path) to troubleshoot the Apify limits issue.
Observation: The EmailGuard1 node (an HTTP Request node) would intermittently fail for specific items (like itemIndex: 2) with the error: "The email field is required.".
Debugging Efforts:
- I verified that the input to EmailGuard1 for the failing item clearly showed a valid email field with a populated string value.
- I confirmed that the EmailGuard1 node’s JSON body was correctly configured to dynamically reference the email: {"email": "{{ $json.email }}"}. (Initially, I had accidentally hardcoded this to {"email": ""}, which was identified and corrected, but the intermittent error persists even after this fix).
- The error points to the field being “required” but the input data shows it is present. This is highly perplexing and suggests an underlying issue in how n8n is sending the request or how the EmailGuard API is interpreting it for certain specific email values.

Phase 4: Other Observed Errors

Problem 7: “Cannot read properties of undefined (reading ‘publicIdentifier’)” error in an ‘add groupKey1’ node

Observation: I’ve also encountered an error in an ‘add groupKey1’ node, stating “Cannot read properties of undefined (reading ‘publicIdentifier’)”. This seems to indicate a data flow or data structure issue where the publicIdentifier field is expected but not present or defined at that point in the workflow. This specific error is visible in one of my comprehensive workflow screenshots, suggesting it’s part of a later stage.

Current Status and Impact
These combined issues are severely hindering the reliability and efficiency of my lead processing workflow. The deduplication struggles, the Apify duplicates and limits, the EmailGuard intermittent failures, and other data-related errors mean I cannot trust the integrity or completeness of the processed leads.

trwgdz · June 17, 2025, 1:00pm

My client has SaaS and we are aiming to send 800 personalized emails daily.

In the past 4 days my n8n has lots of bugs. I used hosting server same as in the workshop from render 25/monthly

Screenshot 1 shows the full workflow and sometimes with this build structure, the personalization is running smooth, but sometimes mess up the data from before, e.g. it uses the first name of the of the current lead but the personalization is made from the data from 3 or 4 leads before.

Data for personalization comes from LinkedIn Apify scrapers (Profile scraper and Posts scraper for company and person LinkedIn pages).

I found that the “Get Data Sets - from Post Scraper” node outputs mixed data somehow and I change the workflow in the beginning.

Screenshot 2 shows modified version in the beginning of the workflow. For help, I used ChatGPT, and the suggestions I’ve got is to normalize all LinkedIn URLs. Now all the data comes in order, well structured without any mistakes but I am facing issue with the Apify profile scraper shown in the Screenshot 3.

I have tried to loop them before the Apify Profile scrape node, I add wait time node (tried from 30 till 120 seconds), I tried webhook to transfer them into new workflow, made new API tokens, tried new accounts, but still all the leads are going at the same time to the scraper.

In the video is shown the additional part of the workflow with “code nodes and GPT o4 mini LinkedIn data analyst” setup.

I have two questions.

I would like to stick with the workflow shown in the Screenshot 1. How to fix the “Get data sets” node, do not output mess up data?
If option 1 is not possible, what I need to do, to stop the data that goes together and split it one lead at a time in the modified workflow with "normalized LinkedIn URLs ?