Bug Description
Bug Description
I’m encountering a series of interconnected and persistent issues with my n8n workflow designed for LinkedIn lead processing, specifically concerning LinkedIn URL normalization, deduplication, Apify scraping, and email validation. These problems have been present since I started building this complex workflow.
My overall workflow aims to:
- Get leads from Google Sheets.
- Normalize LinkedIn profile URLs for consistent identification.
- Deduplicate leads based on normalized LinkedIn URLs and email addresses.
- Scrape LinkedIn profile data using Apify.
- Validate email addresses using EmailGuard.io.
- Generate personalized outreach materials and save results to Outlook.
Here’s a detailed breakdown of the problems I’ve faced, chronologically where possible:
Phase 1: Initial Setup and Deduplication Challenges
Problem 1: Difficulty with Accurate Deduplication (Initial State)
- Initial Goal: My primary goal from the beginning was to prevent processing duplicate leads. Leads often come from Google Sheets, where a single person might appear multiple times with slightly different data, or even multiple times with the same core LinkedIn URL but varied parameters (e.g.,
?trk=
). - Challenge: I needed a reliable way to identify unique individuals, and the LinkedIn Profile URL seemed the most robust identifier. However, direct comparison of raw LinkedIn URLs was failing due to variations.
Problem 2: “normalizedLinkedinUrl” Field Missing in “Remove Duplicates” Node (First Major Hurdle)
- Introduction of Normalization: To address the deduplication challenge, I implemented a
Code
node (namedNormalize LinkedIn URL
) early in my workflow, directly after fetching leads from Google Sheets. Its purpose was to clean up LinkedIn URLs by removing parameters (like?
and#
components) to create a consistent, normalized URL (e.g.,https://www.linkedin.com/in/pamelajgoodwin/
). This normalized URL was intended to be stored in a new field callednormalizedLinkedinUrl
. - First Error: When I then tried to use a
Remove Duplicates
node, configured to compare based on this newnormalizedLinkedinUrl
field, it would consistently fail with the error:"normalizedLinkedinUrl" field is missing from some input items
. - Debugging Attempts:
- I inspected the output of my
Normalize LinkedIn URL
node, and for most items, thenormalizedLinkedinUrl
field seemed to be correctly generated and present. - I tried adding an
IF
node (IF normalized linkedin Exists
) beforeRemove Duplicates
to filter out items without this field. However, the error persisted, suggesting that either theIF
condition wasn’t catching all cases, or the data flow was more complex. - I was confused about why the
Remove Duplicates
node was still complaining if theIF
node was supposed to ensure the field existed.
- I inspected the output of my
Problem 3: Data Stream Mismatch and Field Nesting Issues (Root Cause of Deduplication Failure)
- Discovery: Through detailed debugging, I realized the core issue was that my workflow had branched: one path was normalizing LinkedIn URLs, and another path (my email validation branch, involving
EmailGuard
andmails.so Outlook
) was processing emails. - Key Insight: The
normalizedLinkedinUrl
was being added at the top level ofitem.json
in one branch, while the email validation data (especiallyemail
after processing by EmailGuard) was often nested underitem.json.data
. - Consequence: When the workflow paths reconverged, the
Remove Duplicates
node (and theIF normalized linkedin Exists
node before it) was receiving items where eithernormalizedLinkedinUrl
was missing oremail
was nested incorrectly, leading to the “field missing” error. - Solution Implemented:
- I added a
Set
node (prepare LinkedIn Data
) afterNormalize LinkedIn URL
to explicitly ensurenormalizedLinkedinUrl
andemail
were at the root level ofitem.json
. - I added another
Set
node (prepare Email Data
) after theTrue
branch of myIF email = deliverable1
node (from the email validation path). ThisSet
node ensures theemail
field is brought to the top level ($json.email
) and also explicitly setsnormalizedLinkedinUrl
to an empty string (""
) for items coming only through the email path, guaranteeing the field exists for all items. - Crucially, I then inserted a
Merge
node (inAppend
mode) to combine the outputs ofprepare LinkedIn Data
andprepare Email Data
. This ensures all items passing to downstream nodes (likeIF normalized linkedin Exists
andRemove Duplicates
) consistently have bothnormalizedLinkedinUrl
andemail
at the top level. - I updated the
IF normalized linkedin Exists
condition to{{ $json.normalizedLinkedinUrl }}
isExists
, and theRemove Duplicates
comparison fields tonormalizedLinkedinUrl,email
(removing anydata.
prefixes).
- I added a
- Current Status: This structural fix significantly improved the deduplication process, addressing the “missing field” errors.
Phase 2: Apify Scraper Problems
Problem 4: Apify Scraper Returns Duplicate Leads
- Observation: Despite implementing the LinkedIn URL normalization and subsequent deduplication, I noticed that the
apify - person LinkedIn Scrape
node (myHTTP Request
node calling the Apify LinkedIn Profile Scraper actor) was still consistently returning duplicate scraped data. For instance, if I had 10 unique LinkedIn URLs as input, the output would include items where 6 of them were identical scraped profiles, even though they originated from distinct input URLs. My Apify console shows multiple successful runs that each returned “1 result” in the dataset, but these results often lead to duplicates in my workflow. - Details:
- I confirmed that the
Loop Over Items
node was correctly passing uniquepersonLinkedIn
URLs to the Apify scraper for each iteration. - The Apify node’s JSON body is configured to send
{"profileUrls": ["{{ $json.personLinkedin }}"]}
, indicating a single URL per request. - I explicitly verified that the “Batching” setting on the Apify
HTTP Request
node was OFF. - I tried adding
Wait
nodes (e.g., a22 sec wait!
node after Apify) to mitigate potential rate limits, but the duplicates persisted. - When checking the Apify Console, I could see unique run IDs for each n8n execution, but the datasets retrieved from those runs would contain the same duplicate data.
- I confirmed that the
Problem 5: Apify Scraper Exhibiting “Limits” Errors with Webhook Triggers
- Context: In earlier iterations of my workflow (specifically “Workflow A” and “Workflow B”), which were triggered by webhooks, the Apify scraper would frequently return “limits” related errors.
- Observation: I have encountered “Payment required - perhaps check your payment details?” and “Problem in node ‘apify - person LinkedIn Scrape’ Payment required - perhaps check your payment details?” errors. This suggests hitting Apify’s API rate limits or usage limits more aggressively when the workflow was triggered externally via webhooks, compared to manual execution.
- Interplay with other problems: I observed a strange behavior where, when I simplified the workflow (by temporarily removing the entire “normalized LinkedIn path” branch), the Apify “limits” errors seemed to alleviate. However, this simplification then caused the EmailGuard problem (Problem 6) to surface or become more prominent. This implies a complex and perhaps resource-intensive interaction between the different parts of my workflow.
Phase 3: EmailGuard Validation Problems
Problem 6: Intermittent “Email field is required” error on EmailGuard node (Prominent in Simplified Workflow)
- Context: This error became particularly noticeable and problematic when I simplified the workflow (removing the LinkedIn normalization path) to troubleshoot the Apify limits issue.
- Observation: The
EmailGuard1
node (an HTTP Request node) would intermittently fail for specific items (likeitemIndex: 2
) with the error:"The email field is required."
. - Debugging Efforts:
- I verified that the input to
EmailGuard1
for the failing item clearly showed a validemail
field with a populated string value. - I confirmed that the
EmailGuard1
node’s JSON body was correctly configured to dynamically reference the email:{"email": "{{ $json.email }}"}
. (Initially, I had accidentally hardcoded this to{"email": ""}
, which was identified and corrected, but the intermittent error persists even after this fix). - The error points to the field being “required” but the input data shows it is present. This is highly perplexing and suggests an underlying issue in how n8n is sending the request or how the EmailGuard API is interpreting it for certain specific email values.
- I verified that the input to
Phase 4: Other Observed Errors
Problem 7: “Cannot read properties of undefined (reading ‘publicIdentifier’)” error in an ‘add groupKey1’ node
- Observation: I’ve also encountered an error in an ‘add groupKey1’ node, stating “Cannot read properties of undefined (reading ‘publicIdentifier’)”. This seems to indicate a data flow or data structure issue where the
publicIdentifier
field is expected but not present or defined at that point in the workflow. This specific error is visible in one of my comprehensive workflow screenshots, suggesting it’s part of a later stage.
Current Status and Impact
These combined issues are severely hindering the reliability and efficiency of my lead processing workflow. The deduplication struggles, the Apify duplicates and limits, the EmailGuard intermittent failures, and other data-related errors mean I cannot trust the integrity or completeness of the processed leads.