Hi everyone,
I’m building an n8n workflow for a skincare Product Information Management (PIM) pipeline. The goal is: given one product name (optionally brand), the workflow should search the web, pull data from high-quality sources, validate/compare, and output a strict JSON object with a fixed schema.
What I’m trying to achieve (scope)
Input:
-
product_name (string) e.g. “Anua Heartleaf Pore Control Cleansing Oil Mild”
-
optional brand_name
Output:
- A single JSON object (must be valid JSON.parse) with a fixed structure (identity, variant, formula INCI list, claims, usage, warnings, sources), using null for unknowns, and strict types (booleans must be true/false/null, URLs must be raw URLs, arrays must be arrays).
Critical constraints / pain points
-
Source priority & validation
-
Prefer official brand site / official regional site as primary truth
-
Then INCI databases (INCI Decoder, Skinsort, Incibeauty, CosDNA, etc.)
-
Retailers only as confirmation if they match official
-
Must resolve variant ambiguity (e.g., “Mild” vs “Original”) by matching product title + INCI list + official URL. If conflict: set official_url = null and explain in notes.
-
-
Web search + page reading
-
I can use SerpAPI (or other search tool) to find candidate URLs
-
For page text extraction I’m considering:
-
n8n “Extract Text” node (if it can extract from HTML reliably)
-
Jina Reader (fetch via https://r.jina.ai/https://…) for clean readable text
-
-
Unsure which is more reliable for product pages and INCI sections, and how many URLs to fetch (top 3? top 5?).
-
-
Strict JSON output
-
I want the model to always output strict JSON without markdown links and without type mistakes (e.g. booleans turning into strings/URLs).
-
I tried manual HTTP calls to Gemini generateContent, but formatting / expression issues made it error-prone in n8n.
-
I’m considering using AI Agent + Structured Output Parser to enforce schema.
-
Current approach I’m considering
-
Chat/Webhook trigger → normalize inputs
-
Search tool (SerpAPI) to get candidate links (official + databases)
-
Fetch and extract text for each URL (Jina Reader or Extract Text)
-
Provide the extracted text to an LLM (Gemini/OpenAI) with strict instructions + schema
-
Structured Output Parser to enforce final JSON
-
Optional: a “repair JSON” step if parsing fails
What I’m asking the community
-
What’s the best-practice workflow pattern in n8n for this “grounded extraction + validation + strict JSON” use case?
-
Should I use AI Agent or keep it as a deterministic chain of nodes (Search → Fetch → Extract → LLM → Parser)?
-
For reading pages: Extract Text vs Jina Reader — which tends to be more reliable on real ecommerce/brand pages?
-
Any tips to reduce hallucinations and improve consistency (e.g., always include citations per field, require sources.what_it_supported, etc.)?
-
If you’ve built something similar, what node stack + settings worked best?
If helpful I can share my target JSON schema and a sample output.