Best n8n workflow pattern for “grounded product info extraction” (skincare PIM) — reliable sources + strict JSON output

Mohamad_R_Motamedi · December 28, 2025, 9:46am

Hi everyone,

I’m building an n8n workflow for a skincare Product Information Management (PIM) pipeline. The goal is: given one product name (optionally brand), the workflow should search the web, pull data from high-quality sources, validate/compare, and output a strict JSON object with a fixed schema.

What I’m trying to achieve (scope)

Input:

product_name (string) e.g. “Anua Heartleaf Pore Control Cleansing Oil Mild”
optional brand_name

Output:

A single JSON object (must be valid JSON.parse) with a fixed structure (identity, variant, formula INCI list, claims, usage, warnings, sources), using null for unknowns, and strict types (booleans must be true/false/null, URLs must be raw URLs, arrays must be arrays).

Critical constraints / pain points

Source priority & validation
- Prefer official brand site / official regional site as primary truth
- Then INCI databases (INCI Decoder, Skinsort, Incibeauty, CosDNA, etc.)
- Retailers only as confirmation if they match official
- Must resolve variant ambiguity (e.g., “Mild” vs “Original”) by matching product title + INCI list + official URL. If conflict: set official_url = null and explain in notes.
Web search + page reading
- I can use SerpAPI (or other search tool) to find candidate URLs
- For page text extraction I’m considering:
  - n8n “Extract Text” node (if it can extract from HTML reliably)
  - Jina Reader (fetch via https://r.jina.ai/https://…) for clean readable text
- Unsure which is more reliable for product pages and INCI sections, and how many URLs to fetch (top 3? top 5?).
Strict JSON output
- I want the model to always output strict JSON without markdown links and without type mistakes (e.g. booleans turning into strings/URLs).
- I tried manual HTTP calls to Gemini generateContent, but formatting / expression issues made it error-prone in n8n.
- I’m considering using AI Agent + Structured Output Parser to enforce schema.

Current approach I’m considering

Chat/Webhook trigger → normalize inputs
Search tool (SerpAPI) to get candidate links (official + databases)
Fetch and extract text for each URL (Jina Reader or Extract Text)
Provide the extracted text to an LLM (Gemini/OpenAI) with strict instructions + schema
Structured Output Parser to enforce final JSON
Optional: a “repair JSON” step if parsing fails

What I’m asking the community

What’s the best-practice workflow pattern in n8n for this “grounded extraction + validation + strict JSON” use case?
Should I use AI Agent or keep it as a deterministic chain of nodes (Search → Fetch → Extract → LLM → Parser)?
For reading pages: Extract Text vs Jina Reader — which tends to be more reliable on real ecommerce/brand pages?
Any tips to reduce hallucinations and improve consistency (e.g., always include citations per field, require sources.what_it_supported, etc.)?
If you’ve built something similar, what node stack + settings worked best?

If helpful I can share my target JSON schema and a sample output.