Best n8n workflow pattern for “grounded product info extraction” (skincare PIM) — reliable sources + strict JSON output

Hi everyone,

I’m building an n8n workflow for a skincare Product Information Management (PIM) pipeline. The goal is: given one product name (optionally brand), the workflow should search the web, pull data from high-quality sources, validate/compare, and output a strict JSON object with a fixed schema.

What I’m trying to achieve (scope)

Input:

  • product_name (string) e.g. “Anua Heartleaf Pore Control Cleansing Oil Mild”

  • optional brand_name

Output:

  • A single JSON object (must be valid JSON.parse) with a fixed structure (identity, variant, formula INCI list, claims, usage, warnings, sources), using null for unknowns, and strict types (booleans must be true/false/null, URLs must be raw URLs, arrays must be arrays).

Critical constraints / pain points

  1. Source priority & validation

    • Prefer official brand site / official regional site as primary truth

    • Then INCI databases (INCI Decoder, Skinsort, Incibeauty, CosDNA, etc.)

    • Retailers only as confirmation if they match official

    • Must resolve variant ambiguity (e.g., “Mild” vs “Original”) by matching product title + INCI list + official URL. If conflict: set official_url = null and explain in notes.

  2. Web search + page reading

    • I can use SerpAPI (or other search tool) to find candidate URLs

    • For page text extraction I’m considering:

      • n8n “Extract Text” node (if it can extract from HTML reliably)

      • Jina Reader (fetch via https://r.jina.ai/https://…) for clean readable text

    • Unsure which is more reliable for product pages and INCI sections, and how many URLs to fetch (top 3? top 5?).

  3. Strict JSON output

    • I want the model to always output strict JSON without markdown links and without type mistakes (e.g. booleans turning into strings/URLs).

    • I tried manual HTTP calls to Gemini generateContent, but formatting / expression issues made it error-prone in n8n.

    • I’m considering using AI Agent + Structured Output Parser to enforce schema.

Current approach I’m considering

  • Chat/Webhook trigger → normalize inputs

  • Search tool (SerpAPI) to get candidate links (official + databases)

  • Fetch and extract text for each URL (Jina Reader or Extract Text)

  • Provide the extracted text to an LLM (Gemini/OpenAI) with strict instructions + schema

  • Structured Output Parser to enforce final JSON

  • Optional: a “repair JSON” step if parsing fails

What I’m asking the community

  1. What’s the best-practice workflow pattern in n8n for this “grounded extraction + validation + strict JSON” use case?

  2. Should I use AI Agent or keep it as a deterministic chain of nodes (Search → Fetch → Extract → LLM → Parser)?

  3. For reading pages: Extract Text vs Jina Reader — which tends to be more reliable on real ecommerce/brand pages?

  4. Any tips to reduce hallucinations and improve consistency (e.g., always include citations per field, require sources.what_it_supported, etc.)?

  5. If you’ve built something similar, what node stack + settings worked best?

If helpful I can share my target JSON schema and a sample output.