TLDR: Scraped HTML pages as inputs often lead to bad AI outputs. Convert pages to Markdown first. Methods described in the post.
Long-time lurker, but made this account to say that I’ve been noticing a pattern in shared workflows lately:
HTTP Request node → AI Agent node → disappointing output. The fix is usually in WHAT you’re handing the agent, hardly in the prompt itself.
Raw HTML from websites is a bad input format for LLMs, and cleaned markdown is much better. It’s worth understanding why, even if you end up deciding to live with the HTML you’ve got.
Starting with token costs… A blog post with maybe 2,000 words of real content will often hit 40k+ tokens as raw HTML by the time you add nav, footer, scripts, inline styles, tracking pixels, and the long list of utility classes on every div. The markdown version (if done correctly) is usually 3-5k. You’re paying for scaffolding the model never uses.
Then there’s quality. Every nav bar, footer, cookie banner, related-articles carousel, and analytics snippet is something the model has to parse and then ignore. Sometimes it just doesn’t ignore it.
Last month, I watched my email personalization automation complement the lead on their web design agency, taken from their footer credit. It was funny the first time, but my smile quickly turned to frustration after coming across 50+ leads with the same issue (not always the agency thing).
Granted I could’ve prompted better, but this example highlights the problem at hand.
Lastly, structure. Markdown headings, lists, and code blocks map cleanly to how LLMs interpret documents because they’ve seen enormous amounts of it during training. They’ve seen a lot of HTML too but the signal is buried under layout noise.
A practical but simple example
Scrape any average blog post. The raw HTML is typically 30-60KB. The cleaned markdown is 5-10KB of actual content. Feed both into the same “summarize this article” prompt with the same model, and the markdown version consistently produces a tighter summary at a fraction of the token spend. You can test this in about ten minutes.
A more technical one
I ran a rough test last month on a 3-step agent: fetch page → extract key points → draft a follow-up email. On a typical SaaS pricing page:
Raw HTML:
-
Input: ~38k tokens
-
Occasionally hallucinated pricing tiers that weren’t on the page (hidden-by-CSS variants were still in the DOM)
-
Cost: roughly $0.012 per run
Markdown version of the same page:
-
Input: ~4k tokens
-
Pricing extraction was consistent across 10 runs
-
Cost: roughly $0.0015 per run
Not a formal benchmark, but the shape of this result holds across most pages I’ve tested. Order of magnitude cheaper and measurably more reliable.
How to actually get markdown
If you can add an API dependency, that’s the cleanest route with the best outputs (and I’ll come back to it at the end). If you can’t, here’s what you can do with n8n’s built-in nodes.
The n8n HTML node is underrated. Try “Extract HTML Content” with a selector like article, main, or [role="main"]. Works on a fair number of blog posts, doc sites, and news pages, but breaks on JS-heavy SPAs that render content client-side. Also, not all websites use HTML to mark the main content.
To get more control, a Code node with a regex sweep gets you further:
let html = $input.first().json.html;
html = html.replace(/<script\\b[^<]*(?:(?!<\\/script>)<[^<]*)*<\\/script>/gi, '');
html = html.replace(/<style\\b[^<]*(?:(?!<\\/style>)<[^<]*)*<\\/style>/gi, '');
html = html.replace(/<(nav|footer|aside)\\b[^>]*>[\\s\\S]*?<\\/\\1>/gi, '');
html = html.replace(/<[^>]+>/g, ' ').replace(/\\s+/g, ' ').trim();
return { content: html };
You lose heading structure this way, but you also lose most of the token bloat. Good enough for summarization and classification tasks, but not great for anything that needs document hierarchy preserved.
And if you’re stuck with raw HTML coming in from a legacy source you can’t preprocess, a two-step prompt works. Spend a cheap model call on cleanup first:
“Extract only the main article content from this HTML. Ignore navigation, footer, sidebar, ad, cookie banner, and script content. Return plain text preserving paragraph breaks and heading order.”
Then pipe the cleaned output to your actual model. Two calls instead of one.
However, this may hurt quality as it reintroduces the same problem: The model has to sift through thousands of unnecessary tokens and decide what’s worth keeping.
As mentioned earlier, the cleanest option is a dedicated URL-to-markdown API. These handle JS-rendered pages, preserve heading structure, convert tables properly and strip boilerplate. The DIY options above won’t get all of that right consistently.
I use my own API that I built out of frustration (not self-advertising here, let me know if you’d like to see it) as I rely heavily on dependable input for my flows but there are others out there.
Firecrawl, Jina Reader (decent free tier), and Crawl4AI (self-hosted) are the serious options in this space, and any of them will do the job depending on what fits your stack.
One thing worth taking away, even if you never touch any of those APIs: when an AI workflow isn’t behaving, check what’s actually reaching the model before you spend an hour rewording the system prompt. Bad inputs overshadow even the most meticulous prompts.
Happy to answer questions.