Scraping 101 advice needed

Amir_HK · April 6, 2025, 1:52am

I am new to the world of n8n and web scraping, and quite confused about the best way to proceed. My hope is to scrape the contents from some news sites (as close to clean text as possible to ingest into a vector database).

What I’ve tried so far:

HTTP Request Node, followed by a code node to clean up the content, followed by a Tools agent to output json in a specified format. Lots of noise in the output.
Set up Crawl4AI as a cloud API. It works, but I still see a lot of hyperlinks and unwanted parts in the output. It also fails at some point if I set up to loop through many items.
Firecrawl MCP server. For some reason, it keeps returning a response that the target webpage returned “no content available’”. It seems to work fine if I just prompt it to search for information.

Grateful for help getting over the learning curve. Is it impossible to do this well using n8n native tools? What’s the best approach?

Olek · April 6, 2025, 10:37am

Did you try looking into what’s been done by others?

E.g. AI agent that can scrape webpages | n8n workflow template

If your queston is resolved, please mark this post as a Solution.

Amir_HK · April 6, 2025, 1:06pm

Thanks! This workflow looks interesting, but unfortunately it doesn’t work for me.

For example if I input ?url=https://cavendish.cet.uk/our-history/&method=simplified into the chat (not sure if it’s my syntax, but I tried it with and without “”), it returns " It looks like there was an error with the URL provided. The URL https://cavendish.cet.uk/our-history/ seems to be invalid or not accessible at the moment. Please check the URL for any mistakes or try a different URL."

Olek · April 6, 2025, 1:27pm

Did you look into other workflow templates there linked above?

There is no single 100% working method to scrape websites due to peculiarities of how websites are designed.

You asked about help of getting over the learning curve. One way is to identify the practical approach that works best for your use case and improve it so that it matches your particular requirements and use cases. To accomplish this you will probably need also dig into data extraction and transformation methods to refine results to a degree they match your expectations.

You will need to do some research since your objectives and requirements (including the measurement of “as close to clean text as possible”) are only clear to you.

If you are looking for a help with practical implementation, you may try your chances in Help me Build my Workflow - n8n Community (make sure you read the pinned intro topic in the category).

Still, you’d need to show what you tried to do practically towards your goal and specific impediments you struggle to overcome. I am afraid “I tried one workflow that was shown to me and it didn’t work, while I ignored a whole lot of other examples” won’t qualify as such.

Amir_HK · April 6, 2025, 1:48pm

OK thank you. As mentioned, I had tried three approaches before posting (actually many more) as well as the one that you highlighted.

Clean text in the context of a news website means the text of the news article without all the links, blurbs, navigational elements and tags that appear in the source.

I know that every website is different, but I’d hoped that in this age of AI agents and commercial scraping APIs, there would be some approach that works out of the box.

Olek · April 6, 2025, 2:01pm

You are still ignoring at least 19 others linked in my 1st post above.

n8n community is not the best place to find those. Not specialized enough for this.

Amir_HK · April 6, 2025, 10:27pm

How do you know that I ignored the others? Nevermind. Thank you for the education.

system · July 5, 2025, 10:28pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scrape Any Website in N8N - Top 4 Methods English 🇬🇧 core , node , workflow-building	3	1268	March 20, 2026
Web scraping with ScrapeNinja API Built with n8n data-transformation , html-extract	1	1013	January 24, 2025
[NEW] ScrapeNinja official integration with n8n: web scraping API with rotating proxies and real browser Built with n8n community-node	36	5374	September 30, 2025
SCRAPE any Website in just 5 MINUTES (VIDEO) English 🇬🇧	0	636	November 17, 2025
Anyone using n8n for large-scale web scraping? Questions	4	202	August 5, 2025

Scraping 101 advice needed

Related topics