I am new to the world of n8n and web scraping, and quite confused about the best way to proceed. My hope is to scrape the contents from some news sites (as close to clean text as possible to ingest into a vector database).
What I’ve tried so far:
HTTP Request Node, followed by a code node to clean up the content, followed by a Tools agent to output json in a specified format. Lots of noise in the output.
Set up Crawl4AI as a cloud API. It works, but I still see a lot of hyperlinks and unwanted parts in the output. It also fails at some point if I set up to loop through many items.
Firecrawl MCP server. For some reason, it keeps returning a response that the target webpage returned “no content available’”. It seems to work fine if I just prompt it to search for information.
Grateful for help getting over the learning curve. Is it impossible to do this well using n8n native tools? What’s the best approach?
Thanks! This workflow looks interesting, but unfortunately it doesn’t work for me.
For example if I input ?url=https://cavendish.cet.uk/our-history/&method=simplified into the chat (not sure if it’s my syntax, but I tried it with and without “”), it returns " It looks like there was an error with the URL provided. The URL https://cavendish.cet.uk/our-history/ seems to be invalid or not accessible at the moment. Please check the URL for any mistakes or try a different URL."
Did you look into other workflow templates there linked above?
There is no single 100% working method to scrape websites due to peculiarities of how websites are designed.
You asked about help of getting over the learning curve. One way is to identify the practical approach that works best for your use case and improve it so that it matches your particular requirements and use cases. To accomplish this you will probably need also dig into data extraction and transformation methods to refine results to a degree they match your expectations.
You will need to do some research since your objectives and requirements (including the measurement of “as close to clean text as possible”) are only clear to you.
If you are looking for a help with practical implementation, you may try your chances in Help me Build my Workflow - n8n Community (make sure you read the pinned intro topic in the category).
Still, you’d need to show what you tried to do practically towards your goal and specific impediments you struggle to overcome. I am afraid “I tried one workflow that was shown to me and it didn’t work, while I ignored a whole lot of other examples” won’t qualify as such.
OK thank you. As mentioned, I had tried three approaches before posting (actually many more) as well as the one that you highlighted.
Clean text in the context of a news website means the text of the news article without all the links, blurbs, navigational elements and tags that appear in the source.
I know that every website is different, but I’d hoped that in this age of AI agents and commercial scraping APIs, there would be some approach that works out of the box.