Integrate Scrapfly for scaping any web page as HTML, Text, or Markdown for training LLMs

mazen · May 30, 2024, 3:25pm

The current LLM capabilities open the door for different use cases, one of them being training with scraped data for building RAG systems to create context-aware models.

ScrapFly is a web scraping API that enables extracting any web page data into Markdown or Text, which is accessible for LLMs. It also provides additional scraping utilities, such as proxies, antibot bypass, and headless browsers’ execution. You can learn more via the official Scrapfly documentation.

I am willing to add an official integration for Scrapfly that enables the following actions:

Scrape a web page as HTML, Markdown, or Text
Crawl a full website for as Markdown or Text
Taking customizing screenshots of a given web page

All the above features will include the available Scrapfly API parameters. Please feel free to upvote or suggest new features!

Kool_Baudrillard · May 30, 2024, 3:55pm

Hi,

no offense, you can already do this with html extract node or the puppeteer community node.

Just out of interest, because I´m leading a data extraction team, do you really get enough data with the scraping websites approach?

We´re scaling now the data extraction to 2.5 Mio Domains and revisiting those on a daily basis.

mazen · May 30, 2024, 4:26pm

Hey @Kool_Baudrillard. Yes, you can use headless browsers of even HTTP clients to obtain the HTML. But there’s a major limitation, which is web scraping blocking. Such a challenge requires rotating proxies, obfuscating the browser fingerprint, modify the TLS handshake, mimic browser-like headers, and many other anti-fingerprinting techniques.

A web scraping API already provides the required infrastructure for bypassing such challenges, you only specify the required params. Implementing such a solution can be challenging, specially with low-code tools.

That being said, if your target website has low protection or don’t block your requests at all, then you don’t have an issue! But if you want to extract the data from LLM training purposes, you will need an extra step of parsing the HTML into Markdown or Text, which is also managed internally.

Kool_Baudrillard · May 30, 2024, 7:31pm

I´m aware of the limitation, mostly it´s a cat mouse game. I use it a lot for specific projects, but nonetheless at scale it doesn’t work (at least for us).

Put the protection aside, we ran at scale into limitation like extracting the main content like FTE to build scrapers, maintain, etc. I´m not aware of any service, which does Boilerpipe out of the box.

Scrapfly API looks nice and sure is a good service and the more node in n8n the more freedom to build whatever you want and what you are familiar with.