I want to share a custom node I just built: Webpage Content Extractor.
It takes a HTML document as input and extracts the main contents from it. Sidebars, headers, and footers are all stripped.
It’s based on the readability library that is used by Firefox’s Reader View.
Follow the installation guide in the n8n community nodes documentation, and add the
n8n-nodes-webpage-content-extractor community node.
- Use the “HTTP Request” node to fetch the HTML document of a given URL (or get it from other sources)
- Connect the “HTTP Request” node to the Webpage Content Extractor node
- The Webpage Content Extractor will parse the HTML document and return a JSON document with useful attributes: