Custom node: Webpage content extractor

Hi everyone!

I want to share a custom node I just built: Webpage Content Extractor.

https://github.com/Savjee/n8n-nodes-webpage-content-extractor

It takes a HTML document as input and extracts the main contents from it. Sidebars, headers, and footers are all stripped.

It’s based on the readability library that is used by Firefox’s Reader View.

How to install

Follow the installation guide in the n8n community nodes documentation, and add the n8n-nodes-webpage-content-extractor community node.

How to use it

  1. Use the “HTTP Request” node to fetch the HTML document of a given URL (or get it from other sources)

  1. Connect the “HTTP Request” node to the Webpage Content Extractor node

  1. The Webpage Content Extractor will parse the HTML document and return a JSON document with useful attributes:

8 Likes

Wow, that´s really cool, would be nice to have basic html tags available.

Welcome to the community @xavierd !

Really cool! Thanks a lot for sharing that with the community.

Nice work!! It would be nice as well that this node could extract tables if they present on the page or as an option.