Custom node: Webpage content extractor

Hi everyone!

I want to share a custom node I just built: Webpage Content Extractor.

https://github.com/Savjee/n8n-nodes-webpage-content-extractor

It takes a HTML document as input and extracts the main contents from it. Sidebars, headers, and footers are all stripped.

It’s based on the readability library that is used by Firefox’s Reader View.

How to install

Follow the installation guide in the n8n community nodes documentation, and add the n8n-nodes-webpage-content-extractor community node.

How to use it

  1. Use the “HTTP Request” node to fetch the HTML document of a given URL (or get it from other sources)

  1. Connect the “HTTP Request” node to the Webpage Content Extractor node

  1. The Webpage Content Extractor will parse the HTML document and return a JSON document with useful attributes:

16 Likes

Wow, that´s really cool, would be nice to have basic html tags available.

Welcome to the community @xavierd !

Really cool! Thanks a lot for sharing that with the community.

1 Like

Nice work!! It would be nice as well that this node could extract tables if they present on the page or as an option.

Thanks for this, just sorted an issue I was having! Love it

1 Like

dude this is golden - thank you so much! :heart::+1:

1 Like

This is amazing! This helps save so many tokens and API costs when cleaning website contents.

One request I would have to improve this is to preserve formatting, for example: titles and inline links.

When using the readability feature in the Firefox browser it does do this, so I am not sure if it a limitation with the library they provide or it is not implemented.

If possible would be nice to have this, still very useful otherwise!