Custom node: Webpage content extractor

xavierd · January 7, 2024, 1:05pm

Hi everyone!

I want to share a custom node I just built: Webpage Content Extractor.

https://github.com/Savjee/n8n-nodes-webpage-content-extractor

It takes a HTML document as input and extracts the main contents from it. Sidebars, headers, and footers are all stripped.

It’s based on the readability library that is used by Firefox’s Reader View.

How to install

Follow the installation guide in the n8n community nodes documentation, and add the n8n-nodes-webpage-content-extractor community node.

How to use it

Use the “HTTP Request” node to fetch the HTML document of a given URL (or get it from other sources)

Connect the “HTTP Request” node to the Webpage Content Extractor node

The Webpage Content Extractor will parse the HTML document and return a JSON document with useful attributes:

Kool_Baudrillard · January 7, 2024, 2:48pm

Wow, that´s really cool, would be nice to have basic html tags available.

jan · January 7, 2024, 4:08pm

Welcome to the community @xavierd !

Really cool! Thanks a lot for sharing that with the community.

Ruslan_Yanyshyn · January 9, 2024, 6:41pm

Nice work!! It would be nice as well that this node could extract tables if they present on the page or as an option.

Vincent_Haywood · December 8, 2024, 10:59pm

Thanks for this, just sorted an issue I was having! Love it

Simon_Formanowski · April 18, 2025, 4:14pm

dude this is golden - thank you so much!

Gesture2867 · June 16, 2025, 1:35pm

This is amazing! This helps save so many tokens and API costs when cleaning website contents.

One request I would have to improve this is to preserve formatting, for example: titles and inline links.

When using the readability feature in the Firefox browser it does do this, so I am not sure if it a limitation with the library they provide or it is not implemented.

If possible would be nice to have this, still very useful otherwise!