Feature: HTML Extraction node [GOT CREATED]

gravitacion · December 22, 2019, 10:06am

Currently, there is a way to get HTML (HTTP Request Node) but no way to scrape data from it (XPath, css selectors, etc…)

This node will allow to extract data from HTML in a structured maneer.

Node would be be designed as following:

Input:

String or File (let the user select the source type in the node configuration)
We leave the HTTP request things to the HTTP Request node.

Options:

The users enters combos of XPath + value to extract
example (google, mobile version):

name: google_result_link
xpath: //a[contains(@href, “/url?”) and ./div]
value: @href

There would be multiple XPath in the node

Output:

Resulting values from XPaths would be combined in a list of items.

For example, let’s say we had a “google_result_link” and “google_result_description”.
We would get two arrays:

descriptions

[
    "jardiland: un magasin de choix pour vos besoins de plantes",
    "le monde des plantes - trouvez la plante de vos rêves !",
    "la main verte est une boutique spécialisée dans la vente en ligne de plantes grasses..."
]

and

links

[
    "hxxps://jardiland.com",
    "hxxps://lemondedesplantes.fr",
    "hxxps://lamainverte.fr"
]

The resulting items would be:

var items = [
    {
        "google_result_description": "jardiland: un magasin de choix pour vos besoins de plantes",
        "google_result_link": "hxxps://jardiland.com"
     },
     ...
]

Note that this feature request is heavily inspired from huginn’s “Website Scraper” node.

jan · December 23, 2019, 10:11pm

Yes, I agree such a node would be very important and I wanted to create one for a very long time. However, thought more about doing it with CSS-selectors instead of XPath. The reason being that there is an npm module called Cheerio to do that while for XPath nothing similar seems to exist. At least in my search the modules I found do not seem to be maintained anymore, did not get any update for years or people write about them having issues. Are you aware of any good npm module?

gravitacion · December 24, 2019, 12:18am

I did some quick research and also found out that XPath wasn’t very popular around the Cheerio project.
Consequently, I am sadly not aware of any good npm module.

Anyway, CSS is way more user-friendly and should be enough for simple cases. Items could be modfied afterward using a FunctionItem node to answer complexier problems.

I created a short list of libraries extracted from a quick google search:
- GitHub - goto100/xpath: DOM 3 Xpath implemention and helper for node.js (typescript version of the library below)
- GitHub - yaronn/xpath.js: An xpath module for node, written in pure javascript (nodejs)
- GitHub - libxmljs/libxmljs: NodeJS bindings for libxml2 written in Typescript (libxmljs bindings for XPath 1.0) (see benchmark: jsdom vs libxmljs · Issue #97 · blue-button/bluebutton.js · GitHub)

Also, CSS selectors seem to get better over time:

Up until recently the focus of CSS never really touched on selectors. Occasionally there would be incremental updates within the selectors specification, but never any real ground breaking improvements. Fortunately, more attention has been given to selectors as of late, taking a look at how to select different types of elements and elements in different states of use.
Complex Selectors - Learn to Code Advanced HTML & CSS

Considering all this informations, it seems that idiomaticaly, one’s choice would be to use css selectors in javascript.

jan · December 24, 2019, 12:29am

Looks like the libraries you found are more or less the same I found. The problem with them also seems to be that they are not built for HTML rather for XML instead. So to work, there would have to be an additional transformation that they would even work.

Great to hear that we then came to the same conclusion. I will then have a look to get an HTML Scrape/Parse node implemented soon. Depending on my time it should be available within 1 or 2 weeks.

jan · January 1, 2020, 4:57am

Just released [email protected] with the newly created HTML Extract-Node.
Happy new year 2020!

jan · January 1, 2020, 6:58pm

Published now also a simple example here: