Currently, there is a way to get HTML (HTTP Request Node) but no way to scrape data from it (XPath, css selectors, etc…)
This node will allow to extract data from HTML in a structured maneer.
Node would be be designed as following:
String or File (let the user select the source type in the node configuration)
We leave the HTTP request things to the HTTP Request node.
The users enters combos of XPath + value to extract
example (google, mobile version):
name: google_result_link
xpath: //a[contains(@href, “/url?”) and ./div]
value: @href
There would be multiple XPath in the node
Resulting values from XPaths would be combined in a list of items.
For example, let’s say we had a “google_result_link” and “google_result_description”.
We would get two arrays:
"jardiland: un magasin de choix pour vos besoins de plantes",
"le monde des plantes - trouvez la plante de vos rêves !",
"la main verte est une boutique spécialisée dans la vente en ligne de plantes grasses..."
The resulting items would be:
var items = [
"google_result_description": "jardiland: un magasin de choix pour vos besoins de plantes",
"google_result_link": "hxxps://"
Note that this feature request is heavily inspired from huginn’s “Website Scraper” node.