Currently, there is a way to get HTML (HTTP Request Node) but no way to scrape data from it (XPath, css selectors, etc…)
This node will allow to extract data from HTML in a structured maneer.
Node would be be designed as following:
Input:
String or File (let the user select the source type in the node configuration)
We leave the HTTP request things to the HTTP Request node.
Options:
The users enters combos of XPath + value to extract
example (google, mobile version):
name: google_result_link
xpath: //a[contains(@href, “/url?”) and ./div]
value: @href
There would be multiple XPath in the node
Output:
Resulting values from XPaths would be combined in a list of items.
For example, let’s say we had a “google_result_link” and “google_result_description”.
We would get two arrays:
descriptions
[
"jardiland: un magasin de choix pour vos besoins de plantes",
"le monde des plantes - trouvez la plante de vos rêves !",
"la main verte est une boutique spécialisée dans la vente en ligne de plantes grasses..."
]
and
links
[
"hxxps://jardiland.com",
"hxxps://lemondedesplantes.fr",
"hxxps://lamainverte.fr"
]
The resulting items would be:
var items = [
{
"google_result_description": "jardiland: un magasin de choix pour vos besoins de plantes",
"google_result_link": "hxxps://jardiland.com"
},
...
]
Note that this feature request is heavily inspired from huginn’s “Website Scraper” node.