The idea is:
Allow HTML extract node to output the target outer HTML element
My use case:
I want to scrape websites to get notified when they update. Some post lists may have fields missing for some entries, for example category
, summary
, etc…
<!-- HTML sample -->
<div>
<a href="/path1">
<h3>Title 1</h3>
</a>
<a href="/path2">
<h3>Title 2</h3>
<span>Category 2</span>
<p>Summary ...</p>
</a>
</div>
If I extract the fields separately then merge them, the fields will be uneven.
So, I want to extract the entry items first, then extract the fields inside.
[
{
"data": [
"<a href=\"/path1\">\n <h3>Title 1</h3>\n </a>",
"<a href=\"/path2\">\n <h3>Title 2</h3>\n <span>Category 2</span>\n <p>Summary ...</p>\n </a>"
]
}
]
[
{
"title": "Title 1",
"url": "/path1",
"category": "",
"summary": ""
},
{
"title": "Title 2",
"url": "/path2",
"category": "Category 2",
"summary": "Sumary ..."
}
]
But currently the HTML extract node can only provide the containing HTML. I hope that there can be different options here such as inner HTML
and outer HTML
.
Actually I think it is strange that for the given CSS selector, the extracted text
, value
and attribute
are of the selected element, but HTML
is of the children of the selected element …