Output outer HTML when extract HTML

wwww · February 4, 2025, 4:05pm

The idea is:

Allow HTML extract node to output the target outer HTML element

My use case:

I want to scrape websites to get notified when they update. Some post lists may have fields missing for some entries, for example category, summary, etc…

<!-- HTML sample -->
<div>
  <a href="/path1">
    <h3>Title 1</h3>
  </a>
  <a href="/path2">
    <h3>Title 2</h3>
    <span>Category 2</span>
    <p>Summary ...</p>
  </a>
</div>

If I extract the fields separately then merge them, the fields will be uneven.

So, I want to extract the entry items first, then extract the fields inside.

[
  {
    "data": [
      "<a href=\"/path1\">\n    <h3>Title 1</h3>\n  </a>",
      "<a href=\"/path2\">\n    <h3>Title 2</h3>\n    <span>Category 2</span>\n    <p>Summary ...</p>\n  </a>"
    ]
  }
]

[
  {
    "title": "Title 1",
    "url": "/path1",
    "category": "",
    "summary": ""
  },
  {
    "title": "Title 2",
    "url": "/path2",
    "category": "Category 2",
    "summary": "Sumary ..."
  }
]

But currently the HTML extract node can only provide the containing HTML. I hope that there can be different options here such as inner HTML and outer HTML.

Actually I think it is strange that for the given CSS selector, the extracted text, value and attribute are of the selected element, but HTML is of the children of the selected element …