How I can extract all URLs using the HTML Extract Nod?

I am trying to extract all links from a webpage.
Yeah, I know I can make a simple Function that can extract all the URLs. That being said I think it might be even more helpful to better understand how I can do this using the HTML Extract Nod.

I spent some time looking over some of the flows made that extract a certain link for a site but not all links.
The best I got was taking all links that have h1 or h2… test - https://n8n.io/workflows/434
While I think this is a really good start I want to make sure I get all the URLs from the page and not just the URLs in the text.

I feel there is something small I am might be missing. any insight on the matter would be great : )

{
  "nodes": [
    {
      "parameters": {},
      "name": "Start",
      "type": "n8n-nodes-base.start",
      "typeVersion": 1,
      "position": [
        1140,
        260
      ]
    },
    {
      "parameters": {
        "url": "https://www.decathlon.com/",
        "responseFormat": "string",
        "options": {}
      },
      "name": "HTTP Request2",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        1360,
        260
      ],
      "typeVersion": 1
    },
    {
      "parameters": {
        "extractionValues": {
          "values": [
            {
              "key": "item",
              "cssSelector": "h3",
              "returnValue": "html",
              "returnArray": true
            }
          ]
        },
        "options": {}
      },
      "name": "HTML Extract4",
      "type": "n8n-nodes-base.htmlExtract",
      "position": [
        1560,
        260
      ],
      "typeVersion": 1
    },
    {
      "parameters": {
        "dataPropertyName": "item",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "a"
            },
            {
              "key": "url",
              "cssSelector": "a",
              "returnValue": "attribute",
              "attribute": "href"
            }
          ]
        },
        "options": {}
      },
      "name": "HTML Extract13",
      "type": "n8n-nodes-base.htmlExtract",
      "position": [
        1760,
        260
      ],
      "typeVersion": 1
    }
  ],
  "connections": {
    "Start": {
      "main": [
        [
          {
            "node": "HTTP Request2",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTTP Request2": {
      "main": [
        [
          {
            "node": "HTML Extract4",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTML Extract4": {
      "main": [
        [
          {
            "node": "HTML Extract13",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Hey @David_Go,

Is there any particular reason you’re only extracting information for the h3 tags and not the a tags? If you want to get all the links, I think a tags would be a better option.

1 Like

No, not at all! Wow, I was so far down the rabbit hole trying to reverse engineering a flow that I may have simply over compacted it.

Current me if I am wrong… This simple extract seems to just do the trick:


@harshil1712 Thanks once more for your help!

2 Likes