How to Specify Meta Tags in HTML Extract?

Hi everyone! I’m trying to extract Open Graph meta tags from some HTML using the HTML Extract module, but it’s not working. I can extract all other tags, but these meta tags are giving me grief. Is there a special way to parse them? Dot notation works on some like on <div class=""></div> but not on <meta property="" content=""/>

Hopefully there is an easier way with just the HTML Extract node, but I ended up just extracting the content and property attributes as separate arrays and then making key-value pairs out of them using the Function node and this snippet:

var values=$node["HTML Extract"].json["property"]
var props=$node["HTML Extract"].json["content"]

var i;
var currentProp;
var currentVal;

var result = {}


for (i = 0; i < props.length; i++) {
    currentProp = values[i];
    currentVal = props[i];
    result[currentProp] = currentVal;    
}

return [
{json:{result}}
]

The node behind the scenes uses cheeiro.js. So you can use any selector supported by it. Check the example below.


{
  "nodes": [
    {
      "parameters": {},
      "name": "Start",
      "type": "n8n-nodes-base.start",
      "typeVersion": 1,
      "position": [
        -270,
        180
      ]
    },
    {
      "parameters": {
        "url": "https://n8n.io",
        "responseFormat": "string",
        "options": {}
      },
      "name": "HTTP Request",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [
        30,
        180
      ]
    },
    {
      "parameters": {
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "meta[property=\"og:title\"]",
              "returnValue": "attribute",
              "attribute": "content"
            }
          ]
        },
        "options": {}
      },
      "name": "HTML Extract",
      "type": "n8n-nodes-base.htmlExtract",
      "typeVersion": 1,
      "position": [
        280,
        180
      ]
    }
  ],
  "connections": {
    "Start": {
      "main": [
        [
          {
            "node": "HTTP Request",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTTP Request": {
      "main": [
        [
          {
            "node": "HTML Extract",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}
1 Like

Thanks! This is much better - I wasn’t sure how to format the CSS Selector specifically so this helps a lot and is much cleaner I think.

1 Like