Help requested to manipulate text returned by HTML Extract module

I am trying to extract a specific string of text from my HTML Extract module result.

The HTML Extract module has successfully extracted the following text:
Condition: Used: An item that has been previously used. See the seller’s listing for full details and description of any imperfections. See all condition definitions– opens in a new window or tab ... Read moreabout the condition Year: 2002 Mileage: 42050

Right now I am sending this entire description to Airtable in the subsequent module. Instead, I would like to add a step before sending to Airtable where (using a Javascript function or some other module), I want to perform operations on the text. Specifically:

I want to extract the year and mileage from this text and store those as data points that I send to their respective columns (also called “age” and “mileage” in airtable) along with the other fields extracted from my HTML Extract module.

Any ideas how to do this?

Here is the code behind my HTML Extract Module:

        {
      "nodes": [
        {
          "parameters": {
            "extractionValues": {
              "values": [
                {
                  "key": "title",
                  "cssSelector": "title"
                },
                {
                  "key": "price",
                  "cssSelector": "span#prcIsum.notranslate"
                },
                {
                  "key": "descurl",
                  "cssSelector": "iframe",
                  "returnValue": "attribute",
                  "attribute": "src"
                },
                {
                  "key": "specifics",
                  "cssSelector": "div#viTabs_0_is.itemAttr tbody"
                },
                {
                  "key": "photourl",
                  "cssSelector": "img#icImg",
                  "returnValue": "value"
                },
                {
                  "key": "location",
                  "cssSelector": "span[itemprop=\"availableAtOrFrom\"]"
                }
              ]
            },
            "options": {
              "trimValues": "=true"
            }
          },
          "name": "HTML Extract",
          "type": "n8n-nodes-base.htmlExtract",
          "typeVersion": 1,
          "position": [
            650,
            450
          ]
        }
      ],
      "connections": {}
    }

Hey @automatron!

Below is an example Set node that might help. The important thing to note here is the snippet .match(/[0-9]\\d{3}/).toString() for the year and .match(/[0-9]\\d{4}/).toString() for mileage. These snippets return the value of the year and mileage, respectively, and convert them to a string. So when you reference the values in the Set node, use these snippets to extract the required values.

{
  "nodes": [
    {
      "parameters": {
        "keepOnlySet": true,
        "values": {
          "string": [
            {
              "name": "year",
              "value": "={{$node[\"Webhook\"].json[\"body\"][\"data\"].match(/[0-9]\\d{3}/).toString()}}"
            },
            {
              "name": "mileage",
              "value": "={{$node[\"Webhook\"].json[\"body\"][\"data\"].match(/[0-9]\\d{4}/).toString()}}"
            }
          ]
        },
        "options": {}
      },
      "name": "Set",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [
        861,
        290
      ]
    }
  ],
  "connections": {}
}

Hope this helps :slightly_smiling_face:

1 Like

Hey Harshil, this is fantastic, thank you! I had to tweak a little but got it working with:
{{$node["HTML Extract"].json["specifics"].match(/[0-9]\d{*3 or*4}/).toString()}}

the .match().toString() I recognise as javascript functions that I will read more about on W3C. Is the “(/[0-9]\d{3 or4}/)” part reg-ex and any recommended reading or tutorials to understand that better?

Super impressed by n8n and the community. I have been reading for a while and seen your helpful replies on many posts - thank you :slight_smile:

2 Likes

I am happy that it works!

Yes .match() and .toString() are JavaScript functions. .match() returns the items that it matches, since .match() returns an array, we are using .toString() to convert it to a string.

I use Regex 101 to build and test regular expressions. Apart from that, I refer to the MDN documentation on Regex.

Thank you for your kind words! :slightly_smiling_face:

Thanks Regex101 looks very useful. I’ve put your reg-ex into that. Can you tell me why you are matching on “d”? I thought you would be looking for the words “year” or “mileage” and grabbing the digits that come after them but your code does not seem to be doing that?

If I had a different set of data where the mileage was just 100 (not 42050), would the same code work?

The d checks for the digits. In the case of Mileage this wouldn’t work if the length is not fixed. The solution you’re suggesting makes more sense.

Thanks Harshil, this is all super helpful and I’m learning a lot. I’ve just completed some basic tutorials on regex and understood that I need the following expression to capture the mileage:
{{$node["HTML Extract"].json["specifics"].match(/Mileage:\s*(\d+)/).toString()}}

I am using the brackets around \d+ to “capture” only the numbers after the word “Mileage:”. This should work in theory but with the set module, it returns the following:

Mileage: 
											 
											
												42050,42050

So it seems it is returning the full regex expression and then just the capture part of the expression as indicated by my parentheses. Any tips here to only return what is captured rather than the full expression?

Yes, the .match() method will return the matches. Since we are also looking for the word Mileage it gets returned. This is where you might want to use the .replace() method, and replace (remove in our case) any data with the data we want. I hope this documentation would help.

For completeness, here is the final solution that worked for me. I had to use 2 nodes but maybe it is possible to do it with just 1 for someone who is more proficient in Javascript/n8n.

  1. Function node:
items[0].json.Mileage = $node["HTML Extract"].json["specifics"].match(/Mileage:\s*(\d+)/);
items[0].json.Year = $node["HTML Extract"].json["specifics"].match(/Year:\s*(\d+)/);
items[0].json.Colour = $node["HTML Extract"].json["specifics"].match(/Colour:\s*(\w*)/);
return items;

The above RegEx works in my case to grab numbers and words that come after a certain identifier like “Mileage:” or “Colour:”, removing all whitespace in between.

  1. Set node. Create a new string to set for each of the variables above with an expression like this:
    {{$node["Function"].json["Colour"][1].toString()}}

Thank you for the pointers and tips along the way Harshil :smile:

2 Likes

This is great! Have fun!

1 Like