HTML Extraction / cheerio.extract

Looking for help with -

  1. Any suggestions on how I can get $.extract to work ?
  2. Is this the correct way of thinking about html extraction with n8n ? Any suggestions to different available approaches ?

Context -
I am trying to extract data from html page with a large number of links. Complexity is not all links have all the extra attributes, some links will be missing few attributes.

Given the way the n8n html node works, I am able to get different arrays for each element, however since the number of elements in each array is different, I am unable to map them for each link later on. And there is no way (atleast I cant find it) to say include blanks if attributed are not found for a particular link.

On checking a similar use-case for a html table extraction, where few td are blank / not having the appropriate css selectors because there is no data in them, I run into a similar case where the different arrays representing a column are of different lengths, hence I cannot map the data back to a table.

It would be great if there was a way to tell in the html extract node to insert blanks, but return something if the css selectors are not found. That would mean the lengths of all the arrays will be the same and can be mapped back to a table without mixing up data from different rows.

So I am now experimenting with cheerio, saw a few other discussions here that n8n is already using cheerio and trying out below -

const cheerio = require('cheerio');

const $ = cheerio.load($input.all()[0].json.data);

const allLinks = $('a');
console.log('allLinks - ' + allLinks);

const data = $.extract({
  links: [
    {
      selector: 'a',
      value: (el, key) => {
        const href = $(el).attr('href');
        return `${key}=${href}`;
      },
    },
  ],
});

return {data: data};

I can see all the link printed on the console from the console.log line. However, I get a error message

$.extract is not a function [line 8]

I am not too familiar with JS, trying to find my way around using google. So apologies if I am missing some simple / obvious things.

I am using this as a reference - Extracting Data with the extract Method | cheerio

As for the html code I am trying to extract, it is a html page with hundreds of links with extra attributes.

<A HREF="..." ADD_DATE="..." LAST_MODIFIED="..." USER_ID="..." CATEGORY="..." SUB_CATEGORY="..." and few other weird optional attributes>link desctiption</A>

Hey @atomtr,

I am not that familiar with Cheerio but I did find that extract is not currently an option available to use and the website is ahead of the release (cheerio docs · Issue #3165 · cheeriojs/cheerio · GitHub).

What does happen if you use the html node? It may be possible that you can make a workflow without having to use Cheerio directly.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.