HTML extract broken: doesn't recognize table tags (tr, td not working) + Workaround

artildo · March 12, 2023, 7:14am

I need to extract table data from HTML like here: Table (information) - Wikipedia

I spent a lot of time trying to get the HTML node to recognize table tags. It’s simply not working. But it has very strange workaround (see below)

Check out these workflows. It works only if I replace tr and td with other tags, like li and a.
I think it shouldn’t work like that.

Jon · March 13, 2023, 11:51am

Hey @artildo,

It took me a while but after playing around with the extracted HTML in a browser as well it looks like it was failing because the HTML is not valid. To work around that I added a Set node that wrapped the data back inside the <table></table> tags then used table > tbody > tr to extract the rows.

artildo · March 13, 2023, 2:39pm

Hi @Jon
In this case you can even use simply tr instead of table > tbody > tr. But did you try the last node, the one with td? It still doesn’t work.

And I think, there’s something should be done about that. I got this issue from my client who wanted to parse big tables and failed to do this. But when you simply change the tags, averything works.

Jon · March 13, 2023, 4:04pm

Hey @artildo,

I did try the last node as well and that had empty data in it which I assume came from the a selector not existing but at the time there is a good chance that is failing because the HML is not valid as well. I am not that familiar with the Cheerio package we use but I did try to copy the output as an html file and select using browser tools and had the same issue.

artildo · March 13, 2023, 5:33pm

I see that if something with certain tags not working but works when tags are changed, then maybe some node bug report needed.

Jon · March 13, 2023, 6:35pm

We have it here, I am still not sure it is a bug as such as it could be a package limitation.

We will look into more though.

artildo · June 8, 2023, 2:31pm

@Jon, any progress here?

Jon · June 8, 2023, 2:42pm

Hey @artildo,

Nothing, Looking at the previous notes this is likely to be down to the HTML not being valid anymore and can be reproduced outside of n8n. I will get a development created so we can work out how to fix this as it isn’t directly an n8n issue and it is technically working correctly with the data it has been provided.

Jon · June 8, 2023, 2:48pm

It is still a bit odd that it works when replaced though as I would expect that to not be valid as well Will have another look when I am back on Tuesday.

Jon · June 8, 2023, 3:11pm

Curiosity got the better of me so I put together a quick project to see what is happening… This project takes the HTML that we get once extracted and loads it into Cheerio (the package used for working with html) and it also takes the replaced version from the workflow it then outputs what the package sees and what the browser will see if you try to save the same content as an HTML file and load it.

const cheerio = require('cheerio');
const table = `<caption>Age table </caption> <tbody><tr> <th>First name</th> <th>Last name</th> <th>Age </th></tr> <tr> <td>Tinu</td> <td>Elejogun</td> <td>14 </td></tr> <tr> <td>Javier</td> <td>Zapata</td> <td>28 </td></tr> <tr> <td>Lily</td> <td>McGarrett</td> <td>18 </td></tr> <tr> <td>Olatunkbo</td> <td>Chijiaku</td> <td>22 </td></tr> <tr> <td>Adrienne</td> <td>Anthoula</td> <td>22 </td></tr> <tr> <td>Axelia</td> <td>Athanasios</td> <td>22 </td></tr> <tr> <td>Jon-Kabat</td> <td>Zinn</td> <td>22 </td></tr> <tr> <td>Thabang</td> <td>Mosoa</td> <td>15 </td></tr> <tr> <td>Rhian</td> <td>Ellis</td> <td>12 </td></tr> </tbody>`
const replaced = `<caption>Age table </caption> <tbody><li> <th>First name</th> <th>Last name</th> <th>Age </th></li> <li> <a>Tinu</a> <a>Elejogun</a> <a>14 </a></li> <li> <a>Javier</a> <a>Zapata</a> <a>28 </a></li> <li> <a>Lily</a> <a>McGarrett</a> <a>18 </a></li> <li> <a>Olatunkbo</a> <a>Chijiaku</a> <a>22 </a></li> <li> <a>Adrienne</a> <a>Anthoula</a> <a>22 </a></li> <li> <a>Axelia</a> <a>Athanasios</a> <a>22 </a></li> <li> <a>Jon-Kabat</a> <a>Zinn</a> <a>22 </a></li> <li> <a>Thabang</a> <a>Mosoa</a> <a>15 </a></li> <li> <a>Rhian</a> <a>Ellis</a> <a>12 </a></li> </tbody>`
const $t = cheerio.load(table);
const $r = cheerio.load(replaced);

console.log(`TABLE HTML\n` + $t.html());
console.log(`\n\nREPLACED HTML\n` + $r.html());

Console Output

TABLE HTML
<html><head></head><body>Age table   First name Last name Age   Tinu Elejogun 14   Javier Zapata 28   Lily McGarrett 18   Olatunkbo Chijiaku 22   Adrienne Anthoula 22   Axelia Athanasios 22   Jon-Kabat Zinn 22   Thabang Mosoa 15   Rhian Ellis 12  </body></html>


REPLACED HTML
<html><head></head><body>Age table  <li> First name Last name Age </li> <li> <a>Tinu</a> <a>Elejogun</a> <a>14 </a></li> <li> <a>Javier</a> <a>Zapata</a> <a>28 </a></li> <li> <a>Lily</a> <a>McGarrett</a> <a>18 </a></li> <li> <a>Olatunkbo</a> <a>Chijiaku</a> <a>22 </a></li> <li> <a>Adrienne</a> <a>Anthoula</a> <a>22 </a></li> <li> <a>Axelia</a> <a>Athanasios</a> <a>22 </a></li> <li> <a>Jon-Kabat</a> <a>Zinn</a> <a>22 </a></li> <li> <a>Thabang</a> <a>Mosoa</a> <a>15 </a></li> <li> <a>Rhian</a> <a>Ellis</a> <a>12 </a></li> </body></html>```

So you can see here what the issue is behind it, because the table version of the HTML is invalid it is treated like there is no html in there. I have taken a look at the package issues and found a report of the same thing here: https://github.com/cheeriojs/cheerio/issues/3014 which was closed so this will need some proper thought on how to handle it but hopefully this gives some insight into what is happening.

artildo · June 8, 2023, 3:41pm

So the basic tactics is to replace wicked tags with some other, like i do now

Jon · June 8, 2023, 3:44pm

Hey @artildo,

For now that is the best bet, I may look to see if there are better options for the future it could be that we have to work out if we need to manipulate the html to make it valid in some way.

system · September 6, 2023, 3:44pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.