HTML extract broken: doesn't recognize table tags (tr, td not working) + Workaround

I need to extract table data from HTML like here: Table (information) - Wikipedia

I spent a lot of time trying to get the HTML node to recognize table tags. It’s simply not working. But it has very strange workaround (see below)

Check out these workflows. It works only if I replace tr and td with other tags, like li and a.
I think it shouldn’t work like that.

Hey @artildo,

It took me a while but after playing around with the extracted HTML in a browser as well it looks like it was failing because the HTML is not valid. To work around that I added a Set node that wrapped the data back inside the <table></table> tags then used table > tbody > tr to extract the rows.

Hi @Jon
In this case you can even use simply tr instead of table > tbody > tr. But did you try the last node, the one with td? It still doesn’t work.

And I think, there’s something should be done about that. I got this issue from my client who wanted to parse big tables and failed to do this. But when you simply change the tags, averything works.

Hey @artildo,

I did try the last node as well and that had empty data in it which I assume came from the a selector not existing but at the time there is a good chance that is failing because the HML is not valid as well. I am not that familiar with the Cheerio package we use but I did try to copy the output as an html file and select using browser tools and had the same issue.

1 Like

I see that if something with certain tags not working but works when tags are changed, then maybe some node bug report needed.

We have it here, I am still not sure it is a bug as such as it could be a package limitation.

We will look into more though.

1 Like