It took me a while but after playing around with the extracted HTML in a browser as well it looks like it was failing because the HTML is not valid. To work around that I added a Set node that wrapped the data back inside the <table></table> tags then used table > tbody > tr to extract the rows.
Hi @Jon
In this case you can even use simply tr instead of table > tbody > tr. But did you try the last node, the one with td? It still doesn’t work.
And I think, there’s something should be done about that. I got this issue from my client who wanted to parse big tables and failed to do this. But when you simply change the tags, averything works.
I did try the last node as well and that had empty data in it which I assume came from the a selector not existing but at the time there is a good chance that is failing because the HML is not valid as well. I am not that familiar with the Cheerio package we use but I did try to copy the output as an html file and select using browser tools and had the same issue.
Nothing, Looking at the previous notes this is likely to be down to the HTML not being valid anymore and can be reproduced outside of n8n. I will get a development created so we can work out how to fix this as it isn’t directly an n8n issue and it is technically working correctly with the data it has been provided.
It is still a bit odd that it works when replaced though as I would expect that to not be valid as well Will have another look when I am back on Tuesday.
Curiosity got the better of me so I put together a quick project to see what is happening… This project takes the HTML that we get once extracted and loads it into Cheerio (the package used for working with html) and it also takes the replaced version from the workflow it then outputs what the package sees and what the browser will see if you try to save the same content as an HTML file and load it.
TABLE HTML
<html><head></head><body>Age table First name Last name Age Tinu Elejogun 14 Javier Zapata 28 Lily McGarrett 18 Olatunkbo Chijiaku 22 Adrienne Anthoula 22 Axelia Athanasios 22 Jon-Kabat Zinn 22 Thabang Mosoa 15 Rhian Ellis 12 </body></html>
REPLACED HTML
<html><head></head><body>Age table <li> First name Last name Age </li> <li> <a>Tinu</a> <a>Elejogun</a> <a>14 </a></li> <li> <a>Javier</a> <a>Zapata</a> <a>28 </a></li> <li> <a>Lily</a> <a>McGarrett</a> <a>18 </a></li> <li> <a>Olatunkbo</a> <a>Chijiaku</a> <a>22 </a></li> <li> <a>Adrienne</a> <a>Anthoula</a> <a>22 </a></li> <li> <a>Axelia</a> <a>Athanasios</a> <a>22 </a></li> <li> <a>Jon-Kabat</a> <a>Zinn</a> <a>22 </a></li> <li> <a>Thabang</a> <a>Mosoa</a> <a>15 </a></li> <li> <a>Rhian</a> <a>Ellis</a> <a>12 </a></li> </body></html>```
So you can see here what the issue is behind it, because the table version of the HTML is invalid it is treated like there is no html in there. I have taken a look at the package issues and found a report of the same thing here: https://github.com/cheeriojs/cheerio/issues/3014 which was closed so this will need some proper thought on how to handle it but hopefully this gives some insight into what is happening.
For now that is the best bet, I may look to see if there are better options for the future it could be that we have to work out if we need to manipulate the html to make it valid in some way.