I am using the HTTP Node to fetch the website of a new article.
This website will return to me a HTML page with many items in it which include an array of items that contain a header, image and description of an event.
I am looking to use the XML to json so I can convert the HTML code into json. Unfortunately the website uses invalid characters for the items.
I was wondering if there is a way to auto convert or remove these unsafe characters when passing it to the XML or if there is a node I can use before using the XML that would make the HTML code XML safe.
Additionally, I run into this issue quite often with sometimes articles containing unescaped characters such as ", ',` and specially this character ’ - This one seems to break the javascript syntax highlighter and I cant use regex to filter / replace it out.
Error: Invalid character in entity name
Line: 5
Column: 36
Char:
at error (/usr/local/lib/node_modules/n8n/node_modules/sax/lib/sax.js:652:10)
at strictFail (/usr/local/lib/node_modules/n8n/node_modules/sax/lib/sax.js:678:7)
at SAXParser.write (/usr/local/lib/node_modules/n8n/node_modules/sax/lib/sax.js:1499:13)
at Parser.exports.Parser.Parser.parseString (/usr/local/lib/node_modules/n8n/node_modules/xml2js/lib/parser.js:327:31)
at Parser.parseString (/usr/local/lib/node_modules/n8n/node_modules/xml2js/lib/parser.js:5:59)
at /usr/local/lib/node_modules/n8n/node_modules/xml2js/lib/parser.js:342:24
at new Promise (<anonymous>)
at Parser.exports.Parser.Parser.parseStringPromise (/usr/local/lib/node_modules/n8n/node_modules/xml2js/lib/parser.js:340:14)
at Parser.parseStringPromise (/usr/local/lib/node_modules/n8n/node_modules/xml2js/lib/parser.js:5:59)
at Object.execute (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes/Xml/Xml.node.js:234:47)
well the XML conveniently converted the html into json so I used it for that. It works well for some website and does work well with others.
I have managed to use the HTML node to break down the website I am looking for to grab the items I want.
Unfortunately this method is rather website specific so and I wish to have a more universal solution so I can actually scale this to more than 1 website.
That site you have provided before does not return the content as XML. Some sites can return the content as XML if you provide the Header Accept: application/xml, but it depends on the site.
I do not see an easy way how to convert any site to XML, as it basically will have the same issues as working with HTML directly.
XML works well with structured data (like tables or lists of something), but HTML is a mess.