So i am scraping some web content and do not want to keep hotlinking to another webpage and use up their bandwidth so want to host it on my own.
When I am extracting the HTML content I can see the images they have uploaded and the img src link etc. I can extract these and upload them to my own site no issues at all, but its the reconstructing the end HTML with the links to my own hosted img src and replace the one in the extracted HTML content.
Assuming you know what the URL is I would have thought it would just be a case of using that value to start the replace. If you are scraping from google.com and want to replace it with duckduckgo.com but if you don’t know what site you are getting the data from or you are also downloading any third party hosted resources like javascript or CSS files then it would be a bit trickier and you would need to add those CDN / resource hosts to an array to loop over as they pop up.
That makes things trickier, how dynamic are they? There could be something I am missing but if you are scraping the url then you would need to know the url at some stage so can you not use that?
The other option would be to use regex to search for anything that looks like a domain but that is where you can run into issues if there are links in the content.
Do you have an example workflow or example data we can look at?
yeah, i have a regex that can pick up urls’s etc but my main issue it replacing the images in the article with the correct ones that have been reuploaded to my own host
EDIT: I have used HTML extract to get all the img src links so have a list of them per article.
That sounds like you might have it cracked now then, You have that handy list so in theory it is just a case of looping those images in a function node and doing a replace in the body to replace that string with your new string and everyone is a winner.
I need to find some javascript that will replace the images with the uploaded ones via a function node.
So I can extract the images, but I need to wait for a different bit of the flow to be able to replace that image with the newly created one that I have uploaded.
My plan is to share this flow as a template once it all works as then we have a webpage scraper that anyone can use on n8n