Anyone got a neat trick to replace urls with other urls on the fly?

So i am scraping some web content and do not want to keep hotlinking to another webpage and use up their bandwidth so want to host it on my own.

When I am extracting the HTML content I can see the images they have uploaded and the img src link etc. I can extract these and upload them to my own site no issues at all, but its the reconstructing the end HTML with the links to my own hosted img src and replace the one in the extracted HTML content.

anyone got a clever way to do this?

Hey @RedPacketSec,

I may have missed something but would a replace not do the job from a function node?

Yeah but how are you dynamically aligning the correct old links with the correct new links from the uploads?

Assuming you know what the URL is I would have thought it would just be a case of using that value to start the replace. If you are scraping from google.com and want to replace it with duckduckgo.com but if you don’t know what site you are getting the data from or you are also downloading any third party hosted resources like javascript or CSS files then it would be a bit trickier and you would need to add those CDN / resource hosts to an array to loop over as they pop up.

the url’s will be dynmaic

That makes things trickier, how dynamic are they? There could be something I am missing but if you are scraping the url then you would need to know the url at some stage so can you not use that?

The other option would be to use regex to search for anything that looks like a domain but that is where you can run into issues if there are links in the content.

Do you have an example workflow or example data we can look at?

lets say they are uploading an image…

www.example.com/wordpress/month/day/image_name.jpg <-dynamically changes in the article as img src. there will be multiple of these per page

and i will be doing something similar < - this i will have the urls from the workflow as i upload them

I have DM’d you

yeah, i have a regex that can pick up urls’s etc but my main issue it replacing the images in the article with the correct ones that have been reuploaded to my own host

EDIT: I have used HTML extract to get all the img src links so have a list of them per article.

I then need to upload these, but later on in the flow need to replace that URL in the article body with my own upload of that image.

That sounds like you might have it cracked now then, You have that handy list so in theory it is just a case of looping those images in a function node and doing a replace in the body to replace that string with your new string and everyone is a winner.