Anyone got a neat trick to replace urls with other urls on the fly?

RedPacketSec · June 22, 2022, 1:00pm

So i am scraping some web content and do not want to keep hotlinking to another webpage and use up their bandwidth so want to host it on my own.

When I am extracting the HTML content I can see the images they have uploaded and the img src link etc. I can extract these and upload them to my own site no issues at all, but its the reconstructing the end HTML with the links to my own hosted img src and replace the one in the extracted HTML content.

anyone got a clever way to do this?

Jon · June 22, 2022, 3:47pm

Hey @RedPacketSec,

I may have missed something but would a replace not do the job from a function node?

RedPacketSec · June 22, 2022, 4:06pm

Yeah but how are you dynamically aligning the correct old links with the correct new links from the uploads?

Jon · June 23, 2022, 10:13am

Assuming you know what the URL is I would have thought it would just be a case of using that value to start the replace. If you are scraping from google.com and want to replace it with duckduckgo.com but if you don’t know what site you are getting the data from or you are also downloading any third party hosted resources like javascript or CSS files then it would be a bit trickier and you would need to add those CDN / resource hosts to an array to loop over as they pop up.

RedPacketSec · June 24, 2022, 5:44am

the url’s will be dynmaic

Jon · June 24, 2022, 7:02am

That makes things trickier, how dynamic are they? There could be something I am missing but if you are scraping the url then you would need to know the url at some stage so can you not use that?

The other option would be to use regex to search for anything that looks like a domain but that is where you can run into issues if there are links in the content.

Do you have an example workflow or example data we can look at?

RedPacketSec · June 24, 2022, 7:05am

lets say they are uploading an image…

www.example.com/wordpress/month/day/image_name.jpg <-dynamically changes in the article as img src. there will be multiple of these per page

and i will be doing something similar < - this i will have the urls from the workflow as i upload them

RedPacketSec · June 24, 2022, 7:06am

I have DM’d you

RedPacketSec · June 28, 2022, 5:39am

yeah, i have a regex that can pick up urls’s etc but my main issue it replacing the images in the article with the correct ones that have been reuploaded to my own host

EDIT: I have used HTML extract to get all the img src links so have a list of them per article.

I then need to upload these, but later on in the flow need to replace that URL in the article body with my own upload of that image.

Jon · June 28, 2022, 9:34am

That sounds like you might have it cracked now then, You have that handy list so in theory it is just a case of looping those images in a function node and doing a replace in the body to replace that string with your new string and everyone is a winner.

RedPacketSec · July 15, 2022, 6:46am

not quite.

I need to find some javascript that will replace the images with the uploaded ones via a function node.

So I can extract the images, but I need to wait for a different bit of the flow to be able to replace that image with the newly created one that I have uploaded.

My plan is to share this flow as a template once it all works as then we have a webpage scraper that anyone can use on n8n