So i am scraping some web content and do not want to keep hotlinking to another webpage and use up their bandwidth so want to host it on my own.
When I am extracting the HTML content I can see the images they have uploaded and the img src link etc. I can extract these and upload them to my own site no issues at all, but its the reconstructing the end HTML with the links to my own hosted img src and replace the one in the extracted HTML content.
anyone got a clever way to do this?
I may have missed something but would a replace not do the job from a function node?
Yeah but how are you dynamically aligning the correct old links with the correct new links from the uploads?
the url’s will be dynmaic
That makes things trickier, how dynamic are they? There could be something I am missing but if you are scraping the url then you would need to know the url at some stage so can you not use that?
The other option would be to use regex to search for anything that looks like a domain but that is where you can run into issues if there are links in the content.
Do you have an example workflow or example data we can look at?
lets say they are uploading an image…
www.example.com/wordpress/month/day/image_name.jpg <-dynamically changes in the article as img src. there will be multiple of these per page
and i will be doing something similar < - this i will have the urls from the workflow as i upload them
yeah, i have a regex that can pick up urls’s etc but my main issue it replacing the images in the article with the correct ones that have been reuploaded to my own host
EDIT: I have used HTML extract to get all the img src links so have a list of them per article.
I then need to upload these, but later on in the flow need to replace that URL in the article body with my own upload of that image.
That sounds like you might have it cracked now then, You have that handy list so in theory it is just a case of looping those images in a function node and doing a replace in the body to replace that string with your new string and everyone is a winner.