Website scraping - help pls

Greetings everyone, I hope you can guide me how I can scrape a website like this:
https://www.theoutnet.com/en-nl/shop/product/sandro/sneakers/fashion-sneakers/perforated-two-tone-leather-high-top-sneakers/1647597321702186

I am using n8n on other websites without issues, but this one just does not load properly when using the http-request node. What is the trick to get this one scraped?

Thank you ahead for sharing!

Welcome to th community @koko!

Did not check that page deeply but the reason will probably be that JavaScript is involved. The HTTP Request node will however just load the HTML of the page and will so not execute the JavaScript. To make it work, you would either have to use an external API which renders everything and then returns the resulting HTML (will be paid) or you will check out some other posts in this forum which talk about using Puppeteer with n8n. But that is for sure some more work, and will require some deeper technical knowledge to get it to work.

3 Likes

Thanks a lot @jan for your response. Even though its not what I was hoping for, you clarified it for me. This is much appreciated. Good community is the key!

it is NOT a javascript issue
it has something to do with how this site is secured.
I’ve run a number of tests
Postman from windows - works
Postman online - doesn’t
wget on windows - doesn’t work
wget on windows with user agent - works
curl on windows - doesn’t work
curl on linux - doesn’t work but returns access denied

verbose logs suggest something to do with ssl/tls but don’t have time to dig further.
if you are at least a little techie and it’s a life or death situation for you - describe your problem to chatgpt and move from there step by step.
no way in hell they can secure it 100% :slight_smile:

1 Like

Thanks a lot @mikeon !

I guess in this case it should also work with n8n if the User Agent is set.

I failed to make it work on Linux with or without user agent. I’m guessing your n8n runs on Linux (docker) and it probably uses curl or similar

@mikeon, thanks for all your efforts, much appreciated! I guess I am going to try the ChatGPT path :wink:

Regarding the User Agent suggestion, I did try it before I posted my question here. I did not have much luck with it either. But I will give it another go. Not much I can loose. :thinking:

@mikeon, how about that. So I managed to get what I need using wget.

/usr/bin/wget -U "<USER AGENT>" "{{ $json.ProductUrl }}" -O file.html 2>/dev/null 

After this it was an easy task to read the binary file and extract the content.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.