Hi everyone, I’m currently running down a CSV list with website urls on each line, I’m running a GET request to get the content of the page and looking for certain keywords with the extract HTML extract node. I’ve noticed a small problem, some sites that work perfectly fine when visiting from my browser are throwing up errors in the HTTP request
That got me curious about what could be happening so I looked up the error code plus scraping and found this:
This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0 , it’s easily detected). Try setting a known browser user agent with:
I wonder if there is any plan to add the ability to set user agent properties to the HTTP request node for scraping use cases? I actually need this myself so I may just contribute for this feature, if anyone could direct me to the right area in the project then I can give it a crack
Hey @chrisgereina, thank you for creating this feature request!
I’ve not used the HTTP Request extensively for scraping and hence didn’t come across the exact same issues. However, there were cases where I needed to specify the user-agent. I passed the user-agent in the Header parameters and it worked for me. Did you try passing the User-Agent header via the Header parameters?
Yes, I just tested it, and it works fine. If you want to try, just make an HTTP request to something like https://webhook.site/ where you can explore the request.
Is there any difference between CURL and HTTP Request in n8n?
Yes, you can do more with cURL probably. But, for what you are trying to do, there should not be any problems.
but when using http request node i have 403.
It works for me when changing the user agent or not sending it at. It’s something with how the site handles the sessions.
by the way, to those running into the issue that the added header is being sent only lowercaps, go to options > lowercase headers , select it then disable it, so that it will go on whatever case you specifiec. Otherwise it’s always gonna send it in lowercase.