Set user agent in HTTP node to avoid 403 forbidden error while scraping

Hi everyone, I’m currently running down a CSV list with website urls on each line, I’m running a GET request to get the content of the page and looking for certain keywords with the extract HTML extract node. I’ve noticed a small problem, some sites that work perfectly fine when visiting from my browser are throwing up errors in the HTTP request

That got me curious about what could be happening so I looked up the error code plus scraping and found this:

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0 , it’s easily detected). Try setting a known browser user agent with:

I wonder if there is any plan to add the ability to set user agent properties to the HTTP request node for scraping use cases? I actually need this myself so I may just contribute for this feature, if anyone could direct me to the right area in the project then I can give it a crack

Hey @chrisgereina, thank you for creating this feature request!

I’ve not used the HTTP Request extensively for scraping and hence didn’t come across the exact same issues. However, there were cases where I needed to specify the user-agent. I passed the user-agent in the Header parameters and it worked for me. Did you try passing the User-Agent header via the Header parameters?

1 Like

As @harshil1712 already mentioned you should be to accomplish this by passing the User-Agent in the header.