Webscraping - Proxy with captcha solver

mumudu22 · March 24, 2021, 9:43am

Hi guys,

I am trying to avoid using a Webscraper dedicated tool (like Webscraper.io), both for financial and technical reasons (I would have to create the webscraper via an API, download the results file, parse it etc …).

I have created a workflow with http nodes and html extract to get the data.
But the website that I am scraping is blocking me and asking for a captcha.

Would you know if it’s possible to use a proxy url in the http node and solve the captcha ?

Regards

{
  "nodes": [
    {
      "parameters": {
        "dataPropertyName": "data_brand",
        "extractionValues": {
          "values": [
            {
              "key": "watch_url",
              "cssSelector": ".article-item-container",
              "returnValue": "html",
              "returnArray": true
            }
          ]
        },
        "options": {}
      },
      "name": "HTML Extract",
      "type": "n8n-nodes-base.htmlExtract",
      "typeVersion": 1,
      "position": [
        730,
        300
      ]
    },
    {
      "parameters": {
        "dataPropertyName": "watch_url",
        "extractionValues": {
          "values": [
            {
              "key": "link",
              "cssSelector": "a",
              "returnValue": "attribute",
              "attribute": "href"
            }
          ]
        },
        "options": {}
      },
      "name": "HTML Extract1",
      "type": "n8n-nodes-base.htmlExtract",
      "typeVersion": 1,
      "position": [
        900,
        300
      ]
    },
    {
      "parameters": {
        "url": "https://www.chrono24.fr/rolex/index.htm",
        "responseFormat": "string",
        "dataPropertyName": "data_brand",
        "options": {}
      },
      "name": "HTTP Request1",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [
        550,
        300
      ]
    }
  ],
  "connections": {
    "HTML Extract": {
      "main": [
        [
          {
            "node": "HTML Extract1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTTP Request1": {
      "main": [
        [
          {
            "node": "HTML Extract",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

jan · March 24, 2021, 10:53pm

Yes, it is possible to use a proxy with the HTTP Request node by setting it via “Options → Proxy”.

Here another recent discussion about it:

But not sure if that solves your problem because now you still have to find a proxy which does the captcha solving for you or which routes the traffic through a lot of different IPs. No matter what you would probably need an external paid service.

Miquel_Colomer · March 24, 2021, 11:03pm

He can use this to create a proxy service with autorotating ips

But this needs a docker installation plus Amazon, DigitalOcean,… to allocate required pool of servers.

This doesn’t include captcha solver (probably there are opensource solutions that fix this).

mumudu22 · March 25, 2021, 8:50am

Thanks guys for the answers.
As I am not a master in all these Docker, DIgitalOcean etc … I think that the best solution for me is indeed to use an external service.

I managed to do so with webscraper.io. I will then get used to paying this until I can master other solutions.

Regards