How do you scrape Cloudflare-protected websites in n8n?

I’m building a daily digest workflow that fetches content from HVAC industry blogs. Three of my sources (HouseCall Pro, Jobber Academy, HVAC Informed) are protected by Cloudflare. When my HTTP Request node hits these URLs, it gets a 403 “Just a moment…” Cloudflare challenge page instead of the actual content.

These sites don’t have RSS feeds either, so RSS Feed Read nodes don’t work.

What I’ve tried:

  • HTTP Request node (GET) — returns Cloudflare challenge HTML

  • RSS Feed Read node — returns XML parse errors (pages aren’t RSS)

Questions:

  1. Is there a way to bypass Cloudflare bot protection in n8n natively?

  2. Should I use Apify, ScrapingBee, or BrightData for these sources?

  3. Is there a community node that handles Cloudflare-protected sites?

  4. Would adding browser-like headers (User-Agent, Accept-Language, etc.) help?

@Asim_Arman did a quick check on ur 3 sources before recommending anything and theres some good news for one of them — HouseCall Pro isnt actually on Cloudflare and its /feed/ works fine if u just add a real browser User-Agent header to the HTTP Request node. tested with a Mozilla/5.0 UA against Housecall Pro and got valid RSS XML back. ur 403 there is almost certainly because n8n’s HTTP Request node defaults to a generic UA (or none at all) and the site has a basic UA filter. switch to RSS Feed Read pointed at Housecall Pro with a User-Agent override and ur done for that one, no scraping needed.

Jobber (getjobber.com) is genuinely on Cloudflare — cf-ray header present in responses — and theres no RSS for the academy section. HVAC Informed also doesnt expose RSS. those two genuinely need a scraping service.

on the bypass-cloudflare-natively question, no real way to do it in n8n. their challenge requires executing JS to solve proof-of-work plus behavioral fingerprinting, and the HTTP Request node is a plain http client with no browser engine. the cloudscraper-style libs from 2022-2023 are all broken against current Cloudflare too, so the few community nodes wrapping them dont work either.

browser-like headers fix basic UA filters (which is what HouseCall Pro’s 403 actually was, ironically) but do nothing against the real Cloudflare challenge — they fingerprint TLS, JA3, request timing, and JS execution beyond just headers.

for ScrapingBee vs Apify vs BrightData — for ur use case (low-volume daily digest, just 2 sites needing real scraping) ScrapingBee or Firecrawl is the cleanest. rough comparison:

{
  "scrapingbee": {"per_request": "~$0.001-0.005", "fit": "low-mid volume, simple, fast"},
  "scraperapi": {"per_request": "~$0.001-0.003", "fit": "comparable to scrapingbee"},
  "firecrawl": {"per_request": "~$0.001", "fit": "scrape-to-markdown, LLM-ready output"},
  "brightdata": {"per_request": "~$0.005-0.02", "fit": "hardest targets, expensive but most reliable"},
  "apify": {"per_request": "varies", "fit": "complex multi-page flows"}
}

ScrapingBee called via HTTP Request looks like:

{
  "method": "GET",
  "url": "https://app.scrapingbee.com/api/v1/",
  "qs": {
    "api_key": "<your_key>",
    "url": "https://www.getjobber.com/academy/",
    "render_js": "true",
    "premium_proxy": "true"
  }
}

end state is one RSS Feed Read for HouseCall Pro, two ScrapingBee HTTP calls for Jobber and HVAC Informed, then parse → digest. way cheaper than treating all 3 as Cloudflare-protected when only one actually is.

Thanks alot!, I am trying this and letting you know

Your welcome! Lemme know how it works!

I would like to add that Apify has the Smart Article Extractor which can (usually) detect article links on a page automtaically, and then scrape them one by one.

And it does emulate a browser so it can bypass Cloudflare too. It’s a usage-only actor as well, so a once a day call to three blogs will probably fit into their free tier.

For more general (non-article) cases, you can get structured content by using their Website Content Crawler and then running them through a small AI model, such as haiku or gpt minis.

For more complex sites like Jobber and HVAC Informed, if you use a self-hosted n8n instance with Docker and prefer to avoid paid solutions, an interesting alternative is to run SeleniumBase in Undetected ChromeDriver (UC) mode within a separate Docker container. SeleniumBase’s UC (Undetected ChromeDriver) mode is currently one of the most robust open source tools for bypassing Cloudflare protections.

You can then configure this container to run your scraping scripts and trigger them directly from n8n using the Execute Command node (or using a custom tool if you’ve structured your workflow around an n8n AI Agent).

One thing nobody’s mentioned yet: before reaching for a scraping service, check if these sites have a sitemap or Atom feed hiding behind a non-standard path.

Try hitting these before paying for anything:

  • /sitemap.xml
  • /sitemap_index.xml
  • /feed/
  • /atom.xml
  • /blog/feed/ or /blog/rss/

Many WordPress/HubSpot-powered blogs (which HouseCall Pro and Jobber both use) have feeds enabled but not linked in the . The RSS Read node will work fine if you hit the right URL.

For the ones that genuinely have no feed and Cloudflare blocks you:

Cheapest production-grade approach: Firecrawl or ScrapingBee via HTTP Request node.

HTTP Request → POST https://api.firecrawl.dev/v1/scrape
Headers: Authorization: Bearer
Body: { “url”: “https://target.com/blog”, “formats”: [“markdown”] }

Returns clean markdown — no HTML parsing needed. Run it on a Schedule Trigger once daily. At 3 sites × 1 request/day, you’ll stay well inside any free tier.

Production hardening (the part most tutorials skip):

  1. Add a retry with backoff on the HTTP Request node — scraping APIs occasionally 429 you
  2. Store the last-fetched URL hash in a Postgres table so you don’t re-process the same article
    tomorrow (idempotency)
  3. Add an Error Trigger workflow — if your digest silently fails on Monday, you want to know before Friday

The “just scrape it” part is easy. Making it run reliably every day for 6 months without silent failures is where most daily digest workflows break.

Hey, really appreciate the detailed breakdown, achamm. This was super helpful and I ended up following most of it.

What I followed exactly:

  • HouseCall Pro - confirmed it wasn’t Cloudflare, added the User-Agent header to the HTTP Request node, pointed it at/feed/, parsed through an XML node, then a Code node. Works perfectly now.

  • Jobber and HVAC Informed - used Firecrawl instead of ScrapingBee (same price tier you mentioned, just preferred the markdown output for passing to Claude).

    Two-step approach: scrape the listing page first to extract article URLs, then scrape each article individually via a second HTTP Request.

What we added beyond your recommendations:

  • A Parse Code node after each Firecrawl request to normalize all sources into identical fields (title, content, link, feedSource) before merging

Thanks so much man, glad to have you in the community

Your welcome, happy we were able to help you!

If you have the possibility to self-host a tool (for example via Docker on the same server as your n8n) and you want to avoid the quota limits of paid platforms (like Apify, ScrapingBee or Firecrawl), you should look into FlareSolverr.
It’s an open-source proxy that launches a headless browser in the background to solve the Cloudflare challenge and return you the raw HTML of the page. You just need to query it locally with a simple HTTP Request node in n8n.
However, FlareSolverr returns you the raw HTML code. You’ll therefore need to add an HTML node or a Code node right after to parse and structure the data yourself into JSON.
a small example:

{
“nodes”: [
{
“parameters”: {
“jsCode”: “const html = $json.solution.response;\n\nfunction getText(html, startMarker, endMarker) {\n const start = html.indexOf(startMarker);\n if (start === -1) return ‘’;\n const end = html.indexOf(endMarker, start);\n if (end === -1) return ‘’;\n return html.substring(start + startMarker.length, end).trim();\n}\n\nfunction getMetaContent(html, name) {\n const regex = new RegExp(<meta name=\"${name}\" content=\"([^\"]+)\", ‘i’);\n const match = html.match(regex);\n return match ? match[1] : ‘’;\n}\n\nlet title = getText(html, ‘<h1 class="story-title">’, ‘’);\n// cleans the title of tags\ntitle = title.replace(/<a[^>]>([^<])</a>/, ‘$1’).trim();\n\nconst description = getMetaContent(html, ‘description’);\n\nlet content = getText(html, ‘<div class="articlebody clear cf" id="articlebody">’, ‘<div class="stophere" id="hiddenH1">’);\n// cleans the content\ncontent = content\n .replace(/<script[^>]>[\s\S]?</script>/gi, ‘’)\n .replace(/<style[^>]>[\s\S]?</style>/gi, ‘’)\n .replace(/<[^>]+>/g, ’ ')\n .replace(/\s+/g, ’ ')\n .trim();\n\nreturn [{\n title: title,\n description: description,\n content: content\n}];”
},
“type”: “n8n-nodes-base.code”,
“typeVersion”: 2,
“position”: [
-912,
4480
],
“id”: “a6275095-c4ed-427f-a791-620bcd83710b”,
“name”: “Extract from HTML”
},
{
“parameters”: {
“method”: “POST”,
“url”: “=http://flaresolverr:8191/v1”,
“sendHeaders”: true,
“headerParameters”: {
“parameters”: [
{
“name”: “Content-Type”,
“value”: “application/json”
}
]
},
“sendBody”: true,
“bodyParameters”: {
“parameters”: [
{
“name”: “cmd”,
“value”: “request.get”
},
{
“name”: “url”,
“value”: “={{ $json.url }}”
},
{
“name”: “maxTimeout”,
“value”: 90000
}
]
},
“options”: {
“timeout”: 120000
}
},
“type”: “n8n-nodes-base.httpRequest”,
“typeVersion”: 4.4,
“position”: [
-1120,
4480
],
“id”: “4c884f8a-09c7-4c13-b7e8-ac38098b00f6”,
“name”: “FlareSolverr”,
“retryOnFail”: true
}
],
“connections”: {
“Extract from HTML”: {
“main”: [

]
},
“FlareSolverr”: {
“main”: [
[
{
“node”: “Extract from HTML”,
“type”: “main”,
“index”: 0
}
]
]
},
“pinData”: {},
“meta”: {
“templateCredsSetupCompleted”: true,
“instanceId”: “03428eda2f2a76af33393f7fa59419dea9e2e74e0d630500b36df8cf5b8e161d”
}
}