Scrape any site with Crawl4AI and n8n
If data is oil.
Then scraping is the oil rig that keeps the oil flowing.
One prominent use case for n8n is its ability to deliver custom scraping solutions in minutes in combination with niche service providers, such as Crawl4AI, a free open-source library developed by unclecode.
This tutorial will of tremendous value to anyone working in:
-
Real Estate : We will scrape Sotheby’s property listings for an AI Agent
-
Backend: How to use Supabase to store our info and to do RAG
-
DevOps: We will setup a Digital Ocean API endpoint to scrape
Prerequisites
-
A n8n account, to run your automation, check this tutorial.
-
A Supabase account, to host your vector store, which I go over here.
-
A Digital Ocean account, which we will go over in this tutorial.
For the visual learners here’s the full YT tutorial.
Scraping Practices
Before we dive into the tooling consider the following facts behind a successful scraping operation. If the target website provide its data via an API use it. For example, you can retrieve Reddit data via its API — n8n tutorial here.
If too much scraping is detected from a certain IP address, then you may get blocked, thus making future scraping impossible.
Tutorial
Step 1 — Checking If a Website Can Be Scraped
Before writing a single line of automation, you need to confirm whether the target site allows scraping. Two quick checks:
-
robots.txt — append
/robots.txtto the domain. If only admin or login routes are disallowed, you’re free to scrape the public listings. -
sitemap.xml — append
/sitemap.xmlto the domain. This file lists all available URLs that search engines (and scrapers) can crawl. For real estate sites, you’ll often find a dedicated sitemap for property listings.
This is the fuel you’ll feed into your scraper.
Step 2 — Building the n8n Workflow
Our n8n workflow has two jobs:
-
Scrape listings with Crawl4AI.
-
Insert structured data into Supabase for retrieval (RAG).
Here’s how the flow starts:
-
HTTP Request Node → points to the
sitemap.xmlof Sotheby’s Sydney listings. -
Convert XML to JSON → n8n handles JSON more reliably, so we convert early.
-
Split in Batches → sitemap URLs arrive as one big block. We split them so each listing can be processed one at a time.
-
Loop Node → cycles through each listing, sending it to our crawler.
This modular approach makes the workflow scalable. You can always pause, test, and re-run batches without overwhelming the system.
Step 3 — Deploying Crawl4AI on DigitalOcean
Crawl4AI doesn’t run in n8n directly. Instead, we deploy it as a containerized service on DigitalOcean.
Here’s the setup:
-
Head to DigitalOcean App Platform → choose Container Image.
-
Pull from DockerHub:
unclecode/crawl4ai:latest. -
Configure resources: give it 4GB RAM (crawling often spikes in memory).
-
Set Port 11235 (per Crawl4AI docs).
-
Deploy. Wait a few minutes, and you’ll get a live URL endpoint for your crawler.
Step 4 — Securing Your Crawler
Scrapers without authentication are an open invitation for abuse. We lock it down with an API Token:
In DigitalOcean → Settings → Environment Variables, add:
-
Key:
CRAWL4AI_API_TOKEN -
Value:
your-secret-token
Then in n8n:
- Go to Credentials → Generic Header Auth.
Add a header:
-
Name:
Authorization -
Value:
Bearer your-secret-token
This ensures only your workflow can trigger the crawler.
Step 5 — Scraping in Action
Now when the loop node passes URLs to Crawl4AI:
-
n8n sends an authenticated request.
-
Crawl4AI scrapes the property page.
-
The structured HTML/JSON is returned to n8n.
At this stage, the listings flow into Supabase where they’re stored as vector embeddings for RAG. That way, our AI agent can retrieve them later.
In the next section, we’ll test the end-to-end flow by asking the AI agent:
“Are there any four-bedroom listings in Sydney?”
and watch it pull the answer directly from Sotheby’s scraped data.
Conclusion
In this tutorial, we combined n8n, Crawl4AI, DigitalOcean, and Supabase to build a production-ready scraping workflow in minutes. By using robots.txt and sitemap.xml, we respected the site’s structure, and by containerizing Crawl4AI, we gained a scalable and secure way to fetch structured data. Finally, storing the scraped results in a vector store enabled retrieval-augmented generation (RAG), letting our AI agent answer real questions with fresh, scraped knowledge.
This setup is flexible enough to adapt to almost any domain — real estate, e-commerce, research, or internal data. And with n8n orchestrating the pieces, you can extend the workflow to alerts, reports, or even full automation pipelines.