Unable to generates correct pagination

I am unable to generate the correct order of pagination. The baseurl pagination generates the correct pages in input but in output it still uses the baseurl 1 pages.

My goal was to get Baseurl 1, get agents’ profiles until they’re done, and then move to Baseurl 2 and repeat the same process. Can anayone help me?

Hey @Rafay_Saleem hope all is well. See the workflow below, I think it is pretty close to what you were looking for, only simplified.

You can also see the result of running this in your document under TEST2 tab. Hope it helps.

There was a little flaw, here is an updated version

@jabbson Thank you soo much for your help. I have a little question, what if I want to add a wait after each base url complete so it doesn’t block the request?

You can space them out with “interval between request” within the same pagination option in the same HTTP Request, but that would be for all requests.

Alternatively you can use batching with intervals, where you can wait for x ms after every y requests.

@jabbson Thanks for your help. much appreciated :folded_hands:

@jabbson Can it is possible to use any open-source web scraper with it? since I am self-hosting n8n so I am curious if I can use any web scraper api open soruce instead of native http request

You can take a look choose from a list of solutions I recently came across or heard of (I am sure there are more, if you search):

  1. Cheerio + HTTP Request Node (Built-in)
  2. Puppeteer (via Docker or External API)
  3. Scrapy (Python-based)
  4. Playwright (via external API)
    Use services like Browserless or ScrapingBee
  5. Open-Source APIs with Web Scraping Functionality
    Simple Scraper
    Go-Scraper or ScrapFly

@jabbson can you provide cheerio or puppeteer tutorial? I’m unable to find it

This is what comes up in quick google search:

ask YT too, there must be some guidance or step-by-steps…

thanks @jabbson

Hello @jabbson I installed puppeteer node but when scraping websites like realtor.com or zillow, I encounter this 429 code error.
I’m using a self-hosted n8n setup with Docker and Portainer, and it’s cloud-hosted on Oracle and puppeteer n8n node: GitHub - drudge/n8n-nodes-puppeteer: n8n node for browser automation using Puppeteer

Thank you!

429 is a rate limiting response code, try to space your requests out in time.

@jabbson I did, and I also checked by going through the realtor’s website, and it was working fine in the browser.

Well, you see people who run these services like their data and they like when you can’t get it, at least easily. While one group of smart people is thinking about how to scrape all the data and make it available and make money off of it, the other group of people is thinking of how to protect themselves from this happening. Bot detection is getting as sophisticated as web scraping and it is a never ending battle.

While on this topic, both services you’ve mentioned strictly prohibit data scraping and the use of automated tools to access or extract data from their platforms without explicit written permission. Doing so is unethical and can bear legal consequences.

But realtor provides api as well and I think it’s costly @jabbson

And that is exactly the point - if you want to have the data - they want you to pay for it, and this is exactly why they will try their best to detect and stop any bot activity on their resources.