Unable to fix the pagination in http node

Hello,
Can anyone help me correct this pagination?
I am unable to fix this pagination:
please see this url to get an idea:

`century21.com/real-estate-agents/geary-ok/LCOKGEARY
century21.com/real-estate-agents/geary-ok/LCOKGEARY/?s=12
century21.com/real-estate-agents/geary-ok/LCOKGEARY/?s=24

Hey @Rafay_Saleem ,

It depends how you want to handle the pagination loop and then the results. Multiple ways are possible as always but here a suggestion:

Let me know if it makes sense for you!

See you!

Thanks, mate @theo. I have a question, sites like remax.com or kellerwilliams, where URLs are not directly present in the raw HTML. Perhaps profile URLs are dynamically loaded via JavaScript after the page has initially loaded. so how can we get those urls?

How can we scrape those urls or sites?

Thank you

Hi there, I found workflow that maybe you can use for reference.
Here the link:
Ultimate Scraper Workflow for n8n | n8n workflow template

Hey! U welcome

I’m not sure of what you’re specifically looking at on each page. But you can add a condition after each page result (“If next page” node) to continue or not the workflow and then increment the page parameter (“s”) thanks to the $runIndex variable, which is the number of times the loop ran, starting at 0 (explaining the “+ 1”)

@theo Thanks, mate. Now I am stuck in here. If you see in the screenshot, I used an If node where I set an expression like {{ $(‘Extract Agent Info’).item.json.foundCards.length }} is greater than 0, so true goes to the field expression, where it paginates until no profile is left. The false is set to the split out, but the URLs are not going forward to the code and sheet append. Can you please check?

I also set the if node and set field at the end after append to sheet but it generates 1 pages 10 times like 10 generates 10 times, 20 generates 20 time.

Thanks again for your help, and I look forward to this matter. Btw, it’s a Yelp URL.

Hey @Rafay_Saleem,

Why don’t you connect the true branch as well to the “spreadsheet” section of the workflow?

This way:

  • I’m not sure you need the loop item before (3rd node in the workflow) since the loop isn’t going back there.

@theo Thank you, but this didn’t work as it kept running even when there was no profile left, continuously generating pages. Regarding your query about the loop item, it checks whether the URL has already been processed or not.

You can refer to the screenshot; the maximum pages were about 8-9 for 2 URLs, but it kept running and only provided 3 profile listings.

Hello @Rafay_Saleem

Sorry but I can’t run your request node for some reason to try it out with your version:

[
  {
    "status": "failed",
    "status_code": 613,
    "message": "We were unable to scrape the target. You can try the following: 1) Use the Advanced Scraping API: https://dashboard.decodo.com/web-scraping-api/scraper?target=universal, 2) Switch to a different geo-location, 3) Retry the request later.",
    "task_id": "7347967433770038273"
  }
]

Some clues perhaps but you need to log gradually the results to debug:

  • Does the increment node works as expected for each new request?
  • Are the requests returning different values?
  • Since the your “If next page1” node returns items to the TRUE branch only if the current item matches your condition, there should be a problem here: foundCards.length is maybe always > 0?
  • Having a LOOP over all urls impact the runIndex because it will be still incremented at each new url so you could have at the 100th url somethink like "runIndex = 320 (or any value greater than 100). So this way won’t work at the end I’m afraid.

I hope it helps

Hello @theo,
Can you please check this version now?

Yes, the increment node works as expected for each new request. However, I’m unsure why it doesn’t stop when there are no profiles left, and it generates pages as expected.

If you see in the screenshot, I am getting this HTML instead of empty, so it must be false

Hello @Rafay_Saleem,

Thanks for sharing more info.

You’re “If next page1” node checks the length of foundCards, must be > 0.

The “foundCards” object, as I can see with your screenshot, contains 1 element (at the index = 0): so the length here is 1 and it checks the “If next page1” condition.

This is how arrays and arrays length works. Here a short demo with a Javascript sample:

So you can’t use foundCards.length > 0 as a condition to check if they still are results.

I was able to run your workflow from my end by copy pasting yours.

I can see that you always have \"/adredir?ad_business_id........ as first element (meaning index = 0 :wink:) of foundCards. But if you have real results (the ones you want) they’re starting at the index = 1 (length = 2):

So to conclude: ...foundCards.length > 1 in “If next page1” should work :slight_smile: