Thanks, mate @theo. I have a question, sites like remax.com or kellerwilliams, where URLs are not directly present in the raw HTML. Perhaps profile URLs are dynamically loaded via JavaScript after the page has initially loaded. so how can we get those urls?
I’m not sure of what you’re specifically looking at on each page. But you can add a condition after each page result (“If next page” node) to continue or not the workflow and then increment the page parameter (“s”) thanks to the $runIndex variable, which is the number of times the loop ran, starting at 0 (explaining the “+ 1”)
@theo Thanks, mate. Now I am stuck in here. If you see in the screenshot, I used an If node where I set an expression like {{ $(‘Extract Agent Info’).item.json.foundCards.length }} is greater than 0, so true goes to the field expression, where it paginates until no profile is left. The false is set to the split out, but the URLs are not going forward to the code and sheet append. Can you please check?
I also set the if node and set field at the end after append to sheet but it generates 1 pages 10 times like 10 generates 10 times, 20 generates 20 time.
Thanks again for your help, and I look forward to this matter. Btw, it’s a Yelp URL.
@theo Thank you, but this didn’t work as it kept running even when there was no profile left, continuously generating pages. Regarding your query about the loop item, it checks whether the URL has already been processed or not.
You can refer to the screenshot; the maximum pages were about 8-9 for 2 URLs, but it kept running and only provided 3 profile listings.
Sorry but I can’t run your request node for some reason to try it out with your version:
[
{
"status": "failed",
"status_code": 613,
"message": "We were unable to scrape the target. You can try the following: 1) Use the Advanced Scraping API: https://dashboard.decodo.com/web-scraping-api/scraper?target=universal, 2) Switch to a different geo-location, 3) Retry the request later.",
"task_id": "7347967433770038273"
}
]
Some clues perhaps but you need to log gradually the results to debug:
Does the increment node works as expected for each new request?
Are the requests returning different values?
Since the your “If next page1” node returns items to the TRUE branch only if the current item matches your condition, there should be a problem here: foundCards.length is maybe always > 0?
Having a LOOP over all urls impact the runIndex because it will be still incremented at each new url so you could have at the 100th url somethink like "runIndex = 320 (or any value greater than 100). So this way won’t work at the end I’m afraid.
Hello @theo,
Can you please check this version now?
Yes, the increment node works as expected for each new request. However, I’m unsure why it doesn’t stop when there are no profiles left, and it generates pages as expected.
If you see in the screenshot, I am getting this HTML instead of empty, so it must be false
You’re “If next page1” node checks the length of foundCards, must be > 0.
The “foundCards” object, as I can see with your screenshot, contains 1 element (at the index = 0): so the length here is 1 and it checks the “If next page1” condition.
This is how arrays and arrays length works. Here a short demo with a Javascript sample:
So you can’t use foundCards.length > 0 as a condition to check if they still are results.
I was able to run your workflow from my end by copy pasting yours.
I can see that you always have \"/adredir?ad_business_id........ as first element (meaning index = 0 ) of foundCards. But if you have real results (the ones you want) they’re starting at the index = 1 (length = 2):