Scraping Dynamic Website Which uses AJAX to Load Content

Friends & Family :slight_smile: I am running into a challenge with scraping a highly dynamic website. With my next to non-existing tech background i really tried hard to find ways - i learned a lot in the process but not enough to help myself. Therefore, i was really hoping someone here could lend me a helping hand.

THE CHALLENGE:
I want to scrape all companies listed on this website HKSFC - Public Register of Licensed Persons and Registered Institutions
The companies are only displayed after filters have been selected and a search has been initiated. The website then uses JavaScript to render the content that is loaded via AJAX. So, the content isnt actually embedded in the html when i scrape the html or markdown (and the website url doesnt change) - making scraping difficult.

MY FAILED ATTEMPS

  1. n8n HTTP request to an API - the website doesnt have an API (i also emailed them about it)

  2. n8n HTTP request: i read that ajax call can be intercepted and their content can be extracted. So i went into the websites network tab and found the input and output names from the ajax call. i put these into the body of my http node to ‘populate the filter criteria’ via the http node. The HTTP node worked and returned html, but the html DOES NOT include the companies (also when i convert it to markdown to make it easier to read)


  3. n8n HTTP request using firecrawl : firecrawl apparently is great with dynamic data. i configured firecrawl’s cURL, and populated the body - same as before the HTTP node worked and returned html (my configuration is ok) , but the html DOES NOT include the companies
    {
    “url”: "HKSFC - Public Register of Licensed Persons and Registered Institutions ",
    “formats”: [
    “markdown”
    ]
    }

  4. n8n HTTP request using firecrawl, second try: I updated the body json to include the inptu and output variables that i got from the network tab from the website. BUt it returns a 400 error “Unrecognized key in body – please review the v1 API documentation for request body changes” - chatgpt suggests that i shouldnt use the varibales that i have from the network tab in the firecrawl api, because the firecrawl api documentation doesnt include this.

So, here i am running out of ideas and really hoping that someone of you fine people can guide me to a solution. Even if its just a ‘DIY video tutiroal’. Sorry for the lengthy text, but i really tried to do it myself and now guide any helper through possible solutions that i have ruled out. Thanks a lot n8n fambam!

they have an api, goo dev console - > Network Tab and look for xhr.

curl ‘https://apps.sfc.hk/publicregWeb/searchByRaJson?_dc=1751375086604’
-X ‘POST’
-H ‘Content-Type: application/x-www-form-urlencoded; charset=UTF-8’
-H ‘Accept: /’
-H ‘Sec-Fetch-Site: same-origin’
-H ‘Accept-Language: de-DE,de;q=0.9’
-H ‘Accept-Encoding: gzip, deflate, br’
-H ‘Sec-Fetch-Mode: cors’
-H ‘Origin: https://apps.sfc.hk’
-H ‘User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.5 Safari/605.1.15’
-H ‘Content-Length: 87’
-H ‘Referer: HKSFC - Public Register of Licensed Persons and Registered Institutions’
-H ‘Connection: keep-alive’
-H ‘Sec-Fetch-Dest: empty’
-H ‘Cookie: JSESSIONID=A3F3233456A21BF15FB5A02FD5FFFE91; TS0173272d=01ee710898b9f4ef257982d63f6b99ffb6e3f8256f481599b65ca3d8b075d89a3c1f43756b976722faf0483c3ae8c405cf81e3bbe4; BIGipServerPOOL_SFCAPPS_HTTPS=%15%84%13%1B%A8%B9E%08%FE%83%01%FD%DD%E7%BF%E7%8A%B8%7D)9%AD%AF%B6UTmm%06%D8%F5Y%EAL%87%E1%84f%0D%C2%1EYC%22%AA~%85%A4Ql%40%99%FC2A%F6B%7F%1A%00%00%00%01; _ga=GA1.1.280013404.1751375070; _ga_PSYEMG5425=GS2.1.s1751375070$o1$g0$t1751375070$j60$l0$h0; _gat=1; _gid=GA1.2.139666631.1751375070; TS019a838e=01ee710898b9f4ef257982d63f6b99ffb6e3f8256f481599b65ca3d8b075d89a3c1f43756b976722faf0483c3ae8c405cf81e3bbe4; abb0e204f3eee1c5a6ef12a16ed04d43=f64efff916c142a2f1356d278a77a035’
-H ‘X-Requested-With: XMLHttpRequest’
-H ‘Priority: u=3, i’
–data ‘licstatus=active&roleType=individual&ratype=1&nameStartLetter=A&page=1&start=0&limit=20’

2 Likes

Thanks Baudrillard! I am still new to all of this and apreciate your pointer. I had initially looked for the api documentation, for which an official doc doesnt exist / and when i emailed them they also said there is none. I didnt know the xhr is where i had to look - in the below screenshot of the netowrk tab, the orange circles the payload which is the filter settings, and the blue are the outputs → this seems to be going in the right direction.
I used these in my below http firecrawl node, but it returns a 400 error though “Unrecognized key in body – please review the v1 API documentation for request body changes” - chatgpt suggests that i should not use the websites’ API varibales from the network tab in the firecrawl api, because the firecrawl api documentation doesnt know what to do wih it. Could you throw me another pointer as to how i can address this? Sorry for troubling you with this.

You don´t need to use fire crawl, the issue is in the curl import of n8n and a malformed post request, which you get from copying from the network tab.

Here, is a working http request rom which you can try to set up flow to get all data from the site. The below workflow will return you data like this:

[
{
"totalCount": 
320,
"items": 
[
{
"ceref": 
"AAA259",
"name": 
"AU-YEUNG YEUNG Kwok Lin Jennifer",
"nameChi": 
"歐陽楊幗蓮",
"entityName": 
null,
"entityOtherName": 
null,
"entityType": 
null,
"isIndi": 
true,
"isEo": 
false,
"isCorp": 
false,
"isRi": 
false,
"hasActiveLicence": 
"Y",
"hasActiveLicenceAmlo": 
"N",
"isDeemedLicence": 
"N",
"isDeemedLicenceAmlo": 
"N",
"isActiveEo": 
"N",
"address": 
null,
"raDetails": 
null,
"raDetailsAmlo": 
null
},
{
"ceref": 
"AAA969",
"name": 
"AU Chong Kit, Stanley",
"nameChi": 
"區宗傑",
"entityName": 
null,
"entityOtherName": 
null,
"entityType": 
null,
"isIndi": 
true,
"isEo": 
false,
"isCorp": 
false,
"isRi": 
false,
"hasActiveLicence": 
"Y",
"hasActiveLicenceAmlo": 
"N",
"isDeemedLicence": 
"N",
"isDeemedLicenceAmlo": 
"N",
"isActiveEo": 
"N",
"address": 
null,
"raDetails": 
null,
"raDetailsAmlo": 
null
},
1 Like

Thanks a bunch for mocking up this draft node with the SFC inputs! I should be able to amend and loop this! I will have a proper look tomorrow, with a fresh pair of eyes. Thanks again Baudrillard!

1 Like

Thanks again - initially i had a question here why the number of results dont match on the website and API (329 vs 208), but the names all match. So there seems to be something off with the counter. Thanks!!!

Hi,

i See from my http request: 320 items in total count. Could you post your node here to have a look at it?

1 Like

I had a different filter on the website and in the api. Picked it up as i wrote the reply here and double-checked my inputs to not waste your time. So thats a wrap. Thanks again a lot for your help with this!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.