Friends & Family I am running into a challenge with scraping a highly dynamic website. With my next to non-existing tech background i really tried hard to find ways - i learned a lot in the process but not enough to help myself. Therefore, i was really hoping someone here could lend me a helping hand.
THE CHALLENGE:
I want to scrape all companies listed on this website HKSFC - Public Register of Licensed Persons and Registered Institutions
The companies are only displayed after filters have been selected and a search has been initiated. The website then uses JavaScript to render the content that is loaded via AJAX. So, the content isnt actually embedded in the html when i scrape the html or markdown (and the website url doesnt change) - making scraping difficult.
MY FAILED ATTEMPS
-
n8n HTTP request to an API - the website doesnt have an API (i also emailed them about it)
-
n8n HTTP request: i read that ajax call can be intercepted and their content can be extracted. So i went into the websites network tab and found the input and output names from the ajax call. i put these into the body of my http node to âpopulate the filter criteriaâ via the http node. The HTTP node worked and returned html, but the html DOES NOT include the companies (also when i convert it to markdown to make it easier to read)
-
n8n HTTP request using firecrawl : firecrawl apparently is great with dynamic data. i configured firecrawlâs cURL, and populated the body - same as before the HTTP node worked and returned html (my configuration is ok) , but the html DOES NOT include the companies
{
âurlâ: "HKSFC - Public Register of Licensed Persons and Registered Institutions ",
âformatsâ: [
âmarkdownâ
]
} -
n8n HTTP request using firecrawl, second try: I updated the body json to include the inptu and output variables that i got from the network tab from the website. BUt it returns a 400 error âUnrecognized key in body â please review the v1 API documentation for request body changesâ - chatgpt suggests that i shouldnt use the varibales that i have from the network tab in the firecrawl api, because the firecrawl api documentation doesnt include this.
So, here i am running out of ideas and really hoping that someone of you fine people can guide me to a solution. Even if its just a âDIY video tutiroalâ. Sorry for the lengthy text, but i really tried to do it myself and now guide any helper through possible solutions that i have ruled out. Thanks a lot n8n fambam!