PuppeteerJS Execution (or other Web Scraping)

hckdotng · October 4, 2022, 10:34am

What’s your environment? Is Docker and Docker-Compose updated to the last version? I tested this docker-compose on macOS and Ubuntu Server 22 (and at the time of my test it was 0.195.5, hope there is no breaking change with 0.196.0)

Seavia · January 23, 2023, 4:16pm

Hi @hckdotng do you have another copy of your n8n puppeteer docker-compose file? Your previously link gives 404 error.

Anthony · April 29, 2023, 4:51pm

I am running an API-first web scraping SaaS, and I thought I should add my 2 cents to the discussion. Running Puppeteer for web scraping tasks can be a cumbersome and costly endeavor in terms of debugging, developer UX, and hardware resources. I have written a couple of writeups in my blog on this topic:

A web scraping task usually consists of two stages:

Retrieve HTML/JSON from a target website (this step involves using certain proxy geos, retries, and headers),
Extract useful machine-readable data from the retrieved raw data

My idea is that basic HTTP scraping engine, and real Chrome browser should be an interchangeable brick in the retrieval step, but the extraction process (cheerio JS function, “extractors” in case of ScrapeNinja implementation) should not care which rendering engine was used.

If you decide to run Puppeteer by yourself, consider using puppeteer-extra-plugin-stealth and proxy-chain npm packages, these are very useful for web scraping. Consider disabling images and css downloads to cut web page rendering time in half. Use hardware monitoring and get bigger amount of RAM (at least 4GB, better 8GB).

ScrapeNinja.net has two endpoints for these 2 engines: /scrape (high performance, basic node.js wrapped curl-like utility with Chrome TLS fingerprint, which helps to bypass CloudFlare basic protection) and /scrape-js (customized puppeteer engine). They are vastly different in terms of implementation but their API surface is as similar as possible. I always have them running on separate cloud instances because Puppeteer is very resource hungry.
I have ScrapeNinja no-code integration packages for some of non-opensource n8n competitors, but it’s just an HTTP API call so it’s not too hard to use ScrapeNinja with n8n, I will be happy to help and answer your questions.

Cheers,
Anthony.

Alexis_Sanchis · May 28, 2023, 4:45pm

Hi @Seavia and @hckdotng,

I am experiencing the same issue as you. I’m running into trouble trying to make Puppeteer work in a Docker environment, especially since I’m not well-versed in Docker.

I attempted to build an image using the configuration provided in this git: Running puppeteer node in n8n · GitHub. However, I’m now encountering an error in n8n when running the Puppeteer node. The error message states: “ERROR: Could not find expected browser (chrome) locally. Run npm install to download the correct Chromium revision (1002410).”

I’m not sure where the problem lies exactly - whether it’s within the Docker image or on the Linux host (Ubuntu 22.04), or if it’s a version compatibility issue.

Does anyone have a solution for this problem or perhaps a simpler alternative?

Thanks!